## Friday, August 19, 2011

### Correlation and Causality

I'm writing a paper on the topic of "From correlation to causal inference" for a workshop I'm planning to attend next month at the University of Michigan. A couple of recent papers featured on blogs which are nice examples to use in my paper.

Basically, we can only make causal statements based on a simple regression analysis if:

1. We can establish from theory that an explanatory variable is exogenous. In other words, it isn't itself caused to some degree by the variable we are trying to explain.

2. We are sure we haven't omitted any variables from the regression that might be correlated with those that are included in the regression.

Instrumental variables and Granger causality testing are two approaches that go beyond the simple regression model to try to make causal statements.

In some instances, though, exogeneity and causality are obvious. For example, in a joke paper discussed by Andrew Gelman, Bezimeni (2011) claims * to regress individual ages from survey data on responses to a survey question on trust, a factor derived from a factor analysis of various variables, and the percentage of overqualified women in national parliaments’ cafeterias. Clearly, individual age is exogenous and cannot be caused by any of the explanatory variables. Therefore, the supposed regression is nonsense. Instead, age might explain some of the responses. But average age in a location might be an endogenous variable and researchers need to be cautious of using it as an explanatory variable in a regression. For example, if we regressed income per capita in local government areas in Australia on average age, we could not necessarily interpret the results causally as the age composition of a location will depend to some degree on the economic opportunities available and vice versa.

Then there are cases where an explanatory variable is clearly exogenous and appears to have a significant effect on the dependent variable and yet theory suggests that the relationship is spurious and due to omitted variables that happen to be correlated with the explanatory variable in question. In a paper discussed by the blog Economic Logic, Westling (2011) regresses national economic growth rates on average reported penis lengths and other variables and finds that shorter penises are correlated with more rapid economic growth in the period. Penis length would seem to be exogenous but obviously this relationship would not have held in earlier periods when economic growth was more rapid in Europe and its Western offshoots and slower in Asia. So, it seems that the result is likely due to omitted variables bias unless the effect should only have become relevant in recent decades.

* Though regression results are reported, it is obvious from the variables named that no regression analysis was in fact conducted.