## Sunday, August 9, 2015

### The Extent and Consequences of P-Hacking in Science

Interesting paper from my biology colleagues at ANU on the effects of "p-hacking" - searching for more significant results by looking at various statistical models or samples and picking the more significant ones to report - on reported science. They conclude that when there is a strong real effect it can be detected despite p-hacking by looking at the "p-curve". The p-curve is the distribution of p-values across all the studies collected in a meta-analysis. If the curve is skewed right - there is a peak at very high significance levels (numbers a lot smaller than 5%) then there is a real effect. However, p-hacking can inflate the estimated size of the effect if we use a simple average of effect sizes in the literature. The main novelty of their paper I think is that they collected a large number of p-values from various fields of science using text-mining to test these ideas in the empirical literature.

In meta-analysis in economics, a popular approach is to test the effect of degrees of freedom or precision (inverse of the standard error) on the values of the reported test statistics using regression analysis. This effect is called the power-trace. The idea is that if there is a true effect, then, due to increasing statistical power, reported test statistics will be more significant the higher the degrees of freedom in the underlying study.* Some of these methods can also be used to estimate the true effect size adjusted for publication bias.

In our meta-analysis of energy-GDP Granger causality tests we also present graphs of the distribution of the test-statistics. These seemed to be roughly normal with a mean of about 1, which means there is excess significance in this literature but that the mean test statistic is not statistically significant (the solid histogram in the background is the standard normal distribution):

To help interpret these graphs, note that a normal test statistic (-probit(p)) of zero means that the original Granger causality test p-value was 0.5. A test statistic of 1.65 implies that the original p-value was 0.05 and a test statistic of -1.65 implies that the p-value was 0.95. The econometric analysis in the paper showed that there was no statistically significant relationship between these test statistics and degrees of freedom, also suggesting that there was no genuine effect. We showed in the paper that there did seem to be a robust effect from GDP to energy when underlying studies controlled for energy prices.

We didn't report the actual p-values though, and so I am curious what the p-curves look like. First I made a couple of histograms with bins for each 1% increment of p-values:

Uh-oh! The mode is for 0-1%! According to Head et al.'s methodology that means there is a true effect in each direction of causality. When I broke down the range from p=0 to p=0.1 into 100 bins, again the mode was for the smallest value. So, what does it mean when the overwhelming majority of studies find results that are less significant than the 1% or 0.1% level and yet the mode is for 0-1% or 0-0.1%? And when these results are for not particularly large sample sizes? Either the p-curve or the meta-regression/power trace method is wrong here. One hypothesis is that non-stationarity in macro-economic time series and the over-fitting problem discussed in our paper result in many spuriously significant test statistics in relatively small samples that wouldn't arise with more classically behaved data.

* Though this method can detect a "genuine effect" there is no guarantee that this is a "causal effect". If no studies control for the relevant variables or effects to identify a causal effect then the meta-analyst won't be able to detect a causal effect either. Similarly, if the meta-analyst doesn't control for all the relevant variables included in the underlying studies they may also fail to identify a causal effect when some papers do identify one. All the meta-analyst can find is a robust partial correlation in the underlying studies if one exists.