Sunday, February 22, 2015

Peer Review vs. Citation Analysis in Research Assessment Exercises

Existing research finds strong correlations between the rankings produced by UK research assessment exercises (RAE) and bibliometric analyses for several specific humanities and social science disciplines (e.g. Colman et al., 1995; Oppenheim, 1996; Norris and Oppenheim, 2003) including economics (Süssmuth et al., 2006). Clerides et al. (2011) compare the 1996 and 2001 RAE ratings of economics departments with independent rankings from the academic literature. They find RAE ratings to be largely in agreement with the profession’s view of research quality as documented by independent rankings, although the latter appear to be more focused on research quality at the top end of academic achievement. This is because most rankings of departments in the economics literature are based on publications in top journals only, which lower ranked departments have very few of.

Mryglod et al. (2013) analyse the correlations between the values of Thomson Reuters Normalised Citation Impact (NCI) indicator and RAE 2008 peer-review scores in several academic disciplines, from the natural to social sciences and humanities. The NCI computes the normalized impact factor across a unit of assessment (an academic discipline at a given university) in the RAE based on only the publications actually submitted to the RAE. Mryglod et al. (2013) compute both average (or quality) and total (or strength) values (average multiplied by number of staff submitted to the RAE) of these two indicators for each institution. They find very high correlations for the strength indicators for some disciplines and poor correlations at for the quality indicators for all disciplines. This means that, although the citation-based scores could help to describe institution level strength (which is quality times size), in particular for the so-called hard sciences, they should not be used as a proxy for ranking or comparison of research groups. Moreover, the correlation between peer-evaluated and citation-based scores is weaker for the “soft” sciences. Spearman rank correlation coefficients for their quality indicators range from 0.18 (mechanical engineering) to 0.62 (chemistry). However for strength the correlations range from 0.88 (history and sociology) to 0.97 (biology). This is because quality is correlated with size and so the two factors reinforce each other.

Mryglod et al. (2014) attempt to predict the 2008 RAE retrospectively and the 2014 Research Excellence Framework (REF) before its results were released. They examined biology, chemistry, physics, and sociology. Of the indicators they trialled, they found that the departmental h-index had the best fit to the 2008 results. Departmental h-index is based on all publications published by a department in the time window assessed by the relevant assessment exercise. The rank correlation ranged from 0.83 in chemistry to 0.58 in sociology. They find that the correlation with the RAE for the immediate h-index is as good as the correlation in later years with the h-index of the same set of publications.

Bornmann and Leydesdorff (2014) argue that one of the downsides of bibliometrics as a research assessment instrument is that citations take time to accumulate while research assessment exercises are designed to assess recent performance:

“This disadvantage of bibliometrics is chiefly a problem with the evaluation of institutions where the research performance of recent years is generally assessed, about which bibliometrics—the measurement of impact based on citations—can say little…. the standard practice of using a citation window of only 3 years nevertheless seems to be too small.” (1230)

They argue further that bibliometrics:

“can be applied well in the natural sciences, but its application to TSH (technical and social sciences and humanities) is limited.” (1231)

But rather than assuming that peer review is the preferred approach to research assessment and citation analysis should only be used to reduce cost, we can ask whether the review conducted by research assessments such as the REF and the Australian ERA meets the normal academic standards for peer review. Research does show that peer review at journals has predictive validity for the citations that will be received by accepted papers compared to those received by rejected papers. However, evidence for the predictive validity of peer review of grant and fellowship applications is more mixed (Bornmann, 2011). Therefore, further research is warranted on the use of citation analysis to rank academic departments or universities in research assessment exercises. Sayer (2014) argues that the peer review undertaken in research assessment exercises does not meet normal standards for peer review. He compares university and national-level REF processes against actual practices of scholarly review as found in academic journals, university presses, and North American tenure procedures. He finds that the peer review process used by the REF falls far short of the level of scrutiny or accuracy of these more familiar peer review processes. The number of items each reviewer has to assess alone means that the review cannot be of the same quality as reviews for publication. And reviewers will have to assess much material outside their area of specific expertise. Sayer argues that though metrics may have problems, a process that gives such extraordinary gatekeeping power to individual panel members is far worse.

Given the large number of items that panels need to review they are likely to focus on the venue of publication and at least in business and economics handy mappings of journals to REF grades exist (Hudson, 2013). Regibeau and Rockett (2014) build imaginary economics departments entirely composed of Nobel Prize winners and evaluate them using standard journal rankings geared to the UK RAE. Performing the same evaluation on existing departments, they find that the rating of the Nobel Prize departments does not stand out from other good departments. Compared to recent research evaluations, the Nobel Prize departments’ rankings are less stable. This suggests a significant effect of score “targeting” induced by the rankings exercise. They find some evidence that modifying the assessment criteria to increase the total number of publications considered can help distinguish the top. But if departments composed entirely of Nobel Prize winners perform worse than current departments then it is hard to know what such assessment means.

Sgroi and Oswald (2013) examine how research assessment panels could most effectively use citation data to replace peer review. They suggest a Bayesian approach that uses prior information on where a item was published combined with observations on citations to derive a posterior distribution for the quality of the paper. We could then estimate, for example, what is the probability that a paper belongs in the 4* category given where it was published and the early citations it has received. Stern (2014) and Levitt and Thelwall (2011) show that the journal impact factor has strong explanatory power in the year of publication but that this declines very quickly as citations accumulate. So, this approach would be most useful for papers published in the last year or two before the assessment, but for earlier research outputs the added value over simply counting citations would be minimal.

References

Bornmann, L. (2011) ‘Scientific peer review’, Annual Review of Information Science and Technology, vol. 45, pp. 199‐245.

Bornmann, L. and Leydesdorff, L. (2014). ‘Scientometrics in a changing research landscape’, EMBO Reports, vol. 15(12), pp. 1228–32.

Clerides, S., Pashardes, P. and Polycarpou, A. (2011) ‘Peer review vs metric-based assessment: testing for bias in the RAE ratings of UK economics departments’, Economica, vol. 78(311), pp. 565–83.

Colman, A. M., Dhillon, D. and Coulthard, B. (1995) ‘A bibliometric evaluation of the research performance of British university politics departments: Publications in leading journals’, Scientometrics vol. 32(1), pp. 49-66.

Hudson, J. (2013). ‘Ranking journals’, Economic Journal, vol. 123, pp. F202-22.

Levitt, J.M. and Thelwall, M. (2011). ‘A combined bibliometric indicator to predict article impact’, Information Processing and Management, vol. 47, pp. 300–8.

Mryglod, O., Kenna, R., Holovatch, Y. and Berche, B. (2013). ‘Comparison of a citation-based indicator and peer review for absolute and specific measures of research-group excellence’, Scientometrics, vol.97, pp. 767–77.

Mryglod, O., Kenna, R., Holovatch, Y. and Berche, B. (2014). Predicting Results of the Research Excellence Framework Using Departmental H-Index, arXiv:1411.1996v1.

Norris, M. and Oppenheim, C. (2003) ‘Citation counts and the research assessment exercise V: Archaeology and the 2001 RAE’, Journal of Documentation, vol. 59(6): pp. 709-30.

Oppenheim, C. (1996) ‘Do citations count? Citation indexing and the research assessment exercise’, Serials, vol. 9, pp. 155–61.

Regibeau, P. and Rockett, K.E. (2014). ‘A tale of two metrics: Research assessment vs. recognized excellence’, University of Essex, Department of Economics, Discussion Paper Series 757.

Sayer, D. (2014). Rank Hypocrisies: The Insult of the REF. Sage.

Sgroi, D. and Oswald, A.J. (2013). ‘How should peer-review panels behave?’ Economic Journal, vol. 123, pp. F255–78.

Stern, D.I. (2014). ‘High-ranked social science journal articles can be identified from early citation information’, PLoS ONE, vol. 9(11), art. e112520. 

Süssmuth, B., Steininger, M. and Ghio, S. (2006) 'Towards a European economics of economics: Monitoring a decade of top research and providing some explanation', Scientometrics, vol. 66(3), pp. 579-612.

No comments:

Post a Comment