Mismeasuring scientific quality (and an argument in favour of diversity of measurement systems)
There was a short piece here recently on the misuse of impact factors to measure scientific quality, and how this in turn leads to dependence on drugs like Sciagra™ and other dangerous variants such as Psyagra™ and Genagra™.
Here’s an interesting and important post from Michael Nielsen on the mismeasurement of science. The essence of his argument is straightforward: unidimensional reduction of a multidimensional variable set is going to lead to significant loss of important information (or at least that’s how I read it):
My argument … is essentially an argument against homogeneity in the evaluation of science: it’s not the use of metrics I’m objecting to, per se, rather it’s the idea that a relatively small number of metrics may become broadly influential. I shall argue that it’s much better if the system is very diverse, with all sorts of different ways being used to evaluate science. Crucially, my argument is independent of the details of what metrics are being broadly adopted: no matter how well-designed a particular metric may be, we shall see that it would be better to use a more heterogeneous system.
Nielsen notes three problems with centralised metrics (this can be relying solely on a h-index, citations, publication counts, or whatever else you fancy):
Centralized metrics suppress cognitive diversity: Over the past decade the complexity theorist Scott Page and his collaborators have proved some remarkable results about the use of metrics to identify the “best” people to solve a problem (ref,ref).
Centralized metrics create perverse incentives: Imagine, for the sake of argument, that the US National Science Foundation (NSF) wanted to encourage scientists to use YouTube videos as a way of sharing scientific results. The videos could, for example, be used as a way of explaining crucial-but-hard-to-verbally-describe details of experiments. To encourage the use of videos, the NSF announces that from now on they’d like grant applications to include viewing statistics for YouTube videos as a metric for the impact of prior research. Now, this proposal obviously has many problems, but for the sake of argument please just imagine it was being done. Suppose also that after this policy was implemented a new video service came online that was far better than YouTube. If the new service was good enough then people in the general consumer market would quickly switch to the new service. But even if the new service was far better than YouTube, most scientists – at least those with any interest in NSF funding – wouldn’t switch until the NSF changed its policy. Meanwhile, the NSF would have little reason to change their policy, until lots of scientists were using the new service. In short, this centralized metric would incentivize scientists to use inferior systems, and so inhibit them from using the best tools.
Centralized metrics misallocate resources: One of the causes of the financial crash of 2008 was a serious mistake made by rating agencies such as Moody’s, S&P, and Fitch. The mistake was to systematically underestimate the risk of investing in financial instruments derived from housing mortgages. Because so many investors relied on the rating agencies to make investment decisions, the erroneous ratings caused an enormous misallocation of capital, which propped up a bubble in the housing market. It was only after homeowners began to default on their mortgages in unusually large numbers that the market realized that the ratings agencies were mistaken, and the bubble collapsed. It’s easy to blame the rating agencies for this collapse, but this kind of misallocation of resources is inevitable in any system which relies on centralized decision-making. The reason is that any mistakes made at the central point, no matter how small, then spread and affect the entire system.
What of course is breath-taking is that scientists, who spend so much time devising sensitive measurements of complex phenomena, can sometimes suffer a bizarre cognitive pathology when it comes to how the quality of science itself should be measured. The sudden rise of the h index is surely proof of that. Nothing can actually substitute for the hard work of actually reading the papers and judging their quality and creativity. Grillner and colleagues recommend that “Minimally, we must forego using impact factors as a proxy for excellence and replace them with indepth analyses of the science produced by candidates for positions and grants. This requires more time and effort from senior scientists and cooperation from international communities, because not every country has the necessary expertise in all areas of science.” Nielsen makes a similar recommendation.