Using the Google PageRank algorithm as an alternative citation metric
The Journal of Neuroscience is publishing a series of invited commentaries exploring the current use of impact metrics to measure scientific achievement (and how misleading the metrics can be) and what alternatives may be available. I’m sure that this will give us plenty to discuss over the coming weeks.
Essentially, the algorithm considers a body of literature as a network, with connections made between the various nodes representing citations. Then a “random searcher” is placed at each node and then jumps to a connecting node. There is a certain probability that the searcher will “get bored” and start a new “search”, with this value determined by the actual habits of users; in Google’s case, it is the behavior of web searchers/surfers, and in this case, it is a scientist looking for a particular citation. Starting a new search places the searcher at an entirely different node to start the process over. This continues until the number of searchers reaches a stable value at each node, representing the Google number. The Google numbers are then sorted to provide the PageRank, or in this case, the CiteRank.
Looking “under the hood” at this updated algorithm, received citations (node jumping) are weighted according to whether the connection arises from another important research study (because of its own high number of received citations) and/or whether the citation comes from a paper with fewer total references. For the latter, I guess the assumption is that within a constrained, smaller bibliography, only the best papers will be cited, so if a paper makes it into a reduced list, it must be important. This brings in a certain amount of scientific economics.
The final tweak that these authors gave to the algorithm was to factor in time. This is more important for citation networks as opposed to web content because citations cannot be updated after publication, and only earlier works may be cited within the network. Thus, time-ordering factors were introduced since "aging effects’ will be greater in the citation network. The premise for introducing time is that when looking for a paper, scientists typically start with something more recent and then work chronologically backwards through previous related studies. How does this affect the CiteRank calculations? Well, the authors initially distribute random searchers exponentially with age, in favor of more recent publications.
Thus, the researchers claim that using their CiteRank algorithm can provide a measure of study relevance within current research directions, while the classical PageRank value provides more of a “lifetime achievement rank” for a particular study. Combining information from both calculations can be a powerful measure of impact, both overall and within the current scientific environment, they argue. An example processed citation dataset is provided here.
The dataset they chose for the current analysis was from the physics community. Since I don’t have a gut sense of this literature, it is hard for me to glean the usefulness or power of their calculation. I think I would need to see this applied to my broad field and examine the output before I could pass judgment. Nevertheless, it is an interesting proposal.
Maslov, S., & Redner, S. (2008). Promise and Pitfalls of Extending Google’s PageRank Algorithm to Citation Networks Journal of Neuroscience, 28 (44), 11103-11105 DOI: 10.1523/JNEUROSCI.0002-08.2008