Exploring the Astronomy Literature Landscape
The body of literature in astronomy and physics is enormous, and growing rapidly. How do you “separate wheat from chaff ” and ﬁnd your way in this maze of scholarly papers, conference proceedings, grey literature and technical reports? When you use a general search engine (Google, Yahoo, Live Search, etc.), you will probably ﬁnd thousands of documents, ranked according to some algorithm. Even specialized versions of these tools often leave us awash in information. To search the electronic, scholarly literature, scientists need to be able to zoom in on bibliographic data using additional descriptors and search logic. The SAO/NASA Astrophysics Data System (ADS) offers a bibliographic service with a sophisticated query interface, offering users a wide set of ﬁlters (Kurtz et al. 2000). In addition, the ADS has the myADS service, a fully customizable newspaper covering all (journal) research for astronomy, physics and/or the arXiv e-prints. The myADS-arXiv service (Henneken et al. 2006) is a tailor-made, Open Access, virtual journal, covering the most important papers of the past week in physics and astronomy. Although powerful, these bibliographic services are essentially list-oriented and therefore have their limitations.
Lists, because of their one-dimensionality, are unable to provide a rich context for a given paper. A deeper understanding of the data structure formed by the collection of meta data in the ADS database will allow us to develop tools that provide a fundamentally different way of navigating through this bibliographic universe. As a tool for navigating, the best analogy is probably that of a map. A point on a map has a certain contextual meaning, depending on the information being displayed on that map. How can we form a landscape and subsequently a map, based on the astronomy literature? A set of papers can be regarded as an ensemble of points that “interact” with each other in a certain way. This interaction can, for example, represent the citations between papers, the number of keywords papers have in common, a similarity between abstracts of papers or a combination of these. In this way, these papers form a network with weighted connections.
Depending on the character of the “interaction” between the papers, this network is directed or undirected. A citation network, for example, is directed. The papers, in this network representation, form highly connected modules that are only weakly connected to each other. Using the modules to describe the network is equivalent to ﬁltering out all unimportant details in the network and just using the regularities. An example to illustrate this approach is the difference between a aerial map and a schematic of the subway system of a city (credit goes to Martin Rosvall). Describing a network in terms of modules is equivalent to a lossy compression of the original network. Having established the modular structure for a set of papers provides us with powerful knowledge of how our bibliographic universe is organized. This knowledge can help us with regular search engine queries to give more meaningful results, by making the search engine aware that papers belong to communities. It also allows us to create thematic maps of scholarly literature, allowing an essentially new way of navigating the literature.
How do you get from a network to a map? Our network consists of N papers, each having a certain amount of meta data labels (like keywords, reference sections, etc.). Additionally, there is a relationship that determines whether two papers are connected in this network, and how strong this interaction is. As a result of this relation, there will be l links in this network. Our choice for the relation is: paper i cites paper j . This automatically means that the weight of each link is 1. The (uncompressed) network is described by the adjacency matrix, as usual. There are many ways to compress this network into a modular description. Which one ﬁts our purposes? Obviously, we want a compressed version of the original network that best characterizes the original network. How do we translate this into mathematics? The uncompressed network is the realization of a random variable X and the compressed description is that of a random variable Y . Information theory stipulates that the description Y of random variable X that best characterizes X is the one that maximizes the mutual information I (X ; Y ) between description and network. An often-used alternative is the approach that maximizes the so-called modularity. However, this method has a bias towards equal-sized modules (Rosvall & Bergstrom 2007), which is a partition that is not realistic for our bibliographic network. In our approach, we only use information that is present in the network representation. The key ingredient in our approach is the fact that a random surfer will represent the ﬂow in the network in a natural way. A random surfer is a random walker who makes a random jump within the network with a given probability. This is to allow weighted, directed networks in our analysis. The node visit frequencies of the random surfer naturally tend to the underlying probability distribution within the network. How do we describe paths in our network? In theory we could use a Huffman code, but that is not necessary. If L(C ) represents the expected code length, then according to Shannon’s coding theorem (Shannon 1948): L(C) ≥ H(X) (where H(X ) is the entropy of the random variable X ). So, on average, the number of bits that we need to describe one step taken by the random surfer is always greater or equal to H(P), the entropy defined by the distribution P of (ergodic) visit frequencies to the network nodes. In our approach, the lower bound on the code length will be the “description length” L. In this framework, we partition are network into modules that minimize the expected description length of the random walk. Once we have established the partition into modules, the map will be the graphical representation of this partition. The map will graphically represent partition characteristics like the amount of time a random surfer spends in a module, the citation ﬂow within a module and the citation ﬂow out of a module. This method has been described in detail in Rosvall, M. & Bergstrom, C. T. 2007 (see bibliography).
How do you create an algorithm out of this? The ergodic node visit frequencies are calculated using the fact that the journey of the random surfer is a Markov chain. Because the surfer can get from any point in the network to any other point, it is an irreducible Markov chain. Because this Markov chain is also aperiodic, according to the Perron-Frobenius Theorem, it has a unique steady state solution. This is done using the Power Method (Perra & Fortunato 2008).
The exit probability for a given module follows from the teleportation probability and the probability of leaving the module from any of the nodes in the module.
Next, the network is partitioned into modules by means of a greedy search and simulated annealing. The greedy search merges pairs of modules that give the largest decrease in description length until further merging increases the description length. The result of the greedy search is improved by simulated annealing. Here a conﬁguration that is “near” is explored. If it results in a smaller description length, it is accepted, if not, it is accepted with a probability that depends on a system parameter called “temperature”. The simulated
annealing algorithm used is the “heat bath algorithm” (Newman & Barkema 1999). By using this algorithm at several different temperatures, the description length can be improved by up to several percent over that found by the greedy search alone.
Henneken, E., et al. 2006, in ASP Conference Series, 377, 106
Kurtz, M. J., et al. 2002, Proc. SPIE, 4847, 238
Kurtz, M. J., et al. 2000, A&AS, 143, 41
Newman, M. E. J. & Barkema, G. T. 1999, “Monte Carlo methods in statistical physics”, Oxford University Press
Perra, N. & Fortunato, S. 2008, arXiv:0805.3322
Rosvall, M. & Bergstrom, C. T. 2007, Pub. Nat. Acad. Sc., 104, 7327
Shannon, C. E. 1948, The Bell System Technical Journal, 27, 379