Submission of E-prints – Versioning

•July 6, 2010 • Leave a Comment

Here’s an interesting trend: the fraction of e-prints with mutliple versions has been increasing steadily in a number of categories. The figure below shows these trends for 4 major arXiv categories.

This figure shows the fraction of e-prints with mutiple versions for the arXiv categories astro-ph, cond-mat, hep-ph and nucl-th

I think that authors, over time, started to care more about replacing the initial version with the final version, or at least a more recent version (as some publishers still don’t allow the final version to be made available as e-print). Since the e-prints on arXiv are read so heavily, it is in the authors’ interest to replace their e-prints with corrected/updated versions. There are researchers in some disciplines who will only read and cite e-prints, maybe because their library cannot afford the subscription fees or maybe by choice, but it will be clearly beneficial to them if an e-print is a accurate representation of the end product. The Institute of Mathematical Statistics has the following standpoint with respect to e-printing IMS journal articles:
IMS wishes to demonstrate by example that high quality journals supported by the academic community can provide adequate revenue to their publishers even if all of their content is placed on open access digital repository such as arXiv. A steady flow of IMS content into the PR (probability) section and the new ST (statistics) section of arXiv should help create an eprint culture in probability and statistics, and be of general benefit to these fields. By guaranteeing growth of these sections of arXiv, IMS will support the practice of authors self-archiving their papers by placing them on arXiv. This practice should put some bound on the prices of subscriptions to commercial journals.” (for more into, see IMS Journals on arXiv). They literally give their authors the following advice: “… when a final version is accepted by a journal, update your preprint to incorporate changes made in the refereeing process, so a post-refereed pre-press version of your article is also available on arXiv“. There are probably other journals and societies with the same standpoint.
We’re just seeing another symptom of the (necessary) paradigm shift in scholarly publishing.

Recommending Literature in a Digital Library

•July 2, 2010 • Leave a Comment

I started yesterday’s post with saying that authors publish because they want to transfer information and that an essential ingredient for this transfer is being able to find this information. Of course, any organization running a search engine or a publisher with a substantial online presence are other examples where the art of “discovery” is as essential as wind to a sail boat. Clearly, this is becoming more and more of a challenge with the rapidly expanding information universe (literature universe, in our case). The amount of potentially interesting, searchable literature is expanding continuously. Besides the normal expansion, there is an additional influx of literature because of interdisciplinary boundaries becoming more and more diffuse. Hence, the need for accurate, efficient and intelligent search tools is bigger than ever.

When you just look at the holdings of the SAO/NASA Astrophysics Data System (ADS), you’ll get a good indicator for this expansion. As of April 19, 2010, there are 1,730,210 records in the astronomy database, and 5,437,973 in the physics database, distributed over publication years as shown in the figure below.

This figure shows the number of records in the astronomy and physics databases in the ADS, as a function of publication year

In astronomy, as in other fields, the Literature Universe expands more rapidly because of dissolving boundaries with other fields. Astronomers are publishing in journals and citing articles from journals that had little or no astronomy content not too long ago.
How do you find what you are looking for and more importantly, information you could not have found using the normal information discovery model? When you have some prior information (like author names and/or subject keywords), you can use your favorite search engine and apply that information as filters. There are also more sophisticated services like myADS (as part of your ADS account), that do intelligent filtering for you and provide you with customized suggestions. Alternatively, you can ask somebody you consider to be an expert. This aspect emphasizes that “finding” essentially is a bi-directional process. Wouldn’t it be nice to have an electronic process that tries to mimic this type of discovery? It is exactly this type of information discovery that recommender systems have been designed for.

Recommender systems can be characterized in the following way. Recommender systems for literature recommendation…

  • are a technological proxy for a social process
  • are a way of suggesting like or similar articles to a user-specific way of thinking
  • try to automate aspects of a completely different information discovery model where people try to find other people considered to be experts and ask them to suggest related articles

In other words, the main goal of a literature recommender system is to help visitors find information (in the form of articles) that was previously unknown to them.

What are the key elements needed to build such a recommender system? The most important ingredient is a “proximity concept”. You want to be able to say that two articles are related because they are “closer together” than articles that are less similar. You also want to be able to say that an article is of interest to a person because of its proximity to that person. The following approach will allow you to do just that:

  • build a “space” in which documents and persons can be placed
  • determine a document clustering within this space (“thematic map”)

How do you build such a space? Assigning labels to documents will allow us to associate a “topic vector” with each document. This will allow us to assign labels to persons as well (“interest vector”), using the documents they read. Placing persons in this document space can be used in essentially two different ways: use this information to provide personalized recommendations or use usage patterns (“reads”) of expert users as proxies for making recommendations to other users (“collaborative filtering”). As far as the labels themselves are concerned, there are various sources you can distill them from. The most straightforward approach is to use keywords for these labels. One drawback that comes to mind immediately, is the fact that there are no keywords available for historical literature. However, keywords are an excellent labeling agent for current and recent literature.

Whether keywords really describe the document universe with sufficient accuracy is directly related to the question whether a keyword system is sufficiently detailed to classify articles. I assume the latter is true, but only when you include the keywords from all papers in the bibliography. Having said this, I do realize that a keyword system can never be static because of developments within a field and because of diffusing boundaries with other fields. I use the keywords provided by the publishers, so the scope and the evolution of the keyword spectrum is out of our hands. It also means that a recommender system based on publisher-provided keywords has one obvious vulnerability: if a major publisher would decide to stop using keywords (e.g. PACS identifiers), it would pose a significant problem.

The figure below shows a highly simplified representation of that document space, but it explains the general idea. Imagine a two-dimensional space where one axis represents a topic ranging from galactic to extra-galactic astronomy, and where the other ranges from experimental/observational to theoretical. In this space, a paper titled “Gravitational Physics of Stellar and Galactic Systems” would get placed towards the upper right because its content is mostly about theory, with an emphasis on galactic astronomy. A paper titled “Topological Defects in Cosmology” would end up towards the upper left, because it is purely theoretical and about the extra-galactic astronomy.

A simplistic, two-dimensional representation of a "topic space"

A person working in the field of observational/experimental extra-galactic astronomy will most likely read mostly papers related to this subject, and therefore get placed in the lower left region of this space. A clustering is a document grouping that is super-imposed upon this space, which groups together documents that are about similar subjects. As a result, this clustering defines a “thematic map”. As mentioned, this is a highly simplified example. In reality the space has many dimensions (100 to 200), and these cannot be named as intuitively as “level of theoretical content”. However, the naming of various directions in this “topic space” is not something I don’t worry about. The document clustering is the tool that I will be working with. Note that for to establish this “thematic map”, you could very well use the approach I described earlier this week in my post Exploring the Astronomy Literature Landscape.

Knowing to which cluster a new article has been assigned will allow us to find papers that are the closest to this article within the cluster. The first couple of papers in this list can be used as a first recommendation. The more interesting recommendations, however, arise when you combine the information we have about the cluster with usage information. The body of usage information is rather specific: it consists of usage information for “frequent visitors”. People who read between 80 and 300 articles in a period of 6 months seems like a reasonable definition for the group of “frequent visitors”. I assume that this group of frequent visitors represents either professional scientists or people active in the field in another capacity. People who visit less frequently are not good proxies because they are most likely incidental readers.

The technique used to build the recommender system has been around for quite a while. As early as 1934, Louis Thurstone wrote his paper “Vectors of the Mind” which addressed the problem of “classifying the temperaments and personality types”. Peter Ossorio (1965) used and built on this technique to develop what he called a “Classification Space”, which he characterized as “a Euclidean model for mapping subject matter similarity within a given subject matter domain”. Michael Kurtz applied this “Classification Space” technique to obtain a new type of search method. Where the construction of the “Classification Space” in the application by Ossorio relied on data input by human subject matter experts, the method proposed by Michael Kurtz builds the space from a set of classified data. Our recommender system is a direct extension of the “Statistical Factor Space” described in the appendix “Statistical Factor Spaces in the Astrophysical Data System” of this paper by Michael Kurtz.

Bibliography

  • Kurtz, M.~J.\ 1993, Intelligent Information Retrieval: The Case of Astronomy and Related Space Sciences, 182, 21
  • Ossorio, P.~G.\ 1965, J. Multivariate Behavioral Research, 2, 479
  • Thurstone, L.~L.\ 1934, Psychological Review, 41, 1

Publication Trends – Authors – Astronomy

•June 30, 2010 • 1 Comment

Authors publish because they want to transfer information. An essential ingredient for this transfer is being able to find this information. This means that this information, for example articles in scholarly journals, needs to be indexed properly and enriched with relevant meta data and links. Enhanced information retrieval tools, like recommender systems, have become indispensable. Besides the actual content of the information offered for dispersal, the information comes with another piece of essential meta data: the author list.

The importance of the author list is essentially bidirectional. Having your name appear on articles is an essential ingredient of any scholarly career and plays an important role in the process of seeking for e.g. tenure or jobs. The role of first author depends on discipline, so the first author isn’t necessarily the “most authoritative” author. Some disciplines use alphabetical author lists, for example. Co-authorship with a prominent expert clearly makes a difference and sometimes gives you “measurable status”, like the Erdős number in mathematic, which is the “collaborative distance” between a person and Paul Erdős (if your number is 1, it means you published a paper together with him).

To me, co-authorship is the most normal thing in the world. In a lot of way, doing science is like learning a “trade”. You start off being an apprentice, you do an examn showing that you have mastered the basic skills for the “trade” and then you find your own way. As an aside: I think the doctoral thesis and its subsequent defense is that “test of ability”. In some displines it now seems to have become a requirement that doctoral research should result in something original and new. Please correct me if that observation is incorrect.

In the past, at least in astronomy and physics, it was more common to publish papers just by yourself, once you’ve mastered your field. And this was initially totally feasible. In the early days of science there were no budgets being slashed and there were no enormous projects like LHC. Most scientists had their own little “back yard” where they could grow whatever they felt like growing. As the 20th centory progressed, especially in roughly the second half, collaborations became more and more unavoidable. Enter collaborations and therefore growing numbers of co-authors. From this moment on we see the The demise of the lone author (Mott Greene, Nature, Volume 450, Issue 7173, pp. 1165). The figure below is an illustration of how the distribution of the number of authors has changed over time.

The figure shows the distribution of the relative frequency of the number of authors per paper in the main astronomy journals for a number of years

This figure illustrates a couple of things. First of all is shows the “demise of the lone author”, where the fraction of lone author papers dropped from about 60% in 1960 to about 6% in 2009! The widening of the distribution shows that on average the number of co-authors has increased. It seems that this is still an ongoing process that hasn’t reached a saturation point yet.

The figure below highlights the “demise of the lone author” by showing the change in the fraction of single author papers in the main astronomy and physics journals.

The figure shows the fraction of papers by single authors in the main astronomy and physics journals


The drop in the astronomy journals is more dramatic than for the physics journals. A factor of about 10 versus a factor of about 3 or 4.

Exploring the Astronomy Literature Landscape

•June 29, 2010 • 1 Comment

The body of literature in astronomy and physics is enormous, and growing rapidly. How do you “separate wheat from chaff ” and find your way in this maze of scholarly papers, conference proceedings, grey literature and technical reports? When you use a general search engine (Google, Yahoo, Live Search, etc.), you will probably find thousands of documents, ranked according to some algorithm. Even specialized versions of these tools often leave us awash in information. To search the electronic, scholarly literature, scientists need to be able to zoom in on bibliographic data using additional descriptors and search logic. The SAO/NASA Astrophysics Data System (ADS) offers a bibliographic service with a sophisticated query interface, offering users a wide set of filters (Kurtz et al. 2000). In addition, the ADS has the myADS service, a fully customizable newspaper covering all (journal) research for astronomy, physics and/or the arXiv e-prints. The myADS-arXiv service (Henneken et al. 2006) is a tailor-made, Open Access, virtual journal, covering the most important papers of the past week in physics and astronomy. Although powerful, these bibliographic services are essentially list-oriented and therefore have their limitations.
Lists, because of their one-dimensionality, are unable to provide a rich context for a given paper. A deeper understanding of the data structure formed by the collection of meta data in the ADS database will allow us to develop tools that provide a fundamentally different way of navigating through this bibliographic universe. As a tool for navigating, the best analogy is probably that of a map. A point on a map has a certain contextual meaning, depending on the information being displayed on that map. How can we form a landscape and subsequently a map, based on the astronomy literature? A set of papers can be regarded as an ensemble of points that “interact” with each other in a certain way. This interaction can, for example, represent the citations between papers, the number of keywords papers have in common, a similarity between abstracts of papers or a combination of these. In this way, these papers form a network with weighted connections.

An illustration for filtering out unimportant detail: an aerial photograph doesn't help you when you want to take the subway across Boston into Cambridge. In the subway map all relevant directional information has been compressed into one image. The basic geographical information is still there, but now you can see immediately how to get from one point to another with the subway

Depending on the character of the “interaction” between the papers, this network is directed or undirected. A citation network, for example, is directed. The papers, in this network representation, form highly connected modules that are only weakly connected to each other. Using the modules to describe the network is equivalent to filtering out all unimportant details in the network and just using the regularities. An example to illustrate this approach is the difference between a aerial map and a schematic of the subway system of a city (credit goes to Martin Rosvall). Describing a network in terms of modules is equivalent to a lossy compression of the original network. Having established the modular structure for a set of papers provides us with powerful knowledge of how our bibliographic universe is organized. This knowledge can help us with regular search engine queries to give more meaningful results, by making the search engine aware that papers belong to communities. It also allows us to create thematic maps of scholarly literature, allowing an essentially new way of navigating the literature.

How do you get from a network to a map? Our network consists of N papers, each having a certain amount of meta data labels (like keywords, reference sections, etc.). Additionally, there is a relationship that determines whether two papers are connected in this network, and how strong this interaction is. As a result of this relation, there will be l links in this network. Our choice for the relation is: paper i cites paper j . This automatically means that the weight of each link is 1. The (uncompressed) network is described by the adjacency matrix, as usual. There are many ways to compress this network into a modular description. Which one fits our purposes? Obviously, we want a compressed version of the original network that best characterizes the original network. How do we translate this into mathematics? The uncompressed network is the realization of a random variable X and the compressed description is that of a random variable Y . Information theory stipulates that the description Y of random variable X that best characterizes X is the one that maximizes the mutual information I (X ; Y ) between description and network. An often-used alternative is the approach that maximizes the so-called modularity. However, this method has a bias towards equal-sized modules (Rosvall & Bergstrom 2007), which is a partition that is not realistic for our bibliographic network. In our approach, we only use information that is present in the network representation. The key ingredient in our approach is the fact that a random surfer will represent the flow in the network in a natural way. A random surfer is a random walker who makes a random jump within the network with a given probability. This is to allow weighted, directed networks in our analysis. The node visit frequencies of the random surfer naturally tend to the underlying probability distribution within the network. How do we describe paths in our network? In theory we could use a Huffman code, but that is not necessary. If L(C ) represents the expected code length, then according to Shannon’s coding theorem (Shannon 1948): L(C) ≥ H(X) (where H(X ) is the entropy of the random variable X ). So, on average, the number of bits that we need to describe one step taken by the random surfer is always greater or equal to H(P), the entropy defined by the distribution P of (ergodic) visit frequencies to the network nodes. In our approach, the lower bound on the code length will be the “description length” L. In this framework, we partition are network into modules that minimize the expected description length of the random walk. Once we have established the partition into modules, the map will be the graphical representation of this partition. The map will graphically represent partition characteristics like the amount of time a random surfer spends in a module, the citation flow within a module and the citation flow out of a module. This method has been described in detail in Rosvall, M. & Bergstrom, C. T. 2007 (see bibliography).

How do you create an algorithm out of this? The ergodic node visit frequencies are calculated using the fact that the journey of the random surfer is a Markov chain. Because the surfer can get from any point in the network to any other point, it is an irreducible Markov chain. Because this Markov chain is also aperiodic, according to the Perron-Frobenius Theorem, it has a unique steady state solution. This is done using the Power Method (Perra & Fortunato 2008).
The exit probability for a given module follows from the teleportation probability and the probability of leaving the module from any of the nodes in the module.
Next, the network is partitioned into modules by means of a greedy search and simulated annealing. The greedy search merges pairs of modules that give the largest decrease in description length until further merging increases the description length. The result of the greedy search is improved by simulated annealing. Here a configuration that is “near” is explored. If it results in a smaller description length, it is accepted, if not, it is accepted with a probability that depends on a system parameter called “temperature”. The simulated
annealing algorithm used is the “heat bath algorithm” (Newman & Barkema 1999). By using this algorithm at several different temperatures, the description length can be improved by up to several percent over that found by the greedy search alone.

Bibliography
Henneken, E., et al. 2006, in ASP Conference Series, 377, 106
Kurtz, M. J., et al. 2002, Proc. SPIE, 4847, 238
Kurtz, M. J., et al. 2000, A&AS, 143, 41
Newman, M. E. J. & Barkema, G. T. 1999, “Monte Carlo methods in statistical physics”, Oxford University Press
Perra, N. & Fortunato, S. 2008, arXiv:0805.3322
Rosvall, M. & Bergstrom, C. T. 2007, Pub. Nat. Acad. Sc., 104, 7327
Shannon, C. E. 1948, The Bell System Technical Journal, 27, 379

This figure shows the results for a citation network based on all papers published in The Astrophysical Journal (including Letters), The Astronomical Journal, Astronomy & Astrophysics and Monthly Notices of the Royal Astronomical Society for the period of 2000-2005. Only citations between these journals and within the time period were taken into account. The resulting network consists of 35,941 papers (network nodes) and 325,353 citations (network edges).

Astronomy Journals – Network Properties

•June 28, 2010 • Leave a Comment

Articles form a graph where the edges represent a relationship between the articles. With some of those relationships, the graph will be a directed graph. The relationship “X cites Y” is an example. Representing articles as a graph allows us to do a number of interesting observations. Can we detect clustering within the graph? I will discuss this at a later time. Another question is: how does the topology of the graph change over time? More specifically: can we detect densification? Does the number of edges increase? The diagram below compares the number of nodes (=articles) with the number of edges (=references) for the major astronomy journals in the time period of 1980 through 2006. Citations to articles outside the network (temporally or to other journals, not in the network) were disregarded.

The function fitted to this relationship is

e(t) = a×n(t)b

to illustrate the power law form for network densification. In this relation e(t) represents the number of edges at time t, and n(t) is the number of nodes. An exponent of 1 would indicate linear growth. The fit results in an exponent of about 1.9 (with a correlation of 0.99), indicating high densification over time, i.e. non-linear growth. Another implication is that in an average sense, bibliographies have increased in length over time.

Journal Cititation Statistics – Inter-Citation – Astronomy

•June 25, 2010 • Leave a Comment

Just as articles form a complex network, where some relationship (like “x cites y”) determines the network topology, journals themselves form a similar network, but with less granularity. It’s a little bit like the “thermodynamics” of the article universe. One big difference between the journal and the article universes is that in the journal universe, there are “loops”: journals cite themselves. You can argue that articles, in some sense, cite themselves too, but this is a flow with a constant amplitude across all nodes, while for journals this is clearly not the case. With journal inter-citation (a measure of “how often is journal X cited by journal Y”) we basically look at the out-degree of the nodes.

In this entry I’ll look at the main astronomy journals (The Astrophysical Journal, The Astronomical Journal, Monthly Notices of the R.A.S. and Astronomy & Astrophysics). For a given publication year and journal, I’ve taken the bibliographies of all articles that appeared and determined what percentage of those citations went where. The results are shown below.

This diagram shows for a given publication year which journals were cited the most by articles in the Astrophysical Journal

No dramatic changes in a period spanning just over a decade. The strongest citation flow goes back into The Astrophysical Journal. The ApJ Letters were second for a while, but have now been overtaken (by a small percentage) by Astronomy & Astrophysics and the Monthly Notices of the R.A.S.

This diagram shows for a given publication year which journals were cited the most by articles in the Astronomical Journal

The two largest citation flows here have very similar amplitudes: one looping back to the journal itself and one to The Astrophysical Journal. The next biggest flow is to A&A, followed by MNRAS. Unlike with ApJ, where A&A and MNRAS has roughly the same amplitude, A&A gets significantly more citations from AJ than MNRAS gets. There are very few citations going from AJ to the Physical Review D.

This diagram shows for a given publication year which journals were cited the most by articles in Astronomy & Astrophysics

A&A looks very similar to AJ: again ApJ has roughly the same amplitude as the self-citation flow. MNRAS is the next largest flow. ApJ Letters has roughly the same amplitude as AJ. There is a small, but steadily increasing flow to the Physical Review D.

This diagram shows for a given publication year which journals were cited the most by articles in the Monthly Notices of the R.A.S.

The largest citation flow is to the ApJ, closely followed (bit significantly smaller by ~5%) by the self-citation flow. The third largest flow is to A&A. It interesting to see that MNRAS has the largest citation flow to the Physical Review D of these 4 astronomy journals. Maybe this is an indicator that this journal has, percentage-wise, the largest cosmology content? (or at least theoretical cosmology)

Clearly, ApJ represents the largest citation flows. It is the most “international” journal: it receives the largest amount of citations from both American and European journals. Perhaps this means that it is easier for a European astronomer to publish in ApJ than for an American astronomer to publish in either MNRAS or A&A? It will be interesting to see how these citation flows differ per discipline.

Most Cited Journal Articles – E-prints – Astronomy

•June 24, 2010 • Leave a Comment

In my post “Journal Articles – E-prints – Astronomy” of June 17, I looked at the fraction of e-printed journal articles for a number of astronomy journals. An interesting question is: what this fraction within the top 100 of most cited articles within these journals? The concept of “most important” or “most influential” articles is clearly a somewhat charged one, but it seems reasonable that the more an article gets cited the more influential it becomes. When you look at articles as forming a directed graph, where the edges represent the relationship “x cites y”, a vertex with a high in-degree becomes more “central”. Anyhow, that will be the subject for a blog by itself!

For a number of astronomy journals I took all the articles that appeared in that journal in a given year, and sorted them by citation. Then, for the top 100, I determined how many of these also appeared as an e-print. This is shown in the figure below.

Fraction of e-printed papers for the most cited papers in: The Astrophysical Journal (including Letter and Supplement), The Astronomical Journal, Monthly Notices, A&A, PASP, Solar Physics and Icarus

When you compare the numbers in the above figure with those in my post of June 17, you’ll see that the numbers are much higher. For Monthly Notices of the R.A.S. it is even more than 90%! Monthly Notices is closely followed by The Astrophysical Journal and the ApJ Letters. The fact that most of the highly cited papers first appeared as e-print spawned the question: did they get cited more because they appeared first as e-print? This question lies at the base of the discussion of the influence of Open Access, Early Access and Self-Selection Bias (see the papers The Effect of Use and Access on Citations and Open Access does not increase citations for research articles from The Astrophysical Journal, for example). It will be interesting to recreate some of the diagrams and see if some trends might have changed.