Linking to Data – Effect on Citation Rates in Astronomy

•June 3, 2011 • 5 Comments

In the paper Effect of E-printing on Citation Rates in Astronomy and Physics we asked ourselves the question whether the introduction of the arXiv e-print repository had any influence on citation behavior. We found significant increases in citation rates for papers that appear as e-prints prior to being published in scholarly journals.

This is just one example of how publication practices influence article metrics (citation rates, usage, obsolescence, to name a few). Here we will be examining one practice that is very relevant to astronomy: is there a difference, from a bibliometric point of view, between articles that link to data and articles that do not? Specifically, is there a difference in citation rates between these classes of articles?

Besides being interesting from a purely academic point of view, this question is also highly relevant for the process of “furthering science”. Data sharing not only helps the process of verification of claims, but also the discovery of new findings in archival data. There seems to be a consensus that sharing data is a Good Thing. Let’s ignore the “why” and “how”, and focus on the sharing. You need to have both a willingness and a publication mechanism in order to create a “practice”. This is where citation rates come in: if we can say that papers with links to data get higher citation rates, this might increase the willingness of scientists to take the extra steps of linking data sources to their publications.

Using the data holdings of the SAO/NASA Astrophysics Data System we can do the analysis and see if articles with links to data have different citation rates. For the analysis, we used the articles published in The Astrophysical Journal (including Letters and Supplement), The Astronomical Journal, The Monthly Notices of the R.A.S. and Astronomy & Astrophysics including Supplement), during the period 1995 through 2000. Next we determined the set of 50 most frequently used keywords in articles with data links. The articles to be used for the analysis were obtained by requiring that they have at least 3 keywords in common with that set of 50 keywords. This resulted in a set of 3814 articles with data links and 7218 articles without data links. A random selection of 3814 articles was extracted for this set of 7218 articles.

First, we’ll create a diagram just like the one in figure 4 of the paper Effect of E-printing on Citation Rates in Astronomy and Physics, which shows the number of citations after publication as an ensemble average. In this figure 4 we used the mean number of citations (over the entire data set) to normalize the citations. For our current analysis we will use the total number of citations for normalization.

Our analysis shows that articles with data links are indeed cited more than articles without these links. We can say a little bit more by looking at the cumulative citation distribution. The figure below shows this cumulative distribution, normalized by the total number of citations for articles without data links, 120 months after publication.

This graph shows that for this data set, articles with data links acquired 20% more citations (compared to articles without these links).

Google Books and the Importance of Quality Control

•November 10, 2010 • Leave a Comment

I’ve stopped counting the times when I used Google Books and cringed. To be honest, I have to say that I have mostly limited myself to digitized serials, serials in astronomy and physics, to be precise. I’m going to ignore bad meta data, which in itself would be a source of teeth grinding and hair pulling. I regularly find myself laughing out loud at the subject headings they came up with. Actually, it’s pretty sad.

No, my main source of frustration is bad digitization. Missing pages, partially scanned pages, pages showing body parts (so far, I’ve only seen fingers and hands), etc etc. Here you see a fine example of what I am referring to. I don’t know whose hand this is, but I would feel deeply ashamed if I were this person. Digitization is serious business, especially when your goal is preservation. When publications contain fold-outs, these need to be properly scanned, for example. I totally realize that with an enormous digitization effort like Google’s, quality control is bound to be hard, if not impossible. In the last year, about half a million scans went through my hands (figuratively speaking). I know how hard it is to check for missing pages and I also know that you simply cannot check every single image.

In addition to bad scans, I think that the search interface of Google Books, well… errr.. sucks. The results returned seem inconsistent, probably as a result of bad meta data (and bad indexing?). Navigating through results and trying to drill down or find out which other volumes were digitized is a major undertaking and often impossible.

Clearly this was a “quantity over quality” project, and quality clearly lost.

Indexing Matters – The Importance of Search Engine Behavior

•July 21, 2010 • Leave a Comment

What a search engine returns on a user query largely, if not completely, determines its usefulness for that user. Looking at usage bibliometrics allows to classify the behavior of different types of users, for example (see e.g. Usage Bibliometrics by Michael J. Kurtz and Johan Bollen). There are voices claiming that Google Scholar is a “threat” to scholarly information retrieval services (like the ADS and WoS, for example). The main reason why this is not the case becomes clear when we look at usage statistics. Here I will make a comparison of readership patterns from ADS and Google Scholar queries, as observed in ADS’s access logs. These readership patterns will give us the obsolescence of astronomy articles by ADS and Google Scholar users. In order to zoom in on people who use ADS professionally, I will only regard ADS users who query ADS 10 or more times per month. The journals I have used in the analysis are the main astronomy journals: Astrophysical Journal, Astronomical Journal, Monthly Notices of the R.A.S. and Astronomy & Astrophysics. In the figure below, a comparison is made between readership of frequent ADS users (read “professional astronomers”) and Google Scholar users.

Comparison of readership patterns from ADS and Google Scholar queries, as observed in ADS’s access logs. The red line marked with open circles shows the readership use by people using the ADS search engine. The blue line marked with 'x' corresponds with the readership use by people who used the Google Scholar engine. The orange line marked with closed circles shows the citation rate to the articles, while the purple line marked with ’+’ respresent their total number of citations.

All the quantities in the figure above are on a per article basis and have been normalized by the 1987 value. This was done so that we can compare apples with apples.
The fact that the obsolescence through Google Scholar is strongly correlated with the total number of citations is no coincidence: this is a direct consequence of the correlation between the PageRank and the total number of citations (see e.g. Chen et al. (2007) and Fortunato et al. (2006)). The consequence of this correlation is the following: Google Scholar does not provide what professional astronomers (and other frequent users) want. Google Scholar readership correlates with the reading habit of students. In short, Google Scholar currently is no threat to scholarly information retrieval services.


  • Kurtz, Michael J. and Bollen, Johan (2010), “Usage Bibliometrics”, Annual Review of Information Science and Technology, vol 44, p. 3-64
  • Henneken, E. et al. (2009), “Use of astronomical literature – A report on usage patterns”, Journal of Informetrics, vol. 3, iss. 1, p. 1
  • Fortunato, S., Flammini, A., & Menczer, F. (2006), “Scale-Free Network Growth by Ranking”, Physical Review Letters, 96, 218701
  • Chen P., Xie H., Maslov H., and Redner, S., (2007), “Finding scientific gems with Googles PageRank algorithm”, Journal of Informetrics, 1, 8

The Art of Parsing – Python – Removing Duplicates

•July 13, 2010 • Leave a Comment

When processing large amount of data, for example when building a recommender system or an index, there is often a need to remove duplicates from a list of e.g. words. As always, there are many ways to solve a problem, even when you stick to one programming language (which in my case is Python). It is always good to ask yourself the question: how does this method scale? Especially when you work with large data sets, this is something to keep in mind. I was pretty happy with the following method to remove duplicates from a list:

def uniq(inlist):

if not inlist:
return inlist
outlist = [inlist[0]]
for i in range(1,len(inlist)):
if inlist[i]!=inlist[i-1]:

return outlist

(ok, indentation doesn’t really work with this free version of wordpress AFAIK). But then I decided to try

from sets import Set
def uniq(inlist):

return list(Set((item for item in inlist)))

which turned out to be a significant speedup. And the code is much cleaner too :-) The graphs below show the speed up:

This graph compares two Python methods for removing duplicates from a list

The graph above shows the processing time for removing duplicates from a list as function of list size for the two method described above (“Method 1” is the second method, using the sets module). The graph below shows the relative speedup:

This graph shows how much faster Method 1 is (the method using the sets module)

Submission of E-prints – Versioning

•July 6, 2010 • Leave a Comment

Here’s an interesting trend: the fraction of e-prints with mutliple versions has been increasing steadily in a number of categories. The figure below shows these trends for 4 major arXiv categories.

This figure shows the fraction of e-prints with mutiple versions for the arXiv categories astro-ph, cond-mat, hep-ph and nucl-th

I think that authors, over time, started to care more about replacing the initial version with the final version, or at least a more recent version (as some publishers still don’t allow the final version to be made available as e-print). Since the e-prints on arXiv are read so heavily, it is in the authors’ interest to replace their e-prints with corrected/updated versions. There are researchers in some disciplines who will only read and cite e-prints, maybe because their library cannot afford the subscription fees or maybe by choice, but it will be clearly beneficial to them if an e-print is a accurate representation of the end product. The Institute of Mathematical Statistics has the following standpoint with respect to e-printing IMS journal articles:
IMS wishes to demonstrate by example that high quality journals supported by the academic community can provide adequate revenue to their publishers even if all of their content is placed on open access digital repository such as arXiv. A steady flow of IMS content into the PR (probability) section and the new ST (statistics) section of arXiv should help create an eprint culture in probability and statistics, and be of general benefit to these fields. By guaranteeing growth of these sections of arXiv, IMS will support the practice of authors self-archiving their papers by placing them on arXiv. This practice should put some bound on the prices of subscriptions to commercial journals.” (for more into, see IMS Journals on arXiv). They literally give their authors the following advice: “… when a final version is accepted by a journal, update your preprint to incorporate changes made in the refereeing process, so a post-refereed pre-press version of your article is also available on arXiv“. There are probably other journals and societies with the same standpoint.
We’re just seeing another symptom of the (necessary) paradigm shift in scholarly publishing.

Recommending Literature in a Digital Library

•July 2, 2010 • Leave a Comment

I started yesterday’s post with saying that authors publish because they want to transfer information and that an essential ingredient for this transfer is being able to find this information. Of course, any organization running a search engine or a publisher with a substantial online presence are other examples where the art of “discovery” is as essential as wind to a sail boat. Clearly, this is becoming more and more of a challenge with the rapidly expanding information universe (literature universe, in our case). The amount of potentially interesting, searchable literature is expanding continuously. Besides the normal expansion, there is an additional influx of literature because of interdisciplinary boundaries becoming more and more diffuse. Hence, the need for accurate, efficient and intelligent search tools is bigger than ever.

When you just look at the holdings of the SAO/NASA Astrophysics Data System (ADS), you’ll get a good indicator for this expansion. As of April 19, 2010, there are 1,730,210 records in the astronomy database, and 5,437,973 in the physics database, distributed over publication years as shown in the figure below.

This figure shows the number of records in the astronomy and physics databases in the ADS, as a function of publication year

In astronomy, as in other fields, the Literature Universe expands more rapidly because of dissolving boundaries with other fields. Astronomers are publishing in journals and citing articles from journals that had little or no astronomy content not too long ago.
How do you find what you are looking for and more importantly, information you could not have found using the normal information discovery model? When you have some prior information (like author names and/or subject keywords), you can use your favorite search engine and apply that information as filters. There are also more sophisticated services like myADS (as part of your ADS account), that do intelligent filtering for you and provide you with customized suggestions. Alternatively, you can ask somebody you consider to be an expert. This aspect emphasizes that “finding” essentially is a bi-directional process. Wouldn’t it be nice to have an electronic process that tries to mimic this type of discovery? It is exactly this type of information discovery that recommender systems have been designed for.

Recommender systems can be characterized in the following way. Recommender systems for literature recommendation…

  • are a technological proxy for a social process
  • are a way of suggesting like or similar articles to a user-specific way of thinking
  • try to automate aspects of a completely different information discovery model where people try to find other people considered to be experts and ask them to suggest related articles

In other words, the main goal of a literature recommender system is to help visitors find information (in the form of articles) that was previously unknown to them.

What are the key elements needed to build such a recommender system? The most important ingredient is a “proximity concept”. You want to be able to say that two articles are related because they are “closer together” than articles that are less similar. You also want to be able to say that an article is of interest to a person because of its proximity to that person. The following approach will allow you to do just that:

  • build a “space” in which documents and persons can be placed
  • determine a document clustering within this space (“thematic map”)

How do you build such a space? Assigning labels to documents will allow us to associate a “topic vector” with each document. This will allow us to assign labels to persons as well (“interest vector”), using the documents they read. Placing persons in this document space can be used in essentially two different ways: use this information to provide personalized recommendations or use usage patterns (“reads”) of expert users as proxies for making recommendations to other users (“collaborative filtering”). As far as the labels themselves are concerned, there are various sources you can distill them from. The most straightforward approach is to use keywords for these labels. One drawback that comes to mind immediately, is the fact that there are no keywords available for historical literature. However, keywords are an excellent labeling agent for current and recent literature.

Whether keywords really describe the document universe with sufficient accuracy is directly related to the question whether a keyword system is sufficiently detailed to classify articles. I assume the latter is true, but only when you include the keywords from all papers in the bibliography. Having said this, I do realize that a keyword system can never be static because of developments within a field and because of diffusing boundaries with other fields. I use the keywords provided by the publishers, so the scope and the evolution of the keyword spectrum is out of our hands. It also means that a recommender system based on publisher-provided keywords has one obvious vulnerability: if a major publisher would decide to stop using keywords (e.g. PACS identifiers), it would pose a significant problem.

The figure below shows a highly simplified representation of that document space, but it explains the general idea. Imagine a two-dimensional space where one axis represents a topic ranging from galactic to extra-galactic astronomy, and where the other ranges from experimental/observational to theoretical. In this space, a paper titled “Gravitational Physics of Stellar and Galactic Systems” would get placed towards the upper right because its content is mostly about theory, with an emphasis on galactic astronomy. A paper titled “Topological Defects in Cosmology” would end up towards the upper left, because it is purely theoretical and about the extra-galactic astronomy.

A simplistic, two-dimensional representation of a "topic space"

A person working in the field of observational/experimental extra-galactic astronomy will most likely read mostly papers related to this subject, and therefore get placed in the lower left region of this space. A clustering is a document grouping that is super-imposed upon this space, which groups together documents that are about similar subjects. As a result, this clustering defines a “thematic map”. As mentioned, this is a highly simplified example. In reality the space has many dimensions (100 to 200), and these cannot be named as intuitively as “level of theoretical content”. However, the naming of various directions in this “topic space” is not something I don’t worry about. The document clustering is the tool that I will be working with. Note that for to establish this “thematic map”, you could very well use the approach I described earlier this week in my post Exploring the Astronomy Literature Landscape.

Knowing to which cluster a new article has been assigned will allow us to find papers that are the closest to this article within the cluster. The first couple of papers in this list can be used as a first recommendation. The more interesting recommendations, however, arise when you combine the information we have about the cluster with usage information. The body of usage information is rather specific: it consists of usage information for “frequent visitors”. People who read between 80 and 300 articles in a period of 6 months seems like a reasonable definition for the group of “frequent visitors”. I assume that this group of frequent visitors represents either professional scientists or people active in the field in another capacity. People who visit less frequently are not good proxies because they are most likely incidental readers.

The technique used to build the recommender system has been around for quite a while. As early as 1934, Louis Thurstone wrote his paper “Vectors of the Mind” which addressed the problem of “classifying the temperaments and personality types”. Peter Ossorio (1965) used and built on this technique to develop what he called a “Classification Space”, which he characterized as “a Euclidean model for mapping subject matter similarity within a given subject matter domain”. Michael Kurtz applied this “Classification Space” technique to obtain a new type of search method. Where the construction of the “Classification Space” in the application by Ossorio relied on data input by human subject matter experts, the method proposed by Michael Kurtz builds the space from a set of classified data. Our recommender system is a direct extension of the “Statistical Factor Space” described in the appendix “Statistical Factor Spaces in the Astrophysical Data System” of this paper by Michael Kurtz.


  • Kurtz, M.~J.\ 1993, Intelligent Information Retrieval: The Case of Astronomy and Related Space Sciences, 182, 21
  • Ossorio, P.~G.\ 1965, J. Multivariate Behavioral Research, 2, 479
  • Thurstone, L.~L.\ 1934, Psychological Review, 41, 1

Publication Trends – Authors – Astronomy

•June 30, 2010 • 1 Comment

Authors publish because they want to transfer information. An essential ingredient for this transfer is being able to find this information. This means that this information, for example articles in scholarly journals, needs to be indexed properly and enriched with relevant meta data and links. Enhanced information retrieval tools, like recommender systems, have become indispensable. Besides the actual content of the information offered for dispersal, the information comes with another piece of essential meta data: the author list.

The importance of the author list is essentially bidirectional. Having your name appear on articles is an essential ingredient of any scholarly career and plays an important role in the process of seeking for e.g. tenure or jobs. The role of first author depends on discipline, so the first author isn’t necessarily the “most authoritative” author. Some disciplines use alphabetical author lists, for example. Co-authorship with a prominent expert clearly makes a difference and sometimes gives you “measurable status”, like the Erdős number in mathematic, which is the “collaborative distance” between a person and Paul Erdős (if your number is 1, it means you published a paper together with him).

To me, co-authorship is the most normal thing in the world. In a lot of way, doing science is like learning a “trade”. You start off being an apprentice, you do an examn showing that you have mastered the basic skills for the “trade” and then you find your own way. As an aside: I think the doctoral thesis and its subsequent defense is that “test of ability”. In some displines it now seems to have become a requirement that doctoral research should result in something original and new. Please correct me if that observation is incorrect.

In the past, at least in astronomy and physics, it was more common to publish papers just by yourself, once you’ve mastered your field. And this was initially totally feasible. In the early days of science there were no budgets being slashed and there were no enormous projects like LHC. Most scientists had their own little “back yard” where they could grow whatever they felt like growing. As the 20th centory progressed, especially in roughly the second half, collaborations became more and more unavoidable. Enter collaborations and therefore growing numbers of co-authors. From this moment on we see the The demise of the lone author (Mott Greene, Nature, Volume 450, Issue 7173, pp. 1165). The figure below is an illustration of how the distribution of the number of authors has changed over time.

The figure shows the distribution of the relative frequency of the number of authors per paper in the main astronomy journals for a number of years

This figure illustrates a couple of things. First of all is shows the “demise of the lone author”, where the fraction of lone author papers dropped from about 60% in 1960 to about 6% in 2009! The widening of the distribution shows that on average the number of co-authors has increased. It seems that this is still an ongoing process that hasn’t reached a saturation point yet.

The figure below highlights the “demise of the lone author” by showing the change in the fraction of single author papers in the main astronomy and physics journals.

The figure shows the fraction of papers by single authors in the main astronomy and physics journals

The drop in the astronomy journals is more dramatic than for the physics journals. A factor of about 10 versus a factor of about 3 or 4.


Get every new post delivered to your Inbox.