Recommending Literature in a Digital Library
I started yesterday’s post with saying that authors publish because they want to transfer information and that an essential ingredient for this transfer is being able to find this information. Of course, any organization running a search engine or a publisher with a substantial online presence are other examples where the art of “discovery” is as essential as wind to a sail boat. Clearly, this is becoming more and more of a challenge with the rapidly expanding information universe (literature universe, in our case). The amount of potentially interesting, searchable literature is expanding continuously. Besides the normal expansion, there is an additional influx of literature because of interdisciplinary boundaries becoming more and more diffuse. Hence, the need for accurate, efficient and intelligent search tools is bigger than ever.
When you just look at the holdings of the SAO/NASA Astrophysics Data System (ADS), you’ll get a good indicator for this expansion. As of April 19, 2010, there are 1,730,210 records in the astronomy database, and 5,437,973 in the physics database, distributed over publication years as shown in the figure below.
In astronomy, as in other fields, the Literature Universe expands more rapidly because of dissolving boundaries with other fields. Astronomers are publishing in journals and citing articles from journals that had little or no astronomy content not too long ago.
How do you find what you are looking for and more importantly, information you could not have found using the normal information discovery model? When you have some prior information (like author names and/or subject keywords), you can use your favorite search engine and apply that information as filters. There are also more sophisticated services like myADS (as part of your ADS account), that do intelligent filtering for you and provide you with customized suggestions. Alternatively, you can ask somebody you consider to be an expert. This aspect emphasizes that “finding” essentially is a bi-directional process. Wouldn’t it be nice to have an electronic process that tries to mimic this type of discovery? It is exactly this type of information discovery that recommender systems have been designed for.
Recommender systems can be characterized in the following way. Recommender systems for literature recommendation…
- are a technological proxy for a social process
- are a way of suggesting like or similar articles to a user-specific way of thinking
- try to automate aspects of a completely different information discovery model where people try to find other people considered to be experts and ask them to suggest related articles
In other words, the main goal of a literature recommender system is to help visitors find information (in the form of articles) that was previously unknown to them.
What are the key elements needed to build such a recommender system? The most important ingredient is a “proximity concept”. You want to be able to say that two articles are related because they are “closer together” than articles that are less similar. You also want to be able to say that an article is of interest to a person because of its proximity to that person. The following approach will allow you to do just that:
- build a “space” in which documents and persons can be placed
- determine a document clustering within this space (“thematic map”)
How do you build such a space? Assigning labels to documents will allow us to associate a “topic vector” with each document. This will allow us to assign labels to persons as well (“interest vector”), using the documents they read. Placing persons in this document space can be used in essentially two different ways: use this information to provide personalized recommendations or use usage patterns (“reads”) of expert users as proxies for making recommendations to other users (“collaborative filtering”). As far as the labels themselves are concerned, there are various sources you can distill them from. The most straightforward approach is to use keywords for these labels. One drawback that comes to mind immediately, is the fact that there are no keywords available for historical literature. However, keywords are an excellent labeling agent for current and recent literature.
Whether keywords really describe the document universe with sufficient accuracy is directly related to the question whether a keyword system is sufficiently detailed to classify articles. I assume the latter is true, but only when you include the keywords from all papers in the bibliography. Having said this, I do realize that a keyword system can never be static because of developments within a field and because of diffusing boundaries with other fields. I use the keywords provided by the publishers, so the scope and the evolution of the keyword spectrum is out of our hands. It also means that a recommender system based on publisher-provided keywords has one obvious vulnerability: if a major publisher would decide to stop using keywords (e.g. PACS identifiers), it would pose a significant problem.
The figure below shows a highly simplified representation of that document space, but it explains the general idea. Imagine a two-dimensional space where one axis represents a topic ranging from galactic to extra-galactic astronomy, and where the other ranges from experimental/observational to theoretical. In this space, a paper titled “Gravitational Physics of Stellar and Galactic Systems” would get placed towards the upper right because its content is mostly about theory, with an emphasis on galactic astronomy. A paper titled “Topological Defects in Cosmology” would end up towards the upper left, because it is purely theoretical and about the extra-galactic astronomy.
A person working in the field of observational/experimental extra-galactic astronomy will most likely read mostly papers related to this subject, and therefore get placed in the lower left region of this space. A clustering is a document grouping that is super-imposed upon this space, which groups together documents that are about similar subjects. As a result, this clustering defines a “thematic map”. As mentioned, this is a highly simplified example. In reality the space has many dimensions (100 to 200), and these cannot be named as intuitively as “level of theoretical content”. However, the naming of various directions in this “topic space” is not something I don’t worry about. The document clustering is the tool that I will be working with. Note that for to establish this “thematic map”, you could very well use the approach I described earlier this week in my post Exploring the Astronomy Literature Landscape.
Knowing to which cluster a new article has been assigned will allow us to find papers that are the closest to this article within the cluster. The first couple of papers in this list can be used as a first recommendation. The more interesting recommendations, however, arise when you combine the information we have about the cluster with usage information. The body of usage information is rather specific: it consists of usage information for “frequent visitors”. People who read between 80 and 300 articles in a period of 6 months seems like a reasonable definition for the group of “frequent visitors”. I assume that this group of frequent visitors represents either professional scientists or people active in the field in another capacity. People who visit less frequently are not good proxies because they are most likely incidental readers.
The technique used to build the recommender system has been around for quite a while. As early as 1934, Louis Thurstone wrote his paper “Vectors of the Mind” which addressed the problem of “classifying the temperaments and personality types”. Peter Ossorio (1965) used and built on this technique to develop what he called a “Classification Space”, which he characterized as “a Euclidean model for mapping subject matter similarity within a given subject matter domain”. Michael Kurtz applied this “Classification Space” technique to obtain a new type of search method. Where the construction of the “Classification Space” in the application by Ossorio relied on data input by human subject matter experts, the method proposed by Michael Kurtz builds the space from a set of classified data. Our recommender system is a direct extension of the “Statistical Factor Space” described in the appendix “Statistical Factor Spaces in the Astrophysical Data System” of this paper by Michael Kurtz.
- Kurtz, M.~J.\ 1993, Intelligent Information Retrieval: The Case of Astronomy and Related Space Sciences, 182, 21
- Ossorio, P.~G.\ 1965, J. Multivariate Behavioral Research, 2, 479
- Thurstone, L.~L.\ 1934, Psychological Review, 41, 1