<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>Meters, Metrics and More</title>
	<atom:link href="http://anopisthographs.wordpress.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://anopisthographs.wordpress.com</link>
	<description>This blog will be a semi-coherent selection of observations from informetric, bibliometric and other journeys</description>
	<lastBuildDate>Sun, 18 Dec 2011 12:55:01 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
<cloud domain='anopisthographs.wordpress.com' port='80' path='/?rsscloud=notify' registerProcedure='' protocol='http-post' />
<image>
		<url>http://s2.wp.com/i/buttonw-com.png</url>
		<title>Meters, Metrics and More</title>
		<link>http://anopisthographs.wordpress.com</link>
	</image>
	<atom:link rel="search" type="application/opensearchdescription+xml" href="http://anopisthographs.wordpress.com/osd.xml" title="Meters, Metrics and More" />
	<atom:link rel='hub' href='http://anopisthographs.wordpress.com/?pushpress=hub'/>
		<item>
		<title>Linking to Data &#8211; Effect on Citation Rates in Astronomy</title>
		<link>http://anopisthographs.wordpress.com/2011/06/03/linking-to-data-effect-on-citation-rates-in-astronomy/</link>
		<comments>http://anopisthographs.wordpress.com/2011/06/03/linking-to-data-effect-on-citation-rates-in-astronomy/#comments</comments>
		<pubDate>Fri, 03 Jun 2011 18:54:10 +0000</pubDate>
		<dc:creator>anopisthographs</dc:creator>
				<category><![CDATA[bibliometrics]]></category>
		<category><![CDATA[Digital Libraries]]></category>
		<category><![CDATA[informetrics]]></category>
		<category><![CDATA[publication]]></category>
		<category><![CDATA[trends and practices]]></category>

		<guid isPermaLink="false">http://anopisthographs.wordpress.com/?p=118</guid>
		<description><![CDATA[In the paper Effect of E-printing on Citation Rates in Astronomy and Physics we asked ourselves the question whether the introduction of the arXiv e-print repository had any influence on citation behavior. We found significant increases in citation rates for papers that appear as e-prints prior to being published in scholarly journals. This is just [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=anopisthographs.wordpress.com&#038;blog=14231106&#038;post=118&#038;subd=anopisthographs&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>In the paper <a title="Effect of E-printing on Citation Rates in Astronomy and Physics" href="http://labs.adsabs.harvard.edu/ui/abs/2006JEPub...9....2H" target="_blank">Effect of E-printing on Citation Rates in Astronomy and Physics</a> we asked ourselves the question whether the introduction of the arXiv e-print repository had any influence on citation behavior. We found significant increases in citation rates for papers that appear as e-prints prior to being published in scholarly journals.</p>
<p>This is just one example of how publication practices influence article metrics (citation rates, usage, obsolescence, to name a few). Here we will be examining one practice that is very relevant to astronomy: is there a difference, from a bibliometric point of view, between articles that link to data and articles that do not? Specifically, is there a difference in citation rates between these classes of articles?</p>
<p>Besides being interesting from a purely academic point of view, this question is also highly relevant for the process of &#8220;furthering science&#8221;. Data sharing not only helps the process of verification of claims, but also the discovery of new findings in archival data. There seems to be a consensus that sharing data is a Good Thing. Let&#8217;s ignore the &#8220;why&#8221; and &#8220;how&#8221;, and focus on the sharing. You need to have both a willingness and a publication mechanism in order to create a &#8220;practice&#8221;. This is where citation rates come in: if we can say that papers with links to data get higher citation rates, this might increase the willingness of scientists to take the extra steps of linking data sources to their publications.</p>
<p>Using the data holdings of the <a title="SAO/NASA Astrophysics Data System" href="http://ads.harvard.edu" target="_blank">SAO/NASA Astrophysics Data System</a> we can do the analysis and see if articles with <a title="ADS Link Definitions" href="http://doc.adsabs.harvard.edu/abs_doc/help_pages/results.html#List_of_Links" target="_blank">links to data</a> have different citation rates. For the analysis, we used the articles published in <em>The Astrophysical Journal</em> (including <em>Letters</em> and <em>Supplement</em>), <em>The Astronomical Journal</em>, <em>The Monthly Notices of the R.A.S.</em> and <em>Astronomy &amp; Astrophysics</em> including <em>Supplement</em>), during the period 1995 through 2000. Next we determined the set of 50 most frequently used keywords in articles with data links. The articles to be used for the analysis were obtained by requiring that they have at least 3 keywords in common with that set of 50 keywords. This resulted in a set of 3814 articles with data links and 7218 articles without data links. A random selection of 3814 articles was extracted for this set of 7218 articles.</p>
<p>First, we&#8217;ll create a diagram just like the one in figure 4 of the paper <a title="Effect of E-printing on Citation Rates in Astronomy and Physics" href="http://labs.adsabs.harvard.edu/ui/abs/2006JEPub...9....2H" target="_blank">Effect of E-printing on Citation Rates in Astronomy and Physics</a>, which shows the number of citations after publication as an ensemble average. In this figure 4 we used the mean number of citations (over the entire data set) to normalize the citations. For our current analysis we will use the total number of citations for normalization.</p>
<p style="text-align:center;"><a href="http://anopisthographs.files.wordpress.com/2011/06/set_1995_2000_av.jpg"><img class="aligncenter size-large wp-image-123" title="set_1995_2000_av" src="http://anopisthographs.files.wordpress.com/2011/06/set_1995_2000_av.jpg?w=491&#038;h=218" alt="" width="491" height="218" /></a></p>
<p style="text-align:left;">Our analysis shows that articles with data links are indeed cited more than articles without these links. We can say a little bit more by looking at the cumulative citation distribution. The figure below shows this cumulative distribution, normalized by the total number of citations for articles without data links, 120 months after publication.</p>
<p style="text-align:left;"><a href="http://anopisthographs.files.wordpress.com/2011/06/set_1995_2000_cumul1.jpg"><img class="aligncenter size-large wp-image-126" title="set_1995_2000_cumul" src="http://anopisthographs.files.wordpress.com/2011/06/set_1995_2000_cumul1.jpg?w=491&#038;h=218" alt="" width="491" height="218" /></a><a href="http://anopisthographs.files.wordpress.com/2011/06/set_1995_2000_cumul.jpg"><br />
</a>This graph shows that for this data set, articles with data links acquired 20% more citations (compared to articles without these links).</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/anopisthographs.wordpress.com/118/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/anopisthographs.wordpress.com/118/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=anopisthographs.wordpress.com&#038;blog=14231106&#038;post=118&#038;subd=anopisthographs&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://anopisthographs.wordpress.com/2011/06/03/linking-to-data-effect-on-citation-rates-in-astronomy/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/0e3414a44be11ac6609619a4ee391541?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">anopisthographs</media:title>
		</media:content>

		<media:content url="http://anopisthographs.files.wordpress.com/2011/06/set_1995_2000_av.jpg?w=1024" medium="image">
			<media:title type="html">set_1995_2000_av</media:title>
		</media:content>

		<media:content url="http://anopisthographs.files.wordpress.com/2011/06/set_1995_2000_cumul1.jpg?w=1024" medium="image">
			<media:title type="html">set_1995_2000_cumul</media:title>
		</media:content>
	</item>
		<item>
		<title>Google Books and the Importance of Quality Control</title>
		<link>http://anopisthographs.wordpress.com/2010/11/10/google-books-and-the-importance-of-quality-control/</link>
		<comments>http://anopisthographs.wordpress.com/2010/11/10/google-books-and-the-importance-of-quality-control/#comments</comments>
		<pubDate>Wed, 10 Nov 2010 19:26:07 +0000</pubDate>
		<dc:creator>anopisthographs</dc:creator>
				<category><![CDATA[Digital Libraries]]></category>
		<category><![CDATA[General]]></category>

		<guid isPermaLink="false">http://anopisthographs.wordpress.com/?p=109</guid>
		<description><![CDATA[I&#8217;ve stopped counting the times when I used Google Books and cringed. To be honest, I have to say that I have mostly limited myself to digitized serials, serials in astronomy and physics, to be precise. I&#8217;m going to ignore bad meta data, which in itself would be a source of teeth grinding and hair [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=anopisthographs.wordpress.com&#038;blog=14231106&#038;post=109&#038;subd=anopisthographs&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>I&#8217;ve stopped counting the times when I used Google Books and cringed. To be honest, I have to say that I have mostly limited myself to digitized serials, serials in astronomy and physics, to be precise. I&#8217;m going to ignore bad meta data, which in itself would be a source of teeth grinding and hair pulling. I regularly find myself laughing out loud at the subject headings they came up with. Actually, it&#8217;s pretty sad.</p>
<p>No, my main source of frustration is bad digitization. Missing pages, partially scanned pages, pages showing body parts (so far, I&#8217;ve only seen fingers and hands), etc etc. <a href="http://anopisthographs.files.wordpress.com/2010/11/books.jpeg"><img class="alignright size-medium wp-image-110" title="Google Books woes" src="http://anopisthographs.files.wordpress.com/2010/11/books.jpeg?w=237&#038;h=300" alt="" width="237" height="300" /></a>Here you see a fine example of what I am referring to. I don&#8217;t know whose hand this is, but I would feel deeply ashamed if I were this person. Digitization is serious business, especially when your goal is preservation. When publications contain fold-outs, these need to be properly scanned, for example. I totally realize that with an enormous digitization effort like Google&#8217;s, quality control is bound to be hard, if not impossible. In the last year, about half a million scans went through my hands (figuratively speaking). I know how hard it is to check for missing pages and I also know that you simply cannot check every single image.</p>
<p>In addition to bad scans, I think that the search interface of Google Books, well&#8230; errr.. sucks. The results returned seem inconsistent, probably as a result of bad meta data (and bad indexing?). Navigating through results and trying to drill down or find out which other volumes were digitized is a major undertaking and often impossible.</p>
<p>Clearly this was a &#8220;quantity over quality&#8221; project, and quality clearly lost.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/anopisthographs.wordpress.com/109/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/anopisthographs.wordpress.com/109/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=anopisthographs.wordpress.com&#038;blog=14231106&#038;post=109&#038;subd=anopisthographs&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://anopisthographs.wordpress.com/2010/11/10/google-books-and-the-importance-of-quality-control/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/0e3414a44be11ac6609619a4ee391541?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">anopisthographs</media:title>
		</media:content>

		<media:content url="http://anopisthographs.files.wordpress.com/2010/11/books.jpeg?w=237" medium="image">
			<media:title type="html">Google Books woes</media:title>
		</media:content>
	</item>
		<item>
		<title>Indexing Matters &#8211; The Importance of Search Engine Behavior</title>
		<link>http://anopisthographs.wordpress.com/2010/07/21/indexing-matters-the-importance-of-search-engine-behavior/</link>
		<comments>http://anopisthographs.wordpress.com/2010/07/21/indexing-matters-the-importance-of-search-engine-behavior/#comments</comments>
		<pubDate>Wed, 21 Jul 2010 13:05:44 +0000</pubDate>
		<dc:creator>anopisthographs</dc:creator>
		
		<guid isPermaLink="false">http://anopisthographs.wordpress.com/?p=99</guid>
		<description><![CDATA[What a search engine returns on a user query largely, if not completely, determines its usefulness for that user. Looking at usage bibliometrics allows to classify the behavior of different types of users, for example (see e.g. Usage Bibliometrics by Michael J. Kurtz and Johan Bollen). There are voices claiming that Google Scholar is a [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=anopisthographs.wordpress.com&#038;blog=14231106&#038;post=99&#038;subd=anopisthographs&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>What a search engine returns on a user query largely, if not completely, determines its usefulness for that user. Looking at usage bibliometrics allows to classify the behavior of different types of users, for example (see e.g. <a href="http://adsabs.harvard.edu/abs/2010ARIST..44....3K"><em>Usage Bibliometrics</em></a> by Michael J. Kurtz and Johan Bollen). There are voices claiming that Google Scholar is a &#8220;threat&#8221; to scholarly information retrieval services (like the ADS and WoS, for example). The main reason why this is not the case becomes clear when we look at usage statistics. Here I will make a comparison of readership patterns from ADS and Google Scholar queries, as observed in ADS’s access logs. These readership patterns will give us the obsolescence of astronomy articles by ADS and Google Scholar users. In order to zoom in on people who use ADS professionally, I will only regard ADS users who query ADS 10 or more times per month. The journals I have used in the analysis are the main astronomy journals: <em>Astrophysical Journal</em>, <em>Astronomical Journal</em>, <em>Monthly Notices of the R.A.S.</em> and <em>Astronomy &amp; Astrophysics</em>. In the figure below, a comparison is made between readership of frequent ADS users (read &#8220;professional astronomers&#8221;) and Google Scholar users.</p>
<div id="attachment_100" class="wp-caption aligncenter" style="width: 507px"><a href="http://anopisthographs.files.wordpress.com/2010/07/crocodile.jpg"><img class="size-full wp-image-100" title="Comparison of readership patterns from ADS and Google Scholar queries" src="http://anopisthographs.files.wordpress.com/2010/07/crocodile.jpg?w=497&#038;h=220" alt="" width="497" height="220" /></a><p class="wp-caption-text">Comparison of readership patterns from ADS and Google Scholar queries, as observed in ADS’s access logs. The red line marked with open circles shows the readership use by people using the ADS search engine. The blue line marked with &#039;x&#039; corresponds with the readership use by people who used the Google Scholar engine. The orange line marked with closed circles shows the citation rate to the articles, while the purple line marked with ’+’ respresent their total number of citations.</p></div>
<p>All the quantities in the figure above are on a per article basis and have been normalized by the 1987 value. This was done so that we can compare apples with apples.<br />
The fact that the obsolescence through Google Scholar is strongly correlated with the total number of citations is no coincidence: this is a direct consequence of the correlation between the PageRank and the total number of citations (see e.g. Chen et al. (2007) and Fortunato et al. (2006)). The consequence of this correlation is the following: <strong>Google Scholar does not provide what professional astronomers (and other frequent users) want</strong>. Google Scholar readership correlates with the reading habit of students. In short, <strong>Google Scholar currently is no threat to scholarly information retrieval services</strong>.</p>
<p><strong>Bibliography</strong></p>
<ul>
<li>Kurtz, Michael J. and Bollen, Johan (2010), &#8220;Usage Bibliometrics&#8221;, Annual Review of Information Science and Technology, vol 44, p. 3-64</li>
<li>Henneken, E. et al. (2009), &#8220;Use of astronomical literature &#8211; A report on usage patterns&#8221;, Journal of Informetrics, vol. 3, iss. 1, p. 1</li>
<li>Fortunato, S., Flammini, A., &amp; Menczer, F. (2006), &#8220;Scale-Free Network Growth by Ranking&#8221;, Physical Review Letters, 96, 218701</li>
<li>Chen P., Xie H., Maslov H., and Redner, S., (2007), &#8220;Finding scientific gems with Googles PageRank algorithm&#8221;, Journal of Informetrics, 1, 8</li>
</ul>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/anopisthographs.wordpress.com/99/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/anopisthographs.wordpress.com/99/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=anopisthographs.wordpress.com&#038;blog=14231106&#038;post=99&#038;subd=anopisthographs&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://anopisthographs.wordpress.com/2010/07/21/indexing-matters-the-importance-of-search-engine-behavior/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/0e3414a44be11ac6609619a4ee391541?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">anopisthographs</media:title>
		</media:content>

		<media:content url="http://anopisthographs.files.wordpress.com/2010/07/crocodile.jpg" medium="image">
			<media:title type="html">Comparison of readership patterns from ADS and Google Scholar queries</media:title>
		</media:content>
	</item>
		<item>
		<title>The Art of Parsing &#8211; Python &#8211; Removing Duplicates</title>
		<link>http://anopisthographs.wordpress.com/2010/07/13/the-art-of-parsing-python-removing-duplicates/</link>
		<comments>http://anopisthographs.wordpress.com/2010/07/13/the-art-of-parsing-python-removing-duplicates/#comments</comments>
		<pubDate>Tue, 13 Jul 2010 16:15:43 +0000</pubDate>
		<dc:creator>anopisthographs</dc:creator>
				<category><![CDATA[data parsing]]></category>
		<category><![CDATA[General]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[scripting]]></category>

		<guid isPermaLink="false">http://anopisthographs.wordpress.com/?p=91</guid>
		<description><![CDATA[When processing large amount of data, for example when building a recommender system or an index, there is often a need to remove duplicates from a list of e.g. words. As always, there are many ways to solve a problem, even when you stick to one programming language (which in my case is Python). It [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=anopisthographs.wordpress.com&#038;blog=14231106&#038;post=91&#038;subd=anopisthographs&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>When processing large amount of data, for example when building a recommender system or an index, there is often a need to remove duplicates from a list of e.g. words. As always, there are many ways to solve a problem, even when you stick to one programming language (which in my case is Python). It is always good to ask yourself the question: how does this method scale? Especially when you work with large data sets, this is something to keep in mind. I was pretty happy with the following method to remove duplicates from a list:</p>
<p>def uniq(inlist):</p>
<blockquote><p>if not inlist:<br />
return inlist<br />
inlist.sort()<br />
outlist = [inlist[0]]<br />
for i in range(1,len(inlist)):<br />
if inlist[i]!=inlist[i-1]:<br />
outlist.append(outlist[i])</p></blockquote>
<p>return outlist</p>
<p>(ok, indentation doesn&#8217;t really work with this free version of wordpress AFAIK). But then I decided to try</p>
<p>from sets import Set<br />
def uniq(inlist):</p>
<blockquote><p>return list(Set((item for item in inlist)))</p></blockquote>
<p>which turned out to be a significant speedup. And the code is much cleaner too <img src='http://s0.wp.com/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' />  The graphs below show the speed up:</p>
<div id="attachment_92" class="wp-caption aligncenter" style="width: 507px"><a href="http://anopisthographs.files.wordpress.com/2010/07/testuniq_timing.jpg"><img class="size-full wp-image-92" title="comparison of two methods" src="http://anopisthographs.files.wordpress.com/2010/07/testuniq_timing.jpg?w=497&#038;h=221" alt="" width="497" height="221" /></a><p class="wp-caption-text">This graph compares two Python methods for removing duplicates from a list</p></div>
<p>The graph above shows the processing time for removing duplicates from a list as function of list size for the two method described above (&#8220;Method 1&#8243; is the second method, using the sets module). The graph below shows the relative speedup:</p>
<div id="attachment_93" class="wp-caption aligncenter" style="width: 507px"><a href="http://anopisthographs.files.wordpress.com/2010/07/testuniq_relative.jpg"><img class="size-full wp-image-93" title="Comparison of two Python methods, relative speedup" src="http://anopisthographs.files.wordpress.com/2010/07/testuniq_relative.jpg?w=497&#038;h=221" alt="" width="497" height="221" /></a><p class="wp-caption-text">This graph shows how much faster Method 1 is (the method using the sets module)</p></div>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/anopisthographs.wordpress.com/91/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/anopisthographs.wordpress.com/91/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=anopisthographs.wordpress.com&#038;blog=14231106&#038;post=91&#038;subd=anopisthographs&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://anopisthographs.wordpress.com/2010/07/13/the-art-of-parsing-python-removing-duplicates/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/0e3414a44be11ac6609619a4ee391541?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">anopisthographs</media:title>
		</media:content>

		<media:content url="http://anopisthographs.files.wordpress.com/2010/07/testuniq_timing.jpg" medium="image">
			<media:title type="html">comparison of two methods</media:title>
		</media:content>

		<media:content url="http://anopisthographs.files.wordpress.com/2010/07/testuniq_relative.jpg" medium="image">
			<media:title type="html">Comparison of two Python methods, relative speedup</media:title>
		</media:content>
	</item>
		<item>
		<title>Submission of E-prints &#8211; Versioning</title>
		<link>http://anopisthographs.wordpress.com/2010/07/06/submission-of-e-prints-versioning/</link>
		<comments>http://anopisthographs.wordpress.com/2010/07/06/submission-of-e-prints-versioning/#comments</comments>
		<pubDate>Tue, 06 Jul 2010 11:34:31 +0000</pubDate>
		<dc:creator>anopisthographs</dc:creator>
				<category><![CDATA[bibliometrics]]></category>
		<category><![CDATA[e-print metrics]]></category>
		<category><![CDATA[publication]]></category>
		<category><![CDATA[trends and practices]]></category>

		<guid isPermaLink="false">http://anopisthographs.wordpress.com/?p=87</guid>
		<description><![CDATA[Here&#8217;s an interesting trend: the fraction of e-prints with mutliple versions has been increasing steadily in a number of categories. The figure below shows these trends for 4 major arXiv categories. I think that authors, over time, started to care more about replacing the initial version with the final version, or at least a more [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=anopisthographs.wordpress.com&#038;blog=14231106&#038;post=87&#038;subd=anopisthographs&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>Here&#8217;s an interesting trend: the fraction of e-prints with mutliple versions has been increasing steadily in a number of categories. The figure below shows these trends for 4 major arXiv categories.</p>
<div id="attachment_88" class="wp-caption aligncenter" style="width: 507px"><a href="http://anopisthographs.files.wordpress.com/2010/07/versioning.jpg"><img class="size-full wp-image-88" title="e-prints with multiple=" src="http://anopisthographs.files.wordpress.com/2010/07/versioning.jpg?w=497&#038;h=220" alt="" width="497" height="220" /></a><p class="wp-caption-text">This figure shows the fraction of e-prints with mutiple versions for the arXiv categories astro-ph, cond-mat, hep-ph and nucl-th</p></div>
<p>I think that authors, over time, started to care more about replacing the initial version with the final version, or at least a more recent version (as some publishers still don&#8217;t allow the final version to be made available as e-print). Since the e-prints on arXiv are read so heavily, it is in the authors&#8217; interest to replace their e-prints with corrected/updated versions. There are researchers in some disciplines who will only read and cite e-prints, maybe because their library cannot afford the subscription fees or maybe by choice, but it will be clearly beneficial to them if an e-print is a accurate representation of the end product. The Institute of Mathematical Statistics has the following standpoint with respect to e-printing IMS journal articles:<br />
&#8220;<em>IMS wishes to demonstrate by example that high quality journals supported by the academic community can provide adequate revenue to their publishers even if all of their content is placed on open access digital repository such as arXiv. A steady flow of IMS content into the PR (probability) section and the new ST (statistics) section of arXiv should help create an eprint culture in probability and statistics, and be of general benefit to these fields. By guaranteeing growth of these sections of arXiv, IMS will support the practice of authors self-archiving their papers by placing them on arXiv. This practice should put some bound on the prices of subscriptions to commercial journals.</em>&#8221; (for more into, see <a href="http://www.imstat.org/publications/arxiv.html">IMS Journals on arXiv</a>). They literally give their authors the following advice: &#8220;&#8230; <em>when a final version is accepted by a journal, update your preprint to incorporate changes made in the refereeing process, so a post-refereed pre-press version of your article is also available on arXiv</em>&#8220;. There are probably other journals and societies with the same standpoint.<br />
We&#8217;re just seeing another symptom of the (necessary) paradigm shift in scholarly publishing.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/anopisthographs.wordpress.com/87/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/anopisthographs.wordpress.com/87/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=anopisthographs.wordpress.com&#038;blog=14231106&#038;post=87&#038;subd=anopisthographs&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://anopisthographs.wordpress.com/2010/07/06/submission-of-e-prints-versioning/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/0e3414a44be11ac6609619a4ee391541?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">anopisthographs</media:title>
		</media:content>

		<media:content url="http://anopisthographs.files.wordpress.com/2010/07/versioning.jpg" medium="image">
			<media:title type="html">e-prints with multiple=</media:title>
		</media:content>
	</item>
		<item>
		<title>Recommending Literature in a Digital Library</title>
		<link>http://anopisthographs.wordpress.com/2010/07/02/recommending-literature-in-a-digital-library/</link>
		<comments>http://anopisthographs.wordpress.com/2010/07/02/recommending-literature-in-a-digital-library/#comments</comments>
		<pubDate>Fri, 02 Jul 2010 09:59:40 +0000</pubDate>
		<dc:creator>anopisthographs</dc:creator>
				<category><![CDATA[Digital Libraries]]></category>
		<category><![CDATA[information retrieval]]></category>

		<guid isPermaLink="false">http://anopisthographs.wordpress.com/?p=73</guid>
		<description><![CDATA[I started yesterday&#8217;s post with saying that authors publish because they want to transfer information and that an essential ingredient for this transfer is being able to find this information. Of course, any organization running a search engine or a publisher with a substantial online presence are other examples where the art of &#8220;discovery&#8221; is [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=anopisthographs.wordpress.com&#038;blog=14231106&#038;post=73&#038;subd=anopisthographs&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>I started yesterday&#8217;s post with saying that authors publish because they want to transfer information and that an essential ingredient for this transfer is being able to find this information. Of course, any organization running a search engine or a publisher with a substantial online presence are other examples where the art of &#8220;discovery&#8221; is as essential as wind to a sail boat. Clearly, this is becoming more and more of a challenge with the rapidly expanding information universe (literature universe, in our case). The amount of potentially interesting, searchable literature is expanding continuously. Besides the normal expansion, there is an additional influx of literature because of interdisciplinary boundaries becoming more and more diffuse. Hence, the need for accurate, efficient and intelligent search tools is bigger than ever.</p>
<p>When you just look at the holdings of the SAO/NASA Astrophysics Data System (ADS), you&#8217;ll get a good indicator for this expansion. As of April 19, 2010, there are 1,730,210 records in the astronomy database, and 5,437,973 in the physics database, distributed over publication years as shown in the figure below.</p>
<div id="attachment_74" class="wp-caption aligncenter" style="width: 507px"><a href="http://anopisthographs.files.wordpress.com/2010/07/ast_phy_pubyear.jpg"><img class="size-full wp-image-74" title="Numbers of records in the ADS as function of publication year" src="http://anopisthographs.files.wordpress.com/2010/07/ast_phy_pubyear.jpg?w=497&#038;h=256" alt="" width="497" height="256" /></a><p class="wp-caption-text">This figure shows the number of records in the astronomy and physics databases in the ADS, as a function of publication year</p></div>
<p>In astronomy, as in other fields, the Literature Universe expands more rapidly because of dissolving boundaries with other fields. Astronomers are publishing in journals and citing articles from journals that had little or no astronomy content not too long ago.<br />
<a href="http://anopisthographs.files.wordpress.com/2010/07/delphi1.jpg"><img class="alignleft size-full wp-image-80" title="Ask The Expert" src="http://anopisthographs.files.wordpress.com/2010/07/delphi1.jpg?w=497" alt=""   /></a>How do you find what you are looking for and more importantly, information you could not have found using the normal information discovery model? When you have some prior information (like author names and/or subject keywords), you can use your favorite search engine and apply that information as filters. There are also more sophisticated services like myADS (as part of your ADS account), that do intelligent filtering for you and provide you with customized suggestions. Alternatively, you can ask somebody you consider to be an expert. This aspect emphasizes that &#8220;finding&#8221; essentially is a bi-directional process. Wouldn’t it be nice to have an electronic process that tries to mimic this type of discovery? It is exactly this type of information discovery that <em>recommender systems</em> have been designed for.</p>
<p><em>Recommender systems</em> can be characterized in the following way. Recommender systems for literature recommendation&#8230;</p>
<ul>
<li>are a technological proxy for a social process</li>
<li>are a way of suggesting like or similar articles to a user-specific way of thinking</li>
<li>try to automate aspects of a completely different information discovery model where people try to find other people considered to be experts and ask them to suggest related articles</li>
</ul>
<p>In other words, the main goal of a literature recommender system is to help visitors find information (in the form of articles) that was previously unknown to them.</p>
<p>What are the key elements needed to build such a recommender system? The most important ingredient is a &#8220;proximity concept&#8221;. You want to be able to say that two articles are related because they are &#8220;closer together&#8221; than articles that are less similar. You also want to be able to say that an article is of interest to a person because of its proximity to that person. The following approach will allow you to do just that:</p>
<ul>
<li>build a &#8220;space&#8221; in which documents and persons can be placed</li>
<li>determine a document clustering within this space (&#8220;thematic map&#8221;)</li>
</ul>
<p>How do you build such a space? Assigning labels to documents will allow us to associate a &#8220;topic vector&#8221; with each document. This will allow us to assign labels to persons as well (&#8220;interest vector&#8221;), using the documents they read. Placing persons in this document space can be used in essentially two different ways: use this information to provide personalized recommendations or use usage patterns (&#8220;reads&#8221;) of expert users as proxies for making recommendations to other users (&#8220;collaborative filtering&#8221;). As far as the labels themselves are concerned, there are various sources you can distill them from. The most straightforward approach is to use keywords for these labels. One drawback that comes to mind immediately, is the fact that there are no keywords available for historical literature. However, keywords are an excellent labeling agent for current and recent literature.</p>
<p>Whether keywords really describe the document universe with sufficient accuracy is directly related to the question whether a keyword system is sufficiently detailed to classify articles. I assume the latter is true, but only when you include the keywords from all papers in the bibliography. Having said this, I do realize that a keyword system can never be static because of developments within a field and because of diffusing boundaries with other fields. I use the keywords provided by the publishers, so the scope and the evolution of the keyword spectrum is out of our hands. It also means that a recommender system based on publisher-provided keywords has one obvious vulnerability: if a major publisher would decide to stop using keywords (e.g. PACS identifiers), it would pose a significant problem.</p>
<p>The figure below shows a highly simplified representation of that document space, but it explains the general idea. Imagine a two-dimensional space where one axis represents a topic ranging from galactic to extra-galactic astronomy, and where the other ranges from experimental/observational to theoretical. In this space, a paper titled &#8220;Gravitational Physics of Stellar and Galactic Systems&#8221; would get placed towards the upper right because its content is mostly about theory, with an emphasis on galactic astronomy. A paper titled &#8220;Topological Defects in Cosmology&#8221; would end up towards the upper left, because it is purely theoretical and about the extra-galactic astronomy.</p>
<div id="attachment_83" class="wp-caption aligncenter" style="width: 507px"><a href="http://anopisthographs.files.wordpress.com/2010/07/figure3_ehenneken.jpg"><img class="size-full wp-image-83" title="Example of a Topic Space" src="http://anopisthographs.files.wordpress.com/2010/07/figure3_ehenneken.jpg?w=497&#038;h=328" alt="" width="497" height="328" /></a><p class="wp-caption-text">A simplistic, two-dimensional representation of a &quot;topic space&quot;</p></div>
<p>A person working in the field of observational/experimental extra-galactic astronomy will most likely read mostly papers related to this subject, and therefore get placed in the lower left region of this space. A clustering is a document grouping that is super-imposed upon this space, which groups together documents that are about similar subjects. As a result, this clustering defines a &#8220;thematic map&#8221;. As mentioned, this is a highly simplified example. In reality the space has many dimensions (100 to 200), and these cannot be named as intuitively as &#8220;level of theoretical content&#8221;. However, the naming of various directions in this &#8220;topic space&#8221; is not something I don&#8217;t worry about. The document clustering is the tool that I will be working with. Note that for to establish this &#8220;thematic map&#8221;, you could very well use the approach I described earlier this week in my post <a href="http://anopisthographs.wordpress.com/2010/06/29/exploring-the-astronomy-literature-landscape/">Exploring the Astronomy Literature Landscape</a>.</p>
<p>Knowing to which cluster a new article has been assigned will allow us to find papers that are the closest to this article within the cluster. The first couple of papers in this list can be used as a first recommendation. The more interesting recommendations, however, arise when you combine the information we have about the cluster with usage information. The body of usage information is rather specific: it consists of usage information for &#8220;frequent visitors&#8221;. People who read between 80 and 300 articles in a period of 6 months seems like a reasonable definition for the group of &#8220;frequent visitors&#8221;. I assume that this group of frequent visitors represents either professional scientists or people active in the field in another capacity. People who visit less frequently are not good proxies because they are most likely incidental readers.</p>
<p>The technique used to build the recommender system has been around for quite a while. As early as 1934, Louis Thurstone wrote his paper &#8220;Vectors of the Mind&#8221; which addressed the problem of &#8220;classifying the temperaments and personality types&#8221;. Peter Ossorio (1965) used and built on this technique to develop what he called a &#8220;Classification Space&#8221;, which he characterized as &#8220;a Euclidean model for mapping subject matter similarity within a given subject matter domain&#8221;. Michael Kurtz applied this &#8220;Classification Space&#8221; technique to obtain a new type of search method. Where the construction of the &#8220;Classification Space&#8221; in the application by Ossorio relied on data input by human subject matter experts, the method proposed by Michael Kurtz builds the space from a set of classified data. Our recommender system is a direct extension of the &#8220;Statistical Factor Space&#8221; described in the appendix &#8220;Statistical Factor Spaces in the Astrophysical Data System&#8221; of this paper by Michael Kurtz.</p>
<p><strong>Bibliography</strong></p>
<ul>
<li>Kurtz, M.~J.\ 1993, Intelligent Information Retrieval: The Case of Astronomy and Related Space Sciences, 182, 21</li>
<li>Ossorio, P.~G.\ 1965, J. Multivariate Behavioral Research, 2, 479</li>
<li>Thurstone, L.~L.\ 1934, Psychological Review, 41, 1</li>
</ul>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/anopisthographs.wordpress.com/73/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/anopisthographs.wordpress.com/73/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=anopisthographs.wordpress.com&#038;blog=14231106&#038;post=73&#038;subd=anopisthographs&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://anopisthographs.wordpress.com/2010/07/02/recommending-literature-in-a-digital-library/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/0e3414a44be11ac6609619a4ee391541?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">anopisthographs</media:title>
		</media:content>

		<media:content url="http://anopisthographs.files.wordpress.com/2010/07/ast_phy_pubyear.jpg" medium="image">
			<media:title type="html">Numbers of records in the ADS as function of publication year</media:title>
		</media:content>

		<media:content url="http://anopisthographs.files.wordpress.com/2010/07/delphi1.jpg" medium="image">
			<media:title type="html">Ask The Expert</media:title>
		</media:content>

		<media:content url="http://anopisthographs.files.wordpress.com/2010/07/figure3_ehenneken.jpg" medium="image">
			<media:title type="html">Example of a Topic Space</media:title>
		</media:content>
	</item>
		<item>
		<title>Publication Trends &#8211; Authors &#8211; Astronomy</title>
		<link>http://anopisthographs.wordpress.com/2010/06/30/publication-trends-authors-astronomy/</link>
		<comments>http://anopisthographs.wordpress.com/2010/06/30/publication-trends-authors-astronomy/#comments</comments>
		<pubDate>Wed, 30 Jun 2010 12:23:38 +0000</pubDate>
		<dc:creator>anopisthographs</dc:creator>
				<category><![CDATA[bibliometrics]]></category>
		<category><![CDATA[informetrics]]></category>
		<category><![CDATA[publication]]></category>
		<category><![CDATA[trends and practices]]></category>

		<guid isPermaLink="false">http://anopisthographs.wordpress.com/?p=66</guid>
		<description><![CDATA[Authors publish because they want to transfer information. An essential ingredient for this transfer is being able to find this information. This means that this information, for example articles in scholarly journals, needs to be indexed properly and enriched with relevant meta data and links. Enhanced information retrieval tools, like recommender systems, have become indispensable. [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=anopisthographs.wordpress.com&#038;blog=14231106&#038;post=66&#038;subd=anopisthographs&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>Authors publish because they want to transfer information. An essential ingredient for this transfer is being able to find this information. This means that this information, for example articles in scholarly journals, needs to be indexed properly and enriched with relevant meta data and links. Enhanced information retrieval tools, like recommender systems, have become indispensable. Besides the actual content of the information offered for dispersal, the information comes with another piece of essential meta data: the author list.</p>
<p>The importance of the author list is essentially bidirectional. Having your name appear on articles is an essential ingredient of any scholarly career and plays an important role in the process of seeking for e.g. tenure or jobs. The role of first author depends on discipline, so the first author isn&#8217;t necessarily the &#8220;most authoritative&#8221; author. Some disciplines use alphabetical author lists, for example. Co-authorship with a prominent expert clearly makes a difference and sometimes gives you &#8220;measurable status&#8221;, like the Erdős number in mathematic, which is the &#8220;collaborative distance&#8221; between a person and Paul Erdős (if your number is 1, it means you published a paper together with him).</p>
<p>To me, co-authorship is the most normal thing in the world. In a lot of way, doing science is like learning a &#8220;trade&#8221;. You start off being an apprentice, you do an examn showing that you have mastered the basic skills for the &#8220;trade&#8221; and then you find your own way. As an aside: I think the doctoral thesis and its subsequent defense is that &#8220;test of ability&#8221;. In some displines it now seems to have become a requirement that doctoral research should result in something original and new. Please correct me if that observation is incorrect.</p>
<p>In the past, at least in astronomy and physics, it was more common to publish papers just by yourself, once you&#8217;ve mastered your field. And this was initially totally feasible. In the early days of science there were no budgets being slashed and there were no enormous projects like LHC. Most scientists had their own little &#8220;back yard&#8221; where they could grow whatever they felt like growing. As the 20th centory progressed, especially in roughly the second half, collaborations became more and more unavoidable. Enter collaborations and therefore growing numbers of co-authors. From this moment on we see the <a href="http://adsabs.harvard.edu/abs/2007Natur.450.1165G">The demise of the lone author</a> (Mott Greene, Nature, Volume 450, Issue 7173, pp. 1165). The figure below is an illustration of how the distribution of the number of authors has changed over time.<br />
<div id="attachment_67" class="wp-caption aligncenter" style="width: 507px"><a href="http://anopisthographs.files.wordpress.com/2010/06/author_relfreqs_ast.jpg"><img src="http://anopisthographs.files.wordpress.com/2010/06/author_relfreqs_ast.jpg?w=497&#038;h=220" alt="" title="Relative frequencies of numbers of authors" width="497" height="220" class="size-full wp-image-67" /></a><p class="wp-caption-text">The figure shows the distribution of the relative frequency of the number of authors per paper in the main astronomy journals for a number of years</p></div></p>
<p>This figure illustrates a couple of things. First of all is shows the &#8220;demise of the lone author&#8221;, where the fraction of lone author papers dropped from about 60% in 1960 to about 6% in 2009! The widening of the distribution shows that on average the number of co-authors has increased. It seems that this is still an ongoing process that hasn&#8217;t reached a saturation point yet.</p>
<p>The figure below highlights the &#8220;demise of the lone author&#8221; by showing the change in the fraction of single author papers in the main astronomy and physics journals.<br />
<div id="attachment_68" class="wp-caption aligncenter" style="width: 507px"><a href="http://anopisthographs.files.wordpress.com/2010/06/singleauthorpapers.jpg"><img src="http://anopisthographs.files.wordpress.com/2010/06/singleauthorpapers1.jpg?w=497&#038;h=220" alt="" title="The fraction of single author papers in the main astronomy and physics journals" width="497" height="220" class="size-full wp-image-68" /></a><p class="wp-caption-text">The figure shows the fraction of papers by single authors in the main astronomy and physics journals</p></div><br />
The drop in the astronomy journals is more dramatic than for the physics journals. A factor of about 10 versus a factor of about 3 or 4.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/anopisthographs.wordpress.com/66/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/anopisthographs.wordpress.com/66/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=anopisthographs.wordpress.com&#038;blog=14231106&#038;post=66&#038;subd=anopisthographs&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://anopisthographs.wordpress.com/2010/06/30/publication-trends-authors-astronomy/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/0e3414a44be11ac6609619a4ee391541?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">anopisthographs</media:title>
		</media:content>

		<media:content url="http://anopisthographs.files.wordpress.com/2010/06/author_relfreqs_ast.jpg" medium="image">
			<media:title type="html">Relative frequencies of numbers of authors</media:title>
		</media:content>

		<media:content url="http://anopisthographs.files.wordpress.com/2010/06/singleauthorpapers1.jpg" medium="image">
			<media:title type="html">The fraction of single author papers in the main astronomy and physics journals</media:title>
		</media:content>
	</item>
		<item>
		<title>Exploring the Astronomy Literature Landscape</title>
		<link>http://anopisthographs.wordpress.com/2010/06/29/exploring-the-astronomy-literature-landscape/</link>
		<comments>http://anopisthographs.wordpress.com/2010/06/29/exploring-the-astronomy-literature-landscape/#comments</comments>
		<pubDate>Tue, 29 Jun 2010 11:08:00 +0000</pubDate>
		<dc:creator>anopisthographs</dc:creator>
				<category><![CDATA[Digital Libraries]]></category>
		<category><![CDATA[information retrieval]]></category>
		<category><![CDATA[network analysis]]></category>

		<guid isPermaLink="false">http://anopisthographs.wordpress.com/?p=58</guid>
		<description><![CDATA[The body of literature in astronomy and physics is enormous, and growing rapidly. How do you “separate wheat from chaff ” and ﬁnd your way in this maze of scholarly papers, conference proceedings, grey literature and technical reports? When you use a general search engine (Google, Yahoo, Live Search, etc.), you will probably ﬁnd thousands [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=anopisthographs.wordpress.com&#038;blog=14231106&#038;post=58&#038;subd=anopisthographs&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>The body of literature in astronomy and physics is enormous, and growing rapidly. How do you “separate wheat from chaff ” and ﬁnd your way in this maze of scholarly papers, conference proceedings, grey literature and technical reports? When you use a general search engine (Google, Yahoo, Live Search, etc.), you will probably ﬁnd thousands of documents, ranked according to some algorithm. Even specialized versions of these tools often leave us awash in information. To search the electronic, scholarly literature, scientists need to be able to zoom in on bibliographic data using additional descriptors and search logic. The SAO/NASA Astrophysics Data System (ADS) offers a bibliographic service with a sophisticated query interface, offering users a wide set of ﬁlters (Kurtz et al. 2000). In addition, the ADS has the myADS service, a fully customizable newspaper covering all (journal) research for astronomy, physics and/or the arXiv e-prints. The myADS-arXiv service (Henneken et al. 2006) is a tailor-made, Open Access, virtual journal, covering the most important papers of the past week in physics and astronomy. Although powerful, these bibliographic services are essentially list-oriented and therefore have their limitations.<br />
Lists, because of their one-dimensionality, are unable to provide a rich context for a given paper. A deeper understanding of the data structure formed by the collection of meta data in the ADS database will allow us to develop tools that provide a fundamentally different way of navigating through this bibliographic universe. As a tool for navigating, the best analogy is probably that of a map. A point on a map has a certain contextual meaning, depending on the information being displayed on that map. How can we form a landscape and subsequently a map, based on the astronomy literature? A set of papers can be regarded as an ensemble of points that “interact” with each other in a certain way. This interaction can, for example, represent the citations between papers, the number of keywords papers have in common, a similarity between abstracts of papers or a combination of these. In this way, these papers form a network with weighted connections.</p>
<div id="attachment_59" class="wp-caption alignleft" style="width: 289px"><a href="http://anopisthographs.files.wordpress.com/2010/06/network_map.jpg"><img class="size-full wp-image-59" title="Illustration of information compression: a map" src="http://anopisthographs.files.wordpress.com/2010/06/network_map.jpg?w=497" alt=""   /></a><p class="wp-caption-text">An illustration for filtering out unimportant detail: an aerial photograph doesn&#039;t help you when you want to take the subway across Boston into Cambridge. In the subway map all relevant directional information has been compressed into one image. The basic geographical information is still there, but now you can see immediately how to get from one point to another with the subway</p></div>
<p>Depending on the character of the “interaction” between the papers, this network is directed or undirected. A citation network, for example, is directed. The papers, in this network representation, form highly connected modules that are only weakly connected to each other. Using the modules to describe the network is equivalent to ﬁltering out all unimportant details in the network and just using the regularities. An example to illustrate this approach is the difference between a aerial map and a schematic of the subway system of a city (credit goes to Martin Rosvall). Describing a network in terms of modules is equivalent to a lossy compression of the original network. Having established the modular structure for a set of papers provides us with powerful knowledge of how our bibliographic universe is organized. This knowledge can help us with regular search engine queries to give more meaningful results, by making the search engine aware that papers belong to communities. It also allows us to create thematic maps of scholarly literature, allowing an essentially new way of navigating the literature.</p>
<p><strong>How do you get from a network to a map?</strong> Our network consists of N papers, each having a certain amount of meta data labels (like keywords, reference sections, etc.). Additionally, there is a relationship that determines whether two papers are connected in this network, and how strong this interaction is. As a result of this relation, there will be l links in this network. Our choice for the relation is: paper i cites paper j . This automatically means that the weight of each link is 1. The (uncompressed) network is described by the adjacency matrix, as usual. There are many ways to compress this network into a modular description. Which one ﬁts our purposes? Obviously, we want a compressed version of the original network that best characterizes the original network. How do we translate this into mathematics? The uncompressed network is the realization of a random variable <em>X</em> and the compressed description is that of a random variable <em>Y</em> . Information theory stipulates that the description <em>Y</em> of random variable <em>X</em> that best characterizes <em>X</em> is the one that maximizes the mutual information <em>I (X ; Y )</em> between description and network. An often-used alternative is the approach that maximizes the so-called modularity. However, this method has a bias towards equal-sized modules (Rosvall &amp; Bergstrom 2007), which is a partition that is not realistic for our bibliographic network. In our approach, we only use information that is present in the network representation. The key ingredient in our approach is the fact that a random surfer will represent the ﬂow in the network in a natural way. A random surfer is a random walker who makes a random jump within the network with a given probability. This is to allow weighted, directed networks in our analysis. The node visit frequencies of the random surfer naturally tend to the underlying probability distribution within the network. How do we describe paths in our network? In theory we could use a Huffman code, but that is not necessary. If<em> L(C )</em> represents the expected code length, then according to Shannon’s coding theorem (Shannon 1948): <em>L(C) ≥ H(X)</em> (where <em>H(X )</em> is the entropy of the random variable <em>X</em> ). So, on average, the number of bits that we need to describe one step taken by the random surfer is always greater or equal to <em>H(P)</em>, the entropy defined by the distribution <em>P</em> of (ergodic) visit frequencies to the network nodes. In our approach, the lower bound on the code length will be the “description length” <em>L</em>. In this framework, we partition are network into modules that minimize the expected description length of the random walk. Once we have established the partition into modules, the map will be the graphical representation of this partition. The map will graphically represent partition characteristics like the amount of time a random surfer spends in a module, the citation ﬂow within a module and the citation ﬂow out of a module. This method has been described in detail in Rosvall, M. &amp; Bergstrom, C. T. 2007 (see bibliography).</p>
<p><strong>How do you create an algorithm out of this?</strong> The ergodic node visit frequencies are calculated using the fact that the journey of the random surfer is a Markov chain. Because the surfer can get from any point in the network to any other point, it is an irreducible Markov chain. Because this Markov chain is also aperiodic, according to the Perron-Frobenius Theorem, it has a unique steady state solution. This is done using the Power Method (Perra &amp; Fortunato 2008).<br />
The exit probability for a given module follows from the teleportation probability and the probability of leaving the module from any of the nodes in the module.<br />
Next, the network is partitioned into modules by means of a greedy search and simulated annealing. The greedy search merges pairs of modules that give the largest decrease in description length until further merging increases the description length. The result of the greedy search is improved by simulated annealing. Here a conﬁguration that is “near” is explored. If it results in a smaller description length, it is accepted, if not, it is accepted with a probability that depends on a system parameter called “temperature”. The simulated<br />
annealing algorithm used is the “heat bath algorithm” (Newman &amp; Barkema 1999). By using this algorithm at several different temperatures, the description length can be improved by up to several percent over that found by the greedy search alone.</p>
<p><strong>Bibliography</strong><br />
Henneken, E., et al. 2006, in ASP Conference Series, 377, 106<br />
Kurtz, M. J., et al. 2002, Proc. SPIE, 4847, 238<br />
Kurtz, M. J., et al. 2000, A&amp;AS, 143, 41<br />
Newman, M. E. J. &amp; Barkema, G. T. 1999, “Monte Carlo methods in statistical physics”, Oxford University Press<br />
Perra, N. &amp; Fortunato, S. 2008, arXiv:0805.3322<br />
Rosvall, M. &amp; Bergstrom, C. T. 2007, Pub. Nat. Acad. Sc., 104, 7327<br />
Shannon, C. E. 1948, The Bell System Technical Journal, 27, 379</p>
<div id="attachment_60" class="wp-caption aligncenter" style="width: 507px"><a href="http://anopisthographs.files.wordpress.com/2010/06/literature_map.jpg"><img class="size-full wp-image-60" title="A map of astronomy based on citations between articles in the main astronomy journals in the period 2000-2005" src="http://anopisthographs.files.wordpress.com/2010/06/literature_map.jpg?w=497&#038;h=431" alt="" width="497" height="431" /></a><p class="wp-caption-text">This figure shows the results for a citation network based on all papers published in The Astrophysical Journal (including Letters), The Astronomical Journal, Astronomy &amp; Astrophysics and Monthly Notices of the Royal Astronomical Society for the period of 2000-2005. Only citations between these journals and within the time period were taken into account. The resulting network consists of 35,941 papers (network nodes) and 325,353 citations (network edges). </p></div>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/anopisthographs.wordpress.com/58/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/anopisthographs.wordpress.com/58/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=anopisthographs.wordpress.com&#038;blog=14231106&#038;post=58&#038;subd=anopisthographs&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://anopisthographs.wordpress.com/2010/06/29/exploring-the-astronomy-literature-landscape/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/0e3414a44be11ac6609619a4ee391541?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">anopisthographs</media:title>
		</media:content>

		<media:content url="http://anopisthographs.files.wordpress.com/2010/06/network_map.jpg" medium="image">
			<media:title type="html">Illustration of information compression: a map</media:title>
		</media:content>

		<media:content url="http://anopisthographs.files.wordpress.com/2010/06/literature_map.jpg" medium="image">
			<media:title type="html">A map of astronomy based on citations between articles in the main astronomy journals in the period 2000-2005</media:title>
		</media:content>
	</item>
		<item>
		<title>Astronomy Journals &#8211; Network Properties</title>
		<link>http://anopisthographs.wordpress.com/2010/06/28/astronomy-journals-network-properties/</link>
		<comments>http://anopisthographs.wordpress.com/2010/06/28/astronomy-journals-network-properties/#comments</comments>
		<pubDate>Mon, 28 Jun 2010 12:18:23 +0000</pubDate>
		<dc:creator>anopisthographs</dc:creator>
				<category><![CDATA[bibliometrics]]></category>
		<category><![CDATA[journal metrics]]></category>
		<category><![CDATA[network analysis]]></category>

		<guid isPermaLink="false">http://anopisthographs.wordpress.com/?p=53</guid>
		<description><![CDATA[Articles form a graph where the edges represent a relationship between the articles. With some of those relationships, the graph will be a directed graph. The relationship &#8220;X cites Y&#8221; is an example. Representing articles as a graph allows us to do a number of interesting observations. Can we detect clustering within the graph? I [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=anopisthographs.wordpress.com&#038;blog=14231106&#038;post=53&#038;subd=anopisthographs&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>Articles form a graph where the edges represent a relationship between the articles. With some of those relationships, the graph will be a directed graph. The relationship &#8220;X cites Y&#8221; is an example. Representing articles as a graph allows us to do a number of interesting observations. Can we detect clustering within the graph? I will discuss this at a later time. Another question is: how does the topology of the graph change over time? More specifically: can we detect densification? Does the number of edges increase? The diagram below compares the number of nodes (=articles) with the number of edges (=references) for the major astronomy journals in the time period of 1980 through 2006. Citations to articles outside the network (temporally or to other journals, not in the network) were disregarded.</p>
<a href="http://anopisthographs.files.wordpress.com/2010/06/densification1.jpg"><img class="size-full wp-image-55" title="network densification" src="http://anopisthographs.files.wordpress.com/2010/06/densification1.jpg?w=497&#038;h=220" alt="" width="497" height="220" /></a>
<p>The function fitted to this relationship is</p>
<p>e(t) = a×n(t)<sup>b</sup></p>
<p>to illustrate the power law form for network densification. In this relation e(t) represents the number of edges at time t, and n(t) is the number of nodes. An exponent of 1 would indicate linear growth. The fit results in an exponent of about 1.9 (with a correlation of 0.99), indicating high densification over time, i.e. non-linear growth. Another implication is that in an average sense, bibliographies have increased in length over time.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/anopisthographs.wordpress.com/53/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/anopisthographs.wordpress.com/53/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=anopisthographs.wordpress.com&#038;blog=14231106&#038;post=53&#038;subd=anopisthographs&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://anopisthographs.wordpress.com/2010/06/28/astronomy-journals-network-properties/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/0e3414a44be11ac6609619a4ee391541?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">anopisthographs</media:title>
		</media:content>

		<media:content url="http://anopisthographs.files.wordpress.com/2010/06/densification1.jpg" medium="image">
			<media:title type="html">network densification</media:title>
		</media:content>
	</item>
		<item>
		<title>Journal Cititation Statistics &#8211; Inter-Citation &#8211; Astronomy</title>
		<link>http://anopisthographs.wordpress.com/2010/06/25/journal-cititation-statistics-inter-citation-astronomy/</link>
		<comments>http://anopisthographs.wordpress.com/2010/06/25/journal-cititation-statistics-inter-citation-astronomy/#comments</comments>
		<pubDate>Fri, 25 Jun 2010 11:47:25 +0000</pubDate>
		<dc:creator>anopisthographs</dc:creator>
				<category><![CDATA[bibliometrics]]></category>
		<category><![CDATA[journal metrics]]></category>
		<category><![CDATA[network analysis]]></category>

		<guid isPermaLink="false">http://anopisthographs.wordpress.com/?p=40</guid>
		<description><![CDATA[Just as articles form a complex network, where some relationship (like &#8220;x cites y&#8221;) determines the network topology, journals themselves form a similar network, but with less granularity. It&#8217;s a little bit like the &#8220;thermodynamics&#8221; of the article universe. One big difference between the journal and the article universes is that in the journal universe, [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=anopisthographs.wordpress.com&#038;blog=14231106&#038;post=40&#038;subd=anopisthographs&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>Just as articles form a complex network, where some relationship (like &#8220;x cites y&#8221;) determines the network topology, journals themselves form a similar network, but with less granularity. It&#8217;s a little bit like the &#8220;thermodynamics&#8221; of the article universe. One big difference between the journal and the article universes is that in the journal universe, there are &#8220;loops&#8221;: journals cite themselves. You can argue that articles, in some sense, cite themselves too, but this is a flow with a constant amplitude across all nodes, while for journals this is clearly not the case. With journal inter-citation (a measure of &#8220;how often is journal X cited by journal Y&#8221;) we basically look at the out-degree of the nodes.</p>
<p>In this entry I&#8217;ll look at the main astronomy journals (The Astrophysical Journal, The Astronomical Journal, Monthly Notices of the R.A.S. and Astronomy &amp; Astrophysics). For a given publication year and journal, I&#8217;ve taken the bibliographies of all articles that appeared and determined what percentage of those citations went where. The results are shown below.</p>
<div id="attachment_47" class="wp-caption aligncenter" style="width: 507px"><a href="http://anopisthographs.files.wordpress.com/2010/06/apjrefs.jpg"><img src="http://anopisthographs.files.wordpress.com/2010/06/apjrefs.jpg?w=497&#038;h=220" alt="" title="Citations from the Astrophysical Journal going to other journals" width="497" height="220" class="size-full wp-image-47" /></a><p class="wp-caption-text">This diagram shows for a given publication year which journals were cited the most by articles in the Astrophysical Journal</p></div>
<p>No dramatic changes in a period spanning just over a decade. The strongest citation flow goes back into The Astrophysical Journal. The ApJ Letters were second for a while, but have now been overtaken (by a small percentage) by Astronomy &amp; Astrophysics and the Monthly Notices of the R.A.S.</p>
<div id="attachment_48" class="wp-caption aligncenter" style="width: 507px"><a href="http://anopisthographs.files.wordpress.com/2010/06/ajrefs.jpg"><img src="http://anopisthographs.files.wordpress.com/2010/06/ajrefs.jpg?w=497&#038;h=220" alt="" title="Citatations from the Astronomical Journal going to other journals" width="497" height="220" class="size-full wp-image-48" /></a><p class="wp-caption-text">This diagram shows for a given publication year which journals were cited the most by articles in the Astronomical Journal</p></div>
<p>The two largest citation flows here have very similar amplitudes: one looping back to the journal itself and one to The Astrophysical Journal. The next biggest flow is to A&amp;A, followed by MNRAS. Unlike with ApJ, where A&amp;A and MNRAS has roughly the same amplitude, A&amp;A gets significantly more citations from AJ than MNRAS gets. There are very few citations going from AJ to the Physical Review D.</p>
<div id="attachment_49" class="wp-caption aligncenter" style="width: 507px"><a href="http://anopisthographs.files.wordpress.com/2010/06/aarefs.jpg"><img src="http://anopisthographs.files.wordpress.com/2010/06/aarefs.jpg?w=497&#038;h=220" alt="" title="Citations from A&amp;A going to other journals" width="497" height="220" class="size-full wp-image-49" /></a><p class="wp-caption-text">This diagram shows for a given publication year which journals were cited the most by articles in Astronomy &amp; Astrophysics</p></div>
<p>A&amp;A looks very similar to AJ: again ApJ has roughly the same amplitude as the self-citation flow. MNRAS is the next largest flow. ApJ Letters has roughly the same amplitude as AJ. There is a small, but steadily increasing flow to the Physical Review D.</p>
<div id="attachment_50" class="wp-caption aligncenter" style="width: 507px"><a href="http://anopisthographs.files.wordpress.com/2010/06/mnrasrefs.jpg"><img src="http://anopisthographs.files.wordpress.com/2010/06/mnrasrefs.jpg?w=497&#038;h=220" alt="" title="Citations from MNRAS going to other journals" width="497" height="220" class="size-full wp-image-50" /></a><p class="wp-caption-text">This diagram shows for a given publication year which journals were cited the most by articles in the Monthly Notices of the R.A.S.</p></div>
<p>The largest citation flow is to the ApJ, closely followed (bit significantly smaller by ~5%) by the self-citation flow. The third largest flow is to A&amp;A. It interesting to see that MNRAS has the largest citation flow to the Physical Review D of these 4 astronomy journals. Maybe this is an indicator that this journal has, percentage-wise, the largest cosmology content? (or at least theoretical cosmology)</p>
<p>Clearly, ApJ represents the largest citation flows. It is the most &#8220;international&#8221; journal: it receives the largest amount of citations from both American and European journals. Perhaps this means that it is easier for a European astronomer to publish in ApJ than for an American astronomer to publish in either MNRAS or A&amp;A? It will be interesting to see how these citation flows differ per discipline. </p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/anopisthographs.wordpress.com/40/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/anopisthographs.wordpress.com/40/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=anopisthographs.wordpress.com&#038;blog=14231106&#038;post=40&#038;subd=anopisthographs&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://anopisthographs.wordpress.com/2010/06/25/journal-cititation-statistics-inter-citation-astronomy/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/0e3414a44be11ac6609619a4ee391541?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">anopisthographs</media:title>
		</media:content>

		<media:content url="http://anopisthographs.files.wordpress.com/2010/06/apjrefs.jpg" medium="image">
			<media:title type="html">Citations from the Astrophysical Journal going to other journals</media:title>
		</media:content>

		<media:content url="http://anopisthographs.files.wordpress.com/2010/06/ajrefs.jpg" medium="image">
			<media:title type="html">Citatations from the Astronomical Journal going to other journals</media:title>
		</media:content>

		<media:content url="http://anopisthographs.files.wordpress.com/2010/06/aarefs.jpg" medium="image">
			<media:title type="html">Citations from A&#38;A going to other journals</media:title>
		</media:content>

		<media:content url="http://anopisthographs.files.wordpress.com/2010/06/mnrasrefs.jpg" medium="image">
			<media:title type="html">Citations from MNRAS going to other journals</media:title>
		</media:content>
	</item>
	</channel>
</rss>
