Discovering “Long-Fuse” Papers – Physics Results

•November 20, 2015 • Leave a Comment

I ended the previous blog with a list of “long-fuse” astronomy papers and I realized that, since I have similar results for physics papers too, why not show them here as well? So, that’s what this blog will be about. Just a list of those “long-fuse” physics papers. There is another blog in the making that will explore the statistics a little bit more, so stay tuned…

Here’s a list of 31 “long-fuse” physics papers. There are definitely more, but this is a nice illustrative set with some true gems…

  1. Buckingham, E. (1914), “On Physically Similar Systems; Illustrations of the Use of Dimensional Equations”, Physical Review, vol. 4, Issue 4, pp. 345-376
  2. Washburn, Edward W. (1921), “The Dynamics of Capillary Flow”, Physical Review, vol. 17, Issue 3, pp. 273-283
  3. Robertson, H. P. (1929), “The Uncertainty Principle”, Physical Review, vol. 34, Issue 1, pp. 163-164
  4. Einstein, A.; Podolsky, B.; Rosen, N. (1935), “Can Quantum-Mechanical Description of Physical Reality Be Considered Complete?”, Physical Review, vol. 47, Issue 10, pp. 777-780
  5. Beth, Richard A. (1936), “Mechanical Detection and Measurement of the Angular Momentum of Light”, Physical Review, vol. 50, Issue 2, pp. 115-125
  6. Tolman, Richard C. (1939), “Static Solutions of Einstein’s Field Equations for Spheres of Fluid”, Physical Review, vol. 55, Issue 4, pp. 364-373
  7. Patterson, A. L. (1939), “The Scherrer Formula for X-Ray Particle Size Determination”, Physical Review, vol. 56, Issue 10, pp. 978-982
  8. Snyder, Hartland S. (1947), “Quantized Space-Time”, Physical Review, vol. 71, Issue 1, pp. 38-41
  9. Wallace, P. R. (1947), “The Band Theory of Graphite”, Physical Review, vol. 71, Issue 9, pp. 622-634
  10. Birch, Francis (1947), “Finite Elastic Strain of Cubic Crystals”, Physical Review, vol. 71, Issue 11, pp. 809-824
  11. Snyder, Hartland S. (1947), “The Electromagnetic Field in Quantized Space-Time”, Physical Review, vol. 72, Issue 1, pp. 68-71
  12. Dresselhaus, G. (1955), “Spin-Orbit Coupling Effects in Zinc Blende Structures”, Physical Review, vol. 100, Issue 2, pp. 580-586
  13. Dyakonov, M. I.; Perel, V. I. (1971), “Current-induced spin orientation of electrons in semiconductors”, Physics Letters A, Volume 35, Issue 6, p. 459-460.
  14. Polder, D.; van Hove, M. (1971), “Theory of Radiative Heat Transfer between Closely Spaced Bodies”, Physical Review B, vol. 4, Issue 10, pp. 3303-3314
  15. Vainshtein, A. I. (1972), “To the problem of nonvanishing gravitation mass”, Physics Letters B, Volume 39, Issue 3, p. 393-394.
  16. Boulware, David G.; Deser, S. (1972), “Can Gravitation Have a Finite Range?”, Physical Review D, vol. 6, Issue 12, pp. 3368-3382
  17. Minkowski, Peter (1977), “μ→e γ at a rate of one out of 10 9 muon decays?”, Physics Letters B, Volume 67, Issue 4, p. 421-428.
  18. Konetschny, W.; Kummer, W. (1977), “Nonconservation of total lepton number with scalar bosons”, Physics Letters B, Volume 70, Issue 4, p. 433-435.
  19. Deshpande, Nilendra G.; Ma, Ernest (1978), “Pattern of symmetry breaking with two Higgs doublets”, Physical Review D (Particles and Fields), Volume 18, Issue 7, 1 October 1978, pp.2574-2576
  20. Magg, M.; Wetterich, Ch. (1980), “Neutrino mass problem and gauge hierarchy”, Physics Letters B, Volume 94, Issue 1, p. 61-64.
  21. Sriram Shastry, B.; Sutherland, Bill (1981), “Exact ground state of a quantum mechanical antiferromagnet”, Physica B+C, Volume 108, Issue 1, p. 1069-1070.
  22. Ginsparg, Paul H.; Wilson, Kenneth G. (1982), “A remnant of chiral symmetry on the lattice”, Physical Review D (Particles and Fields), Volume 25, Issue 10, 15 May 1982, pp.2649-2657
  23. Keung, Wai-Yee; Senjanović, Goran (1983), “Majorana Neutrinos and the Production of the Right-Handed Charged Gauge Boson”, Physical Review Letters, Volume 50, Issue 19, May 9, 1983, pp.1427-1430
  24. Silveira, Vanda; Zee, A. (1985), “Scalar Phantoms”, Physics Letters B, Volume 161, Issue 1-3, p. 136-140.
  25. Zee, A. (1986), “Quantum numbers of Majorana neutrino masses”, Nuclear Physics, Section B, Volume 264, p. 99-110.
  26. Wilczek, Frank (1987), “Two applications of axion electrodynamics”, Physical Review Letters (ISSN 0031-9007), vol. 58, May 4, 1987, p. 1799-1802. NASA-supported research.
  27. Haldane, F. D. M. (1988), “Model for a quantum Hall effect without Landau levels: Condensed-matter realization of the “parity anomaly””, Physical Review Letters, Volume 61, Issue 18, October 31, 1988, pp.2015-2018
  28. Deutsch, J. M. (1991), “Quantum statistical mechanics in a closed system”, Physical Review A (Atomic, Molecular, and Optical Physics), Volume 43, Issue 4, February 15, 1991, pp.2046-2049
  29. Siegel, W. (1993), “Superspace duality in low-energy superstrings”, Physical Review D (Particles, Fields, Gravitation, and Cosmology), Volume 48, Issue 6, 15 September 1993, pp.2826-2837
  30. Takeda, Kyozaburo; Shiraishi, Kenji (1994), “Theoretical possibility of stage corrugation in Si and Ge analogs of graphite”, Physical Review B (Condensed Matter), Volume 50, Issue 20, November 15, 1994, pp.14916-14922
  31. Srednicki, Mark (1994), “Chaos and quantum thermalization”, Physical Review E (Statistical Physics, Plasmas, Fluids, and Related Interdisciplinary Topics), Volume 50, Issue 2, August 1994, pp.888-901



Discovering “Long-Fuse” Papers – A First Exploration

•November 16, 2015 • Leave a Comment

On average, articles get cited in a pretty predictable pattern. An article gets published, it takes a little while to get “absorbed” by the scientific community, and then, if it resonates, it starts getting citations. Something along the lines of figure 4 in my paper “Effect of E-printing on Citation Rates in Astronomy and Physics”, Journal of Electronic Publishing, vol. 9, p. 2:


In the pre-Internet days, this meant that people had to work their way through abstract books and reading the tables of contents from volumes on the shelves of their local library. The principle, nevertheless, is the same, just with a different time scale. When you have to physically work your way through shelves and volumes, it obviously takes more time to gather your bibliography, when writing a paper in this pre-Internet era. Even when you allow for these longer time scales, there are still papers that take longer, way longer, to be cited than others. Todd Lauer coined the phrase “long-fuse” papers on Twitter (in a discussion with Joss Bland-Hawthorn) and wondered about how to detect these. That is what this blog is about. One attempt at finding them. The figure below shows one fine example of such a “long-fuse” publication: “On the Masses of Nebulae and of Clusters of Nebulae”, by F. Zwicky (1937), Astrophysical Journal, vol. 86, p.217

citations for 1937ApJ....86..217Z

Before starting this little journey, first a caveat: when you go back into the pre-Internet era, especially decades before its inception, it is much harder to compile citation data. Some journals had references in footnotes, rather than in designated bibliographies, and in all cases references need to be manually typed in or extracted from the OCR of digitized material. In short, there are bound to be significant gaps in older citation data. But, even with this taken into account, there are still publications that can be described as “long-fuse” publications: there is something discussed in a publication that becomes (highly) relevant well after (which can be well up to a decade, or decades, later) its appearance. The Zwicky paper is a prime example.

So, how do you go about finding these publications? The key ingredient here is the fact that pubications really are nodes in a directed graph, specifically a directed graph with a very distinct “flow”: publications cite material that has already been published, i.e. back in time. What makes “long-fuse” papers different within the context of the citation graph? Temporally speaking, the “distance” to the bulk of their citations is relative large. The next step is to turn this observation into a quantitative statement. Since we are looking at publications spread out over decades, it seems logical to apply some sort of normalization or scaling. Let’s rescale all data in terms of “age” rather than absolute years. Within this context, it seems to make sense to compare the age of a publication to the average age of its associated citation distribution. Just to state the obvious: the “paper age” of a publication from, say, 1937, is 79 years (taking the current year as reference point, and starting with an age of 1). The “citation age” for a citation from a 2012 publication is 76 years (because if was published 76 years after the publication of the cited paper, taking citations from the same year as the cited paper as having an age of 1 year).

Let’s explore the quanity F defined as the average citation age, divided by the paper age. What is its distribution like? To explore this, I took the following set of journals: The Astrophysical Journal (including Letters and Supplement Series), The Astronomical Journal, Monthly Notices of the R.A.S., Astronomy & Astrophysics, Physical Review D, Physical Review E, Reviews of Modern Physics, Nuclear Physics A and Nuclear Physics B. All the publications in this set were filtered on the following two, rather arbitrary, criteria: the minimal paper age is 20 years and the minimal amount of citations is 300. These filters were chosen for no particular reason, just to turn a set of about half a million publications into a much smaller set for this initial exploration. The resulting filtered publication set consists of 2750 publications. For this set, the quantity F turns out to have the frequency distribution shown in the figure below.

frequency distribution of average citation age over paper age

The phenomenon of “long-fuse” papers is not very likely to be the result of some sort of “phase transition” in the citation graph evolution; there probably is a smooth transition from the realm “long-fuse” papers to “regularly cited” papers. We can’t point to a region in the figure above and say: this region represents the component of “long-fuse” papers.  Nevertheless, it seems plausible that the “long-fuse” papers live in the right tail of this distribution. Since this is just a first exploration, I’ll just pick a threshold and see what follows.

How many papers remain from the 2750 papers if I add the addition requirement that F > 0.75? A total of 60 papers remain, 32 of which are from the set of core astronomy journals. Let’s explore a couple of them! The following plot shows the normalized number of citations (normalized by the number of citations on November 16, 2015) as a function of year for 5 papers on astronomy-related subjects.

citations for 5 long-fuse astronomy papers

These 5 papers are:

  1. Plummer, H. C. (1911), “On the problem of distribution in globular star clusters”, Monthly Notices of the Royal Astronomical Society, Vol. 71, p.460
  2. Bondi, H. (1947), “Spherically symmetrical models in general relativity”, Monthly Notices of the Royal Astronomical Society, Vol. 107, p.410
  3. Gödel, Kurt (1949), “An Example of a New Type of Cosmological Solutions of Einstein’s Field Equations of Gravitation”, Reviews of Modern Physics, vol. 21, Issue 3, pp. 447
  4. Boulware, David G. & Deser, S. (1972), “Can Gravitation Have a Finite Range?”, Physical Review D, vol. 6, Issue 12, p. 3368
  5. Wetterich, C. (1988), “Cosmology and the fate of dilatation symmetry”, Nuclear Physics B, Volume 302, Issue 4, p. 668

Below is a similar plot for 5 physics papers.

Normalized citations for 5 long-fuse physics papers

The 5 papers are:

  1. Everett, Hugh (1957), “‘Relative State’ Formulation of Quantum Mechanics”, Reviews of Modern Physics, vol. 29, Issue 3, pp. 454
  2. Bell, John S. (1966), “On the Problem of Hidden Variables in Quantum Mechanics”, Reviews of Modern Physics, vol. 38, Issue 3, pp. 447
  3. van Dam, H. & Veltman, M. (1970), “Massive and mass-less Yang-Mills and gravitational fields”, Nuclear Physics B, Volume 22, Issue 2, p. 397
  4. Nambu, Yoichiro (1973), “Generalized Hamiltonian Dynamics”, Physical Review D, vol. 7, Issue 8, pp. 2405
  5. Deshpande, Nilendra G. & Ma, Ernest (1978), “Pattern of symmetry breaking with two Higgs doublets”, Physical Review D, Volume 18, Issue 7, pp.2574

So, this initial exploration looks promising! Next, I should attempt to look in more detail at the various assumptions and seemingly arbitrary choices of variables. The choices for minimal paper age and number of citations are probably arbitrary by nature, but it seems the selected cut-off frequency of 0.75 can definitely be explored a bit further.

Finally, here is the full set of 32 papers from the astronomy data set (note that the Zwicky paper was reproduced in this analysis):

  1. Plummer, H. C. (1911), “On the problem of distribution in globular star clusters”, Monthly Notices of the Royal Astronomical Society, Vol. 71, p.460-470
  2. von Zeipel, H. (1924), “The radiative equilibrium of a rotating system of gaseous masses”, Monthly Notices of the Royal Astronomical Society, Vol. 84, p.665-683
  3. Hubble, E. P. (1926), “Extragalactic nebulae.”, Astrophysical Journal, 64, 321-369 (1926)
  4. Zwicky, F. (1937), “On the Masses of Nebulae and of Clusters of Nebulae”, Astrophysical Journal, vol. 86, p.217
  5. Henyey, L. G.; Greenstein, J. L. (1941), “Diffuse radiation in the Galaxy”, Astrophysical Journal, vol. 93, p. 70-83 (1941).
  6. Chandrasekhar, S. (1943), “Dynamical Friction. I. General Considerations: the Coefficient of Dynamical Friction.”, Astrophysical Journal, vol. 97, p.255
  7. Bondi, H.; Hoyle, F. (1944), “On the mechanism of accretion by stars”, Monthly Notices of the Royal Astronomical Society, Vol. 104, p.273
  8. Bondi, H. (1947), “Spherically symmetrical models in general relativity”, Monthly Notices of the Royal Astronomical Society, Vol. 107, p.410
  9. Bondi, H. (1952), “On spherically symmetrical accretion”, Monthly Notices of the Royal Astronomical Society, Vol. 112, p.195
  10. Salpeter, Edwin E. (1955), “The Luminosity Function and Stellar Evolution.”, Astrophysical Journal, vol. 121, p.161
  11. Bonnor, W. B. (1956), “Boyle’s Law and gravitational instability”, Monthly Notices of the Royal Astronomical Society, Vol. 116, p.351
  12. Schmidt, Maarten (1959), “The Rate of Star Formation.”, Astrophysical Journal, vol. 129, p.243
  13. Kozai, Yoshihide (1962), “Secular perturbations of asteroids with high inclination and eccentricity”, Astronomical Journal, Vol. 67, p. 591
  14. Refsdal, S. (1964), “On the possibility of determining Hubble’s parameter and the masses of galaxies from the gravitational lens effect”, Monthly Notices of the Royal Astronomical Society, Vol. 128, p.307
  15. Neupert, Werner M. (1968), “Comparison of Solar X-Ray Line Emission with Microwave Emission during Flares”, Astrophysical Journal, vol. 153, p.L59
  16. Bardeen, James M.; Press, William H.; Teukolsky, Saul A. (1972), “Rotating Black Holes: Locally Nonrotating Frames, Energy Extraction, and Scalar Synchrotron Radiation”, Astrophysical Journal, Vol. 178, pp. 347-370 (1972)
  17. Sneden, C. (1973), “The nitrogen abundance of the very metal-poor star HD 122563.”, Astrophysical Journal, Vol. 184, p. 839 – 849
  18. Purcell, Edward M.; Pennypacker, Carlton R. (1973), “Scattering and Absorption of Light by Nonspherical Dielectric Grains”, Astrophysical Journal, Vol. 186, pp. 705-714 (1973)
  19. Whelan, John; Iben, Icko, Jr. (1973), “Binaries and Supernovae of Type I”, Astrophysical Journal, Vol. 186, pp. 1007-1014 (1973)
  20. Tayler, R. J. (1973), “The adiabatic stability of stars containing magnetic fields-I.Toroidal fields”, Monthly Notices of the Royal Astronomical Society, Vol. 161, p. 365 (1973)
  21. Petrosian, V. (1976), “Surface brightness and evolution of galaxies”, Astrophysical Journal, vol. 209, Oct. 1, 1976, pt. 2, p. L1-L5.
  22. Blandford, R. D.; Znajek, R. L. (1977), “Electromagnetic extraction of energy from Kerr black holes”, Monthly Notices of the Royal Astronomical Society, vol. 179, May 1977, p. 433-456.
  23. Weidenschilling, S. J. (1977), “Aerodynamics of solid bodies in the solar nebula”, Monthly Notices of the Royal Astronomical Society, vol. 180, July 1977, p. 57-70. Research supported by the Carnegie Corp.
  24. Gingold, R. A.; Monaghan, J. J. (1977), “Smoothed particle hydrodynamics – Theory and application to non-spherical stars”, Monthly Notices of the Royal Astronomical Society, vol. 181, Nov. 1977, p. 375-389.
  25. Cash, W. (1979), “Parameter estimation in astronomy through application of the likelihood ratio”, Astrophysical Journal, Part 1, vol. 228, Mar. 15, 1979, p. 939-947.
  26. Hut, P. (1981), “Tidal evolution in close binary systems”, Astronomy and Astrophysics, vol. 99, no. 1, June 1981, p. 126-140.
  27. Arnett, W. D. (1982), “Type I supernovae. I – Analytic solutions for the early part of the light curve”, Astrophysical Journal, Part 1, vol. 253, Feb. 15, 1982, p. 785-797.
  28. Soltan, A. (1982), “Masses of quasars”, Monthly Notices of the Royal Astronomical Society, vol. 200, July 1982, p. 115-122.
  29. Milgrom, M. (1983), “A modification of the Newtonian dynamics as a possible alternative to the hidden mass hypothesis”, Astrophysical Journal, Part 1 (ISSN 0004-637X), vol. 270, July 15, 1983, p. 365-370. Research supported by the U.S.-Israel Binational Science Foundation.
  30. Li, T.-P.; Ma, Y.-Q. (1983), “Analysis methods for results in gamma-ray astronomy”, Astrophysical Journal, Part 1 (ISSN 0004-637X), vol. 272, Sept. 1, 1983, p. 317-324.
  31. Lin, D. N. C.; Papaloizou, John (1986), “On the tidal interaction between protoplanets and the protoplanetary disk. III – Orbital migration of protoplanets”, Astrophysical Journal, Part 1 (ISSN 0004-637X), vol. 309, Oct. 15, 1986, p. 846-857.
  32. O’Donnell, James E. (1994), “Rnu-dependent optical and near-ultraviolet extinction”, Astrophysical Journal, Part 1 (ISSN 0004-637X), vol. 422, no. 1, p. 158-163

Amazon Container Services, Micro Services, Consul – Backing Up Your Key/Value Store

•November 8, 2015 • Leave a Comment

Here’s the situation in a nutshell: you have a service running in the Amazon EC2 Container Services environment. Your architecture consists of a set of nodes, as shown in the sketch below.

Everything “lives” inside a private address space “in the Cloud”, on Amazon Web Services (AWS). The cluster that resides in the AWS EC2 Container Services (ECS) consists of “nodes”, each of which are “instances” launched from the AWS EC2 environment. These instances are of a certain “type” (like e.g. “t2.medium”), for example determining how much disk space and computing power will be available, and have networking characteristics attributed to them (like private and potentially public address spaces). Within each of these nodes inside our cluster, a number of Docker containers are running. You don’t have to use Docker as a means of provisioning/launching services, but that’s a different topic. Each of these Docker containers, in this example, represents a micro service. That’s the 20,000 ft view, more or less.

These micro services need to get their configuration information from somewhere. Assuming that a the recipe for building one of those micro services is stored on Github, for example, some of the configuration information can be packaged with that repository. Sensitive information, like API keys and other secrets, obviously cannot be stored in such a publicly accessible location. With Python projects, people often use a module for general information and a module (listed in the .gitignore file) for e.g. overwriting stub keys with the actual secret keys. When you deploy your micro service, you need to take care that you deploy (the right) `` along with your code. When you have a lot of micro services, this clearly is a pain and mistakes are bound to happen. That is where a service like Consul is very handy. Consul is a lot more than just a key/value store for configuration information, by the way. When you deploy Consul along your micro services on AWS, you have a means to have a central configuration service, accessible for all your micro services. Assuming you have configured things correctly, the Consul service is only accessible within the private address space used within your ECS cluster. In other words, nobody in the outside world is able to access your configuration stored in Consul.

It would be nice to make frequent backups of your Consul key/value store. Consul does not come with something built in, so you need to roll your own backup solution. In essence you need something that is able to create a dump of the contents of the store. You can write your own utility around a HTTP request like

curl http://<Consul host>:<Consul port>/v1/kv/?recurse

and do all the “dirty work” yourself. Or, you can use a Python client like consulate to do that “dirty work” for you. In other words, getting the data needed for your backup is pretty easy. How you want to implement it is something that needs a little bit more thought. You want your backups done frequently, so a cron job seems like the natural solution. Where do you run your cron job? Do you package it with one of your micro services? Do you package it with your Consul setup? Do you create a separate “stuff” container that runs all kinds of management scripts? How about initiating the backup outside the Cloud? A minor additional question is: where do I store my backups? On AWS (S3) or locally?

To start with the minor additional question of “where”. It seems like that AWS S3 is the most convenient answer. Your backup will live in the same private address space, more readily available for a restore, and storage on AWS is dirt cheap. Storing it locally involves a little bit more coding, but works fine too. If you want to look inside a backup file, you would have to download it first, from S3, so that may be a slight advantage of having it stored locally.

Where to run the cron job is to some degree a matter of taste too. Packaging it with an existing service (like a micro service or Consul itself) seems like bad design to me. The whole idea of a micro service is that it does X and X alone, no room for “oh, and a little bit of Y too”. How about having a “management container” or “management node” within your cluster? Besides doing backups, you probably want to run all kinds of health checking, metrics gathering and other scripts. This choice is a bit more philosophical. Unless you have some sort of management UI on top of your AWS environment (like Rancher), which does all the communicating with AWS “under the hood”, you probably will need to run a local script if you want to do something like restoring a Consul backup. If that is the case, maybe you want to have your backup utility run locally too, just from a “completeness” point of view. Backup and restore are probably just two modes of one utility. Sometimes containers crash, which could theoretically stop your backups. But there is a solution for that: a new container will get spun up, with all your management services. My personal choice was having the backup be initiated locally, through a cron job running on a on-prem server. How do you make that happen?

That is where the concept “Task Definitions” comes in. A “Task Definition” in the AWS ECS environment is a recipe to “do something”. This “something” can either be a “service” or a “task”. A “service” is in general something you expect to keep running; think “micro service” here, for example. You start the service and it keeps running till “something happens”; normally this means that you stop the service or restart the service (after an update). A “task” is in general a more transient event; you start it, it makes something happen and then exits. This is exactly what we need. So, what ingredients do you need for a “Task Definition”? The main ingredient, really, is an “image”, specifically a “Docker image”. A Docker image is a read-only template. For example, an image could contain an Ubuntu operating system with Apache and your web application installed. Images are used to create Docker containers. Docker images are the build component of Docker. In its turn, a Docker image is create from the recipe, listed in a file called the Dockerfile. Docker images can be stored in various ways: on Docker Hub, a local repository or a third party respository like or within AWs itself.

In our case, when the Docker container is created, it has one very clear purpose:

  1. connect to the Consul key/value store,
  2. retrieve all records,
  3. store the records in a file
  4. ship that file to a well-defined bucket on AWS S3

To make this happen, we need to translate this into a Dockerfile. In the FROM clause in the Dockerfile, you specify on which the Docker container is based (like Ubuntu, CentOS, Debian, …). I chose phusion, but that is not necessarily the best choice. I probably could have chosen something like busybox. One difference between various choices is the size of the resulting Docker image; this could be a critical factor. I’ll probably spend a future blog on this.

The rest of the ingredients is listed in this GitHub repo. Using the Dockerfile, I created a Docker image and stored it on Docker Hub. Now we can use this image in the Task Definition:

  "requiresAttributes": [],
  "taskDefinitionArn": "arn:aws:ecs:<region>:<identifier>:task-definition/consul-backup:12",
  "status": "ACTIVE",
  "revision": 12,
  "containerDefinitions": [
      "mountPoints": [
          "containerPath": "/tmp",
          "sourceVolume": "tmp",
      "name": "consul-backup",
      "environment": [
          "name": "SERVICE_TAGS",
          "value": "staging"
      "image": "adsabs/consul-backup:v1.0.8",
  "volumes": [
      "host": {
        "sourcePath": "/tmp"
      "name": "tmp"
  "family": "consul-backup"

I have removed a lot of details from this, to keep things simple and focus on the most important aspects. The Docker image has been bolded above. By just listing adsabs/consul-backup AWS “knows” that it needs to look on Docker Hub if it cannot find the image locally. By adding a label after the colon, a specific version will get downloaded. The Docker container that will get created when you run “Run Task” within AWS RCS for this particular Task Definition, will mount “/tmp” from the node on “/tmp” in the container. This is something I wanted to be able to keep a log file that would stick around, even after the Docker container was removed.

When “Run Task” is executed, the Docker container is built, and the command specified after “CMD” in the Dockerfile is executed. This will run the Python script This looks at the environment variables, telling it to do either a backup or a restore. I think the source code of is pretty self-explanatory. Instead, let’s look at how to make this setup into a local cron job.

The essence of this is the “boto3” Python module, specifically the component that deals with AWS ECS. In essence you would need a method along the following lines:

def run_task(cluster, desiredCount, taskDefinition):
    Thin wrapper around boto3 ecs.update_service;
    :param cluster: The short name or full Amazon Resource Name (ARN) of the cluster that your service is running on. If you do not specify a cluster, the default cluster is assumed.
    :param desiredCount: The number of instantiations of the task that you would like to place and keep running in your service.
    :param taskDefinition: The family and revision (family:revision ) or full Amazon Resource Name (ARN) of the task definition that you want to run in your service. If a revision is not specified, the latest ACTIVE revision is used. If you modify the task definition with UpdateService , Amazon ECS spawns a task with the new version of the task definition and then stops an old task after the new version is running.
    client = get_boto_session().client('ecs')

where cluster refers to the name of the cluster and taskDefinition would be something like “consul-backup:12” (where 12 refers to the “revision number” of the Task Definition). The method “get_boto_session()” is something like

def get_boto_session():
    Gets a boto3 session using credentials stores in app.config; assumes an
    app context is active
    :return: boto3.session instance
    return Session(

Now we have all the ingredients to initiate backups of the Consul key/value store from a local server.

Linking to Data – Effect on Citation Rates in Astronomy

•June 3, 2011 • 5 Comments

In the paper Effect of E-printing on Citation Rates in Astronomy and Physics we asked ourselves the question whether the introduction of the arXiv e-print repository had any influence on citation behavior. We found significant increases in citation rates for papers that appear as e-prints prior to being published in scholarly journals.

This is just one example of how publication practices influence article metrics (citation rates, usage, obsolescence, to name a few). Here we will be examining one practice that is very relevant to astronomy: is there a difference, from a bibliometric point of view, between articles that link to data and articles that do not? Specifically, is there a difference in citation rates between these classes of articles?

Besides being interesting from a purely academic point of view, this question is also highly relevant for the process of “furthering science”. Data sharing not only helps the process of verification of claims, but also the discovery of new findings in archival data. There seems to be a consensus that sharing data is a Good Thing. Let’s ignore the “why” and “how”, and focus on the sharing. You need to have both a willingness and a publication mechanism in order to create a “practice”. This is where citation rates come in: if we can say that papers with links to data get higher citation rates, this might increase the willingness of scientists to take the extra steps of linking data sources to their publications.

Using the data holdings of the SAO/NASA Astrophysics Data System we can do the analysis and see if articles with links to data have different citation rates. For the analysis, we used the articles published in The Astrophysical Journal (including Letters and Supplement), The Astronomical Journal, The Monthly Notices of the R.A.S. and Astronomy & Astrophysics including Supplement), during the period 1995 through 2000. Next we determined the set of 50 most frequently used keywords in articles with data links. The articles to be used for the analysis were obtained by requiring that they have at least 3 keywords in common with that set of 50 keywords. This resulted in a set of 3814 articles with data links and 7218 articles without data links. A random selection of 3814 articles was extracted for this set of 7218 articles.

First, we’ll create a diagram just like the one in figure 4 of the paper Effect of E-printing on Citation Rates in Astronomy and Physics, which shows the number of citations after publication as an ensemble average. In this figure 4 we used the mean number of citations (over the entire data set) to normalize the citations. For our current analysis we will use the total number of citations for normalization.

Our analysis shows that articles with data links are indeed cited more than articles without these links. We can say a little bit more by looking at the cumulative citation distribution. The figure below shows this cumulative distribution, normalized by the total number of citations for articles without data links, 120 months after publication.

This graph shows that for this data set, articles with data links acquired 20% more citations (compared to articles without these links).

Google Books and the Importance of Quality Control

•November 10, 2010 • Leave a Comment

I’ve stopped counting the times when I used Google Books and cringed. To be honest, I have to say that I have mostly limited myself to digitized serials, serials in astronomy and physics, to be precise. I’m going to ignore bad meta data, which in itself would be a source of teeth grinding and hair pulling. I regularly find myself laughing out loud at the subject headings they came up with. Actually, it’s pretty sad.

No, my main source of frustration is bad digitization. Missing pages, partially scanned pages, pages showing body parts (so far, I’ve only seen fingers and hands), etc etc. Here you see a fine example of what I am referring to. I don’t know whose hand this is, but I would feel deeply ashamed if I were this person. Digitization is serious business, especially when your goal is preservation. When publications contain fold-outs, these need to be properly scanned, for example. I totally realize that with an enormous digitization effort like Google’s, quality control is bound to be hard, if not impossible. In the last year, about half a million scans went through my hands (figuratively speaking). I know how hard it is to check for missing pages and I also know that you simply cannot check every single image.

In addition to bad scans, I think that the search interface of Google Books, well… errr.. sucks. The results returned seem inconsistent, probably as a result of bad meta data (and bad indexing?). Navigating through results and trying to drill down or find out which other volumes were digitized is a major undertaking and often impossible.

Clearly this was a “quantity over quality” project, and quality clearly lost.

Indexing Matters – The Importance of Search Engine Behavior

•July 21, 2010 • Leave a Comment

What a search engine returns on a user query largely, if not completely, determines its usefulness for that user. Looking at usage bibliometrics allows to classify the behavior of different types of users, for example (see e.g. Usage Bibliometrics by Michael J. Kurtz and Johan Bollen). There are voices claiming that Google Scholar is a “threat” to scholarly information retrieval services (like the ADS and WoS, for example). The main reason why this is not the case becomes clear when we look at usage statistics. Here I will make a comparison of readership patterns from ADS and Google Scholar queries, as observed in ADS’s access logs. These readership patterns will give us the obsolescence of astronomy articles by ADS and Google Scholar users. In order to zoom in on people who use ADS professionally, I will only regard ADS users who query ADS 10 or more times per month. The journals I have used in the analysis are the main astronomy journals: Astrophysical Journal, Astronomical Journal, Monthly Notices of the R.A.S. and Astronomy & Astrophysics. In the figure below, a comparison is made between readership of frequent ADS users (read “professional astronomers”) and Google Scholar users.

Comparison of readership patterns from ADS and Google Scholar queries, as observed in ADS’s access logs. The red line marked with open circles shows the readership use by people using the ADS search engine. The blue line marked with 'x' corresponds with the readership use by people who used the Google Scholar engine. The orange line marked with closed circles shows the citation rate to the articles, while the purple line marked with ’+’ respresent their total number of citations.

All the quantities in the figure above are on a per article basis and have been normalized by the 1987 value. This was done so that we can compare apples with apples.
The fact that the obsolescence through Google Scholar is strongly correlated with the total number of citations is no coincidence: this is a direct consequence of the correlation between the PageRank and the total number of citations (see e.g. Chen et al. (2007) and Fortunato et al. (2006)). The consequence of this correlation is the following: Google Scholar does not provide what professional astronomers (and other frequent users) want. Google Scholar readership correlates with the reading habit of students. In short, Google Scholar currently is no threat to scholarly information retrieval services.


  • Kurtz, Michael J. and Bollen, Johan (2010), “Usage Bibliometrics”, Annual Review of Information Science and Technology, vol 44, p. 3-64
  • Henneken, E. et al. (2009), “Use of astronomical literature – A report on usage patterns”, Journal of Informetrics, vol. 3, iss. 1, p. 1
  • Fortunato, S., Flammini, A., & Menczer, F. (2006), “Scale-Free Network Growth by Ranking”, Physical Review Letters, 96, 218701
  • Chen P., Xie H., Maslov H., and Redner, S., (2007), “Finding scientific gems with Googles PageRank algorithm”, Journal of Informetrics, 1, 8

The Art of Parsing – Python – Removing Duplicates

•July 13, 2010 • Leave a Comment

When processing large amount of data, for example when building a recommender system or an index, there is often a need to remove duplicates from a list of e.g. words. As always, there are many ways to solve a problem, even when you stick to one programming language (which in my case is Python). It is always good to ask yourself the question: how does this method scale? Especially when you work with large data sets, this is something to keep in mind. I was pretty happy with the following method to remove duplicates from a list:

def uniq(inlist):

if not inlist:
return inlist
outlist = [inlist[0]]
for i in range(1,len(inlist)):
if inlist[i]!=inlist[i-1]:

return outlist

(ok, indentation doesn’t really work with this free version of wordpress AFAIK). But then I decided to try

from sets import Set
def uniq(inlist):

return list(Set((item for item in inlist)))

which turned out to be a significant speedup. And the code is much cleaner too 🙂 The graphs below show the speed up:

This graph compares two Python methods for removing duplicates from a list

The graph above shows the processing time for removing duplicates from a list as function of list size for the two method described above (“Method 1” is the second method, using the sets module). The graph below shows the relative speedup:

This graph shows how much faster Method 1 is (the method using the sets module)