Discovering “Long-Fuse” Papers – Physics Results

•November 20, 2015 • Leave a Comment

I ended the previous blog with a list of “long-fuse” astronomy papers and I realized that, since I have similar results for physics papers too, why not show them here as well? So, that’s what this blog will be about. Just a list of those “long-fuse” physics papers. There is another blog in the making that will explore the statistics a little bit more, so stay tuned…

Here’s a list of 31 “long-fuse” physics papers. There are definitely more, but this is a nice illustrative set with some true gems…

  1. Buckingham, E. (1914), “On Physically Similar Systems; Illustrations of the Use of Dimensional Equations”, Physical Review, vol. 4, Issue 4, pp. 345-376
  2. Washburn, Edward W. (1921), “The Dynamics of Capillary Flow”, Physical Review, vol. 17, Issue 3, pp. 273-283
  3. Robertson, H. P. (1929), “The Uncertainty Principle”, Physical Review, vol. 34, Issue 1, pp. 163-164
  4. Einstein, A.; Podolsky, B.; Rosen, N. (1935), “Can Quantum-Mechanical Description of Physical Reality Be Considered Complete?”, Physical Review, vol. 47, Issue 10, pp. 777-780
  5. Beth, Richard A. (1936), “Mechanical Detection and Measurement of the Angular Momentum of Light”, Physical Review, vol. 50, Issue 2, pp. 115-125
  6. Tolman, Richard C. (1939), “Static Solutions of Einstein’s Field Equations for Spheres of Fluid”, Physical Review, vol. 55, Issue 4, pp. 364-373
  7. Patterson, A. L. (1939), “The Scherrer Formula for X-Ray Particle Size Determination”, Physical Review, vol. 56, Issue 10, pp. 978-982
  8. Snyder, Hartland S. (1947), “Quantized Space-Time”, Physical Review, vol. 71, Issue 1, pp. 38-41
  9. Wallace, P. R. (1947), “The Band Theory of Graphite”, Physical Review, vol. 71, Issue 9, pp. 622-634
  10. Birch, Francis (1947), “Finite Elastic Strain of Cubic Crystals”, Physical Review, vol. 71, Issue 11, pp. 809-824
  11. Snyder, Hartland S. (1947), “The Electromagnetic Field in Quantized Space-Time”, Physical Review, vol. 72, Issue 1, pp. 68-71
  12. Dresselhaus, G. (1955), “Spin-Orbit Coupling Effects in Zinc Blende Structures”, Physical Review, vol. 100, Issue 2, pp. 580-586
  13. Dyakonov, M. I.; Perel, V. I. (1971), “Current-induced spin orientation of electrons in semiconductors”, Physics Letters A, Volume 35, Issue 6, p. 459-460.
  14. Polder, D.; van Hove, M. (1971), “Theory of Radiative Heat Transfer between Closely Spaced Bodies”, Physical Review B, vol. 4, Issue 10, pp. 3303-3314
  15. Vainshtein, A. I. (1972), “To the problem of nonvanishing gravitation mass”, Physics Letters B, Volume 39, Issue 3, p. 393-394.
  16. Boulware, David G.; Deser, S. (1972), “Can Gravitation Have a Finite Range?”, Physical Review D, vol. 6, Issue 12, pp. 3368-3382
  17. Minkowski, Peter (1977), “μ→e γ at a rate of one out of 10 9 muon decays?”, Physics Letters B, Volume 67, Issue 4, p. 421-428.
  18. Konetschny, W.; Kummer, W. (1977), “Nonconservation of total lepton number with scalar bosons”, Physics Letters B, Volume 70, Issue 4, p. 433-435.
  19. Deshpande, Nilendra G.; Ma, Ernest (1978), “Pattern of symmetry breaking with two Higgs doublets”, Physical Review D (Particles and Fields), Volume 18, Issue 7, 1 October 1978, pp.2574-2576
  20. Magg, M.; Wetterich, Ch. (1980), “Neutrino mass problem and gauge hierarchy”, Physics Letters B, Volume 94, Issue 1, p. 61-64.
  21. Sriram Shastry, B.; Sutherland, Bill (1981), “Exact ground state of a quantum mechanical antiferromagnet”, Physica B+C, Volume 108, Issue 1, p. 1069-1070.
  22. Ginsparg, Paul H.; Wilson, Kenneth G. (1982), “A remnant of chiral symmetry on the lattice”, Physical Review D (Particles and Fields), Volume 25, Issue 10, 15 May 1982, pp.2649-2657
  23. Keung, Wai-Yee; Senjanović, Goran (1983), “Majorana Neutrinos and the Production of the Right-Handed Charged Gauge Boson”, Physical Review Letters, Volume 50, Issue 19, May 9, 1983, pp.1427-1430
  24. Silveira, Vanda; Zee, A. (1985), “Scalar Phantoms”, Physics Letters B, Volume 161, Issue 1-3, p. 136-140.
  25. Zee, A. (1986), “Quantum numbers of Majorana neutrino masses”, Nuclear Physics, Section B, Volume 264, p. 99-110.
  26. Wilczek, Frank (1987), “Two applications of axion electrodynamics”, Physical Review Letters (ISSN 0031-9007), vol. 58, May 4, 1987, p. 1799-1802. NASA-supported research.
  27. Haldane, F. D. M. (1988), “Model for a quantum Hall effect without Landau levels: Condensed-matter realization of the “parity anomaly””, Physical Review Letters, Volume 61, Issue 18, October 31, 1988, pp.2015-2018
  28. Deutsch, J. M. (1991), “Quantum statistical mechanics in a closed system”, Physical Review A (Atomic, Molecular, and Optical Physics), Volume 43, Issue 4, February 15, 1991, pp.2046-2049
  29. Siegel, W. (1993), “Superspace duality in low-energy superstrings”, Physical Review D (Particles, Fields, Gravitation, and Cosmology), Volume 48, Issue 6, 15 September 1993, pp.2826-2837
  30. Takeda, Kyozaburo; Shiraishi, Kenji (1994), “Theoretical possibility of stage corrugation in Si and Ge analogs of graphite”, Physical Review B (Condensed Matter), Volume 50, Issue 20, November 15, 1994, pp.14916-14922
  31. Srednicki, Mark (1994), “Chaos and quantum thermalization”, Physical Review E (Statistical Physics, Plasmas, Fluids, and Related Interdisciplinary Topics), Volume 50, Issue 2, August 1994, pp.888-901

 

Advertisement

Discovering “Long-Fuse” Papers – A First Exploration

•November 16, 2015 • Leave a Comment

On average, articles get cited in a pretty predictable pattern. An article gets published, it takes a little while to get “absorbed” by the scientific community, and then, if it resonates, it starts getting citations. Something along the lines of figure 4 in my paper “Effect of E-printing on Citation Rates in Astronomy and Physics”, Journal of Electronic Publishing, vol. 9, p. 2:

3336451.0009.202-00000007

In the pre-Internet days, this meant that people had to work their way through abstract books and reading the tables of contents from volumes on the shelves of their local library. The principle, nevertheless, is the same, just with a different time scale. When you have to physically work your way through shelves and volumes, it obviously takes more time to gather your bibliography, when writing a paper in this pre-Internet era. Even when you allow for these longer time scales, there are still papers that take longer, way longer, to be cited than others. Todd Lauer coined the phrase “long-fuse” papers on Twitter (in a discussion with Joss Bland-Hawthorn) and wondered about how to detect these. That is what this blog is about. One attempt at finding them. The figure below shows one fine example of such a “long-fuse” publication: “On the Masses of Nebulae and of Clusters of Nebulae”, by F. Zwicky (1937), Astrophysical Journal, vol. 86, p.217

citations for 1937ApJ....86..217Z

Before starting this little journey, first a caveat: when you go back into the pre-Internet era, especially decades before its inception, it is much harder to compile citation data. Some journals had references in footnotes, rather than in designated bibliographies, and in all cases references need to be manually typed in or extracted from the OCR of digitized material. In short, there are bound to be significant gaps in older citation data. But, even with this taken into account, there are still publications that can be described as “long-fuse” publications: there is something discussed in a publication that becomes (highly) relevant well after (which can be well up to a decade, or decades, later) its appearance. The Zwicky paper is a prime example.

So, how do you go about finding these publications? The key ingredient here is the fact that pubications really are nodes in a directed graph, specifically a directed graph with a very distinct “flow”: publications cite material that has already been published, i.e. back in time. What makes “long-fuse” papers different within the context of the citation graph? Temporally speaking, the “distance” to the bulk of their citations is relative large. The next step is to turn this observation into a quantitative statement. Since we are looking at publications spread out over decades, it seems logical to apply some sort of normalization or scaling. Let’s rescale all data in terms of “age” rather than absolute years. Within this context, it seems to make sense to compare the age of a publication to the average age of its associated citation distribution. Just to state the obvious: the “paper age” of a publication from, say, 1937, is 79 years (taking the current year as reference point, and starting with an age of 1). The “citation age” for a citation from a 2012 publication is 76 years (because if was published 76 years after the publication of the cited paper, taking citations from the same year as the cited paper as having an age of 1 year).

Let’s explore the quanity F defined as the average citation age, divided by the paper age. What is its distribution like? To explore this, I took the following set of journals: The Astrophysical Journal (including Letters and Supplement Series), The Astronomical Journal, Monthly Notices of the R.A.S., Astronomy & Astrophysics, Physical Review D, Physical Review E, Reviews of Modern Physics, Nuclear Physics A and Nuclear Physics B. All the publications in this set were filtered on the following two, rather arbitrary, criteria: the minimal paper age is 20 years and the minimal amount of citations is 300. These filters were chosen for no particular reason, just to turn a set of about half a million publications into a much smaller set for this initial exploration. The resulting filtered publication set consists of 2750 publications. For this set, the quantity F turns out to have the frequency distribution shown in the figure below.

frequency distribution of average citation age over paper age

The phenomenon of “long-fuse” papers is not very likely to be the result of some sort of “phase transition” in the citation graph evolution; there probably is a smooth transition from the realm “long-fuse” papers to “regularly cited” papers. We can’t point to a region in the figure above and say: this region represents the component of “long-fuse” papers.  Nevertheless, it seems plausible that the “long-fuse” papers live in the right tail of this distribution. Since this is just a first exploration, I’ll just pick a threshold and see what follows.

How many papers remain from the 2750 papers if I add the addition requirement that F > 0.75? A total of 60 papers remain, 32 of which are from the set of core astronomy journals. Let’s explore a couple of them! The following plot shows the normalized number of citations (normalized by the number of citations on November 16, 2015) as a function of year for 5 papers on astronomy-related subjects.

citations for 5 long-fuse astronomy papers

These 5 papers are:

  1. Plummer, H. C. (1911), “On the problem of distribution in globular star clusters”, Monthly Notices of the Royal Astronomical Society, Vol. 71, p.460
  2. Bondi, H. (1947), “Spherically symmetrical models in general relativity”, Monthly Notices of the Royal Astronomical Society, Vol. 107, p.410
  3. Gödel, Kurt (1949), “An Example of a New Type of Cosmological Solutions of Einstein’s Field Equations of Gravitation”, Reviews of Modern Physics, vol. 21, Issue 3, pp. 447
  4. Boulware, David G. & Deser, S. (1972), “Can Gravitation Have a Finite Range?”, Physical Review D, vol. 6, Issue 12, p. 3368
  5. Wetterich, C. (1988), “Cosmology and the fate of dilatation symmetry”, Nuclear Physics B, Volume 302, Issue 4, p. 668

Below is a similar plot for 5 physics papers.

Normalized citations for 5 long-fuse physics papers

The 5 papers are:

  1. Everett, Hugh (1957), “‘Relative State’ Formulation of Quantum Mechanics”, Reviews of Modern Physics, vol. 29, Issue 3, pp. 454
  2. Bell, John S. (1966), “On the Problem of Hidden Variables in Quantum Mechanics”, Reviews of Modern Physics, vol. 38, Issue 3, pp. 447
  3. van Dam, H. & Veltman, M. (1970), “Massive and mass-less Yang-Mills and gravitational fields”, Nuclear Physics B, Volume 22, Issue 2, p. 397
  4. Nambu, Yoichiro (1973), “Generalized Hamiltonian Dynamics”, Physical Review D, vol. 7, Issue 8, pp. 2405
  5. Deshpande, Nilendra G. & Ma, Ernest (1978), “Pattern of symmetry breaking with two Higgs doublets”, Physical Review D, Volume 18, Issue 7, pp.2574

So, this initial exploration looks promising! Next, I should attempt to look in more detail at the various assumptions and seemingly arbitrary choices of variables. The choices for minimal paper age and number of citations are probably arbitrary by nature, but it seems the selected cut-off frequency of 0.75 can definitely be explored a bit further.

Finally, here is the full set of 32 papers from the astronomy data set (note that the Zwicky paper was reproduced in this analysis):

  1. Plummer, H. C. (1911), “On the problem of distribution in globular star clusters”, Monthly Notices of the Royal Astronomical Society, Vol. 71, p.460-470
  2. von Zeipel, H. (1924), “The radiative equilibrium of a rotating system of gaseous masses”, Monthly Notices of the Royal Astronomical Society, Vol. 84, p.665-683
  3. Hubble, E. P. (1926), “Extragalactic nebulae.”, Astrophysical Journal, 64, 321-369 (1926)
  4. Zwicky, F. (1937), “On the Masses of Nebulae and of Clusters of Nebulae”, Astrophysical Journal, vol. 86, p.217
  5. Henyey, L. G.; Greenstein, J. L. (1941), “Diffuse radiation in the Galaxy”, Astrophysical Journal, vol. 93, p. 70-83 (1941).
  6. Chandrasekhar, S. (1943), “Dynamical Friction. I. General Considerations: the Coefficient of Dynamical Friction.”, Astrophysical Journal, vol. 97, p.255
  7. Bondi, H.; Hoyle, F. (1944), “On the mechanism of accretion by stars”, Monthly Notices of the Royal Astronomical Society, Vol. 104, p.273
  8. Bondi, H. (1947), “Spherically symmetrical models in general relativity”, Monthly Notices of the Royal Astronomical Society, Vol. 107, p.410
  9. Bondi, H. (1952), “On spherically symmetrical accretion”, Monthly Notices of the Royal Astronomical Society, Vol. 112, p.195
  10. Salpeter, Edwin E. (1955), “The Luminosity Function and Stellar Evolution.”, Astrophysical Journal, vol. 121, p.161
  11. Bonnor, W. B. (1956), “Boyle’s Law and gravitational instability”, Monthly Notices of the Royal Astronomical Society, Vol. 116, p.351
  12. Schmidt, Maarten (1959), “The Rate of Star Formation.”, Astrophysical Journal, vol. 129, p.243
  13. Kozai, Yoshihide (1962), “Secular perturbations of asteroids with high inclination and eccentricity”, Astronomical Journal, Vol. 67, p. 591
  14. Refsdal, S. (1964), “On the possibility of determining Hubble’s parameter and the masses of galaxies from the gravitational lens effect”, Monthly Notices of the Royal Astronomical Society, Vol. 128, p.307
  15. Neupert, Werner M. (1968), “Comparison of Solar X-Ray Line Emission with Microwave Emission during Flares”, Astrophysical Journal, vol. 153, p.L59
  16. Bardeen, James M.; Press, William H.; Teukolsky, Saul A. (1972), “Rotating Black Holes: Locally Nonrotating Frames, Energy Extraction, and Scalar Synchrotron Radiation”, Astrophysical Journal, Vol. 178, pp. 347-370 (1972)
  17. Sneden, C. (1973), “The nitrogen abundance of the very metal-poor star HD 122563.”, Astrophysical Journal, Vol. 184, p. 839 – 849
  18. Purcell, Edward M.; Pennypacker, Carlton R. (1973), “Scattering and Absorption of Light by Nonspherical Dielectric Grains”, Astrophysical Journal, Vol. 186, pp. 705-714 (1973)
  19. Whelan, John; Iben, Icko, Jr. (1973), “Binaries and Supernovae of Type I”, Astrophysical Journal, Vol. 186, pp. 1007-1014 (1973)
  20. Tayler, R. J. (1973), “The adiabatic stability of stars containing magnetic fields-I.Toroidal fields”, Monthly Notices of the Royal Astronomical Society, Vol. 161, p. 365 (1973)
  21. Petrosian, V. (1976), “Surface brightness and evolution of galaxies”, Astrophysical Journal, vol. 209, Oct. 1, 1976, pt. 2, p. L1-L5.
  22. Blandford, R. D.; Znajek, R. L. (1977), “Electromagnetic extraction of energy from Kerr black holes”, Monthly Notices of the Royal Astronomical Society, vol. 179, May 1977, p. 433-456.
  23. Weidenschilling, S. J. (1977), “Aerodynamics of solid bodies in the solar nebula”, Monthly Notices of the Royal Astronomical Society, vol. 180, July 1977, p. 57-70. Research supported by the Carnegie Corp.
  24. Gingold, R. A.; Monaghan, J. J. (1977), “Smoothed particle hydrodynamics – Theory and application to non-spherical stars”, Monthly Notices of the Royal Astronomical Society, vol. 181, Nov. 1977, p. 375-389.
  25. Cash, W. (1979), “Parameter estimation in astronomy through application of the likelihood ratio”, Astrophysical Journal, Part 1, vol. 228, Mar. 15, 1979, p. 939-947.
  26. Hut, P. (1981), “Tidal evolution in close binary systems”, Astronomy and Astrophysics, vol. 99, no. 1, June 1981, p. 126-140.
  27. Arnett, W. D. (1982), “Type I supernovae. I – Analytic solutions for the early part of the light curve”, Astrophysical Journal, Part 1, vol. 253, Feb. 15, 1982, p. 785-797.
  28. Soltan, A. (1982), “Masses of quasars”, Monthly Notices of the Royal Astronomical Society, vol. 200, July 1982, p. 115-122.
  29. Milgrom, M. (1983), “A modification of the Newtonian dynamics as a possible alternative to the hidden mass hypothesis”, Astrophysical Journal, Part 1 (ISSN 0004-637X), vol. 270, July 15, 1983, p. 365-370. Research supported by the U.S.-Israel Binational Science Foundation.
  30. Li, T.-P.; Ma, Y.-Q. (1983), “Analysis methods for results in gamma-ray astronomy”, Astrophysical Journal, Part 1 (ISSN 0004-637X), vol. 272, Sept. 1, 1983, p. 317-324.
  31. Lin, D. N. C.; Papaloizou, John (1986), “On the tidal interaction between protoplanets and the protoplanetary disk. III – Orbital migration of protoplanets”, Astrophysical Journal, Part 1 (ISSN 0004-637X), vol. 309, Oct. 15, 1986, p. 846-857.
  32. O’Donnell, James E. (1994), “Rnu-dependent optical and near-ultraviolet extinction”, Astrophysical Journal, Part 1 (ISSN 0004-637X), vol. 422, no. 1, p. 158-163

Amazon Container Services, Micro Services, Consul – Backing Up Your Key/Value Store

•November 8, 2015 • Leave a Comment

Here’s the situation in a nutshell: you have a service running in the Amazon EC2 Container Services environment. Your architecture consists of a set of nodes, as shown in the sketch below.

Everything “lives” inside a private address space “in the Cloud”, on Amazon Web Services (AWS). The cluster that resides in the AWS EC2 Container Services (ECS) consists of “nodes”, each of which are “instances” launched from the AWS EC2 environment. These instances are of a certain “type” (like e.g. “t2.medium”), for example determining how much disk space and computing power will be available, and have networking characteristics attributed to them (like private and potentially public address spaces). Within each of these nodes inside our cluster, a number of Docker containers are running. You don’t have to use Docker as a means of provisioning/launching services, but that’s a different topic. Each of these Docker containers, in this example, represents a micro service. That’s the 20,000 ft view, more or less.

These micro services need to get their configuration information from somewhere. Assuming that a the recipe for building one of those micro services is stored on Github, for example, some of the configuration information can be packaged with that repository. Sensitive information, like API keys and other secrets, obviously cannot be stored in such a publicly accessible location. With Python projects, people often use a config.py module for general information and a local_config.py module (listed in the .gitignore file) for e.g. overwriting stub keys with the actual secret keys. When you deploy your micro service, you need to take care that you deploy (the right) `local_config.py` along with your code. When you have a lot of micro services, this clearly is a pain and mistakes are bound to happen. That is where a service like Consul is very handy. Consul is a lot more than just a key/value store for configuration information, by the way. When you deploy Consul along your micro services on AWS, you have a means to have a central configuration service, accessible for all your micro services. Assuming you have configured things correctly, the Consul service is only accessible within the private address space used within your ECS cluster. In other words, nobody in the outside world is able to access your configuration stored in Consul.

It would be nice to make frequent backups of your Consul key/value store. Consul does not come with something built in, so you need to roll your own backup solution. In essence you need something that is able to create a dump of the contents of the store. You can write your own utility around a HTTP request like

curl http://<Consul host>:<Consul port>/v1/kv/?recurse

and do all the “dirty work” yourself. Or, you can use a Python client like consulate to do that “dirty work” for you. In other words, getting the data needed for your backup is pretty easy. How you want to implement it is something that needs a little bit more thought. You want your backups done frequently, so a cron job seems like the natural solution. Where do you run your cron job? Do you package it with one of your micro services? Do you package it with your Consul setup? Do you create a separate “stuff” container that runs all kinds of management scripts? How about initiating the backup outside the Cloud? A minor additional question is: where do I store my backups? On AWS (S3) or locally?

To start with the minor additional question of “where”. It seems like that AWS S3 is the most convenient answer. Your backup will live in the same private address space, more readily available for a restore, and storage on AWS is dirt cheap. Storing it locally involves a little bit more coding, but works fine too. If you want to look inside a backup file, you would have to download it first, from S3, so that may be a slight advantage of having it stored locally.

Where to run the cron job is to some degree a matter of taste too. Packaging it with an existing service (like a micro service or Consul itself) seems like bad design to me. The whole idea of a micro service is that it does X and X alone, no room for “oh, and a little bit of Y too”. How about having a “management container” or “management node” within your cluster? Besides doing backups, you probably want to run all kinds of health checking, metrics gathering and other scripts. This choice is a bit more philosophical. Unless you have some sort of management UI on top of your AWS environment (like Rancher), which does all the communicating with AWS “under the hood”, you probably will need to run a local script if you want to do something like restoring a Consul backup. If that is the case, maybe you want to have your backup utility run locally too, just from a “completeness” point of view. Backup and restore are probably just two modes of one utility. Sometimes containers crash, which could theoretically stop your backups. But there is a solution for that: a new container will get spun up, with all your management services. My personal choice was having the backup be initiated locally, through a cron job running on a on-prem server. How do you make that happen?

That is where the concept “Task Definitions” comes in. A “Task Definition” in the AWS ECS environment is a recipe to “do something”. This “something” can either be a “service” or a “task”. A “service” is in general something you expect to keep running; think “micro service” here, for example. You start the service and it keeps running till “something happens”; normally this means that you stop the service or restart the service (after an update). A “task” is in general a more transient event; you start it, it makes something happen and then exits. This is exactly what we need. So, what ingredients do you need for a “Task Definition”? The main ingredient, really, is an “image”, specifically a “Docker image”. A Docker image is a read-only template. For example, an image could contain an Ubuntu operating system with Apache and your web application installed. Images are used to create Docker containers. Docker images are the build component of Docker. In its turn, a Docker image is create from the recipe, listed in a file called the Dockerfile. Docker images can be stored in various ways: on Docker Hub, a local repository or a third party respository like Quay.io or within AWs itself.

In our case, when the Docker container is created, it has one very clear purpose:

  1. connect to the Consul key/value store,
  2. retrieve all records,
  3. store the records in a file
  4. ship that file to a well-defined bucket on AWS S3

To make this happen, we need to translate this into a Dockerfile. In the FROM clause in the Dockerfile, you specify on which the Docker container is based (like Ubuntu, CentOS, Debian, …). I chose phusion, but that is not necessarily the best choice. I probably could have chosen something like busybox. One difference between various choices is the size of the resulting Docker image; this could be a critical factor. I’ll probably spend a future blog on this.

The rest of the ingredients is listed in this GitHub repo. Using the Dockerfile, I created a Docker image and stored it on Docker Hub. Now we can use this image in the Task Definition:


{
  "requiresAttributes": [],
  "taskDefinitionArn": "arn:aws:ecs:<region>:<identifier>:task-definition/consul-backup:12",
  "status": "ACTIVE",
  "revision": 12,
  "containerDefinitions": [
    {
      "mountPoints": [
        {
          "containerPath": "/tmp",
          "sourceVolume": "tmp",
        }
      ],
      "name": "consul-backup",
      "environment": [
        {
          "name": "SERVICE_TAGS",
          "value": "staging"
        }
      ],
      "image": "adsabs/consul-backup:v1.0.8",
    }
  ],
  "volumes": [
    {
      "host": {
        "sourcePath": "/tmp"
      },
      "name": "tmp"
    }
  ],
  "family": "consul-backup"
}

I have removed a lot of details from this, to keep things simple and focus on the most important aspects. The Docker image has been bolded above. By just listing adsabs/consul-backup AWS “knows” that it needs to look on Docker Hub if it cannot find the image locally. By adding a label after the colon, a specific version will get downloaded. The Docker container that will get created when you run “Run Task” within AWS RCS for this particular Task Definition, will mount “/tmp” from the node on “/tmp” in the container. This is something I wanted to be able to keep a log file that would stick around, even after the Docker container was removed.

When “Run Task” is executed, the Docker container is built, and the command specified after “CMD” in the Dockerfile is executed. This will run the Python script backup.py. This looks at the environment variables, telling it to do either a backup or a restore. I think the source code of backup.py is pretty self-explanatory. Instead, let’s look at how to make this setup into a local cron job.

The essence of this is the “boto3” Python module, specifically the component that deals with AWS ECS. In essence you would need a method along the following lines:


def run_task(cluster, desiredCount, taskDefinition):
    """
    Thin wrapper around boto3 ecs.update_service;
    # http://boto3.readthedocs.org/en/latest/reference/services/ecs.html#ECS.Client.run_task
    :param cluster: The short name or full Amazon Resource Name (ARN) of the cluster that your service is running on. If you do not specify a cluster, the default cluster is assumed.
    :param desiredCount: The number of instantiations of the task that you would like to place and keep running in your service.
    :param taskDefinition: The family and revision (family:revision ) or full Amazon Resource Name (ARN) of the task definition that you want to run in your service. If a revision is not specified, the latest ACTIVE revision is used. If you modify the task definition with UpdateService , Amazon ECS spawns a task with the new version of the task definition and then stops an old task after the new version is running.
    """
    client = get_boto_session().client('ecs')
    client.run_task(
        cluster=cluster,
        desiredCount=desiredCount,
        taskDefinition=taskDefinition
    )

where cluster refers to the name of the cluster and taskDefinition would be something like “consul-backup:12” (where 12 refers to the “revision number” of the Task Definition). The method “get_boto_session()” is something like


def get_boto_session():
    """
    Gets a boto3 session using credentials stores in app.config; assumes an
    app context is active
    :return: boto3.session instance
    """
    return Session(
        aws_access_key_id=current_app.config.get('AWS_ACCESS_KEY'),
        aws_secret_access_key=current_app.config.get('AWS_SECRET_KEY'),
        region_name=current_app.config.get('AWS_REGION')
    )

Now we have all the ingredients to initiate backups of the Consul key/value store from a local server.

Linking to Data – Effect on Citation Rates in Astronomy

•June 3, 2011 • 5 Comments

In the paper Effect of E-printing on Citation Rates in Astronomy and Physics we asked ourselves the question whether the introduction of the arXiv e-print repository had any influence on citation behavior. We found significant increases in citation rates for papers that appear as e-prints prior to being published in scholarly journals.

This is just one example of how publication practices influence article metrics (citation rates, usage, obsolescence, to name a few). Here we will be examining one practice that is very relevant to astronomy: is there a difference, from a bibliometric point of view, between articles that link to data and articles that do not? Specifically, is there a difference in citation rates between these classes of articles?

Besides being interesting from a purely academic point of view, this question is also highly relevant for the process of “furthering science”. Data sharing not only helps the process of verification of claims, but also the discovery of new findings in archival data. There seems to be a consensus that sharing data is a Good Thing. Let’s ignore the “why” and “how”, and focus on the sharing. You need to have both a willingness and a publication mechanism in order to create a “practice”. This is where citation rates come in: if we can say that papers with links to data get higher citation rates, this might increase the willingness of scientists to take the extra steps of linking data sources to their publications.

Using the data holdings of the SAO/NASA Astrophysics Data System we can do the analysis and see if articles with links to data have different citation rates. For the analysis, we used the articles published in The Astrophysical Journal (including Letters and Supplement), The Astronomical Journal, The Monthly Notices of the R.A.S. and Astronomy & Astrophysics including Supplement), during the period 1995 through 2000. Next we determined the set of 50 most frequently used keywords in articles with data links. The articles to be used for the analysis were obtained by requiring that they have at least 3 keywords in common with that set of 50 keywords. This resulted in a set of 3814 articles with data links and 7218 articles without data links. A random selection of 3814 articles was extracted for this set of 7218 articles.

First, we’ll create a diagram just like the one in figure 4 of the paper Effect of E-printing on Citation Rates in Astronomy and Physics, which shows the number of citations after publication as an ensemble average. In this figure 4 we used the mean number of citations (over the entire data set) to normalize the citations. For our current analysis we will use the total number of citations for normalization.

Our analysis shows that articles with data links are indeed cited more than articles without these links. We can say a little bit more by looking at the cumulative citation distribution. The figure below shows this cumulative distribution, normalized by the total number of citations for articles without data links, 120 months after publication.


This graph shows that for this data set, articles with data links acquired 20% more citations (compared to articles without these links).

Google Books and the Importance of Quality Control

•November 10, 2010 • Leave a Comment

I’ve stopped counting the times when I used Google Books and cringed. To be honest, I have to say that I have mostly limited myself to digitized serials, serials in astronomy and physics, to be precise. I’m going to ignore bad meta data, which in itself would be a source of teeth grinding and hair pulling. I regularly find myself laughing out loud at the subject headings they came up with. Actually, it’s pretty sad.

No, my main source of frustration is bad digitization. Missing pages, partially scanned pages, pages showing body parts (so far, I’ve only seen fingers and hands), etc etc. Here you see a fine example of what I am referring to. I don’t know whose hand this is, but I would feel deeply ashamed if I were this person. Digitization is serious business, especially when your goal is preservation. When publications contain fold-outs, these need to be properly scanned, for example. I totally realize that with an enormous digitization effort like Google’s, quality control is bound to be hard, if not impossible. In the last year, about half a million scans went through my hands (figuratively speaking). I know how hard it is to check for missing pages and I also know that you simply cannot check every single image.

In addition to bad scans, I think that the search interface of Google Books, well… errr.. sucks. The results returned seem inconsistent, probably as a result of bad meta data (and bad indexing?). Navigating through results and trying to drill down or find out which other volumes were digitized is a major undertaking and often impossible.

Clearly this was a “quantity over quality” project, and quality clearly lost.

Indexing Matters – The Importance of Search Engine Behavior

•July 21, 2010 • Leave a Comment

What a search engine returns on a user query largely, if not completely, determines its usefulness for that user. Looking at usage bibliometrics allows to classify the behavior of different types of users, for example (see e.g. Usage Bibliometrics by Michael J. Kurtz and Johan Bollen). There are voices claiming that Google Scholar is a “threat” to scholarly information retrieval services (like the ADS and WoS, for example). The main reason why this is not the case becomes clear when we look at usage statistics. Here I will make a comparison of readership patterns from ADS and Google Scholar queries, as observed in ADS’s access logs. These readership patterns will give us the obsolescence of astronomy articles by ADS and Google Scholar users. In order to zoom in on people who use ADS professionally, I will only regard ADS users who query ADS 10 or more times per month. The journals I have used in the analysis are the main astronomy journals: Astrophysical Journal, Astronomical Journal, Monthly Notices of the R.A.S. and Astronomy & Astrophysics. In the figure below, a comparison is made between readership of frequent ADS users (read “professional astronomers”) and Google Scholar users.

Comparison of readership patterns from ADS and Google Scholar queries, as observed in ADS’s access logs. The red line marked with open circles shows the readership use by people using the ADS search engine. The blue line marked with 'x' corresponds with the readership use by people who used the Google Scholar engine. The orange line marked with closed circles shows the citation rate to the articles, while the purple line marked with ’+’ respresent their total number of citations.

All the quantities in the figure above are on a per article basis and have been normalized by the 1987 value. This was done so that we can compare apples with apples.
The fact that the obsolescence through Google Scholar is strongly correlated with the total number of citations is no coincidence: this is a direct consequence of the correlation between the PageRank and the total number of citations (see e.g. Chen et al. (2007) and Fortunato et al. (2006)). The consequence of this correlation is the following: Google Scholar does not provide what professional astronomers (and other frequent users) want. Google Scholar readership correlates with the reading habit of students. In short, Google Scholar currently is no threat to scholarly information retrieval services.

Bibliography

  • Kurtz, Michael J. and Bollen, Johan (2010), “Usage Bibliometrics”, Annual Review of Information Science and Technology, vol 44, p. 3-64
  • Henneken, E. et al. (2009), “Use of astronomical literature – A report on usage patterns”, Journal of Informetrics, vol. 3, iss. 1, p. 1
  • Fortunato, S., Flammini, A., & Menczer, F. (2006), “Scale-Free Network Growth by Ranking”, Physical Review Letters, 96, 218701
  • Chen P., Xie H., Maslov H., and Redner, S., (2007), “Finding scientific gems with Googles PageRank algorithm”, Journal of Informetrics, 1, 8

The Art of Parsing – Python – Removing Duplicates

•July 13, 2010 • Leave a Comment

When processing large amount of data, for example when building a recommender system or an index, there is often a need to remove duplicates from a list of e.g. words. As always, there are many ways to solve a problem, even when you stick to one programming language (which in my case is Python). It is always good to ask yourself the question: how does this method scale? Especially when you work with large data sets, this is something to keep in mind. I was pretty happy with the following method to remove duplicates from a list:

def uniq(inlist):

if not inlist:
return inlist
inlist.sort()
outlist = [inlist[0]]
for i in range(1,len(inlist)):
if inlist[i]!=inlist[i-1]:
outlist.append(outlist[i])

return outlist

(ok, indentation doesn’t really work with this free version of wordpress AFAIK). But then I decided to try

from sets import Set
def uniq(inlist):

return list(Set((item for item in inlist)))

which turned out to be a significant speedup. And the code is much cleaner too 🙂 The graphs below show the speed up:

This graph compares two Python methods for removing duplicates from a list

The graph above shows the processing time for removing duplicates from a list as function of list size for the two method described above (“Method 1” is the second method, using the sets module). The graph below shows the relative speedup:

This graph shows how much faster Method 1 is (the method using the sets module)

Submission of E-prints – Versioning

•July 6, 2010 • Leave a Comment

Here’s an interesting trend: the fraction of e-prints with mutliple versions has been increasing steadily in a number of categories. The figure below shows these trends for 4 major arXiv categories.

This figure shows the fraction of e-prints with mutiple versions for the arXiv categories astro-ph, cond-mat, hep-ph and nucl-th

I think that authors, over time, started to care more about replacing the initial version with the final version, or at least a more recent version (as some publishers still don’t allow the final version to be made available as e-print). Since the e-prints on arXiv are read so heavily, it is in the authors’ interest to replace their e-prints with corrected/updated versions. There are researchers in some disciplines who will only read and cite e-prints, maybe because their library cannot afford the subscription fees or maybe by choice, but it will be clearly beneficial to them if an e-print is a accurate representation of the end product. The Institute of Mathematical Statistics has the following standpoint with respect to e-printing IMS journal articles:
IMS wishes to demonstrate by example that high quality journals supported by the academic community can provide adequate revenue to their publishers even if all of their content is placed on open access digital repository such as arXiv. A steady flow of IMS content into the PR (probability) section and the new ST (statistics) section of arXiv should help create an eprint culture in probability and statistics, and be of general benefit to these fields. By guaranteeing growth of these sections of arXiv, IMS will support the practice of authors self-archiving their papers by placing them on arXiv. This practice should put some bound on the prices of subscriptions to commercial journals.” (for more into, see IMS Journals on arXiv). They literally give their authors the following advice: “… when a final version is accepted by a journal, update your preprint to incorporate changes made in the refereeing process, so a post-refereed pre-press version of your article is also available on arXiv“. There are probably other journals and societies with the same standpoint.
We’re just seeing another symptom of the (necessary) paradigm shift in scholarly publishing.

Recommending Literature in a Digital Library

•July 2, 2010 • Leave a Comment

I started yesterday’s post with saying that authors publish because they want to transfer information and that an essential ingredient for this transfer is being able to find this information. Of course, any organization running a search engine or a publisher with a substantial online presence are other examples where the art of “discovery” is as essential as wind to a sail boat. Clearly, this is becoming more and more of a challenge with the rapidly expanding information universe (literature universe, in our case). The amount of potentially interesting, searchable literature is expanding continuously. Besides the normal expansion, there is an additional influx of literature because of interdisciplinary boundaries becoming more and more diffuse. Hence, the need for accurate, efficient and intelligent search tools is bigger than ever.

When you just look at the holdings of the SAO/NASA Astrophysics Data System (ADS), you’ll get a good indicator for this expansion. As of April 19, 2010, there are 1,730,210 records in the astronomy database, and 5,437,973 in the physics database, distributed over publication years as shown in the figure below.

This figure shows the number of records in the astronomy and physics databases in the ADS, as a function of publication year

In astronomy, as in other fields, the Literature Universe expands more rapidly because of dissolving boundaries with other fields. Astronomers are publishing in journals and citing articles from journals that had little or no astronomy content not too long ago.
How do you find what you are looking for and more importantly, information you could not have found using the normal information discovery model? When you have some prior information (like author names and/or subject keywords), you can use your favorite search engine and apply that information as filters. There are also more sophisticated services like myADS (as part of your ADS account), that do intelligent filtering for you and provide you with customized suggestions. Alternatively, you can ask somebody you consider to be an expert. This aspect emphasizes that “finding” essentially is a bi-directional process. Wouldn’t it be nice to have an electronic process that tries to mimic this type of discovery? It is exactly this type of information discovery that recommender systems have been designed for.

Recommender systems can be characterized in the following way. Recommender systems for literature recommendation…

  • are a technological proxy for a social process
  • are a way of suggesting like or similar articles to a user-specific way of thinking
  • try to automate aspects of a completely different information discovery model where people try to find other people considered to be experts and ask them to suggest related articles

In other words, the main goal of a literature recommender system is to help visitors find information (in the form of articles) that was previously unknown to them.

What are the key elements needed to build such a recommender system? The most important ingredient is a “proximity concept”. You want to be able to say that two articles are related because they are “closer together” than articles that are less similar. You also want to be able to say that an article is of interest to a person because of its proximity to that person. The following approach will allow you to do just that:

  • build a “space” in which documents and persons can be placed
  • determine a document clustering within this space (“thematic map”)

How do you build such a space? Assigning labels to documents will allow us to associate a “topic vector” with each document. This will allow us to assign labels to persons as well (“interest vector”), using the documents they read. Placing persons in this document space can be used in essentially two different ways: use this information to provide personalized recommendations or use usage patterns (“reads”) of expert users as proxies for making recommendations to other users (“collaborative filtering”). As far as the labels themselves are concerned, there are various sources you can distill them from. The most straightforward approach is to use keywords for these labels. One drawback that comes to mind immediately, is the fact that there are no keywords available for historical literature. However, keywords are an excellent labeling agent for current and recent literature.

Whether keywords really describe the document universe with sufficient accuracy is directly related to the question whether a keyword system is sufficiently detailed to classify articles. I assume the latter is true, but only when you include the keywords from all papers in the bibliography. Having said this, I do realize that a keyword system can never be static because of developments within a field and because of diffusing boundaries with other fields. I use the keywords provided by the publishers, so the scope and the evolution of the keyword spectrum is out of our hands. It also means that a recommender system based on publisher-provided keywords has one obvious vulnerability: if a major publisher would decide to stop using keywords (e.g. PACS identifiers), it would pose a significant problem.

The figure below shows a highly simplified representation of that document space, but it explains the general idea. Imagine a two-dimensional space where one axis represents a topic ranging from galactic to extra-galactic astronomy, and where the other ranges from experimental/observational to theoretical. In this space, a paper titled “Gravitational Physics of Stellar and Galactic Systems” would get placed towards the upper right because its content is mostly about theory, with an emphasis on galactic astronomy. A paper titled “Topological Defects in Cosmology” would end up towards the upper left, because it is purely theoretical and about the extra-galactic astronomy.

A simplistic, two-dimensional representation of a "topic space"

A person working in the field of observational/experimental extra-galactic astronomy will most likely read mostly papers related to this subject, and therefore get placed in the lower left region of this space. A clustering is a document grouping that is super-imposed upon this space, which groups together documents that are about similar subjects. As a result, this clustering defines a “thematic map”. As mentioned, this is a highly simplified example. In reality the space has many dimensions (100 to 200), and these cannot be named as intuitively as “level of theoretical content”. However, the naming of various directions in this “topic space” is not something I don’t worry about. The document clustering is the tool that I will be working with. Note that for to establish this “thematic map”, you could very well use the approach I described earlier this week in my post Exploring the Astronomy Literature Landscape.

Knowing to which cluster a new article has been assigned will allow us to find papers that are the closest to this article within the cluster. The first couple of papers in this list can be used as a first recommendation. The more interesting recommendations, however, arise when you combine the information we have about the cluster with usage information. The body of usage information is rather specific: it consists of usage information for “frequent visitors”. People who read between 80 and 300 articles in a period of 6 months seems like a reasonable definition for the group of “frequent visitors”. I assume that this group of frequent visitors represents either professional scientists or people active in the field in another capacity. People who visit less frequently are not good proxies because they are most likely incidental readers.

The technique used to build the recommender system has been around for quite a while. As early as 1934, Louis Thurstone wrote his paper “Vectors of the Mind” which addressed the problem of “classifying the temperaments and personality types”. Peter Ossorio (1965) used and built on this technique to develop what he called a “Classification Space”, which he characterized as “a Euclidean model for mapping subject matter similarity within a given subject matter domain”. Michael Kurtz applied this “Classification Space” technique to obtain a new type of search method. Where the construction of the “Classification Space” in the application by Ossorio relied on data input by human subject matter experts, the method proposed by Michael Kurtz builds the space from a set of classified data. Our recommender system is a direct extension of the “Statistical Factor Space” described in the appendix “Statistical Factor Spaces in the Astrophysical Data System” of this paper by Michael Kurtz.

Bibliography

  • Kurtz, M.~J.\ 1993, Intelligent Information Retrieval: The Case of Astronomy and Related Space Sciences, 182, 21
  • Ossorio, P.~G.\ 1965, J. Multivariate Behavioral Research, 2, 479
  • Thurstone, L.~L.\ 1934, Psychological Review, 41, 1

Publication Trends – Authors – Astronomy

•June 30, 2010 • 1 Comment

Authors publish because they want to transfer information. An essential ingredient for this transfer is being able to find this information. This means that this information, for example articles in scholarly journals, needs to be indexed properly and enriched with relevant meta data and links. Enhanced information retrieval tools, like recommender systems, have become indispensable. Besides the actual content of the information offered for dispersal, the information comes with another piece of essential meta data: the author list.

The importance of the author list is essentially bidirectional. Having your name appear on articles is an essential ingredient of any scholarly career and plays an important role in the process of seeking for e.g. tenure or jobs. The role of first author depends on discipline, so the first author isn’t necessarily the “most authoritative” author. Some disciplines use alphabetical author lists, for example. Co-authorship with a prominent expert clearly makes a difference and sometimes gives you “measurable status”, like the Erdős number in mathematic, which is the “collaborative distance” between a person and Paul Erdős (if your number is 1, it means you published a paper together with him).

To me, co-authorship is the most normal thing in the world. In a lot of way, doing science is like learning a “trade”. You start off being an apprentice, you do an examn showing that you have mastered the basic skills for the “trade” and then you find your own way. As an aside: I think the doctoral thesis and its subsequent defense is that “test of ability”. In some displines it now seems to have become a requirement that doctoral research should result in something original and new. Please correct me if that observation is incorrect.

In the past, at least in astronomy and physics, it was more common to publish papers just by yourself, once you’ve mastered your field. And this was initially totally feasible. In the early days of science there were no budgets being slashed and there were no enormous projects like LHC. Most scientists had their own little “back yard” where they could grow whatever they felt like growing. As the 20th centory progressed, especially in roughly the second half, collaborations became more and more unavoidable. Enter collaborations and therefore growing numbers of co-authors. From this moment on we see the The demise of the lone author (Mott Greene, Nature, Volume 450, Issue 7173, pp. 1165). The figure below is an illustration of how the distribution of the number of authors has changed over time.

The figure shows the distribution of the relative frequency of the number of authors per paper in the main astronomy journals for a number of years

This figure illustrates a couple of things. First of all is shows the “demise of the lone author”, where the fraction of lone author papers dropped from about 60% in 1960 to about 6% in 2009! The widening of the distribution shows that on average the number of co-authors has increased. It seems that this is still an ongoing process that hasn’t reached a saturation point yet.

The figure below highlights the “demise of the lone author” by showing the change in the fraction of single author papers in the main astronomy and physics journals.

The figure shows the fraction of papers by single authors in the main astronomy and physics journals


The drop in the astronomy journals is more dramatic than for the physics journals. A factor of about 10 versus a factor of about 3 or 4.