Edited extract from: Hitchcock et al. (2000) Developing services for open eprint archives: globalisation, integration and the impact of links. 5th ACM Conference on Digital Libraries, San Antonio, Texas, June

Citation link analysis

Working initially just within the Los Alamos physics eprint archive, citations were analysed from a subset of papers submitted during 1999 to the end of October from one section of the archive, hep-th (theoretical high-energy physics), a total of 2170 papers. These papers contained over 65 000 citations.

Citations in physics are notoriously terse. In papers in the archive typically they include author names, and then either an archive identifier number or a standard abbreviation for the journal title followed by some undifferentiated numbers, usually volume number, start page number and the year of publication (see Figure 2). Sometimes both the archive ID and journal data are included.

Figure 2.  Reference section of article hep-th/9907001 with added links indicated by coloured boxes. For a colour key see Figure 5
 

For our subset of documents the relative success in automatically recognising and resolving to the corresponding document in the bibliographic database compiled from the archive can be gauged from Figure 5. The colours in this chart correspond to the link colours in Figure 2. Some links were simply derived from explicitly cited IDs for the archive documents, others were derived purely from the bibliographic data in the citation. Where archive IDs are not included directly, they can alternatively be derived from bibliographic data in the archive journal-ref metadata or from other more intensively-maintained, overlapping bibliographic databases in physics such as SPIRES. (SPIRES is maintained by the Stanford Linear Accelerator Center (SLAC), an associate partner in the OpCit project.)

Figure 5. Resolving and linking citations in a subset of hep-th papers: what proportion could be linked, what could not and why. For an example of a single reference list showing this range of results see Figure 2

Where resolution of reference data against the database, and therefore linking, was unsuccessful there could a number of reasons, and the relative occurrence of these problems is also shown in Figure 5. In some cases a citation was correctly recognised, but the reference is too old to feature in the archive, or a citation was recognised but not found in the database (the cited paper is not in the archive). The remaining citations could not be resolved, possibly due to poor formatting, incorrect data, etc. It can be seen that just over half of the references were successfully linked within the archive. The number of successful links could be increased significantly if the archives were supplemented with other, older sources, online archival journals say. About 16 per cent of citations from this subset may never be resolvable.

These percentages might be generalisable across the physics archive but not necessarily to other archives or applications, although it is interesting to compare broad measures of success in citation linking such as this example (52 per cent of citations linked within the archive) with that reported by Electronic Press for Medline linking in which "on average only about 60% of references in a typical medical paper are contained in the Medline data". Of these Medline citations only 85% were reliably resolved by the linking software. Factors that control these figures include the size and accessibility of the archive and other document sources, and the accuracy, quality and completeness of the reference data.



What is reference linking?
Try a demonstrator of a linked archive of papers

This page produced and maintained by Zhuoan Jiao for the Open Citation project
OpCit Project Working Results - main page