Open Archive Citation Linking Report

The Open Archive Citation Linking Project (OpCit, formerly known as EPrintLinks) is concerned with adding hypertext features to EPrint Archives. Initially, the arXiv.org archive is being targetted: analyses are being undertaken to discover how the archive is being used, how articles are being written and cited, and how these practises are changing over time.

It is increasingly common for article citations to directly quote the arXiv.org archive identifier (e.g. hep-th/9907123) along with, or even instead of, the normal bibliographic data. Over 80% of recent hep-th articles include at least one of these direct citations. A LaTeX style file (hyperref) allows authors to add hypertext links to these citations but it is very rarely used.

Using post-hoc hypertext techniques from the Open Journals Project [Carr & Hitchcock] we can add hypertext links from these direct citations. We can also derive arXiv.org archive identifiers from citations which do not include them directly, by looking up the bibliographic data in the arXiv.org journal-ref metadata, or in other bibliographic databases such as SPIRES.

Results

A program linkcites has been written which analyses the References section of a PDF document from arXiv.org and adds links to online versions of the cited information (see figures 1-3). The links are coloured according to the state of the recognition - this is to provide feedback to the implementers for testing. (It is intended that a graphical icon will be used in the final version of the system to allow the user to see that a separate service is adding the links.) The destination of the links is not the ultimate URL of the online data, but an intermediate page provided by the SFX service [van de Sompel] to provide a variety of instantiations of the cited article. The article shown in the figures can be accessed online at http://jounals.ecs.soton.ac.uk/eprintlinks/99070001.pdf.
Figure 1: The references section of article hep-th/9907001 with a number of added links.

Figure 2: activating one of the links returns an SFX page which allows the user to choose where to look up the cited article.

Figure 3: choosing "Download from Universal Preprint Archive" yields the article's master page from arXiv.org.

The link colours reflect the success of the citation recogniser and the subsequent resolution against the arXiv.org bibliographic database. An orange link is one which was successfully derived from pure bibligraphic data; a red link indicates that the explicit arXiv.org number overrode the derivation process. A blue number indicates that the presence of a citation was correctly recognised, but that the reference is too old to feature in arXiv.org. A greeny-yellow link indicates a citation which is in the correct date range, but which failed to match an entry in the database. Finally, a brown link indicates that the recognition stage failed by seeing too little or too much data for a single citation.

The program was tested on 35 hep-th articles: the results are as follows.

ChartObject QUIKCITE Link Types

article num red orange overridden blue green brown missed badness Comments
9907001 9 1 3 0 4 1 0   0%  
9907003 13 12 1 6 0 0 0   0%  
9907005 6 0 0 0 1 0 5   83% Bad citing practise (footnote style. mostly books)
9907006 57 53 1 29 2 0 1   2% Incorrectly formatted arXiv.orgid confuses bib recogniser
9907007 22 2 14 0 1 1 4   18% Two recogniser faults (numbers in paper title) 1 confusing citation and 1 bad one.
9907009 17 0 0 0 9 0 8   47% Old citations, books. Fails to work after ; fails to work on last one if no page number!
9907011 21 9 6 3 1 1 4   19% Genuine bad formatting. Also confused by refsquishing.
9907012 43 11 5 0 4 2 21   49% Confused by numbering of citations ALSO multiple citation squashing
9907013 19 3 0 1 3 3 10   53% Book citations. arXiv.orgid hyphenation.
9907014 12 10 0 6 2 0 0   0%  
9907015 30 17 0 5 0 0 13   43% Significant numbers separated by colons
9907016 12 0 3 0 9 0 0   0%  
9907017 3 2 0 0 0 0 1 25 93% Untitled refs section & bad formatting (no spaces)
9907018 6 1 0 0 4 0 1   17%  
9907019 82 38 19 0 8 5 12   15%  
9907021 9 9 0 1 0 0 0   0%  
9907022 28 6 19 0 0 2 1   4%  
9907023 57 8 6 0 14 2 27   47% Confused by numbering of citations
9907025 28 0 14 0 4 2 8 5 39% Untitled refs section, books & ref squishing
9907026 79 41 11 3 7 3 17   22%  
9907027 18 0 10 0 4 3 1   6% Ref squishing
9907029 39 16 14 0 3 0 6   15%  
9907030 34 2 9 0 20 0 3   9% Ref squishing
9907032 39 13 18 0 5 2 1   3%  
9907034 10 7 1 0 2 0 0   0%  
9907035 21 0 2 0 18 1 0   0%  
9907036 24 21 2 11 0 0 1   4% Untitled refs section
9907038 45 31 1 17 7 3 3   7%  
9907039 24 0 3 0 11 3 7   29% Books, reports and ref squishing
9907041 14 4 2 0 4 1 3   21% Ref squishing
9907044 11 0 0 0 2 0 9   82% Untitled refs section (mainly books anyway)
9907045 10 0 2 0 2 3 3   30% Books
9907046 37 0 5 0 10 3 19   51% Numbers in titles + missed [arXiv.orgid]
9907048 21 0 1 0 10 2 8   38% Books & ref squishing
9907049 12 3 6 0 1 0 2   17%  
Totals 912 320 178 82 172 43 199   22%