The Open Citation Project is developing automated reference linking services for open eprint archives. The project has produced Citation.pm to extract metadata from references within the scholarly papers deposited in the archives, for example
[2] W.S. Wilburn and J.D. Bowman, Phys. Rev. C57 (1998) 3425.Currently, Citation.pm attempts to parse the following metadata from references to other journal papers (it is not good at parsing metadata from references to books, conference proceedings, theses, etc.) :
Citation.pm performs better when metadata are separated by a comma (i.e. a comma between authors and article/journal titile; and a comma between the article title and the journal title). If there is no comma to separate author name(s) from the journal title, for example, it is difficult for the Citation.pm to parse correctly the (last) author's name and the journal title:
[1] V. E. Zakharov and A. B. Shabat. J. Funct. Anal. Appl. 8, 226 (1974)
Authors: V.E.Zakharov
First Author: V.E.Zakharov
Journal: J.FUNCT.ANAL.APPL.
Volume: 8
Start Page: 226
Year: 1974
[2] Michoel T., Momont B., and Verbeure A. Reports on Mathematical Physics, 41(3):361-395, 1998.
Authors: T.Michoel:B.Momont:A.Verbeure
First Author: T.Michoel
Journal: A.REPORTS ON MATHEMATICAL PHYSICS
Volume: 41
Issue: 3
Start Page: 361
Year: 1998
In the OpCit project we are using the extracted metadata to build a citation database for eprints. By matching extracted data with existing entries in the database we can determine if an eprint ID(s) is known for the identified reference and return a link(s) to it in the original document. The database can be structured so that not only can references be linked directly to the referenced papers, but we can link to other papers that have subsequently cited a given paper or referenced paper, known as 'forward linking' because such links take the user forward in time. We are also using the citation database as the basis for a prototype citation-ranked search engine for eprint archives.
As you can see, Citation.pm serves just one part of this process. We will make other components available whenever we can, and plan to provide advice and guidance about building citation databases for reference linking applications.
Of course, it would be useful for other applications if more metadata pointing to other types of publications (proceedings, books, ...) could be extracted, although it is unlikely we would be able to provide links to such works, which would ususally require some fee to access them. This might be future work. For now we are concentrating on parsing citations to journal articles.
The code has been tested on some of the references found in papers deposited in the Los Alamos arXiv.org physics archives. It seems to work well for most references to journal papers. Take a look at more example results of metadata extraction produced using Citation.pm.