About Parser::Citation and how to use it

A Perl module for extracting reference metadata from scholarly eprint papers.
Citation.pm is free software made available under the terms of the GNU General Public License.
Documentation | Examples

The Open Citation Project is developing automated reference linking services for open eprint archives. The project has produced Citation.pm to extract metadata from references within the scholarly papers deposited in the archives, for example

      [2] W.S. Wilburn and J.D. Bowman, Phys. Rev. C57 (1998) 3425.  
Currently, Citation.pm attempts to parse the following metadata from references to other journal papers (it is not good at parsing metadata from references to books, conference proceedings, theses, etc.) : Sometimes the title of the referenced paper is also extracted if it is in an easy-to-recognise form (e.g. enclosed in double quotes). These data are sufficient to identify a journal paper uniquely for reference linking purposes.

Citation.pm performs better when metadata are separated by a comma (i.e. a comma between authors and article/journal titile; and a comma between the article title and the journal title). If there is no comma to separate author name(s) from the journal title, for example, it is difficult for the Citation.pm to parse correctly the (last) author's name and the journal title:

       [1] V. E. Zakharov and A. B. Shabat. J. Funct. Anal. Appl. 8, 226 (1974)

		Authors:        V.E.Zakharov
		First Author:   V.E.Zakharov
		Journal:        J.FUNCT.ANAL.APPL.
		Volume:         8
		Start Page:     226
		Year:           1974
       [2] Michoel T., Momont B., and Verbeure A. Reports on Mathematical Physics, 41(3):361-395, 1998. 

		Authors:        T.Michoel:B.Momont:A.Verbeure
		First Author:   T.Michoel
		Volume:         41
		Issue:          3
		Start Page:     361
		Year:           1998             

In the OpCit project we are using the extracted metadata to build a citation database for eprints. By matching extracted data with existing entries in the database we can determine if an eprint ID(s) is known for the identified reference and return a link(s) to it in the original document. The database can be structured so that not only can references be linked directly to the referenced papers, but we can link to other papers that have subsequently cited a given paper or referenced paper, known as 'forward linking' because such links take the user forward in time. We are also using the citation database as the basis for a prototype citation-ranked search engine for eprint archives.

As you can see, Citation.pm serves just one part of this process. We will make other components available whenever we can, and plan to provide advice and guidance about building citation databases for reference linking applications.

Of course, it would be useful for other applications if more metadata pointing to other types of publications (proceedings, books, ...) could be extracted, although it is unlikely we would be able to provide links to such works, which would ususally require some fee to access them. This might be future work. For now we are concentrating on parsing citations to journal articles.

How to use Citation.pm

There seems to be increasing interest in reference linking, not just from large publishers but from researchers building information resources and who see reference linking as a means of improving the usefulness of their project. In fact, we are making Citation.pm available in response to a couple of requests from other researchers. However, please bear in mind that Citation.pm is still far from the final product, so it may easily break down in some cases. The documentation on how to use the module is given in pod format in the distribution, but here's online documentation generated by pod2html.

The code has been tested on some of the references found in papers deposited in the Los Alamos arXiv.org physics archives. It seems to work well for most references to journal papers. Take a look at more example results of metadata extraction produced using Citation.pm.

(Z. Jiao, 20/06/01)