NB:  This document consists:
              Donna's original version (text in black)
          + comments from Zhuoan (text in blue, mainly )
          + Donna's original text marked by Zhuoan (texts in red)

Processing an Item in the Archive (Pseudo Code)

One test of the API is to see if it can be used to construct a collection of surrogate objects in the first place. This page details the use of the API to construct such a collection.

We assume two databases are available. The first one holds creation-related data (URN, OAMS metadata). (Note: for Open Archives, the database can be loaded up with existing metadata records.) The other database is the citation database consisting of source and target pairs of document ids. A document id could be the index in the first database of where metadata for the citing and cited creations can be found.

  1. Get the next paper in the archive.
  2. Convert the full text of the paper to text/plain. If this can't be done, give up. We'll never determine linking relationships.
  3. Instantiate a new Surrogate object, passing it the addresses of this item in the archive.
  4. During the instantiation, we extract the available metadata from the ASCII version of the document. Extracted information includes titles, authors, etc. The original text fragments are saved in the MIMEfile localMetaData.  If there is a Dienst harvesting interface for this archive, use it to get the OAMS, and try to find the matching text fragments.
  5. See if this paper is already in the database, using any available data to conduct the search. If so, save the document id. If not, construct an OAMS metadata MIMEfile and add it to the database and save the new document id. A new BibData is constructed from the saved document id.
  6. We now have the following private fields defined:

    BibData   myData         // our document id
    String    myURL          // Network address of our item
    MIMEfile  localMetaData  // Original text fragments
  7. Next collect the references from the raw text. For each reference, save the original string, set its reference type (it may or may not be linkable), and save the contexts of the reference. If it makes sense, save the ordinal number. Parse the reference text to collect some metadata.
  8. For each reference, use collected metadata to look up the reference in the database. If it is already there, save the document id. At this point, you might also have a chance to correct/add some more metadata to the database for this document, if the reference contains more information than was in the database.
  9. ZJ: This look up process may not be simple if the collected metadata from the reference is not complete or contains errors.  CiteSeer can 'predict' that the following three citations refer to the same paper (http://citeseer.nj.nec.com/check/814):

    Quinlan J. R.: C4.5: Programs for Machine Learning, Morgan Kaufmann, San Mateo, California 1993
    Quinlan, J. R. (1993). C4.5: Programs for machine learning. San Francisco, CA: Morgan Kaufmann.
    J. Ross Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA, 1993.

    This kind of 'grouping' process is necessary in order to obtain the correct  number of citations of
    a particular paper (i.e. forward links).  Also only after this process, it is possible to "have a chance to correct/add some more metadata to the database for this document."

    If the reference was in the database already, then it has a URN. Construct a new Citation out of this reference by giving it the context[] and type of citation (REFERENCE). Use the reference's URN to locate the surrogates for copies of this creation. (This involves a call to a handle system.) For each surrogate on the list invoke its addCitation method, handing it the new Citation object.

    ZJ: May be construct a new Citation out of current paper is what it really meant? From the point of view of this reference, the current paper is its citation, so we need to turn this paper to a Citation object. Effectively, each paper creates a Citation object as long as it has at least one reference. This step and the addCitation are also needed for references not already in the database (see below), I think.

    If the reference is not already in the database, then we need to construct an OAMS metadata MIMEfile and add it to the database. Save the newly generated document id.

    Finally, construct a BibData from the reference's document id and store it in referenceData.

    For each reference, construct a new Citeref from this document's id and the reference id, and add it to the citation database. (ResearchIndex would also generate a unique CID for this citation.)

    At this point, we have a completed Reference object:

    BibData  referenceData   // pointer into the creation database; a doc id
    int      ordinalNumber   // which reference this is in this item
    String   origRef         // how the reference was spelled in the text
    String   context[]       // context strings from the text for this reference
    RefEnum  refType         // NATURAL, AMBIGUOUS, CLEAR, or LINKABLE
    Process each reference in the same way until the Surrogate's refList[] is complete.
  10. Finally we do the citations. Go to the citation data base. For each record where our document id is the target, use the source document id to retrieve the associated creation from the first database. This is the creation that cites us.
  11. ZJ:  Steps 8-9 do not seem necessary? (Please see reasons below).

    If this citation is already in our currentCitations we are done with this CiteRef. (We know it's in our list by matching up document ids.)

    (ZJ: currentCitations is defined as knownCitations in the Java API specification.)

    If it is not on our list, then we must construct a new Citation object and add it to our current Citaions. Constructing a new Citation object requires a document id, a set of context strings, and a citation type. We have the document id. Use it to access the surrogate corresponding to the citing creation.

    ZJ: For me, the only possible situation where a <source_id, target_id> pair exists in citation database (i.e. one of the two existing databases) but is not a CiteRef object is that: this <source_id, target_id> was created from information found in SCI (or possibly  CiteSeer too?)  and there are no 'surrogates' corresponding to the citing creations yet, right? If a <source_id, target_id> pair is also a CiteRef object, then the source document should have already been processed and turned into a Citation object  which was added into the currentCitations  list of the target document's surrogate in step 7 when creating surrogate for the source document. Therefore , performing addCitationin step 7 when dealing with each reference is important. From the referenced documents  point of view, addCitation actually performs the jobs described in steps 8-9, i.e. building up a list of known citations for the surrogates of the referenced documents.

    How do we do this access? First, we feed the citing creation's URN to our name server, which gives us URLs for all the surrogates for the creation that cites us. Pick one of the surrogates. Invoking its getRefID(MIMEfile citation-BibData) will return the complete Reference in the citing creation for which we were the target.

    Turn that Reference into a Citation by invoking the static Surrogate.buildCitation( Reference ) method.

  12. Take this Citation and add it to our currentCitations. We are now done with this CiteRef. Repeat until all CiteRefs for which we are the target have been handled. At the point, our currentCitations is complete. It may grow as other surrogate constructors invoke our addCitation method.
  13. Done building the Surrogate object for this item. Store the FEDORA object in the "repository" and go to step 1.
It looks as though the API, with two databases and a document id resolver will let us build a collection of surrogates. Here are the particular methods in the API that were used:
Constructors for Surrogate, Reference, Citation, BibData, Creation, CiteRef
getID()
getRefED()  (ZJ: getRefID)
Protected methods used:
addCitation()
buildCitation()

bergmark/private/DLRG/ReferenceLinking/API.html 2000-03-27