[librecat-dev] MARC authority record lookup

Patrick Hochstenbach Patrick.Hochstenbach at UGent.be
Wed Sep 19 11:36:41 CEST 2018


There is not a generic procedure in Catmandu (or in general) how to enrich personal names in MARC bibliographic records.

In the https://github.com/LibreCat/MARC2RDF project there are some examples how the VIAF dataset can be used to create “perosonal name” -> VIAF URI mappings.
The algorithm is quite simplistic (with reasonable precision but not optional recall):

  * Take from the MARC record the author name + birthday
  * Lookup in VIAF this author name + birthday  and if there is exactly one hit, then it is assumed we have a correct VIAF record for that author
  * Add the VIAF identifier to the 100$0

This is a procedure that could also be done in Catmandu , Perl or any other program . What Catmandu does is to make it easier to parse MARC data and create databases. It doesn’t help you to create better deduplication algorithms.

I understand you have local authority records.

The general procedure with Catmandu in your case would be:

1. Create a search index (e.g. ElasticSearch) of the local authority records with contais the fields  authority record id, author name, date of birth, date of death, uri

This can be done with:

   catmandu import MARC to ElasticSearch --index-name ‘authority’  —fix authority.fix < authority.mrc

Where authority.fix extracts the data out of the authority.mrc you require for deduplication matching. E.g.

  marc_map(100d,date)
  parse_text(date,”(?<birth>\d{4})-(?<death>\d{4})”)
  marc_map(700a,names.$append)
  marc_map(001,_id)
  retain(_id,names,date)

This should give you an ElasticSearch index with records like:

  _id:
  names:
      - Dostoevskij, Fyodor
      - Достоjевски, Ф. М.
      - Dostoyevsky, Fyodor Mikhaylovich
      - …
  date:
     birth: 1821
     death: 1881

2. Then you need to parse your MARC records and try to find a match in the ElasticSearch index..with something along the lines of:

   catmandu convert MARC to MARC —fix dedup.mrc < records.mrc

With dedup.mrc something like (this is not a working example..just showing some of the logic):

   do marc_each()

	if  marc_has(100)
	   marc_map(100a,name)
           marc_map(100d,date)
           parse_text(date,”(?<birth>\d{4})-(?<death>\d{4})”)
           paste(query,”~name:”,name,” AND date.birth:”, date.birth)

           search_in_store(query,ElasticSearch,index_name:authority,limit:10)

           # check if we have one hit
           if all_equal(query.total,1)
	      marc_set(1000,query.hit.0._id)
	   end
       end
   end


You can add logging with https://metacpan.org/pod/release/NICS/Catmandu-1.10/lib/Catmandu/Fix/log.pm

In real life this deduping doesn’t have an easy solution. Based on your local data you very probably need some data cleaning when creating the authority search index and before matching the data. Every local catalog as its own quircks..

Patrick


> On 19 Sep 2018, at 10:38, Uldis Bojars <captsolo at gmail.com> wrote:
> 
> Hi,
> 
> We need to enrich references to personal names in MARC bibliographic records (fields 100, etc.) with additional information (system number -> $0, VIAF URI -> $1) from MARC authority records.
> 
> The input is two separate MARC files: bibliographic and authority data. The output should be enriched version of bibliographic data.
> 
> What is the best way to do this with Catmandu?
> 
> Also, is it possible to set up logging [to record cases where matching authority records were not found] from the Fixes language?
> 
> Thanks,
> Uldis
> 
> 
> _______________________________________________
> librecat-dev mailing list
> - send list mails to librecat-dev at lists.uni-bielefeld.de
> - to unsubscribe or change options, visit https://lists.uni-bielefeld.de/mailman2/cgi/unibi/listinfo/librecat-dev
> - project website: http://librecat.org/

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 488 bytes
Desc: Message signed with OpenPGP
URL: <http://lists.uni-bielefeld.de/mailman2/unibi/public/librecat-dev/attachments/20180919/f2e059de/attachment.asc>


More information about the librecat-dev mailing list