[librecat-dev] identify duplicate records with Catmandu

Nicolas Franck Nicolas.Franck at UGent.be
Fri Dec 2 12:10:10 CET 2016


Identifying duplicate depends on what you see as "duplicate".


I would do the following:


1. at the beginning of the fix: create a new field "identifier" (for example) that is made by joining other fields

2. use "lookup_in_store" to check if it exists already.

3. If yes, then use "reject", which stops the fix, and rejects your current record

4. If no, then store the "identifier" using "add_to_store"

5. do your other stuff


________________________________
From: librecat-dev-bounces at lists.uni-bielefeld.de <librecat-dev-bounces at lists.uni-bielefeld.de> on behalf of Sergio Letuche <code4libuserx at gmail.com>
Sent: Friday, December 2, 2016 10:03 AM
To: librecat-dev at lists.uni-bielefeld.de
Subject: [librecat-dev] identify duplicate records with Catmandu

Hello community,

how do you dedup duplicate records?

For a use case we have, we consider duplicate records to be those that share the same content

in for example 245 tag, and all 6** tags.

something like a record is identical to another, if in it it has a 245 tag, that has the same value,
with another record, that has the same metadata in tag 245, or the same metadata in any of the 6** tags.

How would you approach this, with a fix?

Best

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.uni-bielefeld.de/mailman2/unibi/public/librecat-dev/attachments/20161202/b12f98c7/attachment.html>


More information about the librecat-dev mailing list