[librecat-dev] Catmandu::MARC and potentially a UTF-8 bug

Patrick Hochstenbach Patrick.Hochstenbach at UGent.be
Tue Jan 31 13:51:49 CET 2017


Hi

The standard input for Catmandu::MARC is MARC21. To use UNIMARC input we advice to use the RAW parser. As an example:

# From the command line

$ catmandu convert MARC —type RAW to MARC —type XML < some_records.mrc.txt

Or from a Perl script:

#!/bin/env perl

use Catmandu;

my $importer = Catmandu->importer(‘MARC’, type => ‘RAW’ , file => ‘some_records.mrc.txt’);
my $exporter = Catmandu->exporter(‘MARC’, type => ‘XML’);

$exporter->add_many( $importer );

$exporter->commit;

Cheers
Patrick

> On 30 Jan 2017, at 22:38, Emmanuel Di Pretoro <edipretoro at gmail.com> wrote:
> 
> Hi,
> 
> I've working with a bunch of UNIMARC files these last days and I've been learning a lot about Catmandu! But I've come across a UTF-8 problem and I couldn't be sure if it was a bug or a personal mistake. 
> 
> So, here is a way to reproduce the problem:
> 1. I've got 2 UTF-8 UNIMARC records from the BNF via Z39.50 ; you can find the file on GitHub: https://gist.github.com/edipretoro/ecdbd91cbd202022a939477f224aa712
> 2. when I read the file with yaz-marcdump, everything is fine: eg the title: « 200 1  $a Perl moderne $b Texte imprimé $f Sébastien Aperghis-Tramoni, Damien Krotkine, Jérôme Quelin $g avec la contribution de Philippe Bruhat » ;
> 3. when I process the file with Catmandu, eg with this command: « catmandu convert MARC --fix 'marc_map("200abfg", title, -join => " ");remove_field(record);' < some_records.mrc », here is what I get: « [{"_id":"FRBNF423141140000009","title":"Perl moderne Texte imprimé Sébastien Aperghis-Tramoni, Damien Krotkine, Jérôme Quelin avec la contribution de Philippe Bruhat"},{"title":"De l'art de programmer en Perl Texte imprimé Damian Conway traduction de Philippe Bruhat, Jérôme Fenal, Jean Forget","_id":"FRBNF40135550000000X"}] » ; as the value of encoding is set by default to UTF-8, I don't think I'm missing anything here. 
> 
> As a work-around to continue to go forward with my project, I converted the ISO2709 file into a XML file with yaz-marcdump with the following command: « yaz-marcdump -o marcxml some_records.mrc > some_records.xml » and retry the previous Catmandu command adapted for the XML: « catmandu convert MARC --type XML --fix 'marc_map("200abfg", title, -join => " ");remove_field(record);' < some_records.xml ». And I got a perfect UTF-8 string as a result: « [{"_id":"FRBNF423141140000009","title":"Perl moderne Texte imprimé Sébastien Aperghis-Tramoni, Damien Krotkine, Jérôme Quelin avec la contribution de Philippe Bruhat"},{"title":"De l'art de programmer en Perl Texte imprimé Damian Conway traduction de Philippe Bruhat, Jérôme Fenal, Jean Forget","_id":"FRBNF40135550000000X"}] ». OK, I did received a warning message: « Use of uninitialized value in concatenation (.) or string at /Users/manu/.plenv/versions/5.24.1/lib/perl5/site_perl/5.24.1/MARC/File/XML.pm line 397, <GEN0> chunk 5. » but it doesn't seem to be Catmandu-related.
> 
> Can you tell me if I've been missing something? 
> 
> Thanks in advance and have a nice day!
> 
> Emmanuel Di Pretoro
> _______________________________________________
> librecat-dev mailing list
> - send list mails to librecat-dev at lists.uni-bielefeld.de
> - to unsubscribe or change options, visit https://lists.uni-bielefeld.de/mailman2/cgi/unibi/listinfo/librecat-dev
> - project website: http://librecat.org/




More information about the librecat-dev mailing list