[librecat-dev] Catmandu: field not converted to TSV/CSV output if not present in first (how many?) input records

Patrick Hochstenbach Patrick.Hochstenbach at UGent.be
Wed Nov 8 08:41:43 CET 2023


Dear Martina

Hints are added to the latest version of Catmandu v 1.2021 now available on CPAN.

BR
Patrick
________________________________
From: Siebert, Dr. Martina <Martina.Siebert at sbb.spk-berlin.de>
Sent: 03 November 2023 17:09
To: Patrick Hochstenbach <Patrick.Hochstenbach at UGent.be>; librecat-dev at lists.uni-bielefeld.de <librecat-dev at lists.uni-bielefeld.de>
Subject: AW: [librecat-dev] Catmandu: field not converted to TSV/CSV output if not present in first (how many?) input records


Hi Patrick,



makes 100% sense. Thanks for the heads-up and the ways around the “feature” ;-)

Maybe a note in the “Description” that the first record serves as model for the fields exported would help the un-initiated like myself: https://metacpan.org/pod/Catmandu::Exporter::TSV



Best,

Martina



Von: Patrick Hochstenbach <Patrick.Hochstenbach at UGent.be>
Gesendet: Freitag, 3. November 2023 16:58
An: Siebert, Dr. Martina <Martina.Siebert at sbb.spk-berlin.de>; librecat-dev at lists.uni-bielefeld.de
Betreff: Re: [librecat-dev] Catmandu: field not converted to TSV/CSV output if not present in first (how many?) input records



Hello Martina,



This is a known feature. Catmandu was  created to work with very large files in streaming mode. This means that Catmandu doesn’t read the complete input data first to see what kind of fields are available. The software needs to work with the data at hand.

The JSON format allows that every record have different fields, and doesn’t care if ‘late-comer’ records have a different field layout. Formats such as TSV and CSV don’t have this feature: from the first record you need to tell the software what fields need to be made available. Catmandu use two procedures:



-        If no information is provided, the field layout is guessed from the first record it receives.

-        One can provide information to Catmandu which fields one wants to see in the output.



For the latter Catmandu has the `--fields` option:



$ catmandu convert JSON to CSV –-fix my.fix –-fields ‘id,name,title,author’ < data.json

As I am writing this, I see also a `--collect_fields 1` option, that does what you want. But first load all the data into memory.

BR

Patrick

PS: Do to outlook issues some hyphens and quotes may be lost in the email



From: librecat-dev-bounces at lists.uni-bielefeld.de<mailto:librecat-dev-bounces at lists.uni-bielefeld.de> <librecat-dev-bounces at lists.uni-bielefeld.de<mailto:librecat-dev-bounces at lists.uni-bielefeld.de>> on behalf of Siebert, Dr. Martina <Martina.Siebert at sbb.spk-berlin.de<mailto:Martina.Siebert at sbb.spk-berlin.de>>
Date: Friday, 3 November 2023 at 15:29
To: librecat-dev at lists.uni-bielefeld.de<mailto:librecat-dev at lists.uni-bielefeld.de> <librecat-dev at lists.uni-bielefeld.de<mailto:librecat-dev at lists.uni-bielefeld.de>>
Subject: [librecat-dev] Catmandu: field not converted to TSV/CSV output if not present in first (how many?) input records

Hello,



It seems not possible to produce a TSV/CSV output that includes fields that appear for the first time only much later in an input file. In the JSON export all is fine, but when exporting to TSV/CSV the “late-comer” fields are missing. When I fake-add the field to the first record the TSV/CSV export is correct.

Is this a known bug? Can it be fixed?



Best,

Martina

_____________________________________________

Dr. Martina Siebert

Ostasienabteilung | CrossAsia

Staatsbibliothek zu Berlin – Preußischer Kulturbesitz



martina.siebert at sbb.spk-berlin.de<mailto:martina.siebert at sbb.spk-berlin.de>

www.staatsbibliothek-berlin.de<http://www.staatsbibliothek-berlin.de/>



Im Rahmen der E-Mail-Kommunikation werden gegebenenfalls personenbezogene Daten verarbeitet.
Unsere Hinweise zum Datenschutz finden Sie hier: http://sbb.berlin/datenschutz


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.uni-bielefeld.de/mailman2/unibi/public/librecat-dev/attachments/20231108/c0dba763/attachment.html>


More information about the librecat-dev mailing list