[librecat-dev] Catmandu: field not converted to TSV/CSV output if not present in first (how many?) input records

Siebert, Dr. Martina Martina.Siebert at sbb.spk-berlin.de
Fri Nov 3 17:09:04 CET 2023


Hi Patrick,

makes 100% sense. Thanks for the heads-up and the ways around the "feature" ;-)
Maybe a note in the "Description" that the first record serves as model for the fields exported would help the un-initiated like myself: https://metacpan.org/pod/Catmandu::Exporter::TSV

Best,
Martina

Von: Patrick Hochstenbach <Patrick.Hochstenbach at UGent.be>
Gesendet: Freitag, 3. November 2023 16:58
An: Siebert, Dr. Martina <Martina.Siebert at sbb.spk-berlin.de>; librecat-dev at lists.uni-bielefeld.de
Betreff: Re: [librecat-dev] Catmandu: field not converted to TSV/CSV output if not present in first (how many?) input records

Hello Martina,

This is a known feature. Catmandu was  created to work with very large files in streaming mode. This means that Catmandu doesn't read the complete input data first to see what kind of fields are available. The software needs to work with the data at hand.

The JSON format allows that every record have different fields, and doesn't care if 'late-comer' records have a different field layout. Formats such as TSV and CSV don't have this feature: from the first record you need to tell the software what fields need to be made available. Catmandu use two procedures:

-        If no information is provided, the field layout is guessed from the first record it receives.
-        One can provide information to Catmandu which fields one wants to see in the output.

For the latter Catmandu has the `--fields` option:

$ catmandu convert JSON to CSV --fix my.fix --fields 'id,name,title,author' < data.json

As I am writing this, I see also a `--collect_fields 1` option, that does what you want. But first load all the data into memory.

BR
Patrick

PS: Do to outlook issues some hyphens and quotes may be lost in the email

From: librecat-dev-bounces at lists.uni-bielefeld.de<mailto:librecat-dev-bounces at lists.uni-bielefeld.de> <librecat-dev-bounces at lists.uni-bielefeld.de<mailto:librecat-dev-bounces at lists.uni-bielefeld.de>> on behalf of Siebert, Dr. Martina <Martina.Siebert at sbb.spk-berlin.de<mailto:Martina.Siebert at sbb.spk-berlin.de>>
Date: Friday, 3 November 2023 at 15:29
To: librecat-dev at lists.uni-bielefeld.de<mailto:librecat-dev at lists.uni-bielefeld.de> <librecat-dev at lists.uni-bielefeld.de<mailto:librecat-dev at lists.uni-bielefeld.de>>
Subject: [librecat-dev] Catmandu: field not converted to TSV/CSV output if not present in first (how many?) input records
Hello,

It seems not possible to produce a TSV/CSV output that includes fields that appear for the first time only much later in an input file. In the JSON export all is fine, but when exporting to TSV/CSV the "late-comer" fields are missing. When I fake-add the field to the first record the TSV/CSV export is correct.
Is this a known bug? Can it be fixed?

Best,
Martina
_____________________________________________
Dr. Martina Siebert
Ostasienabteilung | CrossAsia
Staatsbibliothek zu Berlin - Preußischer Kulturbesitz

martina.siebert at sbb.spk-berlin.de<mailto:martina.siebert at sbb.spk-berlin.de>
www.staatsbibliothek-berlin.de<http://www.staatsbibliothek-berlin.de/>

Im Rahmen der E-Mail-Kommunikation werden gegebenenfalls personenbezogene Daten verarbeitet.
Unsere Hinweise zum Datenschutz finden Sie hier: http://sbb.berlin/datenschutz

-------------- n?chster Teil --------------
Ein Dateianhang mit HTML-Daten wurde abgetrennt...
URL: <http://lists.uni-bielefeld.de/mailman2/unibi/public/librecat-dev/attachments/20231103/859a9472/attachment.html>


More information about the librecat-dev mailing list