[librecat-dev] Catmandu::XML and Catmandu::XSD
Patrick Hochstenbach
Patrick.Hochstenbach at UGent.be
Thu Oct 13 09:43:43 CEST 2016
Hi
There is a new Catmandu module available to process XML files: Catmandu::XSD. The existing Catmandu::XML by Jakob can be used for XML data where no schema is needed. For example given as input:
test.xml:
<foo>
<bar>test</bar>
</foo>
will be parsed into a YAML like:
$ catmandu convert XML to YAML < test.xml
—
bar: test
Any syntatic correct XML can processed with Catmandu::XML. But, the module itself can’t guess the structure of the XML files. When you have another XML like:
test2.xml:
<foo>
<bar>test</bar>
<bar>test</bar>
</foo>
it will be parsed into YAML like:
$ catmandu convert XML to YAML < test.xml
—
bar:
- test
- test
In the first example bar contains a string, in the second example bar contains an array. This is something you need to remember when creating Fix-es for this data. The same is true for XML input which has “mixed” content (text and xml-elements mixed).
With the new Catmandu::XSD module an XSD schema file must be provided that contains the exact definition how XML elements should be parsed. When an XSD is avaible, then you’ll get arrays when you need arrays, hashes when you need hash etc:
$ catmandu convert XSD —root ‘{}foo’ —schemas foo.xsd to YAML < test.xml
—
bar:
- test
$ catmandu convert XSD —root ‘{}foo’ —schemas foo.xsd to YAML < test2.xml
—
bar:
- test
- test
The Catmandu::XSD uses XML::Compile internally which is already used in a Belgian project processing LIDO museum data. Based on the same techniques EAD, METS, MODS, PNX, etc can be processed. E.g.
$ cat catmandu.yml
---
importer:
mets:
package: XSD
options:
root: "{http://www.loc.gov/METS/}mets"
schemas: t/demo/mets/*.xsd
exporter:
mets:
package: XSD
options:
root: "{http://www.loc.gov/METS/}mets"
schemas: t/demo/mets/*.xsd
# process one file...
$ catmandu convert mets < mets.file
# process many files
$ catmandu convert mets —files “dir/*.xml”
For more options see : https://metacpan.org/pod/Catmandu::XSD
Patrick
More information about the librecat-dev
mailing list