[librecat-dev] Catmandu and Hadoop/Spark?

Patrick Hochstenbach Patrick.Hochstenbach at UGent.be
Wed Feb 17 16:22:32 CET 2016


I’ve investigated a little bit Apache Spark and I didn’t find anything stopping Catmandu from using these tools. At Ghent University we didn’t investigate this further because it is so much easier to ask for a bigger server with more cores at our computing center than to build-up and maintain a Hadoop cluster. With GNU parallel I can get very far processing millions of records each night. But indeed, don’t ask me to process 10s, 100 million metadata records (in a night).

The way forward I think always is to work in an environment with a direct use-cases for Hadoop and Spark. I’m a bit nervous to write this on my laptop and then present it to the world as : “hey! we can do Hadoop now too”. This is working blind and lying a bit. Certainly I am very willing to advise projects who want to take this and have the environment to really deploy this. Or, I need to do data crunching anyhow and we need to find a way to cooperate in a project to make this much much bigger.

And Perl isn’t for XYZ .. is indeed more an opinion, but I understand why computershops stick a limited number of languages. I understand also that it would be
nice to target the JVM and there are some experiments here in Ghent to figure that out ..but in a very very premature stage.

> On 17 Feb 2016, at 10:28, Jakob Voß <jakob.voss at gbv.de> wrote:
> 
> Hi,
> 
> I just got asked whether Catmandu (or Perl in general) can be used with Hadoop or Spark. Has anyone of you tried this before? This is what I found for Spark:
> 
> https://wiki.ufal.ms.mff.cuni.cz/spark:recipes:using-perl-via-pipes
> 
> Although we successfully do processing of large data sets with Catmandu, I guess it has its limitations with "big data" (whatever that means). Maybe it's worth to use Catmandu on top of existing big data frameworks such as Hadoop and Spark instead of extending Catmandu with big data features such as massive parallel processing?
> 
> Just a thought,
> Jakob
> 
> -- 
> Jakob Voß <jakob.voss at gbv.de>
> Verbundzentrale des GBV (VZG) / Common Library Network
> Platz der Goettinger Sieben 1, 37073 Göttingen, Germany
> +49 (0)551 39-10242, http://www.gbv.de/
> _______________________________________________
> librecat-dev mailing list
> - send list mails to librecat-dev at lists.uni-bielefeld.de
> - to unsubscribe or change options, visit https://lists.uni-bielefeld.de/mailman2/cgi/unibi/listinfo/librecat-dev
> - project website: http://librecat.org/

Patrick Hochstenbach - digital architect
University Library Ghent
Rozier 9 - 9000 Ghent - Belgium
patrick.hochstenbach at ugent.be
+32 (0)9 264 7980




More information about the librecat-dev mailing list