This blog post is an answer to willp-bl's post "Mixing Hadoop and Taverna" and is building on some of the ideas that I presented in my blog post "Big data processing: chaining Hadoop jobs using Taverna".
First of all, it is very interesting to see willp-bl's variants of implementing a large scale file format migration workflow using Taverna and Hadoop, and it is also interesting to see the implications which different integration patterns can have on performance and throughput of a workflow run.
However, while the final conclusion that "it’s clear that adding more processing increases execution time" is logically true, I will argue that interpreting this conclusion in the sense that there is necessarily a significant performance decrease when using Taverna together with Hadoop can be absolutely misleading. In the following, I will explain why it highly depends on the system architecture if I should actually care about this. And I will tell you why I don't.
The intended use of Hadoop is with a cluster of machines where the MapReduce programming model together with the Hadoop Distributed File System (HDFS) provide a powerful backend for processing large amounts of data.
Even with the help of such a backend, typical preservation tasks, like file format migration of millions of image files or mime type identification of billions of objects in a web archive, are long running processes that can take hours or even days depending on the type of processing, the size of the input data set, hardware specifications of the cluster, etc.
In my opinion, Taverna’s strength is not the batch processing performance, it will always stay behind when comparing it’s list processing with direct batch processing in this regard. I see Taverna’s role here therefore rather in the orchestration layer which – just to stay with willp-bl’s words here – “should not be mixed” with the large scale processing layer.
When using Taverna to start long running Hadoop jobs, there is just the startup time caused by the launcher component, like Taverna’s Tool service, for example. This additional cost can be minimized by using the server version of Taverna (Taverna Server) deployed to a servlet container instead of running Taverna in headless mode.
Should I actually care about the cost of 30 seconds additional startup time for initiating a Hadoop job that runs 24 hours? A Taverna workflow managing a sequence of 4 Hadoop jobs can create some minutes of overhead. However, using the wrong integration approach for preservation tools on the record processing level can have much more serious implications when processing 25 million records, for example. I really do care about the latter.
This impact on processing time and throughput lies in the way how the preservation tools are invoked in the iterative Map execution phase and an increase is here multiplied by the number of records being processed. And compared to this the startup time of Taverna is absolutely negligible.
Let us quickly look at another example to make this clearer. Considering alternatives for doing a mime type detection on 1 Terabyte of archived web content using Apache Tika and the unix tool “file” we observed the following differences regarding throughput on our experimental cluster with 5 nodes:
TIKA detector API called from map/reduce | 6,17 GB/min |
FILE called as command line tool from map/reduce | 1,70 GB/min |
TIKA) JAR command line tool called from map/reduce | 0,01 GB/min |
In this sense I see willp-bl’s post as a study on alternative integration patterns for a combination of different preservation tools including the Taverna workflow engine itself. But, regarding system integration, especially when it comes to large scale processing, I prefer using Taverna Server separated from the backend in an orchestration layer where it takes care of scheduling a sequence of long-running jobs.
Admittedly, just for the workflow processing, this could also be done in a batch script, no need for a workflow execution engine here. Therefore, it must be noted that Taverna’s functionality is being extended in the SCAPE project by the possibility to add semantic annotations to inputs, outputs, and to the components of a workflow. On the long term this will help developers – and I wish not only less-technical people – to find and use the right preservation components when designing digital preservation workflows.