Droid file format identification using Hadoop

The DROID software tool is developed by The National Archives (UK) to perform automated batch identification of file formats by assigning Pronom Unique Identifiers (PUIDs) and MIME types to files. The tool uses so called signature files as a basis of information stemming from the PRONOM technical registry.

I am here presenting some considerations for using the tool on the Hadoop platform together with a performance evaluation of the job execution on a Hadoop cluster using the publicly available Govdocs1 corpus data set.

Preliminary considerations

Processing many small files using Hadoop

Hadoop’s strength is the processing of very large files. If the file format identification is applied to a set of files originating from an office or web context, we are usually dealing with many small files, like HTML files, PNG or JPEG images. There might also be large multimedia files, but large does not necessarily mean adequate for Hadoop processing. The input files of a Hadoop job must also be splittable into independent parts so that they can be processed in parallel during the Map phase of a MapReduce job. Regarding the file format identification, a file must be available as a complete and undivided unit which means that it is not splittable in this sense.

Therefore, we must determine how small files are going to be processed on the Hadoop platform in such an application scenario. If the small files are made available in the Hadoop Distributed File System (HDFS) and defined as input for a Map function performing the file identification, Hadoop would create one task per file which – given the additional time required for initiating a task – would result in a bad runtime performance.

One approach to overcome this obstacle is to put references to all files that are going to be processed in a text file and then use this text file as input for the Hadoop job. This requires that all worker nodes of the cluster can access the referenced file paths, e.g. by adding mount points to the cluster nodes so that a file path references the same file on each cluster node. By that way the Hadoop framework does not generate one task per file, but the size of the task only depends on the split size of the input text file, i.e. all file paths contained in a 64 Megabyte section (default split size) of the text file.

A second option is to put all the small files into one large SequenceFile which is an input format that can be used to aggregate content. The sequence file contains key-value pairs, in  this scenarios, the file path could be used as as key and a byte sequence of the binary content as value (BytesWritable data type). It must be taken into account that it is not possible to append additional content to an existing SequenceFile or modify individual key-value pairs afterwards. Creating sequence files is therefore recommended in scenarios where data is not subject to frequent changes. If it is used as input for a Hadoop job, data is available as a byte array for immediate processing in the map phase. However, the current implementation of DROID is based on the assumption that an object is available as a file and file related properties are used to access the file on the local file system. Therefore, either a temporary file is created – this will be shown later on – or the DROID code is adapted to enable DROID identification to be performed on byte arrays or input streams, like the British Library did with the nanite software project.

A performance evaluation of these two alternative approaches will be presented and discussed later on.

Finally, a third option would be to store the files as data blobs – under the condition that they do not exceed a certain size limit, e.g. 50 Megabyte – in HBase or consider similar data stores which allow random access and allow many read operations in a short time period.

Droid's signature file initialisation

As already mentioned, DROID is making use of a signature file for performing the file format identification. Before being able to do the actual identification, it is therefore required to initialise the XML-based signature file. However, this process takes a significant amount of time compared to the actual identification process.

To give a rough idea and as illustrated in figure 1, initialising Version 67 of the DROID signature file, which is about 70 Kilobytes large, took 6047 milliseconds while the identification took 278 milliseconds on a development machine (Dual core 2,8 GHz, 2 Gigabyte RAM).

 

Figure 1: Time needed for initialising the signature file compared to the execution time of identifying one single PDF file.

Using the Hadoop framework means that each task runs in it’s own Java Virtual Machine to isolate it from other running tasks. In consequence, it is not possible to share a JAVA object that is initialised with the signature file between different tasks  – even between tasks that run on the same machine on different cores.

Therefore the time needed for parsing the signature file must be taken into account when configuring the number of files to be identified in one task. Depending on the input type, this can be done, for example, by parameterising the split size. This allows control over the number of tasks that are created for a Hadoop job.

In relation to the example shown in figure 1 this means that for identifying 20 files of the same kind still about half of the time would be required for preparing the Droid identification. In consequence, it would make sense to process a minimum of at least 200 file identifications per task which would then be the proportion shown in figure 2.

 
Figure 2: If running the identification of 200 PDF files in one task, initialising the signature file does not carry so much weight in the overall execution time.

Cluster setup

As an example, the file format identification was performed using the Govdocs1 corpus files which was downloaded March 2013 and contained 986277 file instances. For the sake of simplicity, only Droid's binary signature identifier and no container signature identifier was used

Cluster Hardware

The cluster used for processing the data set has one controller machine (Master) and 5 worker machines (Slaves). The master node has two quadcore CPUs (8 physical/16 HyperThreading cores) with a clock rate of 2.40GHz and 24 Gigabyte RAM. The slave nodes have one quadcore CPUs (4 physical/8 HyperThreading cores) with a clock rate of 2.53GHz and 16 Gigabyte RAM. The machines are connected by a gigabit Ethernet and the cluster network has a shared file server that stores global persistent data, accessed by the slave nodes as needed.

Hadoop configuration

Cloudera’s Distribution (CDH) version 3 update 4 (CDH3u4) including Apache Hadoop was used as a basis to build the cluster.

Regarding the Hadoop configuration, in the current set-up, five processor cores of each machine have been assigned to Map Tasks, two cores to Reduce tasks, and one core is reserved for the operating system. This is a total of 25 processing cores for Map tasks and 10 cores for Reduce tasks which means that, in principle, depending on the type of data processing, 25 map tasks and 10 Reduce tasks can run in parallel.

The replication factor of the Hadoop Distributed File System (HDFS) is set to 2 (default configuration: 3) which means that 2 copies of each data block are stored in the cluster.

Single-threaded java application

As a point of reference for the Hadoop job execution, the identification of the govdocs1 corpus was executed using a single-threaded java application on one of the cluster worker nodes:

Files processed CPU time – user+sys (hh:mm:ss) Application runtime (hh:mm:ss)
986277 01:10:05 03:55:16

Table 1: Runtime of the govdocs1 corpus identification using a single-threaded java application

MapReduce job with input data from an external file server

A MapReduce job for the Droid identification was implemented to perform the identification in the Map phase and to create a simple distribution of files per Pronom Unique Identifier (PUID) in the Reduce phase. As table 2 shows, configuring 800 records per task, Hadoop created 1233 Map tasks which were processed 58 minutes and 6 seconds.

Map tasks Records per task Hadoop Job runtime (hh:mm:ss)
1233 800 00:58:06

Table 2: Hadoop job runtimes for the Droid identification of the Govdocs 1 corpus

One observation is that the runtime of about one hour compared to about 4 hours in a single-threaded application does not exploit the potential of a cluster which is able to process 25 tasks in parallel. This is due to the fact that the identification process is highly I/O bound in the sense that there are many read operations followed by a comparatively short computing time.

MapReduce job using input data from SequenceFiles in HDFS

If large SequenceFiles comprising the complete file set are intended to be used as input for the Hadoop job, it is necessary to aggregate them in a first step. As shown in table 3, this step can take a considerable amount of time. It is therefore recommended to use this approach only if the data is going to be processed many times afterwards.

Map tasks Reduce tasks Hadoop Job runtime (hh:mm:ss)
986 10 03:12:44

Table 3: Runtime of the SequenceFile creation aggregating 986277 files

The runtime of the Droid identification Hadoop job is basically divided into the time required for:

  • setting up the Hadoop job and iterating over all records of the sequence files.
  • creating temporary files from the record’s byte array during map task execution.
  • performing the Droid identification on the temporary files.

In order to make evident which portion of the overall processing time belongs to each of these processing phases, the Hadoop job is executed three times, once only iterating over the sequence file records, then with creating temporary files, and finally performing the Droid identification. The runtimes for these jobs are shown in Table 4 and illustrated in Figure 3.

Map tasks Hadoop Job Type Hadoop Job runtime (hh:mm:ss)
591 Iterating over the sequence file records 00:14:50
591 Creating temporary for all sequence file records 00:21:28
591 Performing Droid identification 00:23:53

Table 4: Hadoop job runtimes for the Droid identification of the Govdocs 1 corpus

Figure 3: Comparing the runtimes of 1) a Hadoop job that only iterates over all records, 2) a Hadoop job that iterates over all records and creates temporary files, and 3) a Hadoop job that performs the Droid identification on the temporary files.

Figure 4 is another way to look at these results, this time showing the portion of the overall execution time that each processing phase takes.

Figure 4: Portion of the overall execution time that each processing phase takes.

From these results it becomes evident that most of the overall execution time of a Droid identification Hadoop job is spent on the framework iterating over the records of the SequenceFiles and on creating temporary files for the Droid identification.

Comparison with Apache Tika

MapReduce job with input data from an external file server

A comparison with a MapReduce job implementation using Apache Tika Version 1.0 is shown in table 3. Considering that Droid and Apache Tika differ in terms of functionality, a direct performance comparison must be interpreted with caution.

Map tasks Records per task Hadoop Job runtime (hh:mm:ss)
1233 800 00:16:56

Table 5: Hadoop job runtimes for the Apache Tika identification of the Govdocs 1 corpus reading files from an external file server.

Leaving the differences in functionality aside, the results show that Apache Tika was nearly 4 times faster compared to the DROID identification (see table 3) in the same environment using the same data set.

MapReduce job with input data from SequenceFiles in HDFS

The following table shows the Apache Tika identification with input data from the SequeceFiles. The Apache Tika API allows performing the identification directly on an input stream.

Map tasks Hadoop Job Type Hadoop Job runtime (hh:mm:ss)
591 Performing Apache Tika identification 00:15:10

Table 6: Hadoop job runtimes for the Apache Tika identification of the Govdocs 1 corpus reading files SequenceFiles in HDFS.

Apache Tika tries to detect byte patterns at the beginning of a file or a stream. It is therefore not necessary to read the complete file content. This explains why the difference between reading the files from an external file server or from the SequenceFiles in HDFS is not significant because the principal factor is the time required for iterating over the records.

Conclusion

In this blog entry I have discussed different approaches of preparing small files in order to perform a file format identification using Droid on a Hadoop platform.

For the first option where data input comes from an external file server, the main drawback is that there is no benefit of using redundantly stored data blocks in HDFS. And this is especially important in such an I/O intensive data processing scenario with a high number of read operations and relatively short computing time.

For the second option, where data is aggregated into large SequeceFiles and stored in HDFS, the Droid identification was nearly 10 times faster compared to the single-threaded execution and more than 4 times faster compared to the Hadoop job reading data from the external file server. However the significant amount of time required to aggregate the small files must be taken into consideration. Furthermore, the comparison between Tika and Droid has shown that this advantage comes only to bear where the files must be read completely.

As an outlook to further work, it would make sense to enable stream identification in DROID so that it is not necessary to create temporary files. Furthermore, for storing the small files in HDFS, HBase as a data store on top of this file system is a promising approach.

By shsdev, posted in shsdev's Blog

24th May 2013  11:44 AM  15656 Reads  3 Comments

Comments

There are no comments on this post.


Leave a comment