Apache Tika File Mime Type Identification and the Importance of Metadata

 

Tika File Mime Type Identification and the Importance of Metadata

An evaluation was recently carried out to determine how well Apache Tika was able to identify the mime types of a corpus of test files, described in the ‘Data Set’ section. The purpose of the evaluation was to determine:

1.      if the performance* of Tika has changed between versions 1.0 and the current version, 1.3 and,

2.      how the provision of metadata, in the form of the file name, affects the performance of Tika.

In order to address the first point, the evaluation was carried out four times, once for each of the four available versions of Tika 1.0, 1.1, 1.2 and 1.3.

The second point was address by running the evaluation twice for each version of Tika; the first test passed only a file input stream to the Tika ‘detect’ method, the second test passed both the file input stream and the file name.

In total eight tests were carried out, the results are shown in the Results section below.

* For the purposes of this evaluation the performance of Tika is measured by the number of file mime types identified correctly when compared against a ground truth, described in the ‘Data Set section.

Data Set

The set of test files consists of a Govdocs corpus of almost 1 million files, freely available from http://digitalcorpora.org/corpora/files. The ground truth for these files has been provided by Forensic Innovations, Inc. available from http://digitalcorpora.org/corp/files/govdocs1/groundtruth-fitools.zip.

Platform

The evaluation was run as a Cloudera Hadoop map/reduce process on a  HP ProLiant DL 385p Gen8 host with 32 CPUs,  224 Gb of RAM and a clock rate of 2.295 Ghz, using ESXi to run 32 virtual machines.  The Hadoop configuration is  a Cloudera  (cdh4.2.0) Hadoop 30 node cluster  consisting of a manager node, a master node and 28 slave nodes located at the British Library in the UK. Each node runs on its own virtual machine with 1 core, 500Gb of storage and 6Gb of RAM.

Results

In total the evaluator process was run eight times on the Govdocs corpus.

The table below shows the number of files processed by Apache Tika, the number correctly identified, the number that were incorrectly identified and the percentage identified correctly. The ‘Filename Used?’ column indicates whether the Tika detect method was pass only the file input stream (‘N’), or passed both the file input stream and file name (‘Y’).

 

Test

Tika Version

Files Processed

Files Identified Correctly

Files Identified Incorrectly

Files Correctly Identified (%)

Filename Used ?

1

1.0

973693

757326

216367

77.779

N

2

1.1

973693

757240

216453

77.770

N

3

1.2

973693

758549

215144

77.904

N

4

1.3

973693

758557

215136

77.905

N

5

1.0

973693

945555

28138

97.110

Y

6

1.1

973693

945516

28177

97.106

Y

7

1.2

973693

938138

35555

96.348

Y

8

1.3

973693

938148

35545

96.349

Y

Table 1 – Files mime types identified correctly/incorrectly by Apache Tika

Observations

The results in Table 1 show that, when used with a file input stream only, the performance of Tika improves slightly between versions 1.0 and 1.3. However, when Tika is used with both a file name and a file input stream the performance degrades between versions 1.0 and 1.3.

Further investigation shows that the files that were identified correctly in Tika version 1.0 but identified incorrectly in version 1.3 were of the following types :-

 

Tika v1.3 Mime Type

Number of Files

 application/msword                    

1

 application/octet-stream              

61

 application/rss+xml                   

4

 application/x-tika-msworks-spreadsheet

2

 application/zip                       

2

 message/x-emlx                        

10

 text/plain                            

6

 text/x-log                            

8107

 text/x-matlab                         

2

 text/x-perl                           

2

 text/x-python                         

4

 text/x-sql                            

295

       Table 2 – Number of files identified correctly in version 1.0 but incorrectly in version 1.3

 

 

Further investigation, carried out into files identified by Tika 1.3 as ‘text/x-log’, shows that these are text files with a file extension of ‘.log’. These files were identified by Tika versions 1.0 and 1.1 as having a mime type of ‘text/plain’, which matches the ground truth mime type. Similarly, Tika versions 1.2 and 1.3, when used with just an input stream, also identified these files as ‘text/plain’, again matching the groundtruth.

However, when Tika versions 1.2 and 1.3 were provided with the filename, they identified .log files as having a mime type of ‘text/x-log’.  As the ‘plain/text’ group of files encompasses a large and diverse set of file types, including logs, source code, properties/config files, data files etc, this could be considered an improvement as it provides greater differentiation between the different file types.

Possible Future Work

The results of the tests show that Apache Tika relies heavily on the filename when carrying out file identification. In the future this work could be extended to investigate how easily Tika can be fooled into identifying a file wrongly after being provided with incorrect/misleading file extension as part of the filename.  

By marwoodls, posted in marwoodls's Blog

20th May 2013  12:43 PM  17226 Reads  5 Comments

Comments

There are no comments on this post.


Leave a comment