Apache Tika File Mime Type Identification and the Importance of Metadata

Apache Tika File Mime Type Identification and the Importance of Metadata

 

Tika File Mime Type Identification and the Importance of Metadata

An evaluation was recently carried out to determine how well Apache Tika was able to identify the mime types of a corpus of test files, described in the ‘Data Set’ section. The purpose of the evaluation was to determine:

1.      if the performance* of Tika has changed between versions 1.0 and the current version, 1.3 and,

2.      how the provision of metadata, in the form of the file name, affects the performance of Tika.

In order to address the first point, the evaluation was carried out four times, once for each of the four available versions of Tika 1.0, 1.1, 1.2 and 1.3.

The second point was address by running the evaluation twice for each version of Tika; the first test passed only a file input stream to the Tika ‘detect’ method, the second test passed both the file input stream and the file name.

In total eight tests were carried out, the results are shown in the Results section below.

* For the purposes of this evaluation the performance of Tika is measured by the number of file mime types identified correctly when compared against a ground truth, described in the ‘Data Set section.

Data Set

The set of test files consists of a Govdocs corpus of almost 1 million files, freely available from http://digitalcorpora.org/corpora/files. The ground truth for these files has been provided by Forensic Innovations, Inc. available from http://digitalcorpora.org/corp/files/govdocs1/groundtruth-fitools.zip.

Platform

The evaluation was run as a Cloudera Hadoop map/reduce process on a  HP ProLiant DL 385p Gen8 host with 32 CPUs,  224 Gb of RAM and a clock rate of 2.295 Ghz, using ESXi to run 32 virtual machines.  The Hadoop configuration is  a Cloudera  (cdh4.2.0) Hadoop 30 node cluster  consisting of a manager node, a master node and 28 slave nodes located at the British Library in the UK. Each node runs on its own virtual machine with 1 core, 500Gb of storage and 6Gb of RAM.

Results

In total the evaluator process was run eight times on the Govdocs corpus.

The table below shows the number of files processed by Apache Tika, the number correctly identified, the number that were incorrectly identified and the percentage identified correctly. The ‘Filename Used?’ column indicates whether the Tika detect method was pass only the file input stream (‘N’), or passed both the file input stream and file name (‘Y’).

 

Test

Tika Version

Files Processed

Files Identified Correctly

Files Identified Incorrectly

Files Correctly Identified (%)

Filename Used ?

1

1.0

973693

757326

216367

77.779

N

2

1.1

973693

757240

216453

77.770

N

3

1.2

973693

758549

215144

77.904

N

4

1.3

973693

758557

215136

77.905

N

5

1.0

973693

945555

28138

97.110

Y

6

1.1

973693

945516

28177

97.106

Y

7

1.2

973693

938138

35555

96.348

Y

8

1.3

973693

938148

35545

96.349

Y

Table 1 – Files mime types identified correctly/incorrectly by Apache Tika

Observations

The results in Table 1 show that, when used with a file input stream only, the performance of Tika improves slightly between versions 1.0 and 1.3. However, when Tika is used with both a file name and a file input stream the performance degrades between versions 1.0 and 1.3.

Further investigation shows that the files that were identified correctly in Tika version 1.0 but identified incorrectly in version 1.3 were of the following types :-

 

Tika v1.3 Mime Type

Number of Files

 application/msword                    

1

 application/octet-stream              

61

 application/rss+xml                   

4

 application/x-tika-msworks-spreadsheet

2

 application/zip                       

2

 message/x-emlx                        

10

 text/plain                            

6

 text/x-log                            

8107

 text/x-matlab                         

2

 text/x-perl                           

2

 text/x-python                         

4

 text/x-sql                            

295

       Table 2 – Number of files identified correctly in version 1.0 but incorrectly in version 1.3

 

 

Further investigation, carried out into files identified by Tika 1.3 as ‘text/x-log’, shows that these are text files with a file extension of ‘.log’. These files were identified by Tika versions 1.0 and 1.1 as having a mime type of ‘text/plain’, which matches the ground truth mime type. Similarly, Tika versions 1.2 and 1.3, when used with just an input stream, also identified these files as ‘text/plain’, again matching the groundtruth.

However, when Tika versions 1.2 and 1.3 were provided with the filename, they identified .log files as having a mime type of ‘text/x-log’.  As the ‘plain/text’ group of files encompasses a large and diverse set of file types, including logs, source code, properties/config files, data files etc, this could be considered an improvement as it provides greater differentiation between the different file types.

Possible Future Work

The results of the tests show that Apache Tika relies heavily on the filename when carrying out file identification. In the future this work could be extended to investigate how easily Tika can be fooled into identifying a file wrongly after being provided with incorrect/misleading file extension as part of the filename.  

489
reads

5 Comments

  1. andy jackson
    May 23, 2013 @ 10:45 AM Europe/Berlin

    Wonderful. Thank you.

  2. marwoodls
    May 23, 2013 @ 10:23 AM Europe/Berlin

    The raw results can be found at  :-

    https://github.com/openplanets/cc-benchmark-tests

    These are in the form of CSV files which show, for each of the test files in the Govdocs1 corpus, the mime type as identified by Tika, and the mime type(s) according to the ground truth.

     

  3. andy jackson
    May 23, 2013 @ 10:04 AM Europe/Berlin

     

    It would be really great to have the detailed results, so that the anomolies can be examined and the tools can be improved. Is there any chance of publishing the raw data, if only for the 'disagreements', for each Tika version.

    Note also that, from my understanding of the source code, Tika uses the file extensions but the extension-based result is always overriden by a 'magic number' or container match. i.e. a wrong extension should only confuse it when the extension is all it has to go on. However, as you say, having a corpus-based test for this would be a great idea.

  4. andy jackson
    May 23, 2013 @ 9:57 AM Europe/Berlin

    Yes, agree very much with you Peter. The text/x-log case is an improvement, in my opinion, and I feel a hierarchy is natural. For example, a text/x-log can also be interpreted as a text/plain, which can also be interpreted as application/octet-stream.

    If we can agree on this approach, I think we should look at these anomolies and consider revising the 'ground truth' accordingly.

     

  5. pmay
    May 23, 2013 @ 9:37 AM Europe/Berlin

    My feeling is that this says much more about the groundtruths and the methodology we use to measure accuracy.

    Assuming "text/x-log" is a specialism of "text/plain", and as such could be considered "correct", then Tika 1.2 using filenames achieves a correctness of 97.181% ((938138+8107)/973693), a very slight improvement on Tika 1.0's 97.110%. Other specialisms would show further improvement in the performance as measured by "correctness" accuracy.

    Perhaps what is needed is more reliance on a hierarchy of mime-types to enable hierarchical groundtruth matching – fuzzy matching, if you will?

Leave a Reply

Join the conversation