From 1 Million to 21,000: Reducing Govdocs Significantly

As part of the evaluation framework i'm developing for OPF and Scape I've been working on gathering a corpora of files to run experiments against.

Although Govdocs1 would seem like a good place to start there are a few problems:

1) It's too big, 1 Million Files is just showing off.

2) It's full of repeats! There are over 700,000 PDF files.

3) Running experiments on 1 Million files that is full of repeats generates too much data (yes there is such a thing)

So I went on a mission to reduce the corpora in size that I explain here.

In order to reduce the corpora in size I am relying on the ground truth data, which is the results of funning the File Identification Toolset (FI-Tools) over the corpora. Now the ground truth data may not be correct but I am relying on it being consistantly wrong such that the size of the corpora can still be easily reduced. We shall hope to find out later if it is wrong.

Stage 1 – Irradicate all Free Variables (Mr Bond)

The ground truth data also pulls out many of the charecteristics of each file. Since we are only interesting in the identification data, lots of data can be removed.

Properties to remove:

Last Saved
Last Printed
Title
Number of Pages
SHA-1
Image Size
File Name (for now)
File Size (for now)
other charecterics…

Properties to keep

Extension
Description
Version (& related information)
Valid File Extensions
Accuracy of Identification
Content
Creating Program (or library)
Description Index (serial code)
Extension Valid (Y/N)

At this point we will still have a million records, however lots and lots of the remaining data *should* be repeated.

Stage 2 – Sort-id

This is an easy stage, run:

#sort -u data.txt > limit.txt

This gives us 4653 unique identifications made up of 87 different extensions. Of the 4653 identifications:

PDF	3337
TEXT	267
XLS	194
HTML	169
DOC	147
PPT	58
PS	52
LOG	52
…	…

Only 20 extensions have more than 20 different identification types, probably down to the lacking number of files in the govdocs selection. However it is still shocking to see that PDFs can be created in 3337 different ways. Considering other formats have never changed (text) we have 20 or so versions fo PDF (including PDFa) and loads of creation libraries. By trying to solve the problem have we actually made it worse?

At this point we could just stop and select 4653 files, one of each type of identification.

Stage 3 – Select some Files

The final stage is to actually select some files of each of the 4653 types of identification.

It was decided to select 10 of each type of identification where possible.

If it wasn't possible to select 10 of each type then however many were available were selected.

Where more than 10 were available the following selection policy applies:

Select the largest in filesize
Select the smallest in filesize
Select 8 random others.

Stage 4 – Publish

So with all this done we have ~21,000 available at http://corpora.opf-labs.org/govdocs_selected.tar.gz

Further to this i'll also push up the code that does all this.

9 Comments

ecochrane
July 29, 2012 @ 8:41 am CEST

Hi Dave,

I wonder if you’ve been a bit hasty in removing the date information.

Date created/saved may help to indicate the possible range of applications that could have created a file and therefore also the possible variant of a format that the file’s structure might match with. At very least it can limit the set to any application released before that date.

Euan
davetaz
July 27, 2012 @ 4:22 pm CEST

I cut them based upon removing the variable attributes, such as file-size and file-name and keeping the more constant ones like version and creating program. So the sheer amount of PDFs is down to a combination of “all the keys”. So 10 files will have all those keys (version, creating program, etc…) the same, none of the included keys can be different for any one of the files in each set.

“Accuracy of Identification” and “Extension Validity” are simply fields that FiTools outputs that I choose to keep. Droid has similar ones, so you should look to FiTools for how “they” measure that.
davetaz
July 27, 2012 @ 4:21 pm CEST

Short Answer: Identification is the only aim of having this set. For charecterisation and properties extraction the way this set is chosen is completely un-suitable. Here a different set of unique charecteristics should be selected.
cbecker
July 27, 2012 @ 9:32 am CEST

Great to see content profiling coming up again as a topic!
Dave, it would help if you could elaborate a bit on some underlying assumptions here. Frankly, I don’t buy your 3 problems.

(1) For which task is one million too big?
(2) Why should the three letters “pdf” alone be enough to claim a “repeat”?
And (3) is just a conclusion on (1)+(2).

I am with Petar when he says that the format is by far not enough. (Check out http://www.staging.openpreservation.org/blogs/2012-07-27-fits-or-not-fits)
We are *not* just interested in format identification. Not at all! The format is one property of many, even though it is a key property. But what we are aiming for in DP in the end is the ability to render content. This can fail in myriad ways for any format, and the reason for that failure is to be found in the other properties. By discarding them, you are left with a little something that might not be too meaningful in the end.

I doubt we can assume that the identification data is consistently wrong. Do you have a reason for that assumption?

The reduced set might be useful for development, as a training set, as a validation set, but in the end, I’m interested in the full set. I want tools that are capable of analysing the entire set. It’s still much smaller than the real world…
peshkira
July 27, 2012 @ 7:47 am CEST

Hi Dave!

Very interesting blog post. Since I am dealing with content profiling I have a couple of questions about it.

I didn’t quite follow how did you cut the files. Consider the pdfs. Are these one pdf for each version/creating app combination or something else? If that is correct how do you deal with related properties and their combinations?

And secondly, I am very interested in the ‘Accuracy of Identification’ & ‘Extension Validity’. How do you measure that?

Thanks for the short reply.

cheers!

You must be logged in to post a comment.

From 1 Million to 21,000: Reducing Govdocs Significantly

Stage 1 – Irradicate all Free Variables (Mr Bond)

Stage 2 – Sort-id

Stage 3 – Select some Files

Stage 4 – Publish

9 Comments

Leave a Reply

You might also like…

Apache Tika File Mime Type Identification and the Importance of Metadata

On Building a Debian Package of a Ruby Program

What is the checksum of a directory? Using DROID reports and the concepts behind Merkle Trees to generate Directory, and Collection Checksums

Join the conversation

Member-only content

or

or

or

or

Download

or