File Identification using Fido and the UDFR Registry

File Identification using Fido and the UDFR Registry

Task:

I primarily wanted to get an understanding of SPARQL queries and how they can be used to query linked data. As a focus for my work, I set myself a challenge to get Fido working using signatures from the UDFR registry.

Solution:

The code (available on GitHub) has two python scripts. The first, UDFR_wrapper.py, provides a wrapper around calling SPARQL queries on the UDFR registry. The second, fido_prepare.py (based on Fido’s prepare.py), uses this wrapper to generate a Fido signature XML file which can be loaded using Fido’s -loadformat argument (to demonstrate that Fido uses only this signature file, some minor mods are made to Fido, listed in the readme.md).

This is only a quick demo of the concept, the code is very rough around the edges, it hasn’t been tested, and there’s probably better SPARQL queries to use, but it does give an indication of how to use the UDFR registry with an identification tool.

What challenges were there?

Besides creating appropriate queries to retrieve format info and associated signatures (currently done in 2 steps), the signatures themselves posed a couple of problems. Firstly, the registry stores internal signatures as byte values, so I had to translate these to regular expressions for use within Fido (using Fido’s convert_to_regex translation function in prepare.py). Secondly, some signatures returned to me from the SPARQL query contained HTML entities (e.g. { for {), these had to be converted before a regex translation could be made (using some very helpful code found on StackOverflow).

Results:

I ran Fido with the generated signature file. It received 329 results to convert to FIDO signatures – I think this is less than there should be, suggesting my queries aren’t quite right.

Running on a test PDF gave me back similar results (“Acrobat PDF 1.5”, “application/pdf”) to standard Fido (i.e. the Fido available from GitHub).

Running on a docx file returned as “ZIP Format” (“application/zip”), which is to be expected as Fido makes use of container-signature and format_extensions XML files (which I have deleted in order to test that Fido uses the UDFR signatures file) to help detect these formats (e.g. “Microsoft Office Open XML – Word”).

Standard Fido returned a test JPG image as “JFIF 1.01”, “image/jpeg”, UDFR Fido came back with “Raw JPEG”, “image/jpeg”.

So some slight discrepencies to look through, but vaguely the right answers for this quick test. Some adjustements are, more than likely, needed to the SPARQL queries, and adaptations are needed to handle container formats such as Office Documents. It would also be a good idea to compare and validate against a known test corpus, so that gaps can be identified and results can easily be conveyed.

1 Comment

  1. andy jackson
    September 10, 2012 @ 12:52 pm CEST

    Great to see this. Just wanted to note a cross-reference to this post on Gary McGath’s File Formats Blog: https://fileformats.wordpress.com/2012/09/06/sparql/

Leave a Reply

Join the conversation