File Identification using Fido and the UDFR Registry

Task:

I primarily wanted to get an understanding of SPARQL queries and how they can be used to query linked data. As a focus for my work, I set myself a challenge to get Fido working using signatures from the UDFR registry.

Solution:

The code (available on GitHub) has two python scripts. The first, UDFR_wrapper.py, provides a wrapper around calling SPARQL queries on the UDFR registry. The second, fido_prepare.py (based on Fido’s prepare.py), uses this wrapper to generate a Fido signature XML file which can be loaded using Fido’s -loadformat argument (to demonstrate that Fido uses only this signature file, some minor mods are made to Fido, listed in the readme.md).

This is only a quick demo of the concept, the code is very rough around the edges, it hasn’t been tested, and there’s probably better SPARQL queries to use, but it does give an indication of how to use the UDFR registry with an identification tool.

What challenges were there?

Besides creating appropriate queries to retrieve format info and associated signatures (currently done in 2 steps), the signatures themselves posed a couple of problems. Firstly, the registry stores internal signatures as byte values, so I had to translate these to regular expressions for use within Fido (using Fido’s convert_to_regex translation function in prepare.py). Secondly, some signatures returned to me from the SPARQL query contained HTML entities (e.g. { for {), these had to be converted before a regex translation could be made (using some very helpful code found on StackOverflow).

Results:

I ran Fido with the generated signature file. It received 329 results to convert to FIDO signatures – I think this is less than there should be, suggesting my queries aren’t quite right.

Running on a test PDF gave me back similar results (“Acrobat PDF 1.5”, “application/pdf”) to standard Fido (i.e. the Fido available from GitHub).

Running on a docx file returned as “ZIP Format” (“application/zip”), which is to be expected as Fido makes use of container-signature and format_extensions XML files (which I have deleted in order to test that Fido uses the UDFR signatures file) to help detect these formats (e.g. “Microsoft Office Open XML – Word”).

Standard Fido returned a test JPG image as “JFIF 1.01”, “image/jpeg”, UDFR Fido came back with “Raw JPEG”, “image/jpeg”.

So some slight discrepencies to look through, but vaguely the right answers for this quick test. Some adjustements are, more than likely, needed to the SPARQL queries, and adaptations are needed to handle container formats such as Office Documents. It would also be a good idea to compare and validate against a known test corpus, so that gaps can be identified and results can easily be conveyed.

File Identification using Fido and the UDFR Registry

Task:

Solution:

What challenges were there?

Results:

1 Comment

Leave a Reply

You might also like…

Wikidata as a digital preservation knowledgebase

ChatGPT discusses Digital Preservation

Monitoring Disappearing File Formats 5: Applications for disappearing file formats

Join the conversation

Member-only content

or

or

or

or

Download

or