A closer look at Pronom signatures

Last time, I discussed Pronom and Droid. We had a quick look at the compiled (nearly unreadable) pattern information that the Droid signature file holds and the uncompiled (but still hard to read) representation that is stored in Pronom.

In this post, I’ll run through the Pronom pattern language, show how to get access to the XML documents for all of the Pronom file formats, and look at some of the properties of the full set of existing patterns. Then we’ll be able to ask and answer a key question: Is there a simpler way to represent the Pronom patterns and match them against files?

It turns out that many file formats can be identified by looking at a few bytes at the very beginning, and sometimes a few at the end. For example, a PDF starts %PDF with and ends with %EOF (almost). The first four bytes of a Zip file are the ascii codes for the letters P and K followed by a 3 and a 4. Some formats can be identified by looking for characters somewhere around the front or end. For example, an Office Open XML file is a Zip file that has a particular content; you can recognise them by the characters [Content_Types.xml] starting at the 30th byte. Some types are even harder to reliably recognise. For example, there are probably billions of ‘csv’ files, but it is hard to specify a rule to recognise them reliably. Other types may require looking for special sequences of characters somewhere in the middle of the file. We can think of these patterns as signatures for each file format.

In order to identify the formats of files, Pronom provides signatures expressed in a special pattern language. I was interested in the set of signatures that Pronom holds and that Droid uses. My problem is that each one is defined on a separate web page on the Pronom site – and there is not easy-to-access list of all of the pages.

Just using the web site didn’t help me very much. Pages for the formats are accessible via search and result in URLs such as http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=613. This is not very satisfactory because the id is the Pronom internal ID – it is not the PUID (persistent unique identifier) that provides a global identifier for the format – and there is no list of the Pronom internal IDs. The visible links to formats have this same form.

When I first used Droid, however, I noticed two things. First, I observed that there was a preference for a PUID URL pattern – http://www.nationalarchives.gov.uk/pronom/%s. Second, in the Pronom report, there was a link from each file format PUID to the web site, such as http://www.nationalarchives.gov.uk/pronom/fmt/11. This is helpful for Droid users, because they can easily find out more about the format. Now I felt that I was closer to my goal. I could use the Droid signature file to extract a list of all of the PUIDS, and use the URL pattern to identify and fetch all of the web pages. Of course, I didn’t really want the web pages, I wanted the XML so that I could analyse and manipulate the information more easily. At this point, I made a lucky guess! I typed ‘.xml’ at the end of the URL and, like magic, the XML file appeared. The magic incantation was: http://www.nationalarchives.gov.uk/pronom/fmt/11.xml. The development team at TNA is really to be praised for providing this style of access to the XML directly.

At this point I was ready to get started, and with a couple of shell script incantations, I had retrieved all 731 XML documents. The following shell script is all that I needed (I’ll explain it below).

 1 base="http://www.nationalarchives.gov.uk/pronom/"
 2 re='PUID=\"(fmt|x-fmt)/([0-9]+)\"'
 3 
 4 while read line; do
 5    if [[ $line =~ $re ]] ;	then
 6	 puid=${BASH_REMATCH[1]}/${BASH_REMATCH[2]}
 7	 fname=xml/puid.${BASH_REMATCH[1]}.${BASH_REMATCH[2]}.xml
 8	 wget -T5 -O $fname $base$puid.xml
 9    fi
10 done < droid-signature_v39.xml

Explanation:

  1. ‘base’ is the URL pattern from the Droid options.
  2. ‘re’ is a simple regular expression that will match all of the PUIDS in the signature file. They look like PUID=”fmt/11” or PUID=”x-fmt/91”. The parentheses save a portion of the match, so that we can access them using the special Bash shell variable BASH_REMATCH.
  3. The loop reads each line in turn.
  4. The [[ $line=~$re ]] matches the line against the regular expression. I was really pleased to discover how easy this was to do in the Bash shell.
  5. With the match, we can reconstruct the PUID from its two parts.
  6. And create a filename to place the result in called, for example, xml/puid.fmt.11.xml.
  7. The real is done by ‘wget’. This is an amazing command-line tool for working with web content. It gets the contents of our URL and outputs it to $fname. The ‘-T5’ specifies a timeout of 5 seconds. If there is a problem fetching one URL, we timeout and try the next one.
  8. The funny ‘done < droid-signature_v39.xml’ simply sends the contents of the file as input to the loop statement, which ends with ‘done’.

After running this script, I had 731 XML files and was ready to take a closer look at the Pronom signatures. I had several questions: (1) How many of the formats had signature patterns; (2) How many signature patterns were there altogether; (3) How many different pattern were used; (4) Which features of the Pronom pattern language were actually used; (5) would it be possible to automatically convert the Pronom patterns into a more widely used pattern language, such as the regular expression language used in Python, Java, and other common programming languages.

I started out using basic shell script tools like grep, moved on to XML tools like the amazing xmlstarlet, and ended up parsing the XML in Python as I got deeper in my explorations. To summarize the results:

  1. There are 208 Formats with InternalSignatures.
  2. There are 244 InternalSignatures. Some formats have more than one signature – that is there is more than one approach to identifying instances of the format.
  3. There are 351 ByteSequences. There are a fair number of InternalSignatures with more than one ByteSequence – what Pronom calls a pattern. For example, some will have a pattern to match the beginning of a file, and another to match the end.
  4. We’ll come back to this later, but none of the unusual features of the pattern language were actually used.
  5. From my initial review, it seemed entirely plausible to convert the Pronom patterns into much more widely used regular expression patterns.

The following passage assumes basic knowledge of regular expressions. There are many good on-line tutorials. Regular expressions are an important tool in any digital preservation expert’s toolkit. Adrian Brown’s document (http://www.nationalarchives.gov.uk/aboutapps/fileformat/pdf/automatic_format_identification.pdf) clearly defines the features of the Pronom pattern language. It turns out that they can all be translated easily into a typical regular expression language (I use the Python regular expression syntax, but the subset used here is equivalent to the Posix standard). In fact, they define a subset of the regular expression language. The following translation table shows the Pronom feature, the regular expression feature, and some comments.

Pronom

Regex

Comment

BOF

\A

\A matches the start of a buffer.

EOF

\Z

\Z matches the end of a buffer.

??

.?

Match one byte.

*

.*

Match zero or more bytes.

{j}

.{j}

Match exactly j bytes.

{j-k}

.{j,k}

Match from j up to k bytes.

{j-*}

.{j,}

Match at least j bytes.

[a:b]

[a-b]

Match one byte between a and b inclusive.  Pronom represents each byte as two hex digits.

(ab|cde)

(?:ab|cde)

Match the byte sequence ab or the sequence cde.  The ‘?:’ means not to save the group specially.

Offset, MaxOffset

.{j,k}

A pattern may have an Offset or MaxOffset.  One or both may be provided.

Offset (no MaxOffset)

.{j}

 

MaxOffset (no Offset)

.{0,k}

 

The table covers all of the Pronom pattern languages features that are actually used. There are two other features defined, but unused. These include [!a], which could be translated as [^a] and [!abc] which could be implemented as (?!abc). In addition, there is an indirect offset feature is not used.

There is one additional wrinkle to consider. Pronom uses byte codes, whereas the regular expression language uses characters – while providing a way to specify byte codes. A byte code can be represented in hexadecimal with a \x prefix. For example, the hex code 30 can be represented as either a space, or \x30. There are several characters that have special meaning in the regular expression language. If they are preceded by a backslash, then they lose their special meaning. For example, ‘.’ matches any character, so to match the dot in ‘1.0’, the regular expression is ‘1\.0’. Examples include:

Pronom

Regex

255044462D312E30

\%PDF-1\.0

2525454F46(0A|0D|0D0A)

\%\%EOF(\n|\r|\r\n)

4D4D002A

MM\x00\*

(43|46)575301

(?:C|F)WS\x01

504B0304{26}6D696D657479706561
70706C69636174696F6E2F766E642E
73756E2E786D6C2E62617365*6F66
666963653A76657273696F6E3D22312E30

PK\x03\x04.{26}mimetypeapplication\/vnd\.sun\.xml\.base.*office\:version\=\”1\.0

This table shows a few examples of the Pronom byte codes and regular expression patterns. All of these patterns are fairly hard to read, but I think that most humans will find the regular expressions easier to read than the byte sequences – even with the backslashes.

There are two surprises in this little set of examples. One is too specific – it will fail to match perfectly good instances, and another is too general – it will match non instances.

I noticed the first when Droid failed to recognise some of my PDFs. Upon inspection, I found that it was because they ended with \n\r, rather than \r\n. Many of the Pronom patterns for PDF variants or related formats actually end with the sequence (0A|0D|0D0A|0A0D) – which covers all of the needed cases.

The ODF signature has an obvious flaw. The ‘*’ between ‘base’ and ‘office’ means that there can be any number of intervening bytes. This means that the string ‘office:version=1.0’ could actually be present in any Zip item name or even in the content portion of any Zip item that was stored uncompressed. On top of that, it can make recognising ODF files much less efficient – instead of scanning the first 50 bytes, one may have to look all the way through to the end of the file.

These are not huge problems, but they are associated with rather ordinary formats. How can we be sure that a signature is both accurate and efficient?

I believe that the choice of pattern language is important. Preservation specialists must be able to determine if a pattern is wrong and correct it. I believe that this is nearly impossible with current the Pronom byte signatures. The regular expressions are still hard to read, but there are many more people who can follow and validate them – and many more tools that can be used to work with them.

At this point in my exploration, I was getting very curious. I had learned that it was easy to fetch all of the Pronom XML files. Moreover, I had concluded that it would be very easy to convert the Pronom byte signatures into a well known regular expression language – and that this made the signatures much easier to read and understand. Now I wondered: would it really be easy to use the regular expressions to identify the formats of digital objects – and what would the performance of such an approach be like?

In the next post, I’ll fill you in on some of the missing details, such as the relationships between formats and signatures. I’ll describe a simple implementation of a file format identification tool using regular expressions, and I’ll show some startling performance results.

By adam farquhar, posted in adam farquhar's Blog

27th Oct 2010  10:03 AM  13001 Reads  No comments

Comments

There are no comments on this post.


Leave a comment