Proposal: GitHub to enable a federated approach to distributing and utilizing custom DROID signatures
This blog discusses what we have available in our toolkit for contributing more signatures to PRONOM for the benefit of the digital preservation community. It also discusses the potential issues we need to work around in the short time we have between controlled PRONOM releases. The blog outlines an idea for a temporary, federated approach utilizing GitHub, Siegfried and Roy, and the droid-list Google Group for working with custom DROID signatures, inspired by conversations with colleagues at The National Library of Australia that might deal with those shortcomings.
It remains to be seen what alternate methods will appear for registering file formats and importantly, identifying them for the requirements of digital preservation, but until such times as Tessella’s Linked Data Registry, or the NSLA Digital Preservation Technical Registry broach the public domain and begin to be accepted and used by the wider public, The National Archives: PRONOM, and its sister tool DROID remain de-facto.
One of the issues that the other initiatives have been trying to solve is the centralized nature of PRONOM and a perceived difficulty in influencing the content of the registry.
Until we have a more federated approach, personal experience demonstrates to me that the PRONOM contribution model described here can work in the short-term. Some of the challenge seems to lay in the discovery of specifications, or sample files; another part of the challenge is creating signatures.
We have guidance and mechanisms to aid in creating signatures; I created a PRONOM Signature Development Utility tool here (hosted by The National Archives, UK), and mirrored here (my personal website), with code, open source here (PHP+HTML), which outputs a single traditional DROID signature file that can be uploaded to the tool to be used.
Matt Palmer, former colleague at The National Archives, UK has recently developed a Java tool which does a similar task, here; each may satisfy different users’ requirements or workflows.
For specifications, then Google Search has always been a good first stop.
Format examples are a little more difficult, but, most users approaching the task of format identification should have a fistful of samples in their possession. Permissive institutional approaches to sharing those samples and trusting the confidentiality of institutions such as The National Archives in looking into format signatures helps.
A sample email one could write might look like this:
Please find attached a set of sample files for testing, please note the following caveats:
- These are [record classification here] deposited under [relevant legal act here]
- The access classification put on these files by the depositor is [access classification here]
- However, they are not yet available to the public
- We are yet to ingest them into the digital [repository/archive/etc.]
- They are provided purely for research and testing purposes for identification of the error involved.
- They are provided for access, only by those helping to identify the error involved.
- Once the testing is complete we ask you to delete the files
- If further testing is required in future, these files should be available again through our [archival/library] catalogue, once fully accessioned
A favorite of mine is an extension based search for samples on Google e.g.
Though it can be harder to track down more specific and more esoteric formats in this manner.
A technique shown to me recently by Andy Jackson of the UK Web Archive is to use magic numbers to search for samples stored in their collection:
This example, though an ever-so-slightly long-winded query string simply boils search of the UK Web Archive down to files with the OLE2 magic number 0xd0cf11e and extension .ppp which should help me to discover Serif PagePlus files for testing. Positives, or false positives all have their benefits when we’re testing the accuracy of any signature we’re developing.
Please note: This is very much a prototype service on offer from the UK Web Archive and it may go down at any point.
As mentioned in the introduction to this blog – alternate approaches seem to look to solve a problem with a centralized approach to maintaining a registry for format identification. Perhaps we do need a federated approach to a registry; but as a stopgap to that, a tool like Roy, and his partner tool Siegfried, created by Richard Lehane of State Records NSW, along with GitHub may help us to build a model providing a temporary federated approach to the distribution of file format signatures.
A distributed approach to publishing our (the community’s) own custom signatures may be as close an approach as we can get to a decentralized format registry model right now. Users and institutions with a GitHub can store their signatures on there, perhaps listing them on the droid-list on Google Groups. They can then be accessed by all users of Siegfried, and concatenated using Roy into a single signature file, or sets of signature file as appropriate.
Siegfried provides part of the provenance reporting showing how it reached a match on an object:
siegfried : 1.2.0
scandate : 2015-08-11T18:56:23+12:00
signature : pronom.sig
created : 2015-08-11T18:56:21+12:00
- name : 'pronom'
details : 'DROID_SignatureFile_V82.xml; container-signature-20150327.xml; built without reports; extensions: air-application-1.0-1.5-2.0-signature-file.xml'
filename : 'AdobeAirExample-v1.5.air'
filesize : 1921900
modified : 2015-08-11T18:27:44+12:00
- id : pronom
puid : archives-nz-dev-signature/2
format : 'adobe.air'
version : '1.5'
mime : 'application/vnd.adobe.air-application-installer-package+zip'
basis : 'extension match; container name mimetype with byte match at 0, 59; name META-INF/AIR/application.xml with byte match at 69, 47'
The extension described here is to add a set of identification methods for the Adobe Air format type. The provenance we are interested in is that we’ve used the extension signature file air-application-1.0-1.5-2.0-signature-file.xml’ on top of the two standard DROID signature files.
N.B. As a result of writing this blog, Siegfried has already been updated to include the full set of extension files used in the details field and this blog will be updated as appropriate when I can build and test the tool.
The command for extending a signature file in Roy is:
roy build -noreports -extend air-application-1.0-1.5-2.0-signature-file.xml -extendc air-application-container-1.0-1.5-2.0-signature-20150623.xml
With more information available on the tool’s wiki
Both -extend and -extendc accept a list of comma separated values and so it is possible for me to extend this infinitely given a number of different custom signature files being made available to me.
Imagine if the identifiers list began to look a little something like this?
- name : 'pronom'
details : 'DROID_SignatureFile_V82.xml; container-signature-20150327.xml; built without reports; extensions: archives-nz-air-application-1.0-1.5-2.0-signature-file.xml, archivematica-forensic-disk-formats.xml, avpreserve-audiovisual-formats.xml, open-preserve-pdf-identifiers.xml, hatii-research-data-formats.xml'
And so forth.
DROID too could perhaps be improved to accept multiple custom signature files to be used concurrently with the primary DROID signature file, and including an extra ‘provenance’ column in the output CSV.
A workaround to using custom signature files is to use DROID’s command line CSV output:
droid -Nr “c:\working\air-files” -Ns air-application-1.0-1.5-2.0-signature-file.xml -Nc air-application-container-1.0-1.5-2.0-signature-20150623.xml
Though a more extensible and user-friendly solution from DROID would also benefit the community.
Before we can start making full use of Siegfried, and then maybe DROID, and before we can see the appearance of a suitable mechanism for a federated approach to signature distribution we need to start publishing our signatures files.
Archivematica for example uses custom FIDO signatures
Though up until now, I fell into the trap of listing them on Google Groups only.
The next step-change will be to have these listed centrally and described appropriately using good style guidelines, especially the PUID, for example, adopting Archivematica style:
And in the filename examples above, I might also include the correct institution, e.g.
From there they can be selected, downloaded, remixed and compiled using the Roy tool (N.B. potential for a web service?) and we, the community, can begin make use of a host of different identifications using DROID signature syntax that might otherwise remain in digital filing cabinets until such a time when a more appropriate registry appears lightening the burden of testing, and provenance on The National Archives, UK.
The comments that Mark Pearson of The National Library of Australia made that resonated with me were something that I know a large number of the digital preservation community already take advantage of: GitHub already provides a great infrastructure for distribution and Git itself a tool proven in the Linux community for distributed projects; as a version control system it can provide much of the provenance information we need, though some of this may also need to be accounted for in how we use custom Signature files.
Can it work as a temporary federated approach?
I think so.
My custom files though not many, and sample layout are here: https://github.com/exponential-decay/droid-signature-files
Where are yours?