Sets tool: online code snippets for file format sets

At my workplace, we write a lot of small scripts to encode preservation workflows. These scripts pipeline simple actions like munging metadata, moving files about, and calling other tools such as Tika and ImageMagick. Often these actions are conditional on the format of the file being processed: for example, we only want to run Tika over the formats for which it can extract text. The snippet below executes an ImageMagick command but, because of the particular command-line incantations we’ve used, it only operates on PDFs:


Defining those tests (e.g. IsPDF()) was a manual process and we’d often fall back on the file extension, even though we had a PRONOM ID (PUID), because testing for “.pdf” is a lot simpler than researching all the PDF formats in PRONOM, and then copying those PUIDs into a script, and then maintaining that code through PRONOM updates.

I was working on this problem the other week and realised that I could re-use siegfried’s format sets to automatically create conditional expressions. Format sets are a siegfried feature that let you build custom signature files scoped to particular groups of formats, e.g. the command roy build -limit @pdf makes a signature file that only knows about PDF.

The sets tool is an online widget that uses siegfried sets to make code snippets for testing whether a particular PUID is one you’re interested in. Enjoy!

Sets Tool

By Richard, posted in Richard's Blog

18th Feb 2016  3:20 AM  3076 Reads  No comments