New characterisation developments from the SPRUCE hackathon

A day after running our Characterisation Hackathon (and helping out with a lively DPC event on PDF/A-3) and I'm still feeling exhausted. This was a developer only event and not as taxing on my facilitation skills as our usual mashups, but it's still been an action packed few days. All this moaning is of course somewhat irrelevant as these events are all about the participants and it was certainly those guys who did the hard work.

Andy Jackson shows his visualisations of tool sensitivity to bit flips throughout a file

Facing the challenge of taking on digital preservation characterisation and making it better, we began with some scene setting lightning talks from our hackers. Andy Jackson challenged us to take on some familiar problems which I paraphrased at the time on twitter as "Too many characterisation tools, too complex, don't meet users needs." As usual Andy wasn't pulling any punches. Despite loads of great development work on characterisation tools, we still have much to do. A key aim for the event was to get some of the key developers working together more effectively, and taking on some of the problems Andy hinted at.

Our starting point was a scratch space that was chock full of great ideas to pursue. It was collaboratively created by our event participants over previous weeks. We boiled this wealth of information down to 6 somewhat crude themes and then voted on them. For the rest of the hackathon we worked in small groups to take on these challenges. Periods of development time were intersperced with discussion, reporting back, demos and pereodic eating of some fabulous home baked cakes.

By 1600 on the second and final day of the hackathon we had some great results in four key areas. Individual blogging from our attendees will provide a lot more detail on the work undertaken, but for now, here's a summary from me:

Just solving the PDF problem

Despite there being a range of tools that tell us useful stuff about PDF files, there isn't a simple, focused tool for identifying clearly understood preservation risks in PDFs, which presents the results in a form suitable for a layman/woman to understand and act on. This was the challenge the first group took on. With the presence of some great devs, as well as PDF gurus such as Johan van der Knijff and Sheila Morrissey, I had high hopes. Those guys did not let us down. Starting with Apache Preflight and it's ability to validate a PDF to the PDF/A standard, this group worked primarily with the output to meet the use case outlined above. The resulting tool comes with a sensible default configuration that alerts the user to concrete risks, but this can easily be tweaked by more advanced users.

The potential of the results is considerable. As an ePrints plugin for example (and perhaps also used earlier in the deposit lifecycle), this could revolutionise preservation in our institutional repositories.

Consolidating file format identification

The second hackathon theme was file format identification, and more specifically, the signature magic that ID tools match with bytes from the headers (and sometimes footers) in target files. DROID, Tika and File all have their own magic, stored in different formats. "Team File Format" looked at mapping DROID and Tika formats together and seeing how we could get value out of amalgamating all this disparate knowledge. The results were fascinating but require some further exploration as the picture is a complex one. There are significant numbers of formats that have magic in DROID but not in Tika, magic in Tika but not in DROID and magic in both (while also noting that DROID magic is more specific to the version level than Tika magic). All of these groups have potential follow up / conversion / exploitation potential. Even where there is magic in both it's of course not always the same.

This work goes a long way to addressing the "too many tools argument" as levelled at file format ID. More needs to be done, but David Clipsham (he works full time on file format magic, so that makes him a digital preservation magician, right?) now has a great resource for compiling more DROID signatures and quality checking existing signatures. Additional tools for creating and testing new signatures have also progressed, and there will be more about this in a blog post from Peter May.

An interesting discussion for me emerged when David was telling us about the precision possible (or not always possible) when the constant magic bytes in a format are quite short. if they are too short, identification can create false positives in incorrect formats. Creating good signatures requires some art, not just science. Format really is a fuzzy thing: a fundamental digital preservation concept that's not always easy to get your head round.

Adding Apache Tika to FITS and C3PO

Apache Tika has become very popular with digital preservationists at our recent mashup events, and as well as file format identification it offers extraction of a host of properties of potential use in long term preservation. Incorporation into some of the best meta tools would be helpful again in meeting the complexity and "too many tools" arguments Andy mentioned at the start. Our two remaining groups therefore decided to incorporate support for Tika in the analysis and visualisation tool: C3PO and the combined characterisation tool: FITS. Of course we had the authors of FITS, C3PO and JHOVE on hand to spearhead this work with support from additional expert hackers. So the results for a day and a half of dev were impressive. Petar Petrov (of C3PO fame) was able to demo a C3PO analysis of Tika output. A new release is expected shortly. The FITS group had a considerably harder challenge, but the lion's share of the work has been completed and Spencer McEwen (FITS developer from Harvard) demo'd FITS in action, capturing properties for a small number of formats. The wealth of properties combined with FITS' role in making sense of and combining those properties with output from other wrapped tools, led to a big challenge. Spencer is hoping to do a quick release of another minor development (addition of execution timings for each tool in the FITS reporting) and then a more complete release with full Tika support will come later.

Much discussion was also had about how these developments and tools could progress in a more community driven way. OPF aims to support and coordinate tool development and much useful work was identified for the next few months, relating in particular to FITS, C3PO, JHOVE and JHOVE2.

Although some of our participants had met previously and some have had frequent exchanges on twitter, they have never all met up in the same room and then coded together. Clearly some strong bonds have been forged, so I'm hoping that the seeds have been sewn for lots more collaboration. A couple of SPRUCE Awards should help the sustainability of the event results, although several participants could already see where results would be useful for future activities they have on the horizon. An encouraging sign.

At the end of our events we ask the participants to fill out a quick anonymous feedback form. In our last question, we asked if there was appetite for other events like this, perhaps on an annual basis. Everyone said yes. Some used the words "definitely", "absolutely" and "pretty please". That latter was in capitals. Several used the word "yes" 3 or more times. A third of the respondees suggested repeating on an annual basis might not be frequent enough. Wow! Clearly there is a lot of appetite for making this happen (and the results speak for themselves), although of course, shrinking travel budgets are going to make this harder. One suggestion at the end of the event was to tag some hack events onto popular conferences in our community. Any takers for iPRES or PASIG?

New characterisation developments from the SPRUCE hackathon

Leave a Reply

You might also like…

Trouble-shooting PDF validation errors – a case of PDF-HUL-38

Convert me if you can – Preservation Planning with malicious PDFs

Book sprinting with SPRUCE

Join the conversation

Member-only content

or

or

or

or

Download

or