Accepting the Format Registry Challenge

Last week I was at home in bed for four days with a bad back. I could not really type, but I could browse, so I did have an opportunity to catch up on the recent OPF blog entries. I was particularly interested in the discussions regarding a new Format Registry. I am an amateur Java programmer at best, but I enjoy getting my hands dirty now and again, so I decided to take up Adam’s challenge: how far could I get with a format registry application in one week? This blog entry is a report on my first two days of the project.

To start with, I agree that simple XML-based persistence should be fine. There are about 750 formats in Pronom today, and if this is increased by a factor of ten or even one hundred it should not cause significant scalability problems. Furthermore, this means that a persisted record is very easy to access and exchange. If we start with the Pronom schema, we already have a well-defined “standard” exchange format.

Thanks to Adam’s work with Fido, I could obtain a zip file of all Pronom XMLs from GitHub. The next step was to figure out how to unmarshall (create objects from XML) and marshall (create XML from objects) these XML files with Java. Due to my personal skill set, Java was already a determined technology choice.

Fortunately, Java 6.0 includes all the machinery necessary for this, as is nicely explained in this article. One snag – I had no XML schema from which the Java class could be automatically generated. No matter, I found a neat online service that will generate an XSD from an XML file here. The nice thing is that the Java xjc compiler produces Bean-like classes, that is, classes with get and set methods for every attribute. A drawback that a later recognized was that every attribute was unmarshalled as a String; I would have to go back and modify the generated XSD to take advantage of XML Schema datatypes like Dates.

At this stage I had a single Java class – PRONOMReport – and it was time to start thinking how to build an application around this. My main criteria here was to find something lightweight and deployable to any application server. I had experience with JSF from Planets, but I was not very satisfied with it, remembering in particular the great difficulties we had in working with complex object lists.

I eventually settle on Apache Tapestry 5.1 because (a) it offered ready-to-use components for displaying and editing Java Beans, (b) it promised easy development (which turns out to be true, I can edit pages and see the effects immediately without re-deploying or re-starting the application server) (c) hey, it is an Apache project, so it must be OK, right? After working with it for two days, I also have discovered that it has a really clean way of transferring objects between pages, which avoids the traditional (and a bit bizarre) servlet development approach of storing attributes in the shared HttpSession object.

I am developing using Eclipse, and using Tapestry required all of twelve jar files to be imported into my project. In a matter of hours I was able to develop a DataAccessObject (DAO) class that managed persistence (XML marshalling and unmarshalling) as well as queries, and a few user interface pages that allow one to search, display results lists, and view individual formats. This work was accelerated due to the very useful Tapestry components Grid (for displaying customized lists of Beans) and BeanDisplay. Of course, the results are not pretty in any way, although they could almost certainly be improved quickly by adding a simple stylesheet to my single Layout class. I was able to improve things somewhat by making use of the @NonVisual annotation in the PRONOMReport class (for hiding unsightly internal IDs) and adding a couple of get methods that would also allow for displaying Pronom IDs and file extensions in a list (these attributes originally being hidden behind Lists in the generated class).

My next challenge is to actually edit records, a task which I have started, but is hindered by the complexity of the PRONOMReport model, which has a few layers of 1:n relationships between objects. My goal is to allow at least the addition of FileFormatIdentifiers, External and Internal Signatures (including the new Fido signature type based on Regular Expressions), and Related Formats. I will report on my progress next week!

Leave a Reply

You might also like…

Error detection of JPEG files with JHOVE and Bad Peggy – so who’s the real Sherlock Holmes here?

TIFF format validation: easy-peasy?

A Weekend With Nanite

Join the conversation

Member-only content

or

or

or

or

Download

or