FITS Blitz

FITS Blitz
FITS is a classic case of a great digital preservation tool that was developed with an initial injection of resource, and subsequently the creator (Harvard University) has then struggled to maintain it. But let me be very clear, Harvard deserves no blame for this situation. They've created a tool that many in our community have found particularly useful but have been left to maintain it largely on their own.
 
Wouldn't it be great if different individuals and organisations in our community could all chip in to maintain and enhance the tool? Wrap new tools, upgrade outdated versions of existing tools, and so on? Well many have started to do this, including some injections of effort from my own project, SPRUCE. What a lovely situation to be in, seeing the community come together to drive this tool forward…
 

Unfortunately we were perhaps a little naive about the effort and mechanics needed to make this happen as a genuine open source development. FITS is a complex beast, wrapping a good number of tools that extract a multitude of information about your files which is then normalised by FITS. What happens when you tweak one bit of code? Does the rest of the codebase still work as it should? Obviously you need to have confidence in a tool if it plays a critical role in your preservation infrastructure.
 
From the point of view of the SPRUCE Project, we'd like to see all the latest tweaks and enhancements to FITS brought together so that the practitioners we're supporting get a more effective tool. But we also equally want future improvements to find their way into the codebase in a managed and dependable way, so that upgrading to a new FITS version doesn't involve lots of testing for every organisation using it.
 
So in partnership with Harvard and the Open Planets Foundation (with support from Creative Pragmatics), SPRUCE is supporting a two week project to get the technical infrastructure in place to make FITS genuinely maintainable by the community. "FITS Blitz" will merge the existing code branches and establish a comprehensive testing setup so that further code developments only find their way in when there is confidence that other bits of functionality haven't been damaged by the changes.
 
FITS Blitz commences next Monday. Please get in touch with myself, or Carl Wilson from the Open Planets Foundation, if you'd like to find out more.

25
reads

12 Comments

  1. willp-bl
    November 7, 2013 @ 1:09 pm CET

    I think there is some difference now between OpenOffice and LibreOffice.  It might be worth speaking directly to Michael Meeks from LibreOffice (https://people.gnome.org/~michael/) about ODF validation and adherance to the ODF spec etc.

  2. andy jackson
    November 7, 2013 @ 10:57 am CET

    That OASIS link says

    last edited 2009-08-12

    which I find a bit worrying. Are you sure it's still that simple? Also, did you report your findings to either of the existing projects?

    FWIW, I think we are strongest when we contribute our solid experience and our testing and development resources to higher-profile projects with larger user communities, like the Apache ones. In particular, I think the work Johan and Will have done with Apache PDFBox/Preflight is wonderful. I know there's a higher collaboration overhead, but as well as gaining a larger audience for our work, we also gain a more maintainable infrastructure for the software over time.

  3. lfaria
    November 6, 2013 @ 6:32 pm CET

    I would surely like to be updated with the results of your efforts!

    Also, it may be useful for you to check our FITS working branch updates:

    • Updated Droid to 6.1.3
    • Removed Java 7 lock from Droid tool
    • Added new Droid signature file with some new mimetypes (experimental)
    • Changed the way FITS consolidates results (so tools that only partially identify the file format can still be used to extract metadata)
    • Added FIDO to FITS
    • Created a new ODF validator and added it to FITS

    Also we are planning to do soon (next week):

    • Add corrupted Microsoft Office documents (doc, docx, ppt, pptx, xls, xlsx) to the fits-testing corpora
    • Add Apache POI to FITS to validate Microsoft Office documents (doc, docx, ppt, pptx, xls, xlsx)

  4. paul
    November 6, 2013 @ 6:03 pm CET

    Thanks for the details on your work to date on this Luis! This gives us a really useful starting point that Carl has already been working with in his preparation for next week. Incidentally, we'll be doing daily skype calls to review where we're up to, so let me know if you want to join in anytime.

  5. lfaria
    November 6, 2013 @ 5:56 pm CET

    KEEP SOLUTIONS, for a private project that is sponsoring some developments in RODA, is going to develop some new features in FITS. These features including improving the support in FITS for identification, feature extraction and validation of some defined file formats.
     
    But, before we could start, we wanted to assess how well FITS currently handles the file formats we want to deal with. To do so, we created a new open-source tool that takes a FITS installation and a curated corpora with well defined ground truth, and it output a XLS report details how well the FITS behaved.
     
    The tool and some preliminary results are available at:
     
    We compared the harvard-lts official version with the openplanets master, openplanets gary version (which updated Droid to version 6  but seems to have some configuration problem), and a new KEEPS version that added FIDO and fine-tuned gary's version with a better configuration and some bug fixes.
     
    We are still going to add some new file formats we need to support, like shape files and autocad. On the next months, we will be invested in improving FITS for these file formats.
     
    The conclusion for now is that FITS gives pretty bad results for our target corpora, but that with just a couple of weeks we managed to greatly improve the FITS results. I know that the corpora is still very reduced and too focused in our own problem, but I think that, by improving the test corpora, we could make this tool very proficient at testing new developments in FITS and verify if they are actually improvements or not.

Leave a Reply

Join the conversation