Statsbiblioteket (The State and University Library, Aarhus, hereafter called SB) welcomed a group of people from The Royal Library, The National Archives, and Danish e-Infrastructure...
Blogs
The problem We have a large volume of content on floppy disks that we know are degrading but which we don't know the value of....
Authors: Martin Schaller, Sven Schlarb, and Kristin Dill In the SCAPE Project, the memory institutions are working on practical application scenarios for the tools and...
Who are you? My name is Leïla Medjkoune and I am responsible for the Web Archiving projects and activities at Internet Memory. Tell us a...
I have been working on some code to ensure the accurate and consistent output of any file format analysis based on the DROID CSV export. The tool produces summary information about any DROID export and more detailed listings for content of interest such as files with potentially problematic file names or duplicate content based on MD5 hash value. I describe some of the rationale and ask for advice on where to go next.
Well over a year ago I wrote the ”A Year of FITS”(http://www.staging.openpreservation.org/blogs/2013-01-09-year-fits) blog post describing how we, during the course of 15 months, characterised 400 million of harvested web documents using the File Information Tool Kit (FITS) from Harvard University. I presented the technique and the technical metadata and basically concluded that FITS didn’t fit that kind of heterogenic data in such large amounts. In the time that has passed since that experiment, FITS has been improved in several areas including the code base and organisation of the development and it could be interesting to see how far it has evolved for big data. Still, FITS is not what I will be writing on today.
Today I’ll present how we characterised more than 250 million web documents, not in 9 months, but during a weekend.
