Digital Preservation Stage Boss One: The Performance of File Format Identification Tools vs. Checksum Generation Tools
At Archives New Zealand we were finding ‘WAVE’ files becoming a bottleneck of one of our ingest processes. The result initially looked odd to me where I had thought I had understood in the past that file format identification would not take longer to divine than a checksum. My rationale being that to identify a file, DROID, or equivalent tool, would rarely need to read the whole file, rather, it should be able to read signatures from offsets relative to the beginning, or end of file.
On top of that, I assumed that even if the identification tool was ‘slow’ then it would only take as long as checksum generation, never longer. My rationale for this being that reading from disk is the slowest part of the process – once the file’s contents were in memory, the tool working with the file would continue its computations in-memory.
It seems that the word ‘rarely’ may have been the error in my original assumptions; and finding those challenged I was eager to create an empirical picture of the performance of format identification vs. checksum generation and created a handful of experiments to do this.
The experiments were created to look at performance over the Govdocs Select corpus; a simulant of that corpus; and the WAVE files we were having the trouble with.
The results demonstrated that over a corpus that perhaps, ‘looks like‘ the Govdocs Select one, DROID will outperform the checksum generation tools. Siegfried will be quicker on average than DROID.
As we start to modify the maximum length of file each tool is configured to scan, for example, asking DROID not to read more than 65535 bytes (its default setting), we see an increase in speed, but we don’t see a major drop-off in the number of files supposedly identified by either tool, but we do start to see differences in what those results are. This indicates less precise identification results and potential false-positives.
For the collection of WAVE files we had at hand, DROID ran 16x slower than the checksum generation tools, and when we reduced DROID’s scan length it was still measurably slower. There is potential for the number of wildcards in file format signatures for WAVE based formats to be causing the problem where it creates the potential for DROID to scan through the entire file for each matching signature. Other things could account for this, for example, code that could still be optimized. Solutions are presented which may allow us to improve on the results we see today.
By way of control, the ‘simulant’ corpus (26,124 files populated with random data, totalling 31.4GB) was used to demonstrate that neither DROID or Siegfried needed much time to reach a conclusion of ‘fmt/UNKNOWN‘. DROID was noticeably quicker than Siegfried, but neither tool took over three minutes to get through the amount of data it was given.
The results are presented in the experiment’s full report, here: https://github.com/exponential-decay/digital-preservation-stage-boss-one/blob/master/final-report/digital-preservation-stage-boss-one.pdf [PDF]
All of the work involved in this experiment can be found in my GitHub repository Digital Preservation Stage Boss One.
Please take your time and enjoy the results and take a look through the repository. All thoughts and comments are appreciated.
[Edit] 2016-08-22: Post-publication Richard Lehane noticed a discrepancy in my thinking re: DROID prioritization of WAVE identification results. My correction is reflected in the report text and committed to the repository by way of source control/versioning. Please see GitHub for the previous reading.
[Edit] 2016-09-21: Surfacing the wildcard signature listing in PRONOM from the report itself: https://github.com/exponential-decay/digital-preservation-stage-boss-one/tree/master/wildcard-signature-information