TIFF format validation: easy-peasy?
The research question
I have never doubted the JHOVE TIFF module. The JHOVE TIFF module is always right. Everybody says so. That’s why nobody uses the myriad alternatives to it, although it’s so easy to write a TIFF validator, I could almost do it myself.
But while my colleague Michelle and I are drafting a paper for the IDCC this february, it dawned on me: “Everybody” has never written about the infallibility of JHOVE in a paper or Blogpost so far, or has run a thorough test that I know of. Besides, the “myriad alternatives” often seem difficult to use for me on my windows machine with my limited experience with command-line-tools and batch-scripting.
Last fall, I have compared the validation tools JHOVE and Bad Peggy and how they both deal with JPEG validation (see OPF Blogpost). My goal was to analyse if the JHOVE JPEG module is reliable, as we are basing our preservation decisions on it. In theory, my goal is the same with this examination, except being focussed on TIFF, but my initial intention was admittedly biased: I wanted to prove that the JHOVE TIFF module indeed is infallible and that TIFF validation is, as I have always known, easy-peasy. As the analysis went on I had to admit that the reality is much more complicated.
The statement of a validation tool usually is relied on without a second thought, although most validation tools are not free from false negatives and false positives. As the JHOVE validation tool is widespread in the digital preservation community and integrated in out-of-the-box digital preservation software like Rosetta and Preservica, the reliance of JHOVE is especially interesting for the formats we possibly all have in our archives, like TIFF images.
My research question: Is the JHOVE TIFF module really that good in comparison with other tools?
And, as a side-effect: Is TIFF validation really easy-peasy?
First, there seem to be plethora of tools to test TIFF-validity, analyse TIFF-tags and even repair common errors. Some are listed in COPTR (search for “TIFF” and “validation”, though some tools like ExifTool do some validation and are not marked as validation tools). Furthermore, the libtiff library offers many programs that can be integrated in other tools (see e. g. this collection of TIFF-tools). Most tools are not out-of-the-box tools with a nice GUI like JHOVE, which can be used by you and me on a windows machine.
I selected the following tools for my test:
|Validation Tool||version||How to use||remark|
|1||JHOVE||1.14.6||GUI and java library|
|2||ImageMagick||7.0.3||Command-line, batch-script||help for the batch-script via twitter from David Underdown and the ImageMagick people|
|3||ExifTool||10.37||Command-line, batch-script||help for the batch-script from Mario from the German nestor format identification group|
|5||checkit_tiff||0.2.0||runs on linux only yet||Andreas, checkit_tiff developer from the SLUB Dresden has run the test suite for me|
|6||LibTIFF||4.0.7||runs on linux only||Heinz from the German nestor format identification group has run the test suite for me|
In summary, I was able to analyse the test suite with six tools. I had some help with a suitable batch-script for ImageMagick and ExifTool, but at least I could easily run the tests on my windows machine (unlike checkit_tiff, for example, which requires Linux). For checkit_tiff and LibTIFF two colleagues helped me out and sent me the findings for me to analyse.
In the following paragraphs, I introduce the tools I have used for this analysis in more detail:
Validation: JHOVE is my to-go-validator for TIFF files. The findings are intelligible and the expectations for the file to follow the TIFF specification seem reasonable. There are almost 70 known TIFF error and info messages in the JHOVE module, most of them carry their meaning within the message like “TileLength not defined” and even a passer-by with a minimum of fantasy can imagine why information about the tile length might come in handy for an image. In forms of transparency, it is described on the JHOVE website which requirements a TIFF file must met to be well-formed or well-formed and valid.
Handling: As the GUI output never suited me, I have long ago begun to use JHOVE as a java library and have my own html output, which is very user-friendly. Aside from the GUI output not being handy when dealing with many files, JHOVE is very easy to install and use, as one could just throw (drag & drop) files and folders at it and it will validate them all.
Validation: Just to be fair, ImageMagick is not primarily about file validation, but instead about displaying, migrating and working on images. I have marked every file as invalid that ImageMagick had at least one error message about, even if the error seems to be a minor one, like the encounter of an unknown TIFF field or incorrect contents of tags. As far as I know there is no list of all possible error messages of ImageMagick. The corpus used for this blogpost, the Google Imagetestsuite, however, consists of more than 40 different error messages, listed here. Analysing the ImageMagick output for this Blogpost, “valid” means “error-free”.
Handling: If I would have known how bad the output is before I started this test, I would have skipped this tool altogether. ImageMagick is a command-line-tool and as far as I know the batch-processing only has a text output and this output is really a mess. It’s difficult to tell which error information belongs to which file, as some files are not even listed by name. I had to test those all one-by-one, which was time-consuming and boring. But I am pigheaded and had already started to tell everybody I am going to test this tool. Even after I had written some java to help me with the messy output, some stuff could not be automated, as I could not find a regular pattern for everything. I am very sure that ImageMagick is very useful for batch-processing when converting images etc., but obviously nobody has really thought about validating 166 images at once or validation in the first place.
Validation: ExifTool is not really meant for validation, either. It’s for metadata extraction. The information about image errors is just a by-product if the tool runs into any problems while trying to extract metadata. So it’s not really fair to treat ExifTool like a validation tool, as it would never complain about an absolute unreadable TIFF which cannot be opened by any viewer, as long as all the metadata can get extracted. That might be the reason why ExifTool has the highest percentage of presumably valid TIFF files within this test. So, “valid” for ExifTool means, that there were no warnings or errors in the metadata output.
Handling: It’s a command-line-tool with quite good possibilities to batch whole folders and output human-readable csv (though the csv can have many, many columns, as images can have a myriad of metadata).
Validation: The DPF Manager is for TIFF validation only and was built only for that. Some validation profiles are included (e. g. for “baseline TIFF”, “extended TIFF”, TIFF/EP and the TI/A Draft). Besides, you can specify your own validation profile to check against. For this analysis, only the baseline and extended TIFF profiles were used. For the DPF manager, “valid” means that a TIFF file does not hurt the specification in any way (depending on the profile used).
Handling: The DPF Manager is very easy to install (though you need Admin-rights to do so, it’s not portable like JHOVE is) and extremely easy to use: You can just drop and drag a file or folder on the GUI or, alternatively, select a file or folder. The tool is very fast – 80 TIFFs need less than a minute (size of the TIFFs varied between 9 kb and 4 MB) – and the HTML output is very nice and there is also METS and XML (although personally I would like an additional csv option). Furthermore, the TIFF files are sorted by the number of errors. The worst ones come first and when you scroll down, the TIFF get less and less invalid, ending with the valid ones in the end. So bad news first! There are also thumbnails of the TIFFs, if the image is renderable. I think in terms of usability this tool has easily won the contest.
As a bonus, each error is referenced to the page and section in the TIFF guide, including the exact quotation that the error refers to.
Having seen this, I am tempted to think of the DPF manager as the reference tool for whether a TIFF really is valid or not – the question is whether it outperforms my current go-to tool, JHOVE? It certainly aims at being the go-to-validator for TIFF-files.
Validation: checkit_tiff is too picky for this examination, as it validates against baseline TIFF whereas the TIFFs in the Google Imagetestsuite are not baseline. I still want to present the findings of the tool and include it in this Blogpost, as the reader surely has a different TIFF corpus and might very well want to validate baseline TIFF.
Handling: The tool runs on Linux. There is not yet a windows version of the tool and it’s a command line tool. A windows version is about to be released soon, though.
Validation: There are 695 different error messages from evaluation of the Google Imagetestsuite sample (listed here), mostly very similar ones dealing with unknown TIFF tags. Reading the error messages, LibTIFF seems to check the tags only, and omits some general file-structure validation, such as the check for end-of-file tags. It does check the TIFF header for the magic number, though (see “No TIFF magic number“). As far as I know, there is only a text output (“.log”), which in general is readable, but for bulk-analysis not much better than ImageMagick.
Handling: The tool runs on Linux.
The test was run on the Google Imagetestsuite for TIFF, which has the advantage of being openly available and consists of some really bad TIFF files. By that I mean that half of the files are not even renderable or looks bogus, as if parts were missing or the image is just grey, white or black, and one does not quite know if that’s intentional.
The files are named after their MD5-checksums (see: About).Unfortunately, there is no “ground truth” indicating whether each TIFF file is valid or not. I have added information about whether the image is renderable in either Windows Photo Preview, Paint or ImageMagick in the Findings spreadsheet, and used this as a basis for comparing tool validation output against. I know that this is not a water-tight solution as an image can be invalid and still open in a current viewer. Some of the Google Images also look damaged in the viewer: either as if something is missing or just black or white – and I cannot figure out if this is on purpose (= this just is a picture of some black stuff) or the image is broken.
Reading the explanation of the TIFF Google Imagetestsuite, all images without the prefix “m-” are original and were not modified. All images with the prefix “m-“, however, were modified in some way. Although the intention was not necessarily to add errors to the image, the percentage of valid images for these files is much smaller (see table below).
For the chapter “differences in terms of error” below I have used the 3 files of the fixit/checkit_tiff test corpus in addition, as these files were better suited for the examples.
For ImageMagick and ExifTool I have marked images as invalid if they threw errors.
If an image could not be analysed by a tool at all (e. g. because the tool only analysis files with a correct TIFF header and some files don’t provide one), they were marked as “could not be analysed”.
I have tested the images wicht windows photos, paint, ImageMagick and the thumbnail preview on DPF Manager. If an image could not be rendered by either of them, I marked it as “invalid” in the column “renderable in a viewer”.
|JHOVE||ImageMagick||ExifTool||DPF Mananger (Baseline)||DPF Manager (Extended TIFF)||checkit_tiff||LibTiff||Renderable in a viewer|
|all 166 files|
|valid / error free||29||18||56||4||15||0||21||83|
|invalid / errors reported||129||148||109||151||140||131||145||83|
|could not be analysed||8||–||1||11||11||35||–|
|47 original files (not modified)|
|could not be analysed||–||–||–||–||–||–||–|
|119 modified files|
|could not be analysed||8||–||1||11||11||35||–|
|83 non-renderable files|
|could not be analysed||5||–||1||11||11||3||–|
A few observations on the above figures:
checkit_tiff does not analyse the TIFF files if the magic number is missing.
Compared against the ability for a viewer to render the files, all tools seem to generate false positive (=false alarm) results – more files are marked as invalid than renderability would seem to imply.
The number of invalid “modified files”, and the number of invalid “non-renderable files” is very high. There were some files which could not be opened with Paint but with ImageMagick, though (marked with “ImageMagick can open” in the Spreadsheet).
I am reluctant to state that JHOVE has two false negatives here, but all the other tools have marked these two files as invalid except ExifTool, which has marked one of the files as error-free (see Spreadsheet). Furthermore, no viewer can render the files. I tried to analyse the files with my own TIFF java tools, but they could not process the files at all. So much as I hate it, I have to admit: These are false negatives. What else should I call it? I certainly would not want these two files going unnoticed in my archive. Furthermore, the other five tools all have detected that something is wrong with these files.
For 5 files of the test corpus JHOVE throws the “Premature End-of-File“-Error, which usually hints at a fatal error with the file. Often parts of the file are missing, a typical issue is that the file was not completely downloaded/uploaded and the last chunk of the file is not there. JHOVE usually realises missing chunks at the end of a file.
Four of the five files (spreadsheet) do look very suspicious. Two cannot be opened, two are black, one looks as if parts of the text were missing. At least most tools agree that something is wrong with the files. Only the DPF manager considers one of the 5 files to be valid. Looking at the error messages of the DPF manager, there is no hint of a premature file ending.
ImageMagick throws the error “unexpected end-of-file”, but for five other files of the corpus (listed here). JHOVE reports other errors for these files, but at least again all validation tools agree on the invalidity of the files. They certainly look damaged in the viewer and one of them cannot even be opened. The DPF manager could not even analyse these five files, they all were omitted in the analysis.
If the file only purports to be a TIFF file, e. g. by the file extension, but the magic number cannot be found, all tools agree on the error (spreadsheet). None of the three files reporting the error could be opened and JHOVE, ImageMagick, DPF Manager and checkit_tiff (by not handling the file) agree that the magic number is missing and that it therefore cannot be a TIFF file or rather the TIFF signature is incorrect.
Sometimes, the tools agree on an error and even use very similar words to describe the error. One example is shown in the table below.
|ImageMagick Error for file 0c84d07e1b22b76f24cccc70d8788e4a||JHOVE TIFF Module Error for file 0c84d07e1b22b76f24cccc70d8788e4a|
|Unknown field with tag 37680 (0x9330) encountered||Unknown TIFF IFD tag: 37680|
|Unknown field with tag 37677 (0x932d) encountered.||Unknown TIFF IFD tag: 37677|
|Unknown field with tag 37678 (0x932e) encountered.||Unknown TIFF IFD tag: 37678|
Obviously, both tools check for unknown TIFF tags and reports it if they encounter some. ImageMagick also gives the Hex value of the field. It does not matter which of these two tools one uses, it will always report unknown tags. At least both tools have done so with the Google Imagetestsuite.
For theese two examples I have used files from the fixit/checkit_tiff test corpus.
The JHOVE Module reports correctly if the DateTime is not formatted as it supposed to be and marks the file as “well-formed, but not valid”. It has done so with the “invalid_date.tiff” from the fixit / checkit_tiff testfiles. ImageMagick, however, completely neglects to realise that there is something wrong with the DateTime Tag in this file and the error goes unnoticed. (ImageMagick does report an error, which seems to be unconnected to the DateTime, however, as it is about the “Photoshop”-tag.) The DPF manager also reports “Incorrect format for DateTime” and quoted the TIFF specification, so this is a false negative for ImageMagick.
The JHOVE module throws this error for the “minimal_valid”-Tiff in the checkit_tiff Examples and marks the TIFF as “not well-formed”. ImageMagick does not report any errors for this file. The DPF Manager, however, finds three errors in the file, two related to “bad word alignment in offset” (which sounds pretty much like the JHOVE error) and one inconsistency about the tag planar configuration, which does not sound that fatal (“PlanarConfiguration is irrelevant if SamplesPerPixel is 1, and need not be included.“).
Of the 166 files, only for four files do all of the tools (except checkit_tiff, which considers them all to be invalid) agree on validity (spreadsheet). If one would decide on a file validity policy which only allows files in an archive for which no tools has any complaints, it would be a very empty archive indeed. It might not even be possible to satisfy them all with real-world images from different producers.
Although the tools agree on the “real bad” TIFF files, TIFF validation does not seem to be at all that easy-peasy. It has been much easier – at least with the corpus analysed – to determine what is a false positive and what is a false negative with the JPEGs in my last OPF Blogpost. The JHOVE TIFF module still seems to be a decent choice and I have not found any real gap like I did with the JHOVE JPEG module the other day, although the two false negatives leave me nervous.
Findings of the DPF manager seem to be trustworthy to me, as the TIFF specification can be referenced for each error found. Please note that some errors lead the DPF manager not to detect TIFF files as such, e. g. if the file ends prematurely or unexpectedly (see this spreadsheet).
Nevertheless, most of the tools – if not all – seem to be too paranoid. Assuming all non-modified TIFF are valid (which are all renderable in a viewer), only ExifTool considers 94% of them to be valid (or, “error-free”, as in the case of ExifTool). The second-best, JHOVE, still considers almost half of them to be invalid in some way. The DPF manager considers only 30% of them to be valid (Extended TIFF) and even is able to prove every bit of it.
Back to my research questions:
Is the JHOVE TIFF module really that good in comparison with other tools?
Well. It’s pretty user-friendly, the error messages are intelligible (but most TIFF errors are, with every tool tested), the output can be dealt with, but it’s not as user-friendly as the DPF manager, which also has a nicer output. And, the DPF manager has the reference to the specification all the time, which really feels good when talking to my boss about the quality of our TIFF files. Look, the TIFF bible says it’s ok / not ok. Who would argue?
Nevertheless, it was the only (real validation) tool with false negatives with perfectly invalid and un-renderable files, which would be worth a second look in one of my next posts.
And, as a side-effect: Is TIFF validation really easy-peasy?
It does not seem so, as the validators agree on very little indeed.
So, how to act?
I might just stick to JHOVE in our productive digital preservation environment, but I will at least add the DPF manager in our Pre-Ingest workflows, especially in our digitisation centre, to be sure we stick to the TIFF specification at least with TIFF files we generate ourselves. When receiving files from outsiders, I will be more tolerant, as I always am, but might add a preservation planning workflow to repair the TIFFs, if possible. But that will be the topic of another post at another time.