Error detection of JPEG files with JHOVE and Bad Peggy – so who’s the real Sherlock Holmes here?
In this Blogpost I want to examine the findings of two validation tools, JHOVE (Version 1.14.6) and Bad Peggy (version 2.0), which scans image files for damages, using the Java Image IO library. Goal is to compare the findings and enable the reader to know what to expect from these validation tools for the daily digital curation work.
This test was done with the publicly available Imagetestsuite from Google. Furthermore, I included pictures from events like Christmas parties and outdoor events in my library from the last 6 years, pictures contributed by friends and colleagues, some of my own pictures, and even some memes from public fun Facebook pages like “useless facts”.
In one case, I even opened a JPEG in an editor to remove some bytes to test if the tools would realise that something was wrong, hoping I would get error messages that were still missing from the list (which, by the way, worked out).
In general, the JHOVE JPEG module knows 13 different error messages, whereas Bad Peggy can distinguish at least 30 (source code of KOST-Val, which uses Bad Peggy to validate JPEG files).
In this table there are listed the JHOVE errors only. All errors found by Bad Peggy in the sample are in the table below the conclusion.
|Message||Examples in the sample?||No. of occurences in Sample||Bad Peggy equivalent|
|1||DTT segment without previous DTI||no||/|
|2||Unexpected end of file||Yes (Example)||26||corrupt data: premature end of data segment. AND
corrupt data: Truncated File – Missing EOI marker.
|3||I/O exception processing Exif metadata:||no||/|
|4||Invalid JPEG header||Yes (Example)||4||The file is not a JPEG (header).|
|5||JFIF APP0 marker not at beginning of file||Yes (see “Expected marker byte 255, got”)||17||Bad Peggy does not recognise this file as invalid|
|6||Marker not valid in context||Yes (Example)||3||invalid file structure: two SOI markers|
|7||Expected marker byte 255, got||Yes (Example)||78||Bad Peggy does not recognise this file as invalid|
|8||SPIFF marker not at beginning of file||no||/|
|9||File does not begin with SPIFF, Exif or JFIF segment||Yes (see “Expected marker byte 255, got”)||105||Bad Peggy does not recognise this file as invalid|
|10||Error creating temporary file. Check your configuration:||no||/|
|11||Unrecognized tiling data||no||/|
|12||Value offset not word-aligned: xxx||Yes (Example)||6||Bad Peggy considers these files to be invalid as well, but throws different error messages|
|13||No TIFF magic number: 4906 (Is officially a TIFF error message, but was thrown for this presumably JPEG file)||Yes (Example)||1||corrupt data: bad Huffman code. (only one file in the sample)|
So for five of the error messages no examples could be found, so I won’t look at these errors in this blog post. Furthermore, for the error “No TIFF magic number: 4906” and “Value offset not word-aligned: xxx“, the examples in the sample were so scarce that I cannot possibly explain them properly yet. If anybody out there has examples for these errors, I will happily extend this post and include the findings.
Let’s take a closer look at the different errors:
Between the SOI (Start of Image) and the EOI (End of Image), there are other segments allowed, roughly said the structure of a JPEG should be as following:
- SOI-segment (SOI: start of image): “FF D8”
- APP0-segment (JFIF-Tag): “FF E0”
- other segments
- SOS-segment (SOS: start of scan): “FF DA”
- data: compressed data
- EOI-segment (EOI: end of image): “FF D9”
If there are e. g. embedded thumbnails, it is also possible that a single file can have more than one SOI- and EOI-segments. In these cases, the JPEG has to have a SOI segment between two EOI thumbnail markers. (Bad Peggy checks this, whereas JHOVE does not.)
Interestingly, JHOVE does not always detect if parts of the files are missing. Only for files where Bad Peggy throws two errors: “corrupt data: premature end of data segment” AND “corrupt data: Truncated File – Missing EOI marker“, will the JHOVE JPEG module detect that something is indeed wrong with the file. For several files Bad Peggy only throws “corrupt data: premature end of data segment” and JHOVE considers them to be valid.
Furthermore, the Bad Peggy error “corrupt data: xxxx extraneous bytes before marker 0xd9.” goes unnoticed by JHOVE. Images in this sample with these errors, though, do not look healthy to me at all.
Jhove detects that there is something wrong with these images (“Unexpected end of file“):
JHOVE does not detect that these images are damages, though Bad Peggy does (“corrupt data: premature end of data segment“):
There is clearly something missing, as you can see with these three examples, which are considered to be valid by JHOVE:
|File Name||Bad Peggy Error||Impact|
|image195||corrupt data: 83426 extraneous bytes before marker 0xd9.||color problems, picture seems to have two parts that do not belong together|
|image185||corrupt data: premature end of data segment.||color problems, picture seems to have three parts|
|image183||corrupt data: 19846 extraneous bytes before marker 0xd9.||color problems, picture seems to have two parts|
As a digital archivist, I would want to know about these errors while ingesting data. I fully agree with Bad Peggy – this data is indeed corrupt. I consider this as a false negative finding of the JHOVE JPEG module: The JPEG has serious problems, but JHOVE does not detect them. In this case, Bad Peggy is the better Sherlock Holmes.
The last Bytes of a JPEG should look like this and always end with and EOI (end of image), which is “FF D9”:
In this example, the JPEG just ends without the necesary EOI:
If I add the “FF D9” manually and save the file, JHOVE does not detect that something is wrong with it any more. It considers the file to be valid & well-formed.
Bad Peggy, however, still considers this file to be invalid:
|File Name||Bad Peggy message 1||Bad Peggy message 2|
|image178.JPG||corrupt data: premature end of data segment||corrupt data: Truncated file – Missing EOI marker|
|image176_added_FFD9.JPG||corrupt data: premature end of data segment||this error message is gone|
Easy and straightforward: Both tools check the JPEG header and throw an error if there is no correct JPEG header. This is extremely useful, as tools usually cannot open files with a missing JPEG header. In most cases the file is unreadable for good – or it’s not even a JPEG in the first place.
After the SOI (“FF D8”), an APP0-segment should follow, which always starts with “FF E0” (see: “General information about the JPEG structure”). If this error is thrown, the APP0-segment does not follow directly after the SOI-segment. A JPEG file which throws this error viewed in a Hex editor shows that the SOI marker “FF D8” is not followed by “FF E0”. Instead, in this example the SOI-marker is directly followed by a copyright marker (“FF EE”).
JHOVE flags this file as missing the APP0 marker; Bad Peggy, however, completely ignores this error and obviously does not test it. The JPEG standard clearly states that JPEG files have to be structured like this, but so far, none of the JPEG files of the sample have caused any problems in commonly used viewers. This cannot be marked as a false positive for the JHOVE JPEG module, but currently does not seem to bear any practical risks for the affected data. As usual, the viewers are more flexible than the file format standard, at least contemporary viewers (we cannot guess for future viewer, however, which makes the decision for long-term-availability harder).
The examples in the corpus are all visibly corrupted, as parts of the images seem to be missing:
The JHOVE error sounds pretty general. For the three examples in the corpus, Bad Peggy has found different equivalents.
|File Name||JHOVE||Bad Peggy||Explanation|
|marker_1.jpg||marker not valid in context||corrupt data: 16 extraneous bytes before marker 0xe0||Cannot explain the error, as there is no 0xe0 marker to be found|
|Marker_2.jpg||marker not valid in context||corrupt data: premature end of data segment||See “Unexpected end of file”|
|Marker_3.jpg||marker not valid in context||corrupt data: premature end of data segment
invalid file structure: two SOI markers
|Searching for the SOI-segment in an affected file has shown a second SOI-segment later in the file|
This error occurs several times within the sample and gives a plethora of marker bytes which have been used instead of 255. So far, none of the affected JPEGs have shown any problems and Bad Peggy ignores the error altogether.
A JPEG file usually uses the graphic format JPEG Interchange Format (JFIF), but can also use Exif or SPIFF – but obviously has to start with one of these three segments and no other. Bad Peggy marks these files as invalid as well, but the error message is quite tight-lipped “ype.” – which was translated by the KOST (Switzerland) as something like “This JPEG contains characteristics that are not supported” – which does not really enlighten me.
JHOVE usually detects this error combined with “JFIF APP0 marker not at beginning of file” or “Expected marker byte 255, got XXX“. An example for a file with this error “standalone” is this:
Similarly to the example for the JHOVE error “JFIF APP0 marker not at beginning of file“, the tag directly following the SOI (“FF D8”)-marker is not a JFIF, Exif- or SPIFF-marker, but a copyright tag (“FF EE”).
Bad Peggy also detects an error that is completely ignored by the JHOVE JPEG module, which Bad Peggy has found for more than 100 files within the sample. This error is almost self-explanatory, knowing what an SOI (start of image) and EOI (end of image) is. So far, none of the JPEGs look bogus in any way or had any problems to be displayed.
After a closer look at the affected JPEG data I would not want these JPEG files being unnoticed in my archive:
Two of the images cannot even be opened and displayed any more and the rest has missing parts, mixed up parts and colour problems. For practical reasons, I would want my tool to detect the errors automatically and not necessarily more than those. These are the only JPEGs that obviously have problems, others show errors in JHOVE or Bad Peggy or both, but contemporary JPEG viewer tools have no problems displaying the JPEGs. Of course it is impossible to say if future tools will be able to display these JPEGs properly.
Considering this, Bad Peggy has clearly won: It detects all of the visually corrupt images.
The JHOVE JPEG module misses 7 out of 18 – which is the Bad Peggy error “corrupt data: premature end of data segment” without the additional error “corrupt data: Truncated File – Missing EOI marker” and “xxxx extraneous bytes before marker 0xd9.” Maybe JHOVE would be just fine if these two extra tests would be included. If there is seriously other stuff missing – well, maybe we’d need a bigger sample to examine to be able to answer this question.
These seven JPEGs with visible problems are missed by JHOVE:
All in all 1007 files in the 3070 sample had problems, if Bad Peggy is to believed. As one file can contain more than one error, the findings are as follows:
|occurance||error flavour||Error Message|
|15||recognition and BadPeggy||The file is not a JPEG (header).|
|846||nvalid file structure||Missing SOI between two EOI thumbnail markers|
|2||invalid file structure||two SOI markers.|
|2||invalid file structure||two SOF markers.|
|2||invalid file structure||Huffman table 0x00 was not defined|
|2||invalid file structure||SOS before SOF.|
|1||invalid file structure||Empty JPEG image (DNL not supported).|
|2||invalid file structure||missing SOS marker.|
|21||corrupt data||premature end of data segment|
|40||corrupt data||16 extraneous bytes before marker 0xe0|
|23||corrupt data||Truncated File – Missing EOI marker|
|1||corrupt data||bad Huffman code.|
|1||corrupt data||found marker 0xf7 instead of RST0.|
|6||other problems||Bogus Huffman table definition|
|6||other problems||Bogus marker length|
|7||other problems||Warning: unknown JFIF revision number 148.195.|
|7||other problems||Image Format Error|
|3||other problems||Invalid progressive parameters Ss=227 Se=63 Ah=1 Al=0|
|3||other problems||Bogus DQT index 10.|
|2||other problems||Quantization table 0x00 was not defined.|
|2||other problems||Unsupported JPEG process: SOF type 0xc3.|
|2||other problems||JFIF not permitted in stream metadata.|
|1||other problems||Unsupported JPEG data precision 9|
|1||other problems||Sampling factors too large for interleaved scan.|
|1||other problems||Bogus sampling factors.|
|1||other problems||Too many color components: 17, max 10.|
|1||other problems||Sorry, there are legal restrictions on arithmetic coding.|