At the KB we have a large collection of offline optical media. Most of these are CD-ROMs, but we also have a sizeable proportion of audio CDs. We’re currently in the process of designing a workflow for stabilising the contents of these materials using disk imaging. For audio CDs this involves ‘ripping’ the tracks to audio files. Since the workflow will be automated to a high degree, basic quality checks on the created audio files are needed. In particular, we want to be sure that the created audio files are complete, as it is possible that some hardware failure during the ripping process could result in truncated or otherwise incomplete files.
To get a better idea of what software tool(s) are best suitable for this task, I created a small dataset of audio files which I deliberately damaged. I subsequently ran each of these files through a set of candidate tools, and then looked which tools were able to detect the faulty files. The first half of this blog post focuses on the WAVE format; the second half covers the FLAC format (at the moment we haven’t decided on which format to use yet).
WAVE dataset
For the WAVE dataset I started out with a small, intact WAVE file. Using a Hex editor I then made the following derivatives of this file:
- frogs-01-last-byte-missing.wav – one byte is missing at the end of the file
- frogs-01-last-2032-bytes-missing.wav – a chunk of 2032 bytes is missing at the end of the file
- frogs-01-byte-missing-at-offset-811537.wav – one byte is missing at offset 811537
Candidate tools, WAVE
The candidate tools I used to analyse the WAVE files are:
- jhove includes a WAVE validation module, which makes it an obvious choice. The tested version is 1.14.6, 2016-05-12.
- shntool is a "multi-purpose WAVE data processing and reporting utility". It was first released in 2000. The tested version is 3.0.7.
- ffmpeg is a popular conversion tool for audio and video formats. The tested version is 3.2.2.
- mediainfo is a widely-used feature extraction tool for audiovisual files. The tested version is v0.7.81.
Note that of the above tools, only Jhove and Shntool are designed to detect problems in WAVE files. Both Ffmpeg and Mediainfo were primarily designed for other purposes (format conversion and technical metadata extraction), and they were not designed to detect defective files! I included these tools here mainly because they are widely used, and I was curious whether they would throw up anything interesting in case of defective files1. I ran the tools with the following command-line arguments (replacing "foo.wav" with the actual file name):
Jhove
jhove -m WAVE-hul foo.wav
Shntool
shntool info foo.wav
Ffmpeg
ffmpeg -v error -i foo.wav -f null -
Mediainfo
mediainfo foo.wav
I automated this using a simple shell script that runs each tool on all files, and then writes the output to a set of text files.
Results, WAVE
The full output results of each tool can be found here.
Jhove
The ‘Status’ field in Jhove’s output summarises the validation outcome. Here are the results for each file:
File | Result |
---|---|
frogs-01.wav | Status: Well-Formed and valid |
frogs-01-last-byte-missing.wav | Status: Well-Formed and valid |
frogs-01-last-2032-bytes-missing.wav | Status: Well-Formed and valid |
frogs-01-byte-missing-at-offset-811537.wav | Status: Well-Formed and valid |
So, Jhove was unable to detect any of the damaged files at all!
Shntool
Shntool checks a WAVE on six criteria, which are listed in its output under ‘Possible problems’:
Possible problems:
File contains ID3v2 tag: no
Data chunk block-aligned: yes
Inconsistent header: no
File probably truncated: no
Junk appended to file: no
Odd data size has pad byte: n/a
The thing to watch here is the ‘File probably truncated’ item:
File | Result |
---|---|
frogs-01.wav | File probably truncated: no |
frogs-01-last-byte-missing.wav | File probably truncated: yes (missing 1 byte) |
frogs-01-last-2032-bytes-missing.wav | File probably truncated: yes (missing 2032 bytes |
frogs-01-byte-missing-at-offset-811537.wav | File probably truncated: yes (missing 1 byte) |
So, Shntool was able to detect all damaged files.
Ffmpeg
For our Ffmpeg call we monitor any errors that are sent to the standard error stream. The results:
File | result |
---|---|
frogs-01.wav | – |
frogs-01-last-byte-missing.wav | [pcm_s16le @ 0x3545380] Invalid PCM packet, data has size 3 but at least a size of 4 was expected Error while decoding stream #0:0: Invalid data found when processing input |
frogs-01-last-2032-bytes-missing.wav | – |
frogs-01-byte-missing-at-offset-811537.wav | [pcm_s16le @ 0x2768380] Invalid PCM packet, data has size 3 but at least a size of 4 was expected Error while decoding stream #0:0: Invalid data found when processing input |
Interestingly, Ffmpeg reports an error for both files that have 1 byte missing, but it doesn’t for the file that has 2023 bytes missing. This suggests that Ffmpeg is not suitable for detecting broken WAVE files.
Mediainfo
Mediainfo didn’t report errors or warnings for any of these files. This is not surprising, but it does confirm that Mediainfo cannot be used for detecting broken WAVE files.
FLAC dataset
Analogous to the WAVE dataset, I started out with a small, intact FLAC file, which I then butchered into the following derivative files:
- frogs-01-last-byte-missing.flac – one byte is missing at the end of the file
- frogs-01-last-1000-bytes-missing.flac – a chunk of 1000 bytes is missing at the end of the file
- frogs-01-byte-missing-at-offset-651202.flac – one byte is missing at offset 651202
Candidate tools, FLAC
The set of candidate tools is identical to the one used for the WAVE analysis, with two exceptions:
- flac is the reference implementation of the FLAC format. The tested version is 1.3.0.
- Since Jhove does not include a FLAC module, it was not used.
Flac
The Flac tool is able to encode audio to FLAC, and decode and analyze FLAC files. For this tests I ran it with the * -t* (or –test) option:
flac -t foo.flac
This decodes a FLAC without writing the decoded data to a file. Any errors during the decoding process are reported to the standard error stream.
Results, FLAC
The full output results of each tool can be found here.
Shntool
Even though Shntool supports FLAC, it was not able to detect the missing data in any of the files:
File | Result |
---|---|
frogs-01.flac | File probably truncated: no |
frogs-01-last-byte-missing.flac | File probably truncated: no |
frogs-01-last-1000-bytes-missing.flac | File probably truncated: no |
frogs-01-byte-missing-at-offset-651202.flac | File probably truncated: no |
So, Shntool does not provide any meaningful information on whether a FLAC is damaged.
Ffmpeg
Here are the results for Ffmpeg:
File | Result |
---|---|
frogs-01.flac | – |
frogs-01-last-byte-missing.flac | [flac @ 0x294b860] overread: 1 Error while decoding stream #0:0: Invalid data found when processing input |
frogs-01-last-1000-bytes-missing.flac | [flac @ 0x3c5d860] overread: 1 Error while decoding stream #0:0: Invalid data found when processing input |
frogs-01-byte-missing-at-offset-651202.flac | [flac @ 0x279faa0] overread: 1 Error while decoding stream #0:0: Invalid data found when processing input |
So, Ffmpeg was able to identify all damaged FLACs.
Mediainfo
Similar to the WAVE results, Mediainfo again didn’t report errors or warnings for any of these files.
Flac
Finally the results for the Flac tool:
File | Result |
---|---|
frogs-01.flac | – |
frogs-01-last-byte-missing.flac | ERROR while decoding data state = FLAC__STREAM_DECODER_END_OF_STREAM| |frogs-01-last-1000-bytes-missing.flac|ERROR while decoding data state = FLAC__STREAM_DECODER_END_OF_STREAM| |frogs-01-byte-missing-at-offset-651202.flac|ERROR while decoding data state = FLAC__STREAM_DECODER_READ_FRAME| |
So, the Flac tool was able to identify all defective files2.
Conclusion
Out of the candidate tools considered here, only Shntool was able to identify all damaged WAVE files in this experiment. As a result, this (ancient!) tool still appears to be the best choice for detecting damaged WAVE files. Surpringly, Jhove was unable to detect any of the damaged files at all, and is probably best avoided for this particular purpose. For FLAC, both the Flac tool (FLAC reference implementation) and Ffmpeg were able to detect all damaged files, and both appear to be suitable tools.
Dataset and scripts
All example files, scripts and raw tool output are available here:
https://github.com/KBNLresearch/detectDamagedAudio
Post scriptum: update on MediaInfo and MediaConch
In response to this post the developers of MediaInfo added support for detecting truncated WAVE files. This should cover all of the damaged WAVE files presented here. Moreover, their Twitter account announced that detection of FLAC flaws is planned for the MediaConch tool, but that they are looking for sponsors for this.
-
Also, this thread on superuser.com recommends Ffmpeg for checking the integrity of video files.↩
-
On a side note, I noticed that the error stream of the Flac tool sometimes contained a sequence of 21 non-printable ‘0x08’ (backspace) characters. This is probably a bug.↩
David Russo
November 21, 2019 @ 2:10 pm CET
As an update on JHOVE’s performance: JHOVE 1.20 correctly reports truncated files as not well-formed, along with the number of bytes missing and the ID of the truncated chunk when available.
Yvonne Tunnat
January 16, 2017 @ 10:15 am CET
Hi Johan,
this is really interesting. I do similar stuff with JPEG & TIFF and reading your post, I have great idea how to test JHOVE and PDF files.
I am really yealous of your cool script, I really have to extend my script-skills. 🙂
Best, Yvonne
johan
January 4, 2017 @ 3:52 pm CET
Hi Carl, yes of course you can use those WAVs. I just added a license statement to the repo’s readme (CC-BY).
Carl Wilson
January 4, 2017 @ 3:29 pm CET
Hi Johan, this is interesting work. I’m commenting as I’m currently adding some test WAV files to JHOVE. I presume I’m free to borrow the WAV data here? I have other sources but some synthetic, broken files would be useful. FYI this doesn’t mean that the WAV module will be fixed to address these testing issues right away. Using other FOSS tools to test JHOVE seems to be the best way of ensuring that JHOVE is doing its job properly. This blog post is an example of similar work using Bad Peggy to test JHOVE’s JPEG validator: https://staging.openpreservation.org/blog/2016/11/29/jpegvalidation/