An Analysis Engine for the DROID CSV Export
I have been working on some code to ensure the accurate and consistent output of any file format analysis based on the DROID CSV export, example here. One way of looking at it is an executive summary of a DROID analysis, except I don't think executives, as such, will be its primary user-base.
The reason for pushing the code and this blog out now is to seek feedback on what else might be useful for users. Specifically, formatting of output, and seeking other use-cases for how others analyse the same results to understand if I can incorporate these methods into my work to the benefit of the community.
The hope is that the results output by the tool can be used by digital preservation researchers, analysts, coders, archivists and digital archivists alike – where there is such a distinction to be drawn.
The tool is split into two or three components depending on how you break it down.
This places DROID CSV export data into a SQLite database with the same filename as the input CSV file.
The process adds two additional columns to the saved table, URI_SCHEME so we can query to a greater granularity the URI scheme used by the various URIs output by DROID; and DIR_NAME to enable analysis on base directory names, e.g. to help us understand the break-down of directories in a collection.
This combines the functions of droid2sqlite.py by calling droid2sqlite's primary class. Further it provides a query layer on top of the DROID SQLite database, outputting the results of various queries we might ask of the dataset to the command line.
This is a class created to help spot potentially difficult to handle file names from any DROID CSV output. The class is based on a Microsoft Development Network article but also checks for non-ascii characters and a handful of other characters that can prove problematic, such as square brackets.
Database and Analysis Engine
This work mirrors some of that done by DROID already. DROID outputs an Apache Derby database for its Profile file format. Information on connecting to it can be found on the droid-list mailing list. For my purposes I had a desire to learn the database management system SQLite and more practically I found a greater amount of support for it in terms of libraries available in Python or the applications I can use to access it. Instead of attempting to access the DROID Derby database and build on top of that, I decided to map the results to a SQLite database. SQLite also has features that I like that might lend itself better to long-term preservation enabling the long term storage of the database alongside any collection analysis documentation outside of the digital preservation system, if necessary.
DROID also enables filtering and the generation of reports, however I haven’t found the way it collects information to be useful in the past and so needed a different approach; an approach that gives me greater flexibility to create more reports or manipulate output.
The DROID CSV export is as simple as it needs to be and provides a lot of useful information and so provided an adequate platform for this work in its own right.
The database engine doesn’t have a hard coded schema; it simply reads the column headers in the CSV provided to the tool. Given the appearance of particular columns it creates two additional columns on top to provide greater query granularity, as mentioned above.
The analysis output includes summary statistics, along with listings of PUIDS and file paths depending on the query that we’re interested in. On top of the summary statistics, the following information is output:
- Identified PUIDs and format names
- PUID frequency
- Extension only identification in the collection and frequency
- ID method frequency
- Unique extensions identified across the collection
- Multiple identification listing
- MIME type frequency
- Zero-byte object listing
- No identification listing
- Top signature and identified PUIDs list
- Container types in collection
- Duplicate content listing
- Duplicate filename listing
- Listing of potentially difficult filenames
Total files: 500 Total container objects: 14 Total files in containers: 176 Total directories: 85 Total unique directory names: 75 Total identified files (signature and container): 420 Total multiple identifications (signature and container): 1 Total unidentified files (extension and blank): 80 Total extension ID only count: 17 Total extension mismatches: 32 Total signature ID PUID count: 54 Total distinct extensions across collection: 64 Total files with duplicate content (MD5 value): 155 Total files with duplicate filenames: 117 Percentage of collection identified: 84.0 Percentage of collection unidentified: 16.0
One point to note is that DROID can analyse the contents of container files, or not. In the former case it makes it difficult to generate a count of top-level objects (objects not stored within a container). It is, however, useful to understand both counts where possible, but duplication of reports might be undesirable. The creation of a URI_SCHEME column in the database enables this count to be calculated without the need to run DROID twice. The number of top-level objects in the opf-format-corpus can be calculated by subtracting the number of files in containers from the total number of files, so: 324.
Questions that we’re asking…
As we get into the analysis of a number of collections that we hope will be our first born-digital transfers at Archives New Zealand, we find ourselves asking questions of them to ensure they can be ingested with minimal issue into our long-term preservation system. We also want to ensure access is uninhibited for end-users once it arrives there.
Our first attempt at born-digital transfer sees us do this analysis up-front in an attempt to understand the transfers completely, looking at the issues likely to be thrown up on ingest and what pre-conditioning we are likely to have to do before that stage. Some of the questions are also part of a technical appraisal that will help us to understand what to do with examples of files with duplicate content and those that might otherwise be considered non-records or non-evidential e.g. zero-byte files.
The output of the tool represents ALL of the questions that we have considered so far. We do expect there to be more and better questions to be asked as well. Throwing the code and this blog into the public domain can help us build on this work through public input, so:
- What other information might be useful to query the DROID export for?
- What output format might be most useful?
- What formatting of that output format will best lend itself to data visualisation?
- What other comments and questions do readers of this blog have?
There is still some work to do here. I need to incorporate unit tests into the code base so that everything is as robust as I seek. I imagine future releases of DROID might initially break this tool's compatibility with DROID CSV exports and so that will have to be catered for at the time.
An important note about maintenance is that having created this tool for my day-to-day work I do hope to continue to maintain it for my day-to-day job as necessary.
One of the things I like about accessing DROID results via a database is that we don’t need an analysis layer on top of it. If users have a different requirement of the database than I have catered for then they can simply use the database and use their own queries on top, using their preferred flavour of programming language. Other ways of using such a database might include re-mapping the output to be suitable for cataloguing and archival description, if one desires.
I have considered adding a temporal angle to the database by enabling the storage of multiple DROID reports relating to the same transfer. This could be used to monitor the result of pre-conditioning or analysis of a collection using progressively up-to-date DROID Signature Files. This lends itself to reporting and demonstration of progress to management. The realization of this is more difficult as there doesn’t seem to be a single immutable piece of information we can hook into to make this possible with MD5 hashes likely to change on pre-conditioning, and the potential for file paths to change depending on machine being used to complete a scan. Thoughts on this matter are appreciated.
The tool is licensed under the Zlib license and so can be easily re-used and incorporated into other’s work without issue.