Project to Identify files with linked dependencies
Many office suites and other applications allow the embedding of information in them via a link to another file. The use of linked spreadsheets is common amonst data intensive agencies and large documents are often managed through linking multiple office documents to form a single final product.
Currently we have only anecdotal evidence as to the prevalence of linked files in the digital universe. It would be really useful to be able to understand the scale of the issue and identify the prevalence of linked files in the material that we ingest. Archives New Zealand and Victoria University have recently intiated a project that we hope will go some way towards achieving this.
A student from the School of Engineering and Computer Science at Victoria University recently started work on a new summer project at Archives New Zealand. The student, Niklas Rehfeld, is funded through a summer scholarship jointly provided by Archives New Zealand and Victoria university.
Over the next 10 weeks Niklas will be working on a project to investigate linked files and build a tool to identify them. Specifically, the aim of this project is to develop a prototype tool to identify when computer files formatted in the Microsoft Office 1997-2003 formats link to other computer files and which files they link to (in order to identify the component files that make up the complex digital object).
The technical work will involve the following:
- Analysis of the Microsoft specifications to determine how document linking and other metadata that maybe of use for preservation purposes is implemented for Word, Excel and Powerpoint documents for the period 1997-2003.
- Review of existing frameworks and related tools such as the open source “format identification, validation, and characterization” tool JHOVE.
- Writing a specification for a modular tool for identifying linked documents given a root Microsoft Office document. As part of the specificion will be an evaluation of the feasibility of extending an existing tool versus creating a standalone implementation from scratch.
- Implementation of a prototype tool for at least one document format. Time permitting, the tool will be extended either to handle a wider range of document formats or a wider range of preservation metadata.
- Testing of the tool against a selection of files supplied by National Archives.
There are no comments on this post.