Much of the inspiration from this blog came from this source here.
According to UNESCO, the authenticity of a record can be jeopardized by:
- Threats to integrity. Changes to the content of the object itself also potentially damage authenticity. Most such changes stem from threats to the object at a data level.
A hyperlink is data. It is content, it can be evidential; and at the very least, a missing link makes us question what was there.
Intersection of the web and Documents
Web preservation is largely focused on crawling the known web. But what about the links that are locked in documents transferred or deposited to a collecting institution?
Studies that look at this includes:
- Burnhill et al. (2015) http://hiberlink.org/Insight.htm describe the extent of the problem,
- Zittrain et al. (2014) For the Harvard Law Review describe extent and look at previous studies too: https://harvardlawreview.org/2014/03/perma-scoping-and-addressing-the-problem-of-link-and-reference-rot-in-legal-citations/
- Jones et al. (2016) http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0167475
- Zhou et al. (2014) http://homepages.inf.ed.ac.uk/kzhou2/papers/dl2014-zhou.pdf
The studies each look at the extent of ‘link rot’ in scholarly publishing – largely PDF in each.
Jones et al. Show that:
- 81.5% is the total number of URI articles in their corpus that suffer from link rot or content drift.
- Only 21.8% that suffer link rot (67,071 out of 308,162) have representative snapshots (Mementos)
- 59.7% that suffer content drift (184,065 out of 308,162), also have representative Mementos)
So link rot and content drift is happening in some domains. And strategies are proposed for combating this in the long-term. Burnhill et al. summarize a few of those that have been discussed in the community:
- Using robust links during preparation of a paper
- Archive links during submission to a publication system
- Post-publishing with the user utilizing a Memento service, e.g. via the Time Travel API to retrieve links when they access an article.
Everything that has been has passed
This blog sets itself apart from strategy 1. – using robust links in the preparation of a paper; let’s side-line that as a potential future state for government. Government hosted robust links for agencies?
What I want to look at is what we do in the position of a receiving institution where records are years (potentially decades) old. This puts us into the domains of strategies 2. and 3.
I suggest a workflow as follows:
- EXTRACT
- AUDIT
- REPORT
- COLLECT
- DESCRIBE
I walk through these below.
Extract
We use a tool to access the content of a record and extract all the links.
Audit
The collecting institution should have a record, of the types of link in the collections that have been transferred or deposited.
- Types of link: HTTP, HTTPS, FTP, internet-based, or intranet-based e.g. link to an enterprise content management system; or link to a Wikipedia article etc.
- Status of the link: 2xx OK, 4xx Client Error, moved, gone, pay-walled, requires authentication, etc.
Report
This report is used internally, and is distributed to the agency (and depositor) of the state of things. In our developing line of work, evidence that needs to be done to improve processes in future.
Collect
Links that are alive still need collecting. As simple as saying to the Internet Archive or other service like Archive-It ‘save this’. For links that are gone, can we use a Memento (archived web snapshot) service to find a link to the original record? For ECMS links, can we still collect the original document as part of a transfer or split transfer?
Describe
You have at least three states of potentially evidential hyperlink in your record:
- Links that are OK (and now saved).
- Links that have gone but available as a Memento.
- Links that have disappeared entirely.
When a user accesses your catalogue, I suggest we give them the courtesy of knowing what to expect whether a link a link has gone or not. We can use our catalogue description to do this. We can look at it as a link out to another record; a mechanism to aid discovery; or it could be listed as information in a preservation status field.
Sixth-stage – Fix?
It is entirely possible to take a (digital) scalpel to a record, lift out the broken link, and replace it with something robust – to fix it. This would result in a second preservation master. Should we? – I’ll leave that up to a discussion.
HTTPreserve and this Workflow
I’ve created a suite of small utilities to help with this work. The three main components (out of a handful), are:
HTTPreserve: tikalinkextract
Built on the premise that we can use Apache Tika’s server mode to bulk extract content from the files that Tika can handle. Tikalinkextractor creates a CSV containing the filename it extracted a link from, and the link itself. It is an alternative to the proposed method in Zhou et al. (2014). Zhou et al. Use the tool pdftohtml to extract links from the XML the tool can produce.
The two methods should be complimentary, with Tika better able to handle OLE2 file types which constitute (at Archives NZ) a large part of our collection.
Edit: September, 2019: I have written an OPF blog post about this tool specifically.
HTTPreserve: HTTPreserve and HTTPreserve.info
HTTPreserve is the main engine of the work.
It is also a web service: http://httpreserve.info
The utility and web-service have two API (application programming interface) methods:
- /httpreserve?url={url}&filename={filename.txt} – provides httpreserve stats about a link
- /save?url={url}&filename={filename.txt} – manages a save transaction with the internet archive and returns a httpreserve stat structure to the user
Give it a try!
http://httpreserve.info/httpreserve?url=https://staging.openpreservation.org/&filename=filename.doc
CURL:
curl –i –X GET http://httpreserve.info/httpreserve?url=https://staging.openpreservation.org/&filename=”filename.doc”
The structure is designed to aid reporting. It uses the filename (optional) as a way of tagging it, and it includes a Base64 encoded image snapshot of the website that can be used for detection of content drift.
The command will work in CURL. You can also use CURL to gleam more information:
curl -s -i -X OPTIONS http://httpreserve.info | less
And the tool is itself a Golang library, which is documented here:
https://godoc.org/github.com/httpreserve/httpreserve
HTTPreserve: workbench
Workbench is a first attempt to wrap some of the bulk capability of HTTPreserve, and has a couple of modes; two of note.
The webapp (workbench) shown above is designed to give archivists full control over their process of auditing and ‘collecting’. I hope to add some tagging functionality to output in a report as well. Tags might be ‘content drifted’, ‘content gone’. These could all be part of an aggregate report.
There may also be effective automated processes which exist that I’d like to hear more about.
CSV output:
The CSV may be useful to recreate studies such as those in the introduction – to create consistent reporting for anyone researching link rot and content drift across domains. It’s also a more traditional way of recording information for our organization’s records.
A JSON output can also be created with the same information.
What next?
- Exploring our collections at Archives NZ. We’ve ~6000 records to scan and audit for hyperlink quality.
- Creation of an ‘executive’ report summary of our results that is easy for all to understand without deep knowledge of HTTP status codes and web archiving principles.
- Exploring memento. Using other internet archives in this work will be beneficial to all, but how do we choose the best snapshot (Memento)? Take for example these two for http://bbc.co.uk (University of North Carolina (UNC)): http://wayback.archive-it.org/all/19961221203254/http://www0.bbc.co.uk/ and the Internet Archive: https://web.archive.org/all/19961221203254/http://www0.bbc.co.uk/:
The UNC snapshot doesn’t have images where the Internet Archive one does – it is potentially a more complete record of this website.
- We need to determine the use of these tools in an ingest workflow, i.e. as part of characterization. The basic functionality of HTTPreserve and http://httpreserve.info gives us a lot of flexibility to incorporate it into tools.
- Determine the extent to which these tools should be incorporated into archival workflows, e.g. collection strategies, and within archival description.
I also want to hear from folk who are interested in this work, testing it, and/or the proposed methodology and might be able to help me test that.
I also want to hear from folk with tools that might already do some, or all of this better.
If you’ve made it to the end of this post, you may also be interested in Nicola Laurent’s writing about the emotional impact of the 404 error.
This blog is in preparation for being at WADL2017 in Toronto where I’ll be presenting a poster and lightning talk on this work. I would love to hear more from y’all there.