Links all of the way down: HTTPreservation of Documentary Heritage

Links all of the way down: HTTPreservation of Documentary Heritage

Much of the inspiration from this blog came from this source here.

According to UNESCO, the authenticity of a record can be jeopardized by:

  • Threats to integrity. Changes to the content of the object itself also potentially damage authenticity. Most such changes stem from threats to the object at a data level.

A hyperlink is data. It is content, it can be evidential; and at the very least, a missing link makes us question what was there.

Intersection of the web and Documents

Web preservation is largely focused on crawling the known web. But what about the links that are locked in documents transferred or deposited to a collecting institution?

Studies that look at this includes:

The studies each look at the extent of ‘link rot’ in scholarly publishing – largely PDF in each.

Jones et al. Show that:

  • 81.5% is the total number of URI articles in their corpus that suffer from link rot or content drift.
  • Only 21.8% that suffer link rot (67,071 out of 308,162) have representative snapshots (Mementos)
  • 59.7% that suffer content drift (184,065 out of 308,162), also have representative Mementos)

So link rot and content drift is happening in some domains. And strategies are proposed for combating this in the long-term. Burnhill et al. summarize a few of those that have been discussed in the community:

  1. Using robust links during preparation of a paper
  2. Archive links during submission to a publication system
  3. Post-publishing with the user utilizing a Memento service, e.g. via the Time Travel API to retrieve links when they access an article.

Everything that has been has passed

This blog sets itself apart from strategy 1. – using robust links in the preparation of a paper; let’s side-line that as a potential future state for government. Government hosted robust links for agencies?

What I want to look at is what we do in the position of a receiving institution where records are years (potentially decades) old. This puts us into the domains of strategies 2. and 3.

I suggest a workflow as follows:

  1. EXTRACT
  2. AUDIT
  3. REPORT
  4. COLLECT
  5. DESCRIBE

I walk through these below.

Extract

We use a tool to access the content of a record and extract all the links.

Audit

The collecting institution should have a record, of the types of link in the collections that have been transferred or deposited.

  • Types of link: HTTP, HTTPS, FTP, internet-based, or intranet-based e.g. link to an enterprise content management system; or link to a Wikipedia article etc.
  • Status of the link: 2xx OK, 4xx Client Error, moved, gone, pay-walled, requires authentication, etc.

Report

This report is used internally, and is distributed to the agency (and depositor) of the state of things. In our developing line of work, evidence that needs to be done to improve processes in future.

Collect

Links that are alive still need collecting. As simple as saying to the Internet Archive or other service like Archive-It ‘save this’. For links that are gone, can we use a Memento (archived web snapshot) service to find a link to the original record? For ECMS links, can we still collect the original document as part of a transfer or split transfer?

Describe

You have at least three states of potentially evidential hyperlink in your record:

  • Links that are OK (and now saved).
  • Links that have gone but available as a Memento.
  • Links that have disappeared entirely.

When a user accesses your catalogue, I suggest we give them the courtesy of knowing what to expect whether a link a link has gone or not. We can use our catalogue description to do this. We can look at it as a link out to another record; a mechanism to aid discovery; or it could be listed as information in a preservation status field.

Sixth-stage – Fix?

It is entirely possible to take a (digital) scalpel to a record, lift out the broken link, and replace it with something robust – to fix it. This would result in a second preservation master. Should we? – I’ll leave that up to a discussion.

HTTPreserve and this Workflow

I’ve created a suite of small utilities to help with this work. The three main components (out of a handful), are:

HTTPreserve: tikalinkextract

Built on the premise that we can use Apache Tika’s server mode to bulk extract content from the files that Tika can handle. Tikalinkextractor creates a CSV containing the filename it extracted a link from, and the link itself. It is an alternative to the proposed method in Zhou et al. (2014). Zhou et al. Use the tool pdftohtml to extract links from the XML the tool can produce.

The two methods should be complimentary, with Tika better able to handle OLE2 file types which constitute (at Archives NZ) a large part of our collection.

Edit: September, 2019: I have written an OPF blog post about this tool specifically.

HTTPreserve: HTTPreserve  and HTTPreserve.info

HTTPreserve is the main engine of the work.

It is also a web service: http://httpreserve.info

The utility and web-service have two API (application programming interface) methods:

  • /httpreserve?url={url}&filename={filename.txt} – provides httpreserve stats about a link
  • /save?url={url}&filename={filename.txt} – manages a save transaction with the internet archive and returns a httpreserve stat structure to the user

Give it a try!

http://httpreserve.info/httpreserve?url=https://staging.openpreservation.org/&filename=filename.doc

CURL:

curl –i –X GET http://httpreserve.info/httpreserve?url=https://staging.openpreservation.org/&filename=”filename.doc”

The structure is designed to aid reporting. It uses the filename (optional) as a way of tagging it, and it includes a Base64 encoded image snapshot of the website that can be used for detection of content drift.

The command will work in CURL. You can also use CURL to gleam more information:

curl -s -i -X OPTIONS http://httpreserve.info | less

And the tool is itself a Golang library, which is documented here:

https://godoc.org/github.com/httpreserve/httpreserve

HTTPreserve: workbench 

Workbench is a first attempt to wrap some of the bulk capability of HTTPreserve, and has a couple of modes; two of note.

The webapp (workbench) shown above is designed to give archivists full control over their process of auditing and ‘collecting’. I hope to add some tagging functionality to output in a report as well. Tags might be ‘content drifted’, ‘content gone’. These could all be part of an aggregate report.

There may also be effective automated processes which exist that I’d like to hear more about.

CSV output:

The CSV may be useful to recreate studies such as those in the introduction – to create consistent reporting for anyone researching link rot and content drift across domains. It’s also a more traditional way of recording information for our organization’s records.

A JSON output can also be created with the same information.

What next?

The UNC snapshot doesn’t have images where the Internet Archive one does – it is potentially a more complete record of this website.

  • We need to determine the use of these tools in an ingest workflow, i.e. as part of characterization. The basic functionality of HTTPreserve and http://httpreserve.info gives us a lot of flexibility to incorporate it into tools.
  • Determine the extent to which these tools should be incorporated into archival workflows, e.g. collection strategies, and within archival description.

I also want to hear from folk who are interested in this work, testing it, and/or the proposed methodology and might be able to help me test that.

I also want to hear from folk with tools that might already do some, or all of this better.

If you’ve made it to the end of this post, you may also be interested in Nicola Laurent’s writing about the emotional impact of the 404 error.

This blog is in preparation for being at WADL2017 in Toronto where I’ll be presenting a poster and lightning talk on this work. I would love to hear more from y’all there. 

Leave a Reply

Join the conversation