Peter May

About Peter May

Posts by Peter May:

August 18 2020

Putting the I in IPS

Peter May Architecture, British Library, IPS, Knowledge Base, Software Repository 0

At iPRES2019, I presented on our Integrated Preservation Suite project. This outlined an internal project we have been undertaking for several years to develop a suite of services, managed around a central web interface, for undertaking preservation planning at scale. The core components include: a Knowledge Base of technical information about file formats and software (implemented as a Neo4J graph database); a Software Repository for preserving software able to render our digital collection items; and a document repository for storing policies, preservation plans, and other preservation related documentation.

Architectural diagram showing functions of the Preservation Workbench (Search, Curation, Watch, Planning, Plugin Management) as well as the other core components of IPS. — **High level IPS architecture**

At the time of the 2019 presentation we were working towards an internal release that supported two main functions: an initial web-form based preservation planning function; and a search page for finding information about software and file formats. Those functions existed in the demonstration I gave at iPRES, but were reliant on an early, unmanaged, import of data into the Knowledge Base.

Since then — and taking into account disruption caused by Covid19 measures — we have been working hard to finalise our Knowledge Base curation process to improve the end-2-end import of data. Specifically, we want to avoid duplicating entries; imported data about the same file format should link to the same main format node in the Knowledge Base. The diagram below outlines the process we’ve implemented.

A data source adapter parses a data source (e.g. a web page) into our data model and adds data nodes into the staging area database. A person will curate the staging data into the Knowledge Base via the Workbench. They control which staging nodes are added as new nodes, which are merged with existing nodes, and which are discarded; the curation adapter implements the data management logic to make this happen. Once complete, we’ll have an updated, curated Knowledge Base.

**The IPS knowledge base curation process**

To enable this curation process to work we firstly had to extend the capabilities of our Data Management Library (DML) — a Python library used by the curation adapter to communicate with the Knowledge Base, allowing it to locate, add, and update graph nodes and relationships. The DML needed to indicate which nodes have been successfully added/updated. Following this we had to amend the curation adapter to remove successfully curated items from the staging area once they’d been copied to the Knowledge Base. We then had to implement a RESTful curation API to control the curation adapter and provide feedback to the Workbench UI. We’re now just finishing updating the Workbench UI to use the Curation API, and then we’ll move on to testing it!

Another key area we’ve been improving is integration between the Workbench’s software search results and the Software Repository. We want preserved software to be discovered and downloaded via the Workbench (rather than separately through the Software Repository). Developments here required creating a Software Repository API and a Repository Adapter that implements that API. We also enhanced the Knowledge Base data model to capture software included in the Software Repository. We’ve now got this initial capability working, allowing the Workbench to indicate preserved software and to provide download links to them.

IPS **UI showing the clickable software name to download JHOVE**

Once those efforts are complete, our next development phase will look at improving the preservation planning process to make better use of the Knowledge Base. For example, how can we improve generation of preservation plan options based on collection, risk, file format or software information? We’ll also look to develop new data source adapters, improve existing ones, and start to populate our Knowledge Base in a curated fashion. Be on the lookout for a future webinar on our progress!

August 23 2018

Helping W3C help us: help wanted!

Peter May Specifications, Standards 4

The World Wide Web Consortium (W3C) are keen to engage with the digital preservation community for comment on their draft Web Publications standard (https://w3c.github.io/wpub/) – a possible parent specification for future EPUBs.

They would like to ensure that this specification, and future Web Publications based on it, are well prepared for archival. This is an excellent opportunity for this community to make a small contribution now, that could result in considerable preservation benefits further down the line.

The Digital Preservation Coalition and Open Preservation Foundation are inviting our collective members to join in a small working group to review, discuss and provide feedback on this developing standard.

As a broad indication of the work involved, we expect:

To hold an online call with a W3C representative to provide an overview of the standard
For participants to then individually review and provide comment on the standard to the digital preservation working group (via google doc)
Further call(s) to review and discuss the comments as necessary
To subsequently submit issues to the Web Publications Github project
For participants to follow up on responses from the W3C to posted Github issues.

We welcome all those who are interested and committed to getting involved in the above work to get in contact with either Paul Wheatley at the DPC or Peter May at the British Library/OPF.

June 22 2017

Help Wanted! Digital archivists, writers and coders needed to improve format validation

Peter May JHOVE 3

April’s JHOVE hack day was another great success covering a range of development and non-development tasks; issues and pull requests were closed, sample files were found, user documentation was reviewed, and our knowledge on JHOVE errors was expanded upon. We don’t want the great work achieved this day to stop there though! There’s still plenty that needs to be, and can be, done before our next hack day.

But we need your help! So if you’re a digital archivist with digital files to test, a developer in need of a challenge, someone struggling to use JHOVE, or just someone with a keen eye for punctuation, we want to hear from you.

We have several aims we’re working to achieve. Firstly, we want to make sure JHOVE is robust and the validation results accurate through improved testing. We also want to make sure that the validation results are clear and understandable and that we know what errors are important for long term preservation. Finally, we want to make JHOVE friendlier for first time users through enhanced documentation.

So how can you help move things along? Carl’s post alluded to a few ways in which you could contribute:

Help finish writing the JHOVE beginner’s guide, agreeing a structure and proofreading the text before publication.
Contribute real-world use cases that demonstrate how JHOVE can be used, e.g. examples of your validation workflows – you could also use this form to send us your examples.
Identify and send us shareable sample files (PDFs, WAVs, etc.) to test JHOVE development with.
Add explanations on JHOVE error messages or notes on the preservation impact of particular errors.
Work on open issues, not all of which require you to be a developer – for example, looking for external reference information.

Whether you’re a developer or not, if you can spare some time to work on these – perhaps trying to find sample files or reviewing documentation – then we want to hear from you. Don’t worry if you’ve never been involved in the hack days before, just drop myself or Carl a line and we’ll get you started.

Finally, if you want to support JHOVE improvements but cannot commit to any of the above, you can always show your support through a financial donation.

Together we can make JHOVE reliable, easy to understand and transparent for everyone. Together we can make JHOVE awesome!

March 10 2017

Testing JHOVE PDF Module: the good, the bad, and the not well-formed

Peter May JHOVE, Validation 0

Organisations that use JHOVE for PDF validation will already be familiar with the number of error messages it reports. The recently released JHOVE v1.16 Release Candidate (RC) includes a couple of my bug fixes for the PDF module which appear to reduce this number significantly. These fixes were the result of investigating “Invalid Page Dictionary Object” errors and then a subsequent “Improperly Constructed Page Tree” error (which seemed to result from having solved the former).

Prior to this RC, Kevin Davies and I performed a short evaluation of the bug fixes against three (BL-internal) PDF samples (2,759 PDFs total) – one eBook set (set 2) and two eJournal set (sets 1 and 3). We wanted to understand the impact these code changes had on the reported error messages, in particular whether they reduced the error messages for which they were targeting.

This post summarises that brief analysis, highlights the fixes, and makes suggestions about what this may mean for software used in digital preservation – spoiler: it needs testing and we need to make that happen!

What have the fixes achieved?

Before discussing what modifications were made, let’s first indicate what effect it has on the sample sets of PDFs.

The figure below shows a breakdown of JHOVE responses for each of the three sample sets, comparing the patched JHOVE v1.15 code against the v1.15 codebase immediately prior to patching. As you can see there is some variation between the test sets, but overall they show a dramatic reduction in the number of “Not well-formed” results and a significant increase in the number of “Well-formed and valid” responses.

Fig 1: Count of files in each test set by overall JHOVE validation status

Comparing the responses for all files, we see – for all but 1 file (in Set 2) – a positive movement in their classification from “Not well-formed” (NWF) to “Well-formed but not Valid” (WF-NV) to “Well-formed and valid” (WF-V). For example, in the case of Set 3:

404 files upgraded from NWF to WF-V
12 files upgraded from NWF to WF-NV
57 files remained NWF
11 files upgraded from WV-NV to WF-V
16 files remained WF-NV
0 files downgraded

This movement is the same for the other sets, except in Set 2 where 1 file is “downgraded” from WF-NV to NWF; the cause of this is yet to be investigated.

Delving into these results a little more and focusing on the error messages that initiated the patch – Invalid Page Dictionary Objects and Improperly Constructed Page Trees – the results show that the patch completely eliminates occurrences of these error messages:

Count of test files with reports of Invalid Page Dictionary Objects and Improperly Constructed Page Trees

Without knowing anything about the patch or the sample PDFs, this of course opens up questions about whether such results are the correct results, but we’ll get to that…

Assuming our aim is to reduce incidence of erroneous error messages, on the whole then, the JHOVE patch appears to be working insofar as the intended error messages. Given the number of files that have had their overall status changed (fig 1), however, the results also hint that they affect more than just Invalid Page Dictionary Object and Improperly Constructed Page Tree errors.

Expanding to look at all the error messages reported by the JHOVE v1.15 and the patched version, we see that the patches have caused similar eliminations across a wider set of messages. Continuing to focus on Set 3 alone, Table 1 shows the extent of the reduction in error counts across all the error messages reported for that set. Of the 27 listed error messages, 15 have been eliminated from being reported, and 1 has been reduced from 6 incidences to 1 (Invalid Outline Dictionary Item). Similar results are seen for the other 2 sets (see the summary of JHOVE error messages Spreadsheet).

	JHOVE v1.15		JHOVE v1.15 (Patched)
JHOVE Error Message	#Messages	#Files	#Messages	#Files
Expected dictionary for font entry in page resource	360	360	0	0
Invalid Annotation property	118	118	0	0
Annotation object is not a dictionary	83	83	0	0
Improperly constructed page tree	43	35	0	0
Invalid object number in cross-reference stream	40	40	40	40
Invalid page tree node	19	19	0	0
Invalid page dictionary object	15	15	0	0
Invalid object number or object stream	12	11	0	0
Improperly formed date	8	8	8	8
Malformed dictionary	8	8	8	8
Invalid outline dictionary item	6	6	1	1
Invalid dictionary data for page	6	6	0	0
Invalid Resources Entry in document	6	5	0	0
Invalid Annotations	5	5	0	0
Compression method is invalid or unknown to JHOVE	4	4	4	4
edu.harvard.hul.ois.jhove.module.pdf.PdfInvalidException: Invalid destination object	4	3	31	14
Unexpected exception java.lang.ClassCastException	3	3	0	0
Malformed filter	3	3	3	3
Invalid destination object	3	3	5	5
Annotation dictionary missing required type (S) entry	3	3	0	0
Malformed outline dictionary	2	2	0	0
Improperly nested array delimiters	1	1	1	1
Invalid Font entry in Resources	1	1	0	0
Invalid character in hex string	1	1	1	1
Missing expected element in page number dictionary	1	1	1	1
Outlines contain recursive references.	74	74	0	0
Too many fonts to report; some fonts omitted.	1	1	4	4

In general, these results indicate that there appear to have been a large number files erroneously being reported as invalid due to bugs in JHOVE. These appear to have been resolved by the code modifications, but this opens up the question of whether this reduction in error counts is the correct behaviour? My colleagues in the Library and in the OPF have reviewed my patches from a coding perspective and have confirmed the modifications are correct. Equally the PDF experts at Dual Lab (the development team who brought you veraPDF) have evaluated the changes with reflection on the actual validity of test PDFs and have also given it a thumbs up.

Not all error messages were reduced or eliminated, however. A couple of messages show a small increase in the number of incidents – “Invalid Destination Object”, “Too many fonts to report; some fonts omitted”, and (for another set) “Malformed Filter” errors. These need further investigation to determine what their cause is; either it’s another bug that’s only become apparent by the correction of the coding bugs (in much the same way that the Improperly Constructed Page Tree only became apparent after the Invalid Page Dictionary Object error was corrected), or these files do indeed have problems related to these new error messages that were being masked by the patched errors.

What were the bugs? Why have they had such an effect on other error messages?

The bugs were quite low-level in the code and, as I was coming at it fresh to the codebase and the PDF spec, took a while to track down – roughly 3 days (including time to read up parts of the PDF spec). Having identified them and worked them through, they now seem fairly obvious corrections; this goes to show just how difficult it can be to build and maintain robust software.

My investigation started with looking at a file exhibiting the Invalid Page Dictionary Object error.

Invalid Page Dictionary Object

Invalid Page Dictionary Object error is reported when the PDF Pages Dictionary (the root of the Page Tree) is not found or has somehow been corrupted (tested by it not being a valid PDFDictionary object).

Unfortunately, in occasions when the PDF’s Page Tree is Flate encoded in a stream object and where the root page starts beyond one stream buffer’s worth of data, this error is also reported. This is caused by a mistake (typo?) in the PdfFlateInputStream.skipIISBytes method which miscalculates the number of bytes in the stream to skip when the requested skip number is larger than the remaining buffer size.

@@ -314,15 +314,15 @@ public class PdfFlateInputStream extends FilterInputStream {

     /** Skip a specified number of bytes. */
     private long skipIISBytes (long n) throws IOException {
         if (iisBufOff >= iisBufLen && !iisEof) {
             readIIS ();
         }
         if (iisEof) {
             return -1;
         }
         if (iisBufLen - iisBufOff < n) {
-            n = iisBufLen + iisBufOff;
+            n = iisBufLen - iisBufOff;
         }
         iisBufOff += n;
         return n;
     }

Specifically, line 324 of PdfFlateInputStream should set the skip amount to be the amount of data left (by subtracting the difference between buffer length and current offset). Instead the amount skipped was being set to the buffer length plus the current offset into that buffer – i.e. the skip jumps beyond the length of the buffer.

Having corrected for that error (see patch above), JHOVE then typically reports an Improperly Constructed Page Tree.

Improperly Constructed Page Tree

This error gets reported when there’s a problem iterating through JHOVE’s internal model of a PDF’s Page Tree. Specifically, JHOVE records a list of Page Object IDs that it visits whilst iterating through, and if an ID reoccurs, it reports that the Page Tree was not constructed properly.

This sounds sensible. Unfortunately, it was possible for a Page Object to not get its ID set when the object was located in an Object Stream (line 2429 in PdfModule simply returned the Object without setting it’s ID); in this case its ID reverted to the default (-1). With no checks being made for -1 IDs when iterating the Page Tree model, a second occurrence of a Page Object with a -1 ID causes the Improperly Constructed Page Tree error to be reported.

The fix (which David Russo improved upon from my initial implementation) is to ensure that the Object’s ID is set when trying to retrieve that specific Object (see line 128 in ObjectStream)

@@ -116,21 +116,24 @@ public class ObjectStream {
     /** Extracts an object from the stream. */
     public PdfObject getObject (int objnum)
         throws PdfException
     {
         Integer onum = new Integer (objnum);
         Integer off = (Integer) _index.get (onum);
         try {
             if (off != null) {
                 int offset = off.intValue ();
                 _parser.seek (offset + _firstOffset);
-                return _parser.readObject (false);
+                PdfObject object = _parser.readObject (false);
+                /* Need to ensure the object number is set */
+                object.setObjNumber(objnum);
+                return object;
             }
             return null;
         }
         catch (IOException e) {
             throw new PdfMalformedException
                     ("Offset out of bounds in object stream");
         }
     }
 }

Why should the bugs have an effect on other error messages then? Ultimately, the bugs themselves are in low-level parts of code that affect the handling of Flate encoded stream objects. The error in skipIISBytes method, in particular, will cause issues navigating around such stream objects (e.g. seeking to specific offsets), which impact the ability for JHOVE to process them. Therefore the other errors reported are likely to be the result of this failure.

What does this mean for JHOVE and more generally for software used in digital preservation?

Foremost, the software we use needs to be tested and re-tested as and when changes are made to it.

As far as JHOVE is concerned, it appears that it has erroneously been reporting many PDFs as invalid – but there was no reason to suspect these outputs were incorrect. The changes to the code described above seem to be a positive step forwards in reducing these occurrences; however, despite these changes having had numerous people review them, it is still hard to be conclusive about an improvement without adequate testing against a suitable, open, ground-truthed corpus of test files. How else will we know if the software has been correctly developed to produce the correct output?

Developing software is difficult, period. It’s harder still to develop robust software that correctly handles every variation in input. And it’s even harder when development effort is being done in people’s spare time, split around day jobs, etc. Without focused effort, coding errors creep in, go easily unnoticed, and can be difficult to find even when you are looking (regardless of how simple they may seem once found).

With no slight on JHOVE’s developers, the fact that bugs appear to have been found in JHOVE is not a surprise; I would expect them in any large and complicated piece of software. And so it is perhaps a good idea for us to assume that bugs exist in any of the software we intend to use. Some software will be better than others, of course, and this will generally be related to the amount of testing that has gone into its development. Does JHOVE need better testing? Sure it does!

So, with this assumption in mind, what can we do to help assure ourselves that tools are working effectively? And what else can we do to mitigate preservation problems relating to software – such as from files erroneously marked as invalid?

Test them! Build up unit tests and testing frameworks for tools; evaluate new versions and updates to software; ensure all tests pass.
Develop open, ground-truthed, test suites so that we can verify the output from software testing.
Perform in-house testing and report summary results to developers and/or the community, particularly where sample files cannot be shared, so as to provide a community confidence level in the software.
Compare tools against one another to see where differences of opinion lie (how does JHOVE validation compare against other tools?), promoting confidence in software and direction for further development.
Provide dedicated in-house developer resource to support the development and testing of software your organisation relies upon; your QA process will only be as good as the software it uses.
Record what software and versions of software are used to process files so that you can identify content that may have been affected by software later found to be defective. Keep in mind that modular software, such as JHOVE, may also require you to record versions of modules used as well as the main software.
Preserve the original files to revert back to; particularly important for workflows involving migrations or modifications to files.
Preserve the software that you use so that you can always refer back to it and replicate historic workflows if necessary.

May 12 2014

Using Kanban at the SCAPE Developer Workshop

Peter May SCAPE 0

The SCAPE project is into its final 6 months and with that came our final developer workshop. The main focus of this event was demonstrations, productisation and sustainability, however with everyone together it provided an opportune time to make progress with other SCAPE related activities. With nearly 30 people there, there was a lot going on and so the agenda needed to be flexible to enable productive working, but managed to ensure the workshop’s overall goals were achieved. This post discusses our use and experience of kanban as a means to manage such workshops.

Kanban is a visual way of managing tasks through a workflow. It is lightweight, applicable to whatever process you have today, and doesn’t require a lot of overhead. It is based on three principles (some say 6): visualise, limit Work in Process (WIP) and manage flow, which serve to ensure tasks and processes are brought into the open for discussion, that lead times are reduced and that the workflow is understood and continually improved through monitoring. For the purposes of this SCAPE workshop the main priority was visibility of the work.

Visualising Tasks

Typically a workflow is represented on a kanban board along with the work items that move through this workflow. Placing such a board on the wall enables everyone to see what’s being worked on, who’s working on it, and at what stage in the workflow it’s at. There is a minimal overhead – creating work item cards and moving them through the workflow stages – but this is manageable with short, regular update discussions.

Task Boards

Todo and Doing tasks Using large sheets of paper stuck to the meeting room walls, we created a primary board with 3 main columns: To Do, Doing, Done. After introductory presentations recapping the purpose of the workshop and the main goals, participants were urged to consider tasks they needed to do and to write them on sticky notes, whilst further presentations were given about specific activities that needed doing (e.g., generating microsite documentation for each tool). This worked really well as it got participants thinking about and writing down what they needed to do whilst discussions were happening.

It is important to note that a fair amount of pre-work was performed in the lead-up to the workshop. For example, gaps in tool documentation and tool installation issues were both identified prior to the workshop. These were then introduced in the initial presentations (and discussed further during the remainder of the workshop) and provided a valuable initial set of tasks that quickly got the workshop moving. The motivation for this pre-work was driven primarily by the desired overall goals of the workshop.

Tasks

One sticky note equated to one task. We made no distinction between different coloured notes, although the colours could have been used to indicate categories of work, e.g., bugs, development, documentation, etc. This does add another level of complexity into visualising the work and given the workshop was only 3 days long, not giving meaning to the colours kept things simple.

Similarly, it is useful to know who is working on a task. We pre-printed everyone’s names onto white stickers which could be affixed to the sticky notes. Being white on coloured notes emphasised the names, making it easier to see which tasks were ownerless. Avatars were considered, however given the short duration of the workshop and the fact that few people pre-selected one, we didn't use them (they were film character based and so didn't relate to any specific individual at the workshop, meaning it probably would have been harder to work out who the task "owner" was anyway).

"Why does this tool have its own board?"

Beyond the main board we also had several smaller boards for individual software tools being worked on. These had the same 3 columns as before, but also including “Waiting for Feedback” (queue) and “Feedback” columns between ‘Doing’ and ‘Done’. These two additional columns were motivated by the work on documentation, where checks were thought necessary to ensure documentation completeness. I didn’t see these columns used much, and they could have equally been replaced by creating “Check documentation for tool X” sticky notes once the “Do documentation” task had been completed.

Specific Tool Kanban boards We never had any logic or consistency in which tools had their own board and which didn’t, which resulted in some confusion; a few times I overheard questions such as “why does this tool have its own board?” to which there was no obvious answer. In many ways these smaller boards acted as swim lanes for those tools, the lanes just happened to be separated out into their own boards. Separate boards highlight the work on a specific tool/topic, but perhaps unnecessarily isolate that work from the rest. If a clear distinction between work item “topics” is needed, different coloured sticky notes could always be used instead (this wasn’t the case for us though), but caution should be used to avoid making it too complex (e.g. through use of too many colours).

Work in Process – What to look out for

The emphasis for taking a kanban approach was to bring the tasks being worked on out into the open. It can often be hard to manage workshops with 30 people all working on different things and ensure that the workshop’s goals are met. However, having tasks visible on the board with people’s names attached to them means no-one can hide.

No emphasis was placed on limiting the amount worked on by one person, although people naturally tended towards only working on one or two things at a time. Progress was simply monitored through ad-hoc group discussions centred on the boards and going around the table asking each participant how they were getting on. This sometimes resulted in the need to move a sticky note from one column to another, but forgetting to do this can be excused by unfamiliarity with the kanban approach. A number of times, people were working on things that didn’t have a sticky note at all; asking them to create one then and there was the easiest approach to ensuring it happened, and also encouraged a few others to slyly put up other missing notes!

Another thing to look out for are ill-written tasks – tasks descriptions that are either too vague to be understood (by anyone other than the task owner) or too high-level that they encompass many individual tasks. A big factor in using kanban is to bring the work out into the open so everyone can become familiar with what’s going on. If the actual things being worked on are hidden behind vague descriptions then nothing is communicated and you may as well not have had the note at all.

Lessons Learned

Kanban was a very effective approach to managing the wide variety of tasks going on at the workshop, and recommended for such meetings. The following bullets summarise the discussions above, reflecting our experiences with the technique, and are directed at the use of kanban in short workshops (i.e. the recommendations may be different if applied to a long-running project, for instance).

Preparation:

Understand what the key goals are that need to be achieved at the workshop and prepare accordingly; from this it is likely that an initial set of tasks can be created (or at least hinted at) to help get the workshop up-to-speed quickly.

At the Workshop:

Briefly explain the kanban process and give everyone sticky notes so they can write tasks as soon as they think of them; do this up front before main presentations.
Keep any introductory presentations short and directed towards encouraging task identification around the main workshop goals; aim to swiftly move on to the "doing".

Visualise, but keep things simple:

Keep everything on the same board; 3 columns (To Do, Doing, Done) is often enough.
Don’t give meaning to sticky note colours unless there’s a good reason to; if there is a good reason, create a key to ensure everyone understands each colour’s meaning.
Use stickers with people’s name on. These are more obvious than handwritten names, and easier to recognise who they refer to than avatars (given the short duration of the workshop). If using avatars, consider just using people’s photo (perhaps also with their name).

Tasks:

Ask for note rewrites or breakdowns into multiple notes where task descriptions are vague or too high-level.
Encourage ownership of tasks by getting the person whose task it is to write the note.

Manage the boards:

Hold ad-hoc group discussions centred around the boards.
Ask for status updates by person rather than task, so that missing on task notes can be identified (this favours having all the tasks identified over an exact status for each).
Understand that people (especially those new to the process) forget to create/update tasks; encourage note creation as soon as the gap is recognised; if necessary, move tasks appropriately during the ad-hoc recaps.
Don’t move tasks to “Done” until they truly are done; “I’ve just got to push the code back to the repository” means it’s not finished.

Job Done!

Completed tasks

It was great to see everyone participate in the approach – I was expecting reluctance from people to get involved, but was surprised by the level of enthusiasm; I even noted someone exclaim at completing a task before jumping up to triumphantly move the sticky note to the done pile! With a variety of tasks, progress across the board can often be seen quickly (for us, things were complete by the end of the first day), and at the end of the workshop hopefully you end up with a “Done” column full of sticky notes and (if anything’s left) a “To Do” column with follow-up actions.

March 7 2013

SCAPE Scenario & Developer’s Workshop, 22-24th January

Peter May SCAPE 0

After a year predominantly focussed on external SCAPE events (Open Research Challenges Workshop @ iPres, the First SCAPE Training Event in Guimarães) we finally organised another project internal scenario and developer’s workshop. As always, the event provided a great opportunity for developers and SCAPE scenario holders to get together and talk face-to-face. It also gave those new to the project an opportunity to meet some of those people who they’d only spoken to over Skype (or even only over email!).

Flexibility in the workshop agenda enabled each day’s activities to be adjusted slightly so that everyone had opportunity to discuss and gain awareness of important topics at this point in the project. Main topics focused on were the scenario status updates, refining scenarios, policy representation work, and the functional review criteria; enabling team members to have an appreciation of these subjects will be important in helping to improve project-wide understanding and communication, especially as the project steps up integration of the various components we have been working on.

Busy working! The first day started by reviewing the existing scenarios. Plenty of excellent work has gone on across the project, and much of this is driven and directed by content holders who express their needs and assess solutions through the various Scenarios. From assessing this list, it was clear that we have a good spread of Scenarios underway, and completing these should form the focus for upcoming work. As it stood at the meeting, across the three TestBeds (Large Scale Digital Repositories, Web Content, and Research Datasets), we have 11 active scenarios, 7 just starting to be worked on, with a further 10 not started, postponed or unknown (2 unknowns due to no feedback before the meeting).

Understanding the breadth and status of scenarios helps us understand and prioritise forthcoming work. Combined with the recent gap analysis and the scenario refinement work, which aims to succinctly identify the issue with each scenario and the associated solution’s requirements, it will also help make on-going development work clearer, better directed and easier to measure.

Catherine Jones (STFC) presented an overview of the policy representation work going on within the project. This aims to break down organisational high level policies into low level machine actionable policies which can be monitored and reacted to by the automated watch component (for example, reacting to risks and constraints derived from policies), as well as used by preservation planning to ensure that Preservation Plans meet institutional needs and requirements (for example, using institutional policies to select appropriate plans or components for plans). As a means to understand this better, the group worked on developing example policies for the three defined policy levels (High-level guidance policies, Preservation policies, and Control Level policies) using an existing scenario (TIFF to JPG2000 migration).

On the final day, a couple of demonstrations were given to the group. One focussed on the development of the Component Plugin functionality in Taverna (to enable the creation and use of SCAPE Preservation Components within the Taverna Workbench) and the associated Component Catalogue APIs for storage, access and discovery of these components within the web based catalogue. The second demo was of Matchbox, a tool for detecting near-similar images, for example where one image is a rotation or scaled transformation of another. In particular this demo focussed on the detection of duplicate images from a collection of content, with excellent results.

To wrap things up, Carl Wilson (OPF) gave a presentation surrounding the Functional Coding Review aspects of the project which serves to ensure that the software we produce is of good quality and easily maintainable. He covered many related aspects such as documentation, licensing, unit tests, bug tracking and packaging. He also discussed the need for a team of developers who will be responsible for reviewing code against our coding guidelines (as well as reviewing and iterating the guidelines themselves) – if you can spare some time to help with this, please get in touch.

On the whole the workshop was very successful. Flexible arrangements allowed a lot of important work to be covered, with the participants I spoke to all having positive things to say about this approach and the meet-up in general. Ultimately, the benefits of such flexibility is probably reflective of the fact that it is constructive participation which is important in setting the direction and success of any endeavour.

On this note, if you have a scenario with a scalability challenge which SCAPE should be working on, are able to help with functional review, or have feedback on SCAPE outputs (especially from those external to the project) then please get in touch.

1 2