SCAPE

October 7 2013

Open-source Database Preservation Toolkit released!

The Database Preservation Toolkit allows conversion between Database formats, including connection to live systems, for purposes of digitally preserving databases. The toolkit allows conversion of live or backed-up databases into preservation formats such as DBML, an XML format created for the purpose of database preservation. The toolkit also allows conversion of the preservation formats back into live systems to allow the full functionality of databases. For example, it supports a specialized export into MySQL, optimized for PhpMyAdmin, so the database can be fully experimented using a web interface.

This toolkit was part of the RODA project and now has been released as a project by its own due to the increasing interest on this particular feature.

The toolkit is created as a platform that uses input and output modules. Each module supports read and/or write to a particular database format or live system. New modules can easily be added by implementation of a new interface and adding of new drivers.

To download it, know how to use it and check related publication please visit:

http://keeps.github.io/db-preservation-toolkit/

So give it a try, provide feedback on issues and requested features and feel free to contribute!

October 7 2013

Published Preservation Policies

Barbara Sierman SCAPE 0

One of the activities in the European project SCAPE is to create a catalogue of policy elements. At the last iPRES conference we explained our work and you can read about it . During our activities we started collecting existing, published policies and we have now put the current set on a wiki http://wiki.opf-labs.org/display/SP/Published+Preservation+Policies Looking at the results of your colleagues might help to create or finalize your own preservation policies. As I said during my presentation at iPRES 2013, there are far more organizations dealing with digital preservation than published preservation policies on the internet – at least based on what we found!

If your organization has a digital preservation policy and you want to see yours in this list as well, please send an email to Barbara.Sierman@kb.nl and it will be added.

September 30 2013

Assessing file format risks: searching for Bigfoot?

johan Preservation Risks, SCAPE 9

Last week someone pointed my attention to a recent iPres paper by Roman Graf and Sergiu Gordea titled "A Risk Analysis of File Formats for Preservation Planning". The authors propose a methodology for assessing preservation risks for file formats using information in publicly available information sources. In short, their approach involves two stages:

Collect and aggregate information on file formats from data sources such as PRONOM, Freebase and DBPedia
Use this information to compute scores for a number of pre-defined risk factors (e.g. the number of software applications that support the format, the format's complexity, its popularity, and so on). A weighted average of these individual scores then gives an overall risk score.

This has resulted in the "File Format Metadata Aggregator" (FFMA), which is an expert system aimed at establishing a "well structured knowledge base with defined rules and scored metrics that is intended to provide decision making support for preservation experts".

The paper caught my attention for two reasons: first, a number of years ago some colleagues at the KB developed a method for evaluating file formats that is based on a similar way of looking at preservation risks. Second, just a few weeks ago I found out that the University of North Carolina is also working on a method for assessing "File Format Endangerment" which seems to be following a similar approach. Now let me start by saying that I'm extremely uneasy about assessing preservation risks in this way. To a large extent this is based on experiences with the KB-developed method, which is similar to the assessment method behind FFMA. I will use the remainder of this blog post to explain my reservations.

Criteria are largely theoretical

FFMA implicitly assumes that it is possible to assess format-specific preservation risks by evaluating formats against a list of pre-defined criteria. In this regard it is similar to (and builds on) the logic behind, to name but two examples, Library of Congress' Sustainability Factors and UK National Archives' format selection criteria. However, these criteria are largely based on theoretical considerations, without being backed up by any empirical data. As a result, their predictive value is largely unknown.

Appropriateness of measures

Even if we agree that criteria such as software support and the existence of migration paths to some alternative format are important, how exactly do we measure this? It is pretty straightforward to simply count the number of supporting software products or migration paths, but this says nothing about their quality or suitability for a specific task. For example, PDF is supported by a plethora of software tools, yet it is well known that few of them support every feature of the format (possibly even none, with the exception of Adobe's implementation). Here's another example: quite a few (open-source) software tools support the JP2 format, but for this many of them (including ImageMagick and GraphicsMagick) rely on JasPer, a JPEG 2000 library that is notorious for its poor performance and stability. So even if a format is supported by lots of tools, this will be of little use if the quality of those tool are poor.

Risk model and weighting of scores

Just as the employed criteria are largely theoretical, so is the computation of the risk scores, the weights that are assigned to each risk factor, and they way the individual scores are aggregated into an overall score. The latter is computed as the weighted sum of all individual scores, which means that a poor score on, for example, Software Count can be compensated by a high score on other factors. This doesn't strike me as very realistic, and it is also at odds with e.g. David Rosenthal's view of formats with open source renderers being immune from format obsolescence.

Accuracy of underlying data

A cursory look at the web service implementation of FFMA revealed some results that make me wonder about the data that are used for the risk assessment. According to FFMA:

PNG, JPG and GIF are uncompressed formats (they're not!);
PDF is not a compressed format (in reality text in PDF nearly always uses Flate compression, whereas a whole array of compression methods may be used for images);
JP2 is not supported by any software (Software Count=0!), it doesn't have a MIME type, it is frequently used, and it is supported by web browsers (all wrong, although arguably some browser support exists if you account for external plugins);
JPX is not a compressed format and it is less complex than JP2 (in reality it is an extension of JP2 with added complexity).

To some extent this may also explain the peculiar ranking of formats in Figure 6 of the paper, which marks down PDF and MS Word (!) as formats with a lower risk than TIFF (GIF has the overall lowest score).

What risks?

It is important that the concept of 'preservation risk' as addressed by FFMA is closely related to (and has its origins in) the idea of formats becoming obsolete over time. This idea is controversial, and the authors do acknowledge this by defining preservation risks in terms of the "additional effort required to render a file beyond the capability of a regular PC setup in [a] particular institution". However, in its current form FFMA only provides generalized information about formats, without addressing specific risks within formats. A good example of this is PDF, which may contain various features that are problematic for long-term preservation. Also note how PDF is marked as a low-risk format, despite the fact that it can be a container for JP2 which is considered high-risk. So doesn't that imply that a PDF that contains JPEG 2000 compressed images is at a higher risk?

Encyclopedia replacing expertise?

A possible response to the objections above would be to refine FFMA: adjust the criteria, modify the way the individual risk scores are computed, tweak the weights, change the way the overall score is computed from the individual scores, and improve the underlying data. Even though I'm sure this could lead to some improvement, I'm eerily reminded here of this recent ~~rant~~ blog post by Andy Jackson, in which he shares his concerns about the archival community's preoccupation with format, software, and hardware registries. Apart from the question whether the existing registries are actually helpful in solving real-world problems, Jackson suggests that "maybe we don't know what information we need", and that "maybe we don't even know who or what we are building registries for". He also wonders if we are "trying to replace imagination and expertise with an encyclopedia". I think these comments apply equally well to the recurring attempts at reducing format-specific preservation risks to numerical risk factors, scores and indices. This approach simply doesn't do justice to the subtleties of practical digital preservation. Worse still, I see a potential danger of non-experts taking the results from such expert systems at face value, which can easily lead to ill-judged decisions. Here's an example.

KB example

About five years some colleagues at the KB developed a "quantifiable file format risk assessment method", which is described in this report. This method was applied to decide which still image format was the best candidate to replace the then-current format for digitisation masters. The outcome of this was used to justify a change from uncompressed TIFF to JP2. It was only much later that we found out about a host of practical and standard-related problems with the format, some of which are discussed here and here. None of these problems were accounted for by the earlier risk assessment method (and I have a hard time seeing how they ever could be)! The risk factor approach of GGMA is covering similar ground, and this adds to my scepticism about addressing preservation risks in this manner.

Final thoughts

Taking into account the problems mentioned in this blog post, I have a hard time seeing how scoring models such as the one used by FFMA would help in solving practical digital preservation issues. It also makes me wonder why this idea keeps on being revisited over and over again. Similar to the format registry situation, is this perhaps another manifestation of the "trying to replace imagination and expertise with an encyclopedia phenomenon? What exactly is the point of classifying or ranking formats according to perceived preservation "risks" if these "risks" are largely based on theoretical considerations, and are so general that they say next to nothing about individual file (format) instances? Isn't this all a bit like searching for Bigfoot? Wouldn't the time and effort involved in these activities be better spent on trying to solve, document and publish concrete format-related problems and their solutions? Some examples can be found here (accessing old Powerpoint 4 files), here (recovering the contents of an old Commodore Amiga hard disk), here (BBC Micro Data Recovery), or even here (problems with contemporary formats)?

I think there could also be a valuable role here for some of the FFMA-related work in all this: the aggregation component of FFMA looks really useful for the automatic discovery of, for example, software applications that are able to read a specific format, and this could be could be hugely helpful in solving real-world preservation problems.

September 30 2013

Let’s benchmark our Hadoop clusters (join in!)

willp-bl SCAPE 0

Introduction

For our evaluations within SCAPE it would be useful to have the ability to quantitatively measure the abilities of the Hadoop clusters available to us, to allow results from each cluster to be compared.

Fortunately as part of the standard Hadoop distribution there are some examples included that can be run as tests. Intel has produced a benchmarking suite – HiBench – that uses those included Hadoop examples to produce a set of results.

There are various aspects of performance that can be assessed. The main ones being:

CPU loaded workflows (e.g. file format migration) where the workflow speed is limited by the CPU processing available
I/O loaded workflows (e.g. identification/characterisation) where the workflow speed is limited by the I/O bandwidth available

For the testing of our cluster I used HiBench 2.2.1. I made some notes about getting it to run that should be useful (see below). Apart from the one change described below in the notes, there was no need to edit or change the code.

In SCAPE testbeds we are running various workflows on various clusters. However, individual workflows tend to be run on only one cluster. Running a standard benchmark on each Hadoop installation may allow us to better compare and extrapolate results from the different testbed workflows.

Notes – These are only required to be done on the node where HiBench is run from.

JAVA_HOME is needed by some tests – I set this using “export JAVA_HOME=/usr/lib/jvm/j2sdk1.6-oracle/”.
For the kmeans test I changed the HADOOP_CLASSPATH line in “kmeans/bin/prepare.sh” to “export HADOOP_CLASSPATH=`mahout classpath | tail -1`” as it was unable to run without that change; mahout already being in the path.
The nutchindexing and bayes tests required a dictionary to be installed on the node that HiBench was started from – I installed the “wbritish-insane” package.

Caveats

Some tests use less map/reduce slots than are available and therefore are not that useful for comparison as we want to max out the cluster. For example, the kmeans tests only used 5 map slots.

Results

I have created a page on the SCAPE wiki where I have put the results from our cluster: “Benchmarking Hadoop installations”. I invite and encourage you to run the same tests above and add them to the wiki page. Running the tests was much quicker than I thought it might be – it took less than a morning to setup and execute.

To get a better understanding of which benchmarks are more/less appropriate I propose we first get some metrics from all the HiBench tests across different clusters. In future we may choose to refine or change the tests to be run but this is just a start of a process to better understand how our Hadoop clusters perform. It’s only through you participating that we will get useful results, so please join in!

September 27 2013

SCAPE Software Needs You

Carl Wilson SCAPE 0

One of the Open Planets Foundation’s main roles in the SCAPE project is to provide stewardship for, and ensure longevity of the SCAPE software outputs.

The SCAPE project is committed to producing open source software that is available to the wider community on GitHub, with clear licence terms and appropriate documentation, at an early stage in development.

While the above steps are important and helpful in encouraging other developers to download a project's source code, compile it, and try the software, this isn’t an everyday activity for the less geeky members of the digital preservation community. Software in this state is also unlikely to meet with the approval of an institution's IT Operations / Support section.

What’s really required for software longevity is an active community of users who:

Use the software for real world activities in their day to day work.
Report bugs and request enhancements on the project's issue tracker.
Contribute to community software documentation.

So how do we bridge the gap between our current developer-ready software, and software that non-geeks find easy to install and use?

Over October there will be a sustained effort to package, document and publish SCAPE software for download by anybody who wants to try it. If that sounds like you then read on.

Where can I find the SCAPE software?

We have compiled a list of tools that have been developed or extended as part of the SCAPE Project: http://www.scape-project.eu/tools. Currently our software is on the OPF’s GitHub page, though if you’re not comfortable with source code this might not prove very helpful. To help you make sense of what’s on the GitHub page the OPF have created a project health check page, which distills the information a little and provides helpful links to the projects' README and LICENSE files. This page is still a work in progress, so if there’s some information you’d like to see on it you can raise an issue on GitHub.

How do I know that the software builds?

All SCAPE software should have a Continuous Integration build that runs on the Travis-CI site, this means that the software is built every time somebody checks in a change to the source code in GitHub. If the build fails the developer is informed, and corrects the problem as soon as possible. Every project listed on the project health check site has one of these graphics:

indicating the result of the most recent attempt to build the project on Travis, or informs you that a Travis build couldn’t be found. Click on the image and you’ll be taken to the project’s Travis page if you’re interested in the gory details.

On top of these CI builds, the OPF runs a Jenkins server that performs nightly builds of the software. The aim of these builds is to analyse the code quality using Sonar, an open code QA platform.

So how do I download and use SCAPE software?

Which brings us round to October, where we’ll be fitting the final piece of the puzzle. The real aim of the nightly builds is to build installable packages to be downloaded by you. These packages will be debian apt packages, installable on debian based linux distributions including ubuntu, mint, and of course debian itself.

We’ll be creating stable release packages for download from the OPF's Bintray page, and overnight “snapshot” builds of the current project at a to be decided location. Keep an eye @openplanets and @scapeproject for news and download links over the coming month.

But I use Windows, Mac OS, or another linux packaging system.

Fear not, all is not lost. We’ve chosen debian based linux distros first because:

it simplifies licensing issues for build machines and virtual test and demonstration environments.
debian based distros are among the most widely used linux distributions.
Hadoop, the engine that runs SCAPE’s scalable platform, has historically not played well with Windows, although this is no longer such a problem.

Some of the software will run on other platforms easily, Jpylyzer is available for Windows. Others may require a little more work, but if there’s interest and it’s practical we’ll do our best. We’re trying to establish a community of users, not exclude people.

So that’s why SCAPE software needs you, hopefully as much as you need SCAPE software.

September 25 2013

Interview with a SCAPEr – Rui Castro

Jette Junge SCAPE 0

Who are you?

I’m Rui Castro. I work at KEEP SOLUTIONS since 2010 where I have the roles of Director of Infrastructures, project manager and researcher. Before joining KEEP SOLUTIONS, I was part of the team who developed RODA, the digital preservation repository used by the Portuguese National Archives.

Tell us a bit about your role in SCAPE and what SCAPE work you are involved in right now?

My role in SCAPE is primarily focused on Preservation Action Components and Repository Integration.

In Action Components, I’ve worked in the identification, evaluation and selection of large-scale action tools & services to be adapted to the SCAPE platform. I’ve contributed to the definition of a preservation tool specification with the purpose of creating a standard interface for all preservation tools and a simplified mechanism for packaging and redistributing those tools to the wider community of preservation practitioners. I have also contributed to the definition of a preservation component specification with the purpose of creating standard preservation components that can be automatically searched for, composed into executable preservation plans and deployed on SCAPE-like execution platforms.

Currently my work is focused on repository integrations where I have the task of implementing the SCAPE repository interfaces into RODA, an open-source digital repository supported and maintained by KEEP SOLUTIONS. These interfaces when implemented will enable the repository to use the SCAPE preservation environment to perform preservation planning, watch and large-scale preservation actions.

Why is your organisation involved in SCAPE?

KEEP SOLUTIONS is a company that provides advanced services for managing and preserving digital information. One of the vectors that drive us is continuous innovation in the area of digital preservation. In the SCAPE project, KEEP SOLUTIONS is contributing with expertise in digital preservation, especially migration technologies, and with practical knowledge on the development of large-scale digital repository systems. KEEP SOLUTIONS is also acquiring new skills in digital preservation, especially in preservation planning, watch and service parallelisation, we are enhancing digital preservation products and services we currently support, such as RODA, and enhancing relationships with world leader digital preservation researchers and institutions. KEEP SOLUTIONS’ participation in the project will enhance our expertise in digital preservation and that will result in better products and services for our current and future clients.

What are the biggest challenges in SCAPE as you see it?

SCAPE is a big project, from the number of people and institutions involved to the number of digital preservation aspects covered. I think the biggest challenge will be the integration of all parts into a single coherent system. From a technical point of view the integration between content repositories, automated planning & watch and the executable platform is a huge challenge.

What do you think will be the most valuable outcome of SCAPE?

I see two very interesting aspects emerging from SCAPE.

One is the integration of automated planning & watch into digital preservation repositories. Planning is an essential part of digital preservation and it involves human level activities (like policy and decision making) and machine activities (like evaluation of alternative strategies, characterisation and migration of contents). Being able to make the bridge between these two realms and provide content holders the tools to take informed decisions about what to do with their data is a great achievement.

The other is the definition of a system architecture for large-scale processing, applied to the specific domain of digital preservation, that is able of executing preservation actions like characterisation, migration and quality-assurance over huge amounts of data in a “short” time.

Contact information:

Email: rcastro@keep.pt

Webpage: http://www.keep.pt/

«< 7 8 9 10 11 >»