SCAPE Planning and Watch: Two years and a bit more
What did we set out to do?
To accomplish effective digital preservation, environments with a preservation concern such as repositories need scalable and context-aware preservation planning and monitoring capabilities to ensure continued accessibility of content over time. These should create a continuous cycle that allows the system to detect opportunities and risks and act accordingly.
So far so good. The problem is that so far, this lifecycle is not well-supported and thus hardly implemented in practice. We identified a number of gaps in the state of art that we set out to address:
Preservation environments are lacking business intelligence mechanisms and tools and the scalable feature-rich content profiling required to really understand what is in the preservation collections and what risks exist.
Knowledge sharing and discovery is not practiced at scale, since it is not supported well enough by current mechanisms.
Decision making efficiency needs to be improved (Plato is trustworthy, but requires considerable manual effort).
Policies need to be better understood and modelled, in particular preservation policies in the sense of "business policies" which guide and constraint preservation activities and provide the context of preservation planning, monitoring, and operations-
Lots of challenges! Five key goals are what we set out to achieve:
Provide a scalable mechanism to create and monitor large content profiles
Enable monitoring of operational preservation compliance, risks and opportunities
Improve efficiency of trustworthy preservation planning
Make the systems aware of their context
Design for open, loosely-coupled and robust preservation ecosystems that can grow over time
What did we achieve so far?
Our SCAPE Planning and Watch suite makes preservation planning and monitoring context-aware through a semantic representation of key organizational factors, and it collects and reasons on preservation-relevant information. Integration with repositories and external information sources provide powerful preservation capabilities that can be freely integrated with virtually any repository. Many of you already know the names of the components of that solution:
C3PO provides scalable profiling of feature-rich content collections. It takes the output of fits or tika and calculates the statistical distribution of features. It also selects representative sample objects from the collection to enable systematic experiments of more manageable size, which is important for planning, and it exports these statistics and the samples into a content profile that is understood by its partnering tools, Plato and Scout. Finally, it has an intiuitive neat user interface for visualising properties dynamically. It does not (yet?) support real-time analytics on Petabytes of data. (So far. It it would be great to make that happen too….)
Scout is the business intelligence component that draws content profiles and many other sources together to monitor what is going on in the repository – and outside! – and check whether the two are a good fit or not. C3PO and SCOUT together can already provide very useful insights to collections right now, if you use them properly. (There are still a few spaces left for the tutorial at IPRES.)
Plato, who has been around for a while, has been learning a lot about its context recently. Endowed with an understanding of the C3PO content profile and a semantic model of preservation "control policies", it is increasingly able to support the preservation planning process more efficiently. While the big improvements will be coming out in the next year, some things are already much less work than they used to be – provided you have created a "policy model" before and used C3PO to make a content profile. You can also discover Taverna workflows on myExperiment inside Plato and run them from there. That discovery function is going to get a lot more powerful in the near future, by the way… The latest release of Plato is 4.2, with more to come soon, and of course online as a service as usual since 2007.
The policy model is one of the things we will be presenting in more detail at IPRES this year (together with a demonstration of the tool suite and a tutorial on content profiling and monitoring with C3PO and SCOUT). The model represents an organisation's objectives and key contextual knowledge in a way that both Plato and SCOUT can use to provide better support for preservation planning and monitoring. A set of permanent vocabularies is out at PURL.org to provide the core elements used by the control policy model and others:
http://purl.org/DP/preservation-case contains the basic elements that link a preservation case together. This is in some cases quite closely related to preservation intent: It defines what is being preserved for whom, providing the rationale for checking whether the current state of preservation is fine or whether a plan for actions is needed.
http://purl.org/DP/quality describes the attributes to describe aspects of preservation quality,
http://purl.org/DP/quality/measures contains the elements used for annotating, describing and discovering measures for quality,
http://purl.org/DP/control-policy , finally, defines the classes of objectives relevant for a preservation case, so that goas and objectives can be defined for each case.
We expect this set of vocabularies to grow over time, naturally, both in terms of classes and their instances. It is used by the tools in different ways, providing a glue that enables them to converse with each other.
Where are we now and what is left to do?
Without trying to be complete here, some of the key things you want to know in my opinion will include that
Prototypes of all tools are out.
The APIs between these are partially published, the rest will follow soon.
We have started to measure how long it takes to create a plan and how much we can improve on that.
SCOUT already knows a number of sources to get information from (such as content profiles, PRONOM, and the policy model). Every additional source that is added makes every other source more valuable, since SCOUT can link between them. We will be developing more adaptors for additional sources, but you are very much encouraged to create adaptors too!
Documentation about the vocabulary will be out soon, and so will be further thoughts on how you can specify your policies more effectively.
Some of the upcoming things that the team in Planning and Watch is working on include
Specifying Service Level Agreements for preservation actions as part of the executable preservation plan, based on the criteria and measures created as part of planning, so that execution of the preservation plan can be monitored continuously for compliance to the expectations (after all, the choice and configuration of the action was based on experiments and measures of quality – we don’t want surprises there when we run it on lots of content!)
Sophisticated integration of Plato and Scout with myExperiment to discover components according to what they can do and what they measure, provided they are properly annotated.
Tool support for control policy editing, so you don’t need to model your policies in RDF!
A simulation engine that can be used to calculate predictions about the future state of a preservation environment, based on a current state and a set of assumptions. The neat thing here is that the entire set of assumptions is explicitly declared and documented, since this environment is built using model-driven engineering. That means that a model of the simulation, the cause-effect relationships, can be built using a domain-specific language, and that model is documented together with the simulation run, can be shared and extended, and the simulation is hence documented fully.
….. and quite a few other things that you will hear about soon!
Final APIs will be openly published to enable anybody to integrate (with) these tools.
I am very much looking forward to seeing the outcomes of the final SCAPE year!
By cbecker, posted in cbecker's Blog
There are no comments on this post.