Assessing file format risks: searching for Bigfoot?

Assessing file format risks: searching for Bigfoot?

Last week someone pointed my attention to a recent iPres paper by Roman Graf and Sergiu Gordea titled "A Risk Analysis of File Formats for Preservation Planning". The authors propose a methodology for assessing preservation risks for file formats using information in publicly available information sources. In short, their approach involves two stages:

  1. Collect and aggregate information on file formats from data sources such as PRONOM, Freebase and DBPedia
  2. Use this information to compute scores for a number of pre-defined risk factors (e.g. the number of software applications that support the format, the format's complexity, its popularity, and so on). A weighted average of these individual scores then gives an overall risk score.

This has resulted in the "File Format Metadata Aggregator" (FFMA), which is an expert system aimed at establishing a "well structured knowledge base with defined rules and scored metrics that is intended to provide decision making support for preservation experts".

The paper caught my attention for two reasons: first, a number of years ago some colleagues at the KB developed a method for evaluating file formats that is based on a similar way of looking at preservation risks. Second, just a few weeks ago I found out that the University of North Carolina is also working on a method for assessing "File Format Endangerment" which seems to be following a similar approach. Now let me start by saying that I'm extremely uneasy about assessing preservation risks in this way. To a large extent this is based on experiences with the KB-developed method, which is similar to the assessment method behind FFMA. I will use the remainder of this blog post to explain my reservations.

Criteria are largely theoretical

FFMA implicitly assumes that it is possible to assess format-specific preservation risks by evaluating formats against a list of pre-defined criteria. In this regard it is similar to (and builds on) the logic behind, to name but two examples, Library of Congress' Sustainability Factors and UK National Archives' format selection criteria. However, these criteria are largely based on theoretical considerations, without being backed up by any empirical data. As a result, their predictive value is largely unknown.

Appropriateness of measures

Even if we agree that criteria such as software support and the existence of migration paths to some alternative format are important, how exactly do we measure this? It is pretty straightforward to simply count the number of supporting software products or migration paths, but this says nothing about their quality or suitability for a specific task. For example, PDF is supported by a plethora of software tools, yet it is well known that few of them support every feature of the format (possibly even none, with the exception of Adobe's implementation). Here's another example: quite a few (open-source) software tools support the JP2 format, but for this many of them (including ImageMagick and GraphicsMagick) rely on JasPer, a JPEG 2000 library that is notorious for its poor performance and stability. So even if a format is supported by lots of tools, this will be of little use if the quality of those tool are poor.

Risk model and weighting of scores

Just as the employed criteria are largely theoretical, so is the computation of the risk scores, the weights that are assigned to each risk factor, and they way the individual scores are aggregated into an overall score. The latter is computed as the weighted sum of all individual scores, which means that a poor score on, for example, Software Count can be compensated by a high score on other factors. This doesn't strike me as very realistic, and it is also at odds with e.g. David Rosenthal's view of formats with open source renderers being immune from format obsolescence.

Accuracy of underlying data

A cursory look at the web service implementation of FFMA revealed some results that make me wonder about the data that are used for the risk assessment. According to FFMA:

  • PNG, JPG and GIF are uncompressed formats (they're not!);
  • PDF is not a compressed format (in reality text in PDF nearly always uses Flate compression, whereas a whole array of compression methods may be used for images);
  • JP2 is not supported by any software (Software Count=0!), it doesn't have a MIME type, it is frequently used, and it is supported by web browsers (all wrong, although arguably some browser support exists if you account for external plugins);
  • JPX is not a compressed format and it is less complex than JP2 (in reality it is an extension of JP2 with added complexity).

To some extent this may also explain the peculiar ranking of formats in Figure 6 of the paper, which marks down PDF and MS Word (!) as formats with a lower risk than TIFF (GIF has the overall lowest score).

What risks?

It is important that the concept of 'preservation risk' as addressed by FFMA is closely related to (and has its origins in) the idea of formats becoming obsolete over time. This idea is controversial, and the authors do acknowledge this by defining preservation risks in terms of the "additional effort required to render a file beyond the capability of a regular PC setup in [a] particular institution". However, in its current form FFMA only provides generalized information about formats, without addressing specific risks within formats. A good example of this is PDF, which may contain various features that are problematic for long-term preservation. Also note how PDF is marked as a low-risk format, despite the fact that it can be a container for JP2 which is considered high-risk. So doesn't that imply that a PDF that contains JPEG 2000 compressed images is at a higher risk?

Encyclopedia replacing expertise?

A possible response to the objections above would be to refine FFMA: adjust the criteria, modify the way the individual risk scores are computed, tweak the weights, change the way the overall score is computed from the individual scores, and improve the underlying data. Even though I'm sure this could lead to some improvement, I'm eerily reminded here of this recent rant blog post by Andy Jackson, in which he shares his concerns about the archival community's preoccupation with format, software, and hardware registries. Apart from the question whether the existing registries are actually helpful in solving real-world problems, Jackson suggests that "maybe we don't know what information we need", and that "maybe we don't even know who or what we are building registries for". He also wonders if we are "trying to replace imagination and expertise with an encyclopedia". I think these comments apply equally well to the recurring attempts at reducing format-specific preservation risks to numerical risk factors, scores and indices. This approach simply doesn't do justice to the subtleties of practical digital preservation. Worse still, I see a potential danger of non-experts taking the results from such expert systems at face value, which can easily lead to ill-judged decisions. Here's an example.

KB example

About five years some colleagues at the KB developed a "quantifiable file format risk assessment method", which is described in this report. This method was applied to decide which still image format was the best candidate to replace the then-current format for digitisation masters. The outcome of this was used to justify a change from uncompressed TIFF to JP2. It was only much later that we found out about a host of practical and standard-related problems with the format, some of which are discussed here and here. None of these problems were accounted for by the earlier risk assessment method (and I have a hard time seeing how they ever could be)! The risk factor approach of GGMA is covering similar ground, and this adds to my scepticism about addressing preservation risks in this manner.

Final thoughts

Taking into account the problems mentioned in this blog post, I have a hard time seeing how scoring models such as the one used by FFMA would help in solving practical digital preservation issues. It also makes me wonder why this idea keeps on being revisited over and over again. Similar to the format registry situation, is this perhaps another manifestation of the "trying to replace imagination and expertise with an encyclopedia phenomenon? What exactly is the point of classifying or ranking formats according to perceived preservation "risks" if these "risks" are largely based on theoretical considerations, and are so general that they say next to nothing about individual file (format) instances? Isn't this all a bit like searching for Bigfoot? Wouldn't the time and effort involved in these activities be better spent on trying to solve, document and publish concrete format-related problems and their solutions? Some examples can be found here (accessing old Powerpoint 4 files), here (recovering the contents of an old Commodore Amiga hard disk), here (BBC Micro Data Recovery), or even here (problems with contemporary formats)?

I think there could also be a valuable role here for some of the FFMA-related work in all this: the aggregation component of FFMA looks really useful for the automatic discovery of, for example, software applications that are able to read a specific format, and this could be could be hugely helpful in solving real-world preservation problems.

9 Comments

  1. andy jackson
    October 5, 2013 @ 7:44 pm CEST

    i find these comment heartening because, generally, I'd say we're all in agreement on one central point. We do not need 'expert systems' that make our decisions for us. But we do need aggregation systems that help bring together related information about formats (in whatever form, war stories, metrics, whatever) so that we can be better informed, and to help bring conflicted and bad information to the surface so that we might correct it.

    That's exactly what I would have like to have used during some recent digging around in the tail of a web archive format distribution, and judging my the comments here, others would find that to be useful too. So maybe we could try to solve that problem first, and then worry about who and how best to contribute additional information and correct existing misinformation.

    Also, by focussing on aggregation, and not having to worry about editing, publishing or two-way synchronisation, it makes it much easier to imagine how to plumb in various forms of experimental data and bulk information from related data sources. I worked on the PLANETS Testbed, and I believe it went too far in that it tried to host the experiment and manage the data and the aggregation all in one system. It's not that it's necessarily impossible to make it work, it just that it requires a huge amount of central investment and commitment. The web has shown us there are better ways of bringing things together.

  2. andrea_goethals
    October 4, 2013 @ 1:13 pm CEST

    Hi Johan,

    Thanks for the interesting post. I have a couple thoughts about it. One is that I agree that it is problematic to base format risk assessment tools on criteria that are entirely theoretical. There are a lot of theories but they haven't been tested. A few years ago I submitted a grant proposal to a US agency to fund a systematic study of 3000-5000 formats going back to around 1978 that would lead to a predictive model of format obsolescence. Unfortunately it wasn't funded but I still believe strongly that something like this is needed. 

    My second comment is about the Graf/Gordea paper about FFMA. I saw this presented at iPRES and my takeaway was that this is a creative process for automatically collecting online information and we should think about ways in which we could use it (for example like you mention, find tools that claim to support a format, or maybe look for patents). Forgive me Graf & Gordea but I considered the FFMA just an example of what could be done with this process but I don't consider the FFMA itself useful as a risk assessment tool because the underlying information is incomplete and often inaccurate. It is though a really good demonstration of what could be done with this technology in areas where we can tolerate fuzziness or in areas where we have good underlying information.

    Andrea

  3. Roman Graf
    October 1, 2013 @ 7:47 am CEST

    Hi Johan,

    Thank you for your blog post that has triggered a discussion about format risks. Discussion is very helpful to understand the real digital preservation problems and to prepare better solutions. We are really missing a feedback and expert inputs.

    Our intention was to give an expert a method at hand that allows to define institutions priorities and to automatically retrieve data from internet. It is possible to apply this method only for discovery of software applications.

    We are aware that repositories or experts could provide incomplete or wrong information. The advantage of presented method that you don’t need to relay of somebody’s risk definitions but define them by yourself according to your definitions.

    This method does not pretend to be 100% solution for all problems – it is a helping tool. And like Ross mentioned in his blog “the work is experimental, values mutable, and there is very much potential for organisational/personal differences.”

    Actually after recognizing that information aggregated from open source sometimes is not sufficient or accurate. We started aggregation of expert knowledge in order to create default repository with high confidence that could be extended by open source knowledge. (e.g. search for associated software). We also think about “aggregating the data sources and exposing conflicts and inaccuracies”, like Andrew stated and like it was suggested after presentation, but we really miss input from experts.

    In this regard I would ask people interested in qualitative digital preservation to spent 5 minutes and to support our work and provide your opinion about format risks.

    In order to improve the quality of file format risk analysis I would ask you to describe file format risks that are important for BL. It could be an extension or update of the attached XML file or simply in free text format. The sample risk factors you could find in the paper (23 risk factors) or in XML file. Especially useful would be if you could add new risks.

    Additionally to the definition of risk it is important to have associated classification. E.g. for software count found for a file format: from 0 to 2 – risk is 100%, from 3 to 8 – risk is 62%, from 9 to 15 – risk is 30%, for > then 15 – risk is 0%. The risk score is always between 0 and 100%. Or true/false for Boolean risk factor (e.g. compressed or not).

    It is possibly a nice idea for beginning to make focus only on some popular formats often used in libraries and Web archiving.

    If you describe it in plain text important is to define:

    Risk Factor Name (e.g. “versions count” – the influence of the versions number that were aggregated from LOD for particular format)

    Weight (severity) between 0.0 and 1.0, whereas 1.0 is most important

    Risk classifications [0:n] for this Risk Factor with min value, max value und associated risk score from 0 to 100, whereas 100 means maximal Risk.

    e.g. for versions count risk factor: classification1 {min value=0; max value=2; risk score=0},  classification2 {min value=3; max value=10; risk score=30}, classification3 {min value=11; max value=20; risk score=74}, classification4 {min value=21; max value=200; risk score=100}, what means, that if our automatic analysis for e.g. “PDF” file format detects 17 versions, the risk score=74. Similar is it for dates and boolean.

     

    In XML the same is like:

     

    <tns:property>

    <tns:id>VERSIONS_COUNT_SCORE_PROPERTY_ID</tns:id>

    <tns:riskClassification>

    <tns:weight>1.0</tns:weight>

    <tns:agent>tns:agent</tns:agent>

    <tns:creationDate>2013-02-11</tns:creationDate>

    <tns:riskFactors>

    <tns:riskFactor>

    <tns:minValue>5</tns:minValue>

    <tns:maxValue>100</tns:maxValue>

    <tns:riskScore>100</tns:riskScore>

    </tns:riskFactor>

    <tns:riskFactor>

    <tns:minValue>2</tns:minValue>

    <tns:maxValue>4</tns:maxValue>

    <tns:riskScore>30</tns:riskScore>

    </tns:riskFactor>

    <tns:riskFactor>

    <tns:minValue>0</tns:minValue>

    <tns:maxValue>1</tns:maxValue>

    <tns:riskScore>0</tns:riskScore>

    </tns:riskFactor>

    </tns:riskFactors>

    </tns:riskClassification>

    </tns:property>

     

    You could post your thoughts here or send us an e-mail: roman.graf@ait.ac.at or sergiu.gordea@ait.ac.at

  4. ross-spencer
    October 1, 2013 @ 6:41 am CEST

    I think there is a trend in digital preservation to follow a single path at a time and look back unfavourably on the previous paths we've followed. 

    I think we need war stories. I think we need registries. I also believe we need metrics. 

    While I agree with some of your criticisms, I did leap to the author's defence on Twitter to highlight that the paper states that the work is experimental, values mutable, and there is very much potential for organisational/personal differences. Confirmation bias works in favour of those using the model, and those criticising it – we need to be wary of that.

    Further, while it looks like we are creating a single number that says a format is risky or not; one must understand that risk analysis and measurement is about the reduction of uncertainty – not the elimination of it.

    I wrote a paper on risk for the Planets Project and ECA2010, the latter less redacted than the one published to Planets but never made more widely available. 

    It uses Douglas Hubbard's 'How To Measure Anything' as a backbone for an attempt to understand the application of metrics and risk analysis techniques to understanding file format obsolescence. 

    Key quotes from Hubbard's book:

    Measurement: “A set of observations that reduce uncertainty where the result is expressed as a quantity”

     

    Uncertainty: "The lack of complete certainty – that is, the existence of more than one possibility. The ‘true’ outcome/state/result/value is not known."

     

    Risk: "A state of uncertainty where some of the possibilities involve a loss, injury catastrophe, or other undesirable outcome (i.e. something bad could happen)."

    I think we have ability to measure and the ability to think about what we do with those measurements; but we need the numbers first.

    You highlight counting the number of applications that support a format, or the number of migration paths for a format – I think we can do that. 

    I agree that this says nothing about the quality and suitability of that measurement, however, and somewhat ironically, with tools such as Jpylyzer we have such powerful ways of measuring formats – and more and more should appear over time. That means we can start counting the amount of support or number of pathways available for very specific profiles of JPEG2000. 

    Problems with weighting do exist:

    Software Count can be compensated by a high score on other factors. This doesn't strike me as very realistic

    No – it isn't very realistic but to be pedantic – what other factors do you mean.

    I highlight that specifically because what numbers can do is pull us away from situations where we are telling folk stories. Combined with documented war storiesand encyclopaedia we can reduce ambiguity in such statements. 

    One of the more interesting statements you make is one that I think we could really bolster with numbers:

    PDF is not a compressed format (in reality text in PDF nearly always uses Flate compression, whereas a whole array of compression methods may be used for images);

    I really like that statement. I believe it! But, let's add numbers to it! Let's collect PDF, count them, count the versions, then dissect those and pull out the features – the ones using deflate, the ones with different myriad compression types for images and count those too. Let's do it for the web, for institutions, for agencies, for departments, and then we can work through those numbers looking for patterns.

    We might be able to develop statistics that say we're likely to collect formats with this feature from location X or these features from location Y or there might be no correlation whatsoever – but it might help us to think about potential (likely) risks in our repositories, give us research angles, and might be able to reduce the fear of the unknown we might sometimes approach this subject with. 

    To take your example literally, we might be able to say, "dissect those numbers any which way you like and we see that 90% of all PDF use DEFLATE compression for text. – PDF, commonly, is a compressed format." 

    Does this help us to say PDF is "riskier" than another format -say JPEG2000? I don't know – we do need to address that. It might be possible – let's keep collecting figures, looking for patterns, and other features to factor in, and let's find out. This demonstrates an automated approach to that, and there's nothing to stop us building better automated approaches to continue do that. 

    One of the paper's issues which comes out, not through its own fault, is that we simply haven't the data and we haven't a consistent way of recording it or a standard place to put it. We need to collect it. We need to be consistent in our collection. We need to put it somewhere. We need numbers, and once we've got them, subject matter experts and maybe some of those mathematical types with far greater statistics capability than my own might be able to work with us to do something just a little bit clever with them.

    What exactly is the point of classifying or ranking formats according to perceived preservation "risks" if these "risks" are largely based on theoretical considerations, and are so general that they say next to nothing about individual file (format) instances? Isn't this all a bit like searching for Bigfoot? Wouldn't the time and effort involved in these activities be better spent on trying to solve, document and publish concrete format-related problems and their solutions?

    Registries, war stories, statistics. I believe there is room for all of it. We just need to be given a framework to enable us to record it and use it. 

  5. andy jackson
    September 30, 2013 @ 9:37 pm CEST

    Reading through your post, this quote lept out at me:

    "well structured knowledge base with defined rules and scored metrics that is intended to provide decision making support for preservation experts"

    …but who needs preservation experts if your expert system is able to reduce all these complexities down to a single number? What else could a responsible operator do but pick the format with lowest percentage risk… (Aside: percentage of what?!)

    There is certainly value in aggregating the data sources and exposing conflicts and inaccuracies, and I'd much rather see more work like that, with the goal of improving the quality of the results. We have known for many years that we need better quality data sources and test suites. Without that, any evaluation framework is pure GIGO.

Leave a Reply

Join the conversation