Monitoring Disappearing File Formats 1: Introduction

Monitoring Disappearing File Formats 1: Introduction

This article is part of a series on monitoring aging file formats. Can you predict which file formats are likely to become obsolete? This project is part of the Dutch Digital Heritage Network Preservation Watch and Preferred Formats program. We will post the entire (translated) series on the OPF platform in the coming weeks.

© 2022 CC-BY-SA-4.0 Rein van ‘t Veer/Netwerk Digitaal Erfgoed
Afbeelding bovenin: © 2022 CC-BY-SA-4.0 auteur, bewerkt van https://www.thingiverse.com/thing:1614896

Original author: Rein van ‘t Veer

In mid-2022, the Preservation Watch working group issued an assignment regarding the monitoring the life cycle of file formats. The purpose of this was to investigate the predictability of aging file formats. In October 2022 Rein van ‘t Veer started this project as “data scientist” (archaeologist by training). This article explains the intended approach to the project and what questions form the base of the research.

First, a few obvious questions:

  1. What exactly do we mean by file formats?
  2. What does it mean when file formats are “obsolete”?
  3. Why would we want to know which file formats are becoming obsolete?

1. What exactly do we mean by file formats?

The answer to the question of what a file format (or “file type”) is, is less simple than it seems. Many will equate file formats with a filename extension such as “.docx” or “.jpg”, but sometimes this is not correct. Consider, for example, the PDF file format: it has different profiles that offer different functionalities. The PDF/A sub-profiles are recommended for archiving purposes, to ensure that encryption and password protection do not stand in the way of the accessibility of archives.

2. What does it mean when file formats are obsolete?

“Aging” and “disuse” are more difficult to pin down. Is a format obsolete if it is no longer used? Or if it can no longer be opened with common programs? For example, consider WordPerfect documents, once the standard for word processing under MS-DOS. Microsoft Word quickly overtook WordPerfect in the mid-1990s after a flopped Windows version. Despite this, files can still be opened and edited with a recent LibreOffice. The only question is how long this file format will be supported in commonly used word processing programs. After all, if no one produces WordPerfect documents anymore, there will be little incentive for software manufacturers to include it in the list of supported formats.

Support in software libraries will be available for a long time, such as those made available in the Document Liberation Project or the Developer’s Collection of Open Source File Format APIs. However, implementing this in an application is a fairly technical affair. This is the crux of aging file formats. The moment that common user software no longer supports the file format, you want to be well ahead of it.

3. Why would we want to know which file formats are becoming obsolete?

We already touched on it in the previous paragraph. Despite the fact that there will always be old versions of LibreOffice to install. The threshold for opening WordPerfect files and the search for software that does support the format in the future must be kept to a minimum to keep archives sustainable and usable. While more and more software can be brought back to life with emulation, modern computers can run Windows 95, Windows 3.1 and MacOS 8 in a browser without any problem. Installing old software can be quite a challenge, let alone the lack of software license keys for old proprietary software.

By monitoring the prevalence of file formats, we can guarantee the accessibility and usability of digital archives. This can be achieved by converting outdated formats (while preserving the original) to modern, well-supported equivalents: preferred formats.

Approach

We have now established the importance of monitoring file format obsolescence, in the coming blogs we will discuss its measurability. We can measure the popularity of file formats in the number of deposited files of a certain file type, added up over a certain period of time such as a month or a quarter. Rare file formats may even require collection over a period of one year. Increasing use, measured over sufficiently different periods, is good, but declining use may require intervention. Intervention is only necessary if:

  1. Usage falls below a certain threshold (say: 1% of the highest measured usage)
  1. If there are no more common applications available to open the format. Opening the file format should not become a quest or technical undertaking.
  1. If an equivalent (or better) file type is available to which the obsolete format can be converted, while retaining all data and metadata.

The main question for this project is whether we can predict when this threshold value will be reached for decreasingly popular file formats.

Models

Just like for weather forecasts, we use mathematical models to predict the popularity of file formats. In the case of file formats, we currently use much simpler models than for forecasting the weather. As many more measurable variables play a role in weather than file types. We will test the following models:

  1. The Bass model, named after scientist and model creator Frank Bass,
  1. A simple regression model with only one variable: file type prevalence.

These two model types are explained in more detail below.

Baseline models

The reason why we choose two models instead of one is easy to explain. The Bass diffusion model is more complex than the simplest approximation: a regression line. So to make sense, we need to know if, under what circumstances, and how much better the Bass diffusion model is at predicting popularity. We simply do not know this with a single model. The simpler regression model is therefore a “baseline” model. It allows us to compare the reliability of a less simple model with it. The saying goes “All models are wrong, but some are useful.” Each model is no more than an approximation of reality, but some models do better than others.

Metrics

We can measure the accuracy, for example, by the average deviation of the number of deposited files per period compared to the forecast. The lower the deviation, the more accurate the model. Measuring such accuracy is called a metric. In general, the simpler the metric, the easier it is to interpret, which benefits the comparison of the models.

The Bass Diffusion Model

Frank Bass introduced the Bass model in 1969 in Management Science magazine under the title A New Product Growth for Model Consumer Durables. The model describes and predicts how sales of new products will develop – from early adopters (innovators) to followers (imitators). After the early entrants share their enthusiasm or introduce derivative services or products, the bulk of the product decline comes from the “followers” until the product becomes obsolete. Incidentally, the obsolescence of products can be intentional: this is known as planned obsolescence. Anyway: over time there are fewer and fewer buyers until there are no more new buyers.

The Bass model predicts decline (the number of sales by new adopters, in red in the figure) over time. This is based on the above two parameters: the innovators (the dotted line) and the imitators (the dashed-dashed line).

The innovators initially form a large majority that steadily decreases, but after a year or so they are overtaken by the imitators. After a peak, the number of imitators decreases again, after which adoption eventually fizzles out.

Previous research on the Bass model

The question of whether this model applies to file formats has already been investigated in 2017 by Kresimir Duretec and Christoph Becker in their article Format Technology Lifecycle Analysis and their results were positive. The intuition behind the applicability of the Bass model to file formats is as follows. The software used to produce the file formats has a lifecycle that fits Frank Bass’s model. Software is released, picked up by early adopters who spread their enthusiasm for the software (if software is successful) and followed by a larger group of users who see the software’s usefulness. Until a better version of the software comes out, with more features beyond the capabilities of the older associated file formats. So with newer and better software come formats that can make better use of the new functionalities, after which the older formats become obsolete.

The question is whether their research results also apply to the Dutch archival situation and how the accuracy of the model compares to a simpler model: this is the main goal of this project.

Simple linear regression

Most of us have dealt with this at least in high school: a linear regression or trend line. This trend line is a straight line through a number of measured data points. The deviation between this line and the actual measurements is as small as possible. Since we limit our research to declining popularity, we can measure accuracy using measures of declining popularity. Linear trendlines are simple and useful for phenomena that have a steady, continuous decrease or increase, but have some “noise on the line”.

Example of a linear regression line (in blue) going through several data points (orange). Source: https://en.wikipedia.org/wiki/Linear_regression#/media/File:Anscombe’s_quartet_3.svg.

A regression line also has two parameters. One, the multiplier for the steepness of the line. Two, the correction that determines where on the y-axis the line crosses the zero point. Due to its simplicity, such a regression line is a commonly used reference method.

Data

To measure the accuracy of the above models, we first use a reference dataset:

  • Common Crawl data, a reference dataset for comparison with the Dutch digital archive situation.

Data from the following NDE partners:

Rein van ‘t Veer examined which files in which formats were created at which time, or archived when the creation date is not known at the above mentioned institutions. The data is grouped by month, quarter or year and filtered by file formats that are decreasing in use. We then fit the remaining data into the models and assess these models for their accuracy. Finally, we look at the available applications for a selection of file formats. Can the files still be opened in generally available applications – are they still well supported?

Event

Do you want to learn how to make predictions about the life cycle of file formats in an e-depot?

We will host an in-person workshop in Dutch where you get to work hands-on with your own data. Join us on 5 September, 2023, 13:00-16:00 in The Hague, The Netherlands. More information.

© 2022 CC-BY-SA-4.0 Rein van ‘t Veer/Network Digital Heritage

Featured image: © 2022 CC-BY-SA-4.0 author, modified from https://www.thingiverse.com/thing:1614896

The original blog was posted in Dutch on the KIA platform (Knowledge network Information and Archives) by tech watcher Rein van ‘t Veer.

Adaptation on floppy disk

Leave a Reply

Join the conversation