ChatGPT discusses Digital Preservation

ChatGPT discusses Digital Preservation

DALL·E “A futuristic robot and human meet, synth wave”

There has been a lot of buzz around AI and Language Tools so OPF decided to see what all the fuss is about, and if they can be useful tools for our community.

The field of Artificial Intelligence (AI) has existed almost as long as the field of Computer Science[1 2]. Once machines became capable of doing lots of complex tasks quickly it was speculated that in the not too distant future they would also become intelligent. Scientists and researchers immediately set about making that a reality. For a long time AI was pretty much only found in research labs and universities, and we, the public, would rarely interact with it. This changed when the web came along, and algorithms became more and more important. In the last couple of years AI tools have become accessible to the masses through online services like OpenAI that we are looking at today.

In the early days of AI, we had Expert Systems that could help doctors and engineers with diagnoses and design[3]. Then neural networks became the hot topic and brought facial recognition and object classification to all our favourite social media. Most people’s first direct interaction with an AI would have been the early chatbots of the day, or “Ask Jeeves” who are descended from the 1960’s MIT project ELIZA[4]. The tools slaying it today are the Convolutional Neural Networks famous for generating new images, but also used for other traditional research, and Language Models like GPT.

Tom Scott recently opined on his YouTube channel that we might be at the start of a new technological revolution, similar to the early days of the internet and that we may have to wait a decade or more to see how this will change our lives and the world around us[5]. Already we are seeing lawsuits around copyright and plagiarism from both the generative art and language models fields. These are not unlike the lawsuits we saw around Napster in the early 2000’s that were the precursor to streaming music, and eventually video too, that led the way to the death of the video rental store and physical music shops business models. The sale of music and movies on physical media still exists but not like we remember from the end of the last century! How these new AI tools in the hands of the wider public will challenge existing business models and the status quo is perhaps unknown, but we are probably in for some interesting years ahead. Deep fakes are already a problem, school work written by machines, fake news – like any new development it can be used as a tool to assist us or used for less than noble purposes.

But I am not here to ponder the future of AI, I want to see what it knows about Digital Preservation! Can this new AI tool be an ally to the Digital Preservation Community? So I sat down with a coffee and had a chat with ChatGPT. Incidentally, the images in the blog were also generated by AI tools.

Hello ChatGPT

OPF: Hello ChatGPT. I am writing a blog post about AI and a conversation with a chatbot. Can you write the first introductory paragraph for me, explaining what you are and what you can and cannot do in layman’s terms

Hello there! I’m ChatGPT, a language model developed by OpenAI. I’m designed to understand and process human language and generate responses that are as human-like as possible. I can do a lot of things, like answering questions, helping with research, generating creative writing prompts, and even engaging in casual conversation. However, I’m not a human, so I don’t have personal experiences or emotions, and my responses are based solely on patterns and information I’ve learned from vast amounts of text data. Nevertheless, I’m always learning and improving, so feel free to ask me anything you’d like!

OPF: The blog is about the future of Digital Preservation, can you explain digital preservation in such a way that a 5 year old could understand?

Of course! Digital preservation is like keeping your favorite toys safe for a long time. When you grow up, you might not play with them anymore, but you still want to keep them because they are special to you. Just like how you put your toys away in a special place so they won’t get lost or broken, people who take care of important digital things like pictures, videos, and documents also want to keep them safe for a long time. They do this by making sure that the digital things are saved in special places where they won’t get lost or damaged, and that they can still be looked at or used in the future.

OPF: Explain Digital Preservation in simple terms

Digital preservation refers to the process of ensuring that digital information, such as documents, images, videos, and other types of digital files, remain accessible and usable over time.

This involves taking steps to protect digital content from the risk of loss or deterioration due to various factors such as hardware or software obsolescence, data corruption, cyber-attacks, natural disasters, and human error.

To achieve digital preservation, various techniques such as backup and replication, migration, emulation, and digital curation are employed to ensure that digital content is maintained and can be accessed, searched, and used by future generations. The ultimate goal of digital preservation is to ensure that valuable digital information is available and usable both today and in the future.

OPF: In one paragraph explain the importance of Digital Preservation

Digital Preservation is crucial for any organization that values its digital assets and wants to ensure their long-term accessibility and usability. In today’s digital world, organizations rely heavily on digital information, such as records, financial data, customer information, and intellectual property. However, digital content is vulnerable to many threats, such as data loss, corruption, and obsolescence, which can lead to significant financial and reputational damage. By implementing a comprehensive digital preservation strategy, organizations can protect their digital assets, reduce the risk of data loss, ensure compliance with regulatory requirements, and maintain their competitive edge. Moreover, digital preservation ensures that valuable information can be accessed and used by future generations, contributing to the continuity of knowledge and the preservation of our cultural heritage. Therefore, taking digital preservation seriously is not only a responsible and prudent approach but also an investment in the future of the organization and society as a whole.

In the style of Picasso draw a machine writing its memoirs

In the style of Picasso draw a machine writing its memoirs

In the style of Picasso draw a machine writing its memoirs

In the style of Picasso draw a machine writing its memoirs

In the style of Picasso draw a machine writing its memoirs

In the style of Picasso draw a machine writing its memoirs

A surrealist painting of a robot learning to use tools

A surrealist painting of a robot learning to use tools

A high resolution photo of a steampunk DNA molecule

A high resolution photo of a steampunk DNA molecule

A cubist painting of a robot chiselling information onto a stone CD disc

A cubist painting of a robot chiselling information onto a stone CD disc

OPF: What tools can the archivist use to preserve digital assets?

Archivists can use various tools to preserve digital assets. Here are some examples:

  1. Digital Repository: A digital repository is a software system that provides secure and long-term storage of digital assets. It can manage the preservation of digital materials and support access to them over time.
  2. Digital Forensics Tools: These are specialized software tools that help in the identification, preservation, and analysis of digital evidence, such as files, metadata, and other digital artifacts.
  3. Digital Asset Management System: A digital asset management system (DAMS) is a software tool used to organize, store, and retrieve digital assets. It provides a centralized repository for digital assets and can help ensure their long-term preservation.
  4. Migration Tools: Migration tools help transfer digital assets from one format to another while maintaining their integrity and authenticity. They can also help to ensure the continued accessibility of digital assets as technology changes.
  5. Emulation Tools: Emulation tools enable the running of outdated hardware or software on modern computers, ensuring the continued accessibility of digital assets that are dependent on obsolete technology.
  6. Digital Preservation Metadata: Digital preservation metadata is information that describes the characteristics and history of digital assets. It provides context and aids in the preservation and long-term management of digital assets.

OPF: What is a the best storage media?

There isn’t a one-size-fits-all answer to this question, as the best storage media can vary depending on factors like the type of data you need to store, how long you need to store it, and how frequently you need to access it.

That being said, some commonly used storage media for digital preservation include hard drives, tape drives, optical discs (like CDs and DVDs), and solid-state drives. Each of these storage media have their own pros and cons in terms of factors like capacity, speed, durability, and cost.

Additionally, it’s important to note that storage media is just one aspect of digital preservation, and that other factors like data integrity, metadata, and migration plans are also important considerations for ensuring that digital data remains accessible and usable over time.

OPF: If I wanted to store a digital asset for a million years what storage media would I use?

Storing a digital asset for a million years is an extremely long time, and there is currently no storage medium that can guarantee reliable data storage for such an extended period. Even the most durable storage media can degrade over time due to physical, chemical, or environmental factors.

However, there are some potential storage solutions that could be considered for extremely long-term preservation. One example is a specialized storage medium called a “synthetic DNA archive.” Scientists have developed a method for encoding digital data as synthetic DNA molecules, which are incredibly stable and can last for thousands of years under optimal conditions.

Another approach is the “M-Disc,” a type of optical disc that uses a special layer of rock-like material to physically engrave data onto the disc surface. M-Discs claim to have a lifespan of up to 1,000 years under normal storage conditions.

That being said, these solutions are still relatively new and untested over such long periods of time. In practice, the best strategy for long-term digital preservation is to use a combination of storage media, data redundancy, and regular data migration to ensure that important digital assets are preserved and accessible over time.

ChatGPT writing code

ChatGPT writing code

OPF: Can you write me a script using only free and open source tools, that takes a folder as a parameter and finds all the PDF files and converts them to PDF/A format, changing the name so as to preserve the original file

Sure, here is an example script that meets your requirements using only free and open source tools:

#!/bin/bash

# Take the folder as a parameter
folder="$1"

# Find all PDF files in the folder and subdirectories
find "$folder" -type f -name '*.pdf' | while read -r file; do

# Get the filename without the extension
name=$(basename "$file" .pdf)

# Convert the PDF to PDF/A format using Ghostscript
gs -dPDFA -dBATCH -dNOPAUSE -sProcessColorModel=DeviceCMYK -sDEVICE=pdfwrite -sPDFACompatibilityPolicy=1 -sOutputFile="$folder/$name.pdfa.pdf" "$file"

done

 

Here’s what the script does:

  • It takes a folder as a parameter, which is stored in the '$folder' variable.
  • It uses the 'find' command to locate all PDF files in the specified folder and its subdirectories.
  • For each PDF file found, it uses Ghostscript to convert it to PDF/A format and save the resulting file with a new name that preserves the original file name but with the “.pdfa.pdf” extension.

Note that this script assumes that Ghostscript is installed on your system. If it’s not installed, you can typically install it using your system’s package manager (e.g., 'sudo apt-get install ghostscript' on Ubuntu/Debian-based systems).

OPF: Can you write me a Powershell script for windows using free and open source tools that accepts a folder as a parameter, and then scans for all the PDF files and adds the following watermark to the file: “© 2023 – Open Preservation Foundation”

Certainly! Here’s a PowerShell script that should accomplish what you’re looking for:

# Define watermark text
$watermark = "© 2023 - Open Preservation Foundation"

# Get folder path from parameter
$folderPath = $args[0]

# Find all PDF files in folder and subfolders
$pdfFiles = Get-ChildItem -Path $folderPath -Recurse -Filter *.pdf

# Loop through PDF files and add watermark
foreach ($pdfFile in $pdfFiles) {
# Create new output file path with "_watermarked" suffix
$outputPath = Join-Path $pdfFile.Directory.FullName ($pdfFile.BaseName + "_watermarked" + $pdfFile.Extension)

# Use Ghostscript to add watermark to PDF
& "C:PathTogsbingswin64c.exe" -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite `
-dPDFSETTINGS=/prepress -dCompatibilityLevel=1.4 `
-sOutputFile=$outputPath `
-c "mark { $watermark } /Arial-Bold findfont 40 scalefont setfont 1 1 moveto gsave show grestore newpath } pathforall"
-f $pdfFile.FullName

# Replace original file with watermarked file
Remove-Item $pdfFile.FullName
Rename-Item $outputPath $pdfFile.Name
}

To use this script, save it as a .ps1 file (e.g. Add-PdfWatermark.ps1) and run it in PowerShell with the folder path as the first argument:

.Add-PdfWatermark.ps1 "C:PathToFolder"

Note that this script relies on the free and open-source Ghostscript tool to add the watermark to PDF files. You’ll need to download and install Ghostscript, and update the path to the 'gswin64c.exe' executable in the script to match your installation location.

OPF: What educational programs exist in digital preservation?

There are several educational programs and courses available for individuals who are interested in learning about digital preservation. Here are some examples:

  1. Digital Preservation Management (DPM) Online Certificate Program: This is a comprehensive program offered by the University of Michigan that provides training in all aspects of digital preservation management, including selecting content, metadata, storage, access, and sustainability.

  2. Digital Curation and Data Management Certificate Program: This program, offered by the University of Illinois at Urbana-Champaign, provides training in digital curation and data management, including digital preservation, data sharing, and open science.

  3. Digital Preservation Training Program: The Library of Congress offers a digital preservation training program that provides information and resources on the basics of digital preservation, including preservation planning, metadata, and file formats.

  4. Digital Preservation Essentials: The Digital Preservation Coalition offers a series of online courses and webinars that cover the basics of digital preservation, including preservation planning, metadata, and file formats.

  5. Preservation Management of Digital Materials: This course, offered by the Society of American Archivists, covers the fundamentals of digital preservation, including selecting content, metadata, storage, access, and sustainability.

Overall, these programs and courses can provide individuals with the knowledge and skills needed to effectively manage and preserve digital materials. Additionally, many professional organizations and associations offer workshops, conferences, and other training opportunities for individuals who are interested in digital preservation.

OPF: If I had never heard of digital preservation before, what would be a good book, or academic paper to start learning more?

If you are new to digital preservation, there are several books and academic papers that could provide a good introduction to the topic. Here are a few examples:

  1. “Digital Preservation for Libraries, Archives, and Museums” by Edward M. Corrado and Heather Moulaison Sandy: This book is a comprehensive introduction to digital preservation, covering topics such as metadata, file formats, digital preservation planning, and risk management. It provides practical guidance for those new to the field and includes case studies and examples to help illustrate key concepts.
  2. “Preserving Digital Materials” by Ross Harvey: This book provides an overview of the key issues involved in digital preservation, including the challenges of preserving different types of digital materials, the use of standards and best practices, and the role of digital preservation in cultural heritage organizations. It is aimed at archivists, librarians, and other professionals involved in the preservation of digital materials.
  3. “The Digital Preservation Handbook” by the Digital Preservation Coalition: This online resource provides an introduction to digital preservation, covering topics such as selection and appraisal, preservation planning, metadata, and file formats. It includes practical guidance and examples, and is aimed at practitioners and others with a basic understanding of the issues involved in digital preservation.
  4. “Towards a Theory of Digital Preservation” by Seamus Ross: This academic paper provides a conceptual framework for understanding the key issues in digital preservation, including the challenges of technological obsolescence, authenticity, and sustainability. It provides a theoretical perspective on digital preservation that can help guide practice and policy in the field.

 

These are just a few examples of resources that could be useful for someone new to digital preservation. The choice of resource will depend on your specific interests and requirements.

A caveman learns to code from a futuristic robot

A caveman learns to code from a futuristic robot

An aquatic robot stores all the worlds books in its head

An aquatic robot stores all the worlds books in its head

A caveman learns to code from a futuristic robot

A caveman learns to code from a futuristic robot

An aquatic robot stores all the worlds books in its head

An aquatic robot stores all the worlds books in its head

A girl in a spacesuit on the moon watches a formation of Earth Spaceships flying towards the planet earth

A girl in a spacesuit on the moon watches a formation of Earth Spaceships flying towards the planet earth

A child in a city watches a futuristic star-trek space ship in the sky

A child in a city watches a futuristic star-trek space ship in the sky

A girl in a spacesuit on the moon watches a formation of Earth Spaceships flying towards the planet earth

A girl in a spacesuit on the moon watches a formation of Earth Spaceships flying towards the planet earth

OPF: What is the future of Digital Preservation?

The future of digital preservation is likely to involve a continued focus on technological advancements and increased collaboration among organizations and institutions. Here are some potential developments that we may see in the coming years:

  1. Increased use of automation and artificial intelligence: As the volume of digital materials continues to grow, organizations may increasingly turn to automation and artificial intelligence to help manage and preserve these materials. For example, machine learning algorithms could be used to identify and prioritize materials for preservation based on their significance, condition, and other factors.

  2. Adoption of open standards and interoperability: The use of open standards and interoperable technologies can help ensure that digital materials can be accessed and used over time, regardless of changes in hardware and software. Organizations may increasingly adopt these technologies to ensure the long-term sustainability of their digital materials.

  3. Continued development of cloud-based storage and preservation solutions: Cloud-based storage and preservation solutions offer many advantages, including scalability, flexibility, and cost-effectiveness. As cloud technologies continue to evolve, we may see more organizations moving their digital preservation efforts to cloud-based platforms.

  4. Emphasis on data management and stewardship: In addition to preserving individual digital assets, there is a growing recognition of the importance of managing and stewarding digital data as a whole. This includes ensuring the long-term availability and usability of research data, scientific datasets, and other types of digital information.

  5. Greater collaboration and shared infrastructure: Collaboration and shared infrastructure can help reduce costs and increase efficiency in digital preservation efforts. We may see more organizations working together to develop shared digital preservation repositories and platforms.

Overall, the future of digital preservation is likely to involve a combination of technological advancements, increased collaboration, and a continued emphasis on the importance of long-term stewardship and access to digital materials.

OPF: Looking to the future now, what challenges does the field of digital preservation face in the short term, medium term, and long term?

The field of digital preservation faces several challenges in the short term, medium term, and long term. Here are some examples:

 

Short term:

  1. Rapidly changing technology: Digital preservation practitioners must keep up with the pace of technological change to ensure that digital content can be preserved and accessed over time.
  2. Limited resources: Many organizations have limited resources for digital preservation, which can make it difficult to implement and maintain preservation strategies.
  3. Lack of standards and best practices: While there are many standards and best practices for digital preservation, there is still a lack of consensus on many issues, which can make it difficult for practitioners to know which approaches to use.

 

Medium term:

  1. Data volume: The volume of digital content is growing rapidly, which makes it increasingly challenging to manage and preserve this content over time.
  2. Data complexity: Digital content is becoming more complex, with new types of content such as virtual and augmented reality, 3D modeling, and machine learning algorithms, which pose new challenges for preservation.
  3. Digital obsolescence: The risk of digital obsolescence continues to increase as hardware and software become obsolete more quickly.

 

Long term:

  1. Sustainability: Digital preservation requires ongoing funding and resources to be sustainable over the long term. This can be challenging to ensure, particularly for smaller organizations.
  2. Changing user expectations: As user expectations and technologies evolve over time, digital preservation strategies may need to be adapted to ensure that digital content remains accessible and usable.
  3. Legal and ethical issues: There are many legal and ethical issues surrounding digital preservation, such as privacy, copyright, and ownership, which may become more complex and challenging over time.

 

These are just a few examples of the challenges facing the field of digital preservation. Practitioners must continue to innovate and develop new approaches to address these challenges and ensure that digital content can be preserved and accessed over the long term.

OPF: You touched on Digital Obsolescence – can you tell me more about this. The different mitigation strategies and the possible solutions.

Digital obsolescence refers to the risk of digital content becoming inaccessible or unusable due to changes in hardware or software. As technology evolves, older formats and standards may become obsolete, making it difficult or impossible to access or use digital content that has been stored in these formats.

There are several mitigation strategies that can be used to address the risk of digital 

obsolescence, including:

  1. Migration: Migration involves converting digital content from one format to another to ensure that it remains accessible over time. For example, a digital archive may choose to migrate files from a proprietary format to an open standard to ensure that the files can be accessed using future software.
  2. Emulation: Emulation involves using software to create a virtual environment that mimics the hardware and software needed to access digital content in its original format. This allows digital content to be accessed and used as it was originally intended, even if the original hardware and software are no longer available.
  3. Bit-level preservation: Bit-level preservation involves preserving the digital content in its original format, without making any changes to the file. This can ensure that the content remains accessible and usable even if the original software and hardware become obsolete. However, this approach requires ongoing maintenance to ensure that the digital content can be accessed using future technology.
  4. Diverse format preservation: Diverse format preservation involves storing digital content in multiple formats to ensure that it can be accessed and used using a variety of software and hardware. This approach can reduce the risk of obsolescence and ensure that the digital content remains accessible over time.

 

There are also several possible solutions to the problem of digital obsolescence, including the development of open standards and the use of community-based preservation efforts. By working together to develop open standards and best practices, and by sharing knowledge and resources, digital preservation practitioners can help to ensure that digital content remains accessible and usable over time.

DALL·E “Machine learning and AI helping make a brighter future”

OPF: What role do you think artificial intelligence or machine learning will play in the future of Digital Preservation?

Artificial intelligence (AI) and machine learning (ML) could play a significant role in the future of digital preservation. Here are some potential ways in which AI and ML could be used:

  1. Automatic metadata extraction: AI and ML algorithms can be used to automatically extract metadata from digital content, which can make it easier to manage and preserve this content.
  2. Automatic format identification: AI and ML algorithms can be used to automatically identify the format of digital content, which can help digital preservation practitioners to choose the most appropriate preservation strategy.
  3. Image and audio analysis: AI and ML algorithms can be used to analyze images and audio files to identify content, such as faces, locations, and objects. This can make it easier to identify and manage digital content and ensure that it is preserved in a way that is appropriate for its content.
  4. Automatic quality control: AI and ML algorithms can be used to automatically detect errors and inconsistencies in digital content, which can help to ensure that this content is preserved in a way that is accurate and complete.
  5. Predictive analysis: AI and ML algorithms can be used to predict potential issues or problems with digital content based on historical data. This can help digital preservation practitioners to proactively address these issues and ensure that the content remains accessible and usable over time.

 

While AI and ML offer many potential benefits for digital preservation, there are also potential challenges, such as ensuring the accuracy and transparency of these algorithms and addressing issues related to data privacy and security. However, with proper planning and oversight, AI and ML could play an important role in helping to ensure the long-term preservation and access of digital content.

Conclusion 

I found this conversation very interesting. The responses may not have blown me away, and perhaps I can craft my questions better in future, or probe more deeply into the reasoning for some of the responses. But what surprised me the most, or the most interesting aspect perhaps was how natural it was to chat with the bot. It was fun to chat with the old school bots back in the day, but this new generation of bots are eerily close to passing the Turing Test[5] – you can still find the limits thankfully, but presumably with each new model those limits will get pushed further and further back. 

These bots have great potential in assisting in research, I found myself during the exercise going off on tangents during our discussion, treating the bot like google, just asking it something rather than opening a new browser tab. With actual google you need to know the correct names and terminology for the things you search. But with ChatGPT it can decipher what you want from a description, and lead you to new results you may not have even found. 

I found the script writing capability incredible. I have recently written some scripts for members as part of our Tech Clinic sessions that probably took 30 minutes to an hour to put together, but the language model did a great job in mere seconds, and I was then able to ask it to rewrite the script for other languages or operating systems. This kind of capability can be really beneficial for anyone who needs a small snippet of code, but I can see how it could be an issue in Academia. Just like with all new tools, we will all need to change and adapt how we learn and use these tools to get the most out of them. 

The latest version of ChatGPT 3.5 is out there now, and people are training it on their own documents so that it can become an ‘expert’ in their own information. The first uses will likely be for company help and support – embedded in the online chat tools on their corporate sites. Trained on all their product manuals and FAQs, they should be quite capable at giving basic help and support to customers. If the same were applied to some of the large archival collections out there, it could be incredibly useful for researchers searching for obscure documents or manuscripts not often referenced in any other way. Perhaps these tools can bring light to new insights and ideas.

I had to edit down the Q&A from multiple sessions to get the material for this blog, but if people are interested then perhaps I can somehow make it all available. Since putting this document together more questions and thoughts have come to mind so I may perhaps do a follow up to this piece in the future. If you have any questions you’d like me to ask, please post them below!

Here are some of mine:

  • How do you think that AI will impact the digital preservation community?
  • Can AI replace human intervention in digital preservation processes?
  • How can AI help archivists in managing digital born content?
  • What is the best file format for long term preservation? Explain your reasoning?
(15/03/2023: This blog was edited for clarity and readability)
< Previous Next >

1 Comment

  1. Richard
    March 31, 2023 @ 12:34 PM Europe/Berlin

    A really interesting post, Darren!

    I checked ChatGPT’s recommended readings list: most are correct but the article “Towards a Theory of Digital Preservation” is by Reagan Moore not Seamus Ross. Illustrates the “hallucination” issue with ChatGPT: it can confidently give truthy but inaccurate answers.

Leave a Reply

Join the conversation