Scenario for discussion: Text files.

I would like to pose a scenario for your comment:

Description 

A large set of files, ~5,000.

Created between ~1993 to ~1997

Creation software unknown

Given extension .ASC

PRONOM PUIDs:

x-fmt/22 (7-bit ASCII Text) and x-fmt/283 (8-bit ASCII Text) relate DROID matches by extension as above

JHove: ASCII-hul (Status: Well-Formed and valid)

Visual inspection confirms that these files are ASCII text documents, with no BOM or other header/footer data. The characters seem to be limited to 7-bit ASCII, but a full check of the whole collection has not been made. This has to be undertaken manually, there is no tool that will make the distinction in a ‘bulk’ mode.

The files have no discernable data in the filename other than an arbitrary text string and the extension.

Most of the inspected files have a contextual ‘header’ (a vaguely structured line of text) inside the document of the (approximately common) form:

example 1: ‘NEWZTEL NEWS: RNZ 1ZB  “LARRY WILLIAMS”      MONDAY 25 MARCH 1996’

example 2: ‘NEWZTEL NEWS: CAPITAL TV “NIGHTLY NEWS”          TUESDAY 19 MARCH 1996’

example 3: ‘NEWZTEL LOG:  RNZ 12:00 NOON NEWS            FRIDAY 17 MARCH 1995’

example 4: ‘NEWZTEL NEWS: CAPITAL TV “NIGHTLY NEWS”     WEDNESDAY 14 FEBRUARY 1996’

Collection description:

Files are transcriptions of news broadcasts of the period.

The set is rich with useful search terms inside the files: Names, dates, places, themes etc.

The library describes the collection by a calendar month grouping. (e.g. ‘Transcripts of news broadcasts from May 1995’) The library has not, and will not undertake a ‘file-by-file’ description.

Technical details (Files with an .ASC extension)

There are two PUIDs that refer (as above). The only constraints appear to be either 7-bit or 8-bit ASCII character encoding, They have a standard ASCII CR (carriage return, or new line) encoding of {0xda} and a LF (line feed) encoding of {0x0a}. There is what appears to be an EOF type character {0x1a} at the end of the encoded text, which is followed by what appears to be some zero bit padding (in the form of {0x00} assumed to repeat until the total file size reaches a specific multiplier).

These encodings are interpreted correctly by any text viewer that was tested (MS word, notepad, notepad++).

Ingest options

1) Ingest as is. Create a rule that will associate all files of the extension ASC to x-fmt/22, and assume that all files are 7-bit ASCII (confidence in this assertion as yet unknown)

2) Change all the ASC extended files to .txt. ingest as x-fmt/111.

Justifications 

1) (a) These files came in as .ASC files, they should be ingested as such. Any modifications required in the future should be undertaken through the creation of a modified master, and an new representation ‘layer’ added to the IE.

(b) There is a matching and suitable PUID.

(c) File completes MDE completely.

2 (a) ASC is not a widely adopted format. It’s a legacy format identifier that simply indicates the file contains ASCII text.

(b) Long term, the value of these objects is making them accessible – to a human reader, and to a systematic parser/indexer

(c) External tools are unlikely to support ASC as a format type (where text format type is specified). By changing the extension to txt , this potential bottleneck is completely removed. Objects are delivered to viewers in a widely accepted format that (generally) will be natively rendered on most platforms. Objects are delivered to agents in a widely accepted format.

(d) it is expected that the use of free text indexers or other content crawlers will be used at some point to extract context and search terms from this collection. If this process is not undertake its true value is unlikely to be realised by the Library given the limited description that is available. This includes harvesting the title, the dates, locations, names and other such useful terms, and making an index of these granular expressions available to researchers.

(e) Accepting the above, it would be more efficient to change the files once at ingest, (recording the changes as per policy), negating the need to revisit the objects in the future.

 

Thoughts? questions? comments?

By Jay Gattuso, posted in Jay Gattuso's Blog

16th Aug 2011  10:47 PM  14960 Reads  No comments

Comments

There are no comments on this post.


Leave a comment