Welcome back to Nooks & Crannies! After a month off for my wedding, I’ve been digging around for some interesting bits for upcoming columns. This month, I’ll take a look at some open source code libraries that developers can use to handle MARC-formatted records.
A little background for the MARC novice
MARC stands for MAchine Readable Cataloging records. It’s a format first developed in the 1960s for the U.S. Library of Congress in order to facilitate the exchange of bibliographic records among libraries. By the mid-1970s, it was an international standard, used around the world.
There are several variants of the MARC format. MARC21 was a merger in the 1990s between USMARC and CANMARC, the US and Canadian variants then in use, and other countries have their own formats. In much of Europe, UNIMARC is the variant most often seen. All of these records are formatted the same, with a structure of tags that are used to contain information, a directory which tells what tags are in the record, and where they are located.
Each tag, in each format, means something specific. For instance, in MARC21 bibliographic format, the 245 tag holds information about the title of the work. Additional information, including the publisher, author, size of the physical book, publication date, and subjects, are contained in other tags.
The format of the record, if you were to just print it out, is kind of hard to read. It was originally designed for serial interchange, via 9-track tape, and that medium was still in use in the early days of my career, in the 1990s. The first five bytes of the record are digits and tell you how long the record is, in bytes—including those five bytes. The clever modern nerd will instantly perceive the limitation of this structure: the record cannot be 100,000 bytes in length. Following that is the directory of tags, telling what tags to look for, and at which byte each tag starts. After that comes the tag data, and the next byte after that is the first byte of the next record. The leader/directory/tag structure is generically defined in ISO-2709; MARC21 or UNIMARC are the formats that define the meanings of the tags.
Yes, it’s a poorly designed format by modern standards. Yes, it needs updating, in the worst way, but that’s the subject of another article altogether. In this article, I’ll show you three code libraries that you can use to manipulate MARC records without having to know all the nitty-gritty of the arcane tag directory.
MARC4J allows the creation of an iterator to read an input stream such as a file, and do things with the MARC21 or UNIMARC records that it finds in the stream. There are record-writing tools, too, of course, and iterators for examining the records in detail. Here’s a quick example that will read in a file of records, and if the title of the work in field 245, subfield a, starts with the letter J, writes it to another file:
import org.marc4j.MarcReader; import org.marc4j.MarcStreamReader; import