Some International Dimensions of Metadata
Fellow Metaware contributor Carolyn Hansen opened up some interesting areas of discussion in her piece, “Standardizing Metadata for Digital Humanities.” She examines issues of context and ethics relating to Digital Humanities, where standards for the Text Encoding Initiative (TEI) predominate. She notes the need for “more cross-community discussions at ALA and other conferences between librarians specializing in metadata and cataloging standards and experts in digital scholarship,” with some focused attention on non-technical issues in contexts outside of MARC-based cataloging. While my experience has been largely based in MARC and not TEI, there are sets of issues common to both that are only starting to be adequately addressed. MARC has not yet left us and initiatives like TEI, the Bibliographic Framework, and others are still evolving. Negotiating between the communities these standards serve is important.
One area that deserves some attention in this space is that of internationalization, which for folks who specialize in the field is thought of as an architecture, or “essential part of initial software design”, rather than a feature (Deitsch & Czarnecki, 2001). For metadata practitioners interested in international aspects, some of the most relevant pieces of the architecture are those which can be extracted from a MARC record or other formats to feed into a structured language tag following BCP47: the language value, the country, and any data available on the scripts used in the record or the resource. While examining these factors in turn, we can also consider a few points about the conditions under which this kind of metadata is created, the context in which it is found, and to what end we can expect that a language tag might be applied.
The conditions for a library cataloger creating internationalized metadata are constrained at many levels. Catalogers have access to library cataloging software that is usually part of a proprietary Integrated Library System (ILS) such as Sierra, Alma, or Symphony, among others. Decisions regarding expanded support for a character repertoire within those environments are complex and can take several years of consultation with OCLC, the Library of Congress, and the Program on Cooperative Cataloging, and other stakeholders. In 2007, a specification was added to MARC to allow for the lossless conversion between MARC-8 and UTF-8 character sets through the use of hexadecimal numeric character representations (NCRs). I won’t go into more detail here, but referring to the specification may give you some sense of what the constraints have been for fully implementing Unicode in the MARC environment.
Contextually, there are also constraints on what language and country tags have been available to the cataloger. The 008 field in MARC is populated with language values from a different list (ISO 639-2/B) than the lists commonly used on the World Wide Web (ISO 639-1 or ISO 639-3). Country tags for the web are pulled from ISO 3166, but there is a separate MARC list of country codes used by libraries. While the language tags used in HTML might be familiar as part of a locale — en-us for English in the US, fr-fr for French in France, and zh-cn for Chinese in China — the equivalents in MARC coding may look more like this: eng-xxu, fre-fr, and chi-cc. Language tags are not always evenly mappable, especially when it comes to more obscure special collections material, translations, or content with more than one language represented.
The reason for extracting metadata from a record in order to build a structured locale tag is to match search results more evenly with language and locale preferences as detected from the user’s web browser settings. Whether this would also have the effect of making a given user more identifiable to law enforcement is an open question, so appropriate privacy and minimization measures should be considered in light of continued efforts to collect metadata at scale for government surveillance projects. Data on any non-Latin scripts used in a record can be detected in some cases by querying the 066 field for content, but in most cases it would be more useful to rely on the Common Locale Data Repository (CLDR) library of exemplar characters for a language, and determine the script from its usage in the 880 fields.
In general, MARC offers a fairly high degree of granularity with respect to requirements for the interoperability of internationalized metadata; in many cases it offers a richness that deserves to be maintained. Unpacking its intricacies can be of interest for re-use and for delivering to the user desired content across platforms.