Enabling Access to Resources in Non-Latin Scripts
Somewhere between the realm of the symbolic and the Lacanian Real, there exists something akin to Michael Binkley’s Anxiety Closet of resurgent Bloom County fame: characters living there may all have identifiable names, but they have a tendency to remain hidden and only emerge to torment us when we are otherwise most at peace. So it is with romanization. Resources that need to be described are identifiable as distinct from one another—whether both in their original script and transliterated form, or just through the romanized version of their metadata—but they still have a tendency to keep us awake at night.
Representing non-Latin scripts in the Latin alphabet often calls upon a need to use special characters or diacritics to keep everything well-accounted for and, in many cases, reversible to its original script. Many of the romanization tables in use by the Library of Congress and the ALA have, as a design goal, the intent of one-to-one correspondence between characters on each side of the table. This goal is not always possible to achieve. For most scripts, the conversion process readily lends itself to automation, but there are a few scripts (e.g. Japanese kanji and Malayalam) that defy automation and require dictionary lookup or more sophisticated algorithms to process.
From time to time, romanization tables come under review for different reasons. Sometimes user expectations have drifted away from a traditional transliteration method, and sometimes a newer method receives government approval, ISO acceptance, or lends itself to automation—though rarely all three. In the case of the recently approved revision to Tibetan, the motivating factors were really about meeting user expectations with a method that is easier to automate. Batch re-conversion of older records is possible, but the difficulty is prohibitive with current technology. This was a consideration raised in the debate over whether to revise the romanization table.
Including the original script into metadata has been getting easier, especially since about 2006. OCLC has done its part in expanding coverage from the traditionally supported JACKPHY scripts (Japanese, Arabic, Chinese, Korean, Persian, Hebrew and Yiddish). It announced support for Bengali, Devanagari, Tamil and Thai in 2006 and has since expanded to cover Syriac, Armenian and Ethiopic. More initiatives are taking place that address, for example, Georgian (British Library), Gujarati (D. K. Agencies), Lao (National Library of Australia and National Diet Library in Japan). Cases like extended Latin (a.k.a. International Phonetic Alphabet), extended Arabic (a.k.a. Ajami), Cyrillic and Greek are also much closer to being ready for wider acceptance in the metadata ecosystem.
Direct support for Tibetan and Mongolian scripts received priority consideration from respondents to a Script Priority Survey that was carried out in 2013 under the auspices of the ALA’s Committee on Cataloging: Asian and African Materials, which I chair. While Tibetan should be relatively straightforward now with a new romanization table in place, Mongolian has some complex text rendering and layout requirements that might put its direct support on a slower track, at least for now, until things like vertical text layout become more widely supportable.
There are, even so, anxieties in Binkley’s closet remaining to be provoked, but we will leave those for a future discussion.