The Encoding of Avestan - Problems and Solutions

Avestan’ is the name of the ritual language of Zoroastrianism, which was the state religion of the Iranian empire in Achaemenid, Arsacid and Sasanid times, covering a time span of more than 1200 years. It is named after the ‘Avesta’, i.e., the collection of holy scriptures that form the basis of the religion which was allegedly founded by Zarathushtra, also known as Zoroaster, by about the beginning of the first millennium B.C. Together with Vedic Sanskrit, Avestan represents one of the most archaic witnesses of the Indo-Iranian branch of the Indo-European languages, which makes it especially interesting for historical-comparative linguistics. This is why the texts of the Avesta were among the first objects of electronic corpus building that were undertaken in the framework of Indo-European studies, leading to the establishment of the TITUS database (‘Thesaurus indogermanischer Textund Sprachmaterialien’). 2 Today, the complete Avestan corpus is available, together with elaborate search functions and an extended version of the subcorpus of the so-called ‘Yasna’, which covers a great deal of the attestation of variant readings. Right from the beginning of their computational work concerning the Avesta, the compilers had to cope with the fact that the texts contained in it have been transmitted in a special script written from right to left, which was also used for printing them in the scholarly editions used until today. It goes without saying that there was no way in the middle of the 1980s to encode the Avestan scriptures exactly as they are found in the manuscripts. Instead, we had to rely upon transcriptional devices that were dictated by the restrictions of character encoding as provided by the computer systems used. As the problems we had to face in this respect and the solutions we could apply are typical for the development of computational work on ancient languages, it seems worthwhile to sketch them out here. 1 The Avestan script and its transcription 1.1 Early western approaches to the Avestan script and its transcription The Avestan script has been known to western scholarship since the 17 century when the first accounts of the religion of the ‘Parsees’, i.e., Zoroastrians living in India and Iran, were published. The first notable description of the script is found in the travel report by JEAN CHARDIN who sojourned in Iran in 1673–7; in the 1711 edition of his report, the author provides an ‘alphabet of the ancient Persians’, together with a lithographed table contrasting the characters of the Avestan script with their Perso-Arabian equivalents; cf. the extract illustrated in Fig. 1.


1
The Avestan script and its transcription

Early western approaches to the Avestan script and its transcription
The Avestan script has been known to western scholarship since the 17 th century when the first accounts of the religion of the 'Parsees', i.e., Zoroastrians living in India and Iran, were published.The first notable description of the script is found in the travel report by JEAN CHARDIN who sojourned in Iran in 1673-7; in the 1711 edition of his report, 7 the author provides an 'alphabet of the ancient Persians', together with a lithographed table contrasting the characters of the Avestan script with their Perso-Arabian equivalents; 8 cf. the extract illustrated in Fig. 1. 9 A much more interesting account than CHARDIN's, 10 who took the Persian letters to be 'small' variants of the 'big' Avestan ones, 11 is that of THOMAS HYDE who in his 'History of the Religion of the Ancient Persians, Parthians and Medes' of 1700 provided the first specimens of words written in Avestan characters along with a Latin transcription.The words are not in Avestan (or 'Zend'), however, but 'in the Pahlavi language, which is verily Persian' (cf.Fig. 2); 12 as a matter of fact, what we have here is a list of words in 'Pazend', i.e.Middle Persian (Pahlavi) written in Avestan characters. 13o the (postumous) second edition (of 1760) of HYDE's work, the editor (G.COSTARD) added a comprehensive Table displaying the 'Letters used in the books in Zend and Pazend, according to the copies by DR.HYDE, together with the Zend ligatures and abbreviations' in toto, together with a detailed explanation of their values in Latin script (cf.Fig. 3). 14 even more detailed account of 'Zend' and 'Pehlvi' characters was published by (ABRA-HAM-HYACINTHE) ANQUETIL-DUPERRON in his comprehensive treatise on the 'Zend-Avesta' in 1771 (cf.Fig. 4). 15ANQUETIL's description, which was derived from two manuscripts of the Bibliothèque Nationale in Paris, is generally regarded as the beginning of modern Avestology.The transcription he used was clearly based upon French orthography.In a similar way, IGNACY PIETRASZEWSKI in 1858 explicitly applied Polish rules to his transcripts (cf.Fig. 5 and Fig. 6). 16  The different approaches to a Romanization of the Avestan script had reached a preliminary end by the beginning of the 20 th century when CHRISTIAN BARTHOLOMAE, first in his account on the Avestan and Old Persian languages in W. GEIGER's and E. KUHN's 'Outline of Iranian Philology' (1895-1901) and then in his 'Old Iranian Dictionary' (1904), proposed a transcription system that was based upon the choice of original characters used in K. F. GELDNER's edition of the Avesta (cf.Fig. 7 -Fig.9).Due to the importance of the dictionary, which has remained the standard reference work of Avestan lexicography until the present day, BARTHOL-OMAE's transcription system was used for many years to come.

The 'Hoffmann system'
On the basis of a thorough reconsideration of the character inventory and its linguistic background, BARTHOLOMAE's system was challenged to a certain extent by KARL HOFFMANN (cf.Fig. 10). 17It is HOFFMANN's merit to have clarified the function and mutual relationship of the three characters numbered 42-44, all transcribed by plain š in BARTHOLOMAE's works, as well as several other letters.Table 1 illustrates the peculiarities of the system thus achieved, which was the first to be strictly transliterative in the sense that all characters (rather: graphemes) of the original script are rendered by one unique Latin symbol, in contrast to the 'mixed' systems of former authors. 18

Encoding Avestan 2.1 A 7-bit rendering
In the middle of the 1980s, when the project of digitizing the Avestan corpus was initiated, 21 there was no use in trying to encode the texts in the original script, given that the line-based desktop computer available for the project was not programmable to non-Latin scripts. 22The same holds true for several special characters used in K. HOFFMANN's transliteration system.As a matter of fact, the character inventory usable for the given task consisted of nothing but the items pertaining to the plain ASCII standard, 23 plus a few extra characters necessary for the encoding of German and Skandinavian languages, all stored in the 7-bit range of character encoding (code values of 0 to 127; cf.Table 2 showing the character set of the computer used, with the German non-ASCII characters printed on a shaded background).To maintain the principle of a unique one-to-one rendering of (transliterated) Avestan characters, the existing inventory had to be applied with awkward-seeming but 'natural' equivalences such as $ = š, ö = ə, or Z = ž.The transliteration system thus arrived at differed greatly from that of comparable digitization projects such as that of the R̥ gveda Saṃ hitā, 24 the most ancient text collection of Vedic Sanskrit, which made ample use of digraphical and trigraphical combinations of ASCII characters (cf. the example in Table 3). 25The advantage of the 'clumsy' one-to-one encoding of (transcriptional) Avestan simply consisted in the fact that it could easily be converted into any other code, without any further consideration of the length of coherent character sequences; in addition, we may state that the inventory necessary for rendering Vedic Sanskrit was much larger than that covered by Avestan (because of the great number of accented vowels it has to cover), and a 7-bit-based one-to-one rendering system (which cannot provide code points for more than ca.120 characters) would not have been applicable for it. 26Another reason to stick to a one-byte representation lay in the fact that the amount of disk space available was extremely limited when the Avesta project was started; there was no hard disk available yet, and the ca.1.2 Million characters of the text collection were just what the two floppy disks manageable by the system could store (in a database application that had to be programmed especially for this task).0 1 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9

An 8-bit rendering
After having moved to an IBM-DOS-based system in 1986, the transcriptional data could for the first time be visualized both on a printer and on the screen.Equipped with a programmable EGA 27 graphics card and a 70 MB hard disk, the IBM-compatible PC used then was much better suited to the task of completing the electronic corpus of Avestan.Software for entering larger specimens of non-conventional text in a structured way was also available by then: even though it was still line-based, WordPerfect 4.1 was an excellent basis for this task as it enabled the user to check his or her input in a "Reveal Codes" screen and provided an interface for rendering the special transcription characters correctly even on a Laser Printer.For the encoding of Avestan (and other ancient Indo-European languages) in WP 4.1, a special font could thus be designed for both screen and printer representation; different from the 7-bit font used before, this was 8-bit based, with all "special" (non-ASCII) characters stored in the "upper" character range (code values extending from 128 to 255, cf.Table 4).

The first 16-bit encoding scheme
In 1988, the project proceeded one step further by applying the first 16-bit character encoding available for PCs.With WordPerfect 5.0, the user had at hand a total of 1632 uniquely encodable characters, among them Greek, Cyrillic, Hebrew, and Japanese (hiragana and katakana) sets, but also a large set of Latin characters with diacritics that were not covered by 7-bit ASCII or its 'western' 8-bit successor, the ANSI standard. 28For the encoding of the 'idiosyncratic' transcription of Avestan, even this character 'block' (cf.Table 5) was not sufficient though; instead, the project had to rely upon the extra-block of 'user definable' entities, which comprised up to 255 additional items, and characters such as ẏ, ą̇ or ə̄ had to be assigned code points in that range (cf.

Rendering the original script
The next version of WordPerfect, 5.1, was even programmable to display and handle the Avestan original script with its right-to-left directionality.The prerequisite for this was the installation of either the Hebrew or the Arabic language package, both of which added the necessary functionality for switching between bidirectional text passages.For Avestan, however, the packages offered no code space off-hand; instead, the Avestan characters had to be mapped onto one of the character sets of either Hebrew (block 9) or Arabic (blocks 13 and 14).As the latter was designed to imply the automatic adaptation of letters to their left and right environment (a feature not relevant to Avestan), the Hebrew block was much better suited for this purpose.The resulting assignment is illustrated in Table 7; for lack of demand, it was never applied to the rendering of the corpus.

Towards unique encoding: Unicode
With the introduction of the World Wide Web in about 1994, it became necessary to provide a unique encoding scheme for the Avestan texts that was not restricted to proprietary formats.As none of the code pages that were usable in WWW applications then covered the special characters used in the transcription of Avestan, let alone the original Avestan script, the project had to rely upon the emerging 'Unicode' standard right from the beginning even though there was practically no support for this available when the first specimens of the corpus were put online on the server of the TITUS project in 1996.The first sample page, which is still available today (cf.http://titus.uni-frankfurt.de/unicode/samples/homyast.htm),displays in Roman transcription a part of the so-called 'Hōm-Yašt' (i.e.Yasna 9,1-11,8) together with its Middle Persian ('Pahlavī') and Sanskrit translations and liturgical prescriptions in Pāzend (i.e.Middle Persian written in Avestan script).The page clearly indicates what was encodable and retrievable in the early years of Unicode: many characters could not be visualized because they (or their elements, diacritics or basic characters) were not covered by standard fonts, or they had to be left open as there were no code points available for them yet.Meanwhile, 15 years after these first attempts, Unicode has become prevalent as the most widely used encoding standard in the Web, and there is no longer any problem in encoding, retrieving and displaying the transcriptional data of the Hōm-Yašt or any other Avestan text (cf. the online edition in http://titus.unifrankfurt.de/texte/etcs/iran/airan/avesta/avest010.htm#Avest._Y_9).It is true that many of the combinations of characters with diacritics that are used in the transcription (e.g., ą, t̰ , or x) cannot be encoded as such, i.e., as 'precomposed characters', because there are no code points available for them; instead they have to be encoded as sequences of basic characters and diacritics, 29 and it has taken quite some time until 'system fonts' and 'rendering engines' were able to display this kind of combinations in an acceptable way.
As to the original script, the development of a corresponding code block within Unicode took even longer, and only with the publication of Unicode version 5.2.0 in October 2009 has this goal been achieved. 30The Avestan block consisting of code points 10B00 through 10B3F 31 now enables us for the first time to encode in a standardized way the complete text of the Avesta in the original script.However, for lack of standard fonts that comply to this standard, it will take some more time to make this encoding readily available to the public.
Still, we have to admit that the encoding provided by the new Unicode block is not exhaustive, given that there is still a small set of letters that have not been assigned a code point.The reason is that these letters ( , , , , ) have been regarded as mere glyph variants of other characters ( = ą, = δ, ' = ń, -= v, and 6 = h), mostly in accordance with traditional usage which did not provide separate transcriptions for them.However, this decision brings about a dilemma, not only with respect to displaying them in a scholarly context such as the present paper (as a matter of fact, the five letters in question are represented by images here): if we wanted to challenge the assumption of their being functionally identical with their encodable 'partners', we would have to check whether they only occur in distinct environments and never side by side with them within one and the same manuscript.But to check this thoroughly, we would have to encode the texts of all manuscripts accordingly -which we cannot, as there are no code points for these letters available.It is true that provisional code points could be provided in the 'Private Use Area' of Unicode; 32 however, the use of code points of this area may still lead to problems depending on systems and software used, and it is therefore not recommendable.For the time being, I suggest to solve the problem via transliteration, by assigning special (adapted) transliteration symbols to the five letters in question.

Summary
The different approaches to the encoding of the Avestan script, first in transliteration and later in the original form, are summarized in Table 8 below.The table also includes the five letters for which no Unicode code points are available, together with a proposal for their transliteration.The combinations of = ii and = uu are not included as they have been encoded right from the beginning as sequences of the two characters each that they contain.

Table 1 :
Transcription systems for Avestan

Table 2 :
7-bit character set applied in 1985

Table 4 :
8-bit font used for the encoding of Avestan and other ancient Indo-European languages(1986-1989)

Table 6 )
. For the screen rendering, which was still line-based, WP 5.0 provided a sophisticated solution to extend the 8-bit-based character set of the graphics cards used in PCs to 512 characters, and this was programmable to display the extra characters of Avestan transcription.

Table 5 :
The 'Latin Extended' Block of WP 5.0

Table 6 :
Assignment of the 'User definable' Block of WP 5.0

Table 7 :
Avestan characters mapped onto the Hebrew character set of WP 5.1