Main >> Business Park >> Software

 
LISA Newsletter XI/1.3, Intra-textual Inconsistency Commentary Single-sourcing solutions Profile Creole OSCAR RAND IDC Last issue
LISA Newsletter Banner [LISA Newsletter Banner]
Issue navigation

Intra-textual Inconsistency

Risks of Implementing Orthographies for Less-Prevalent Languages

Marilyn Mason & Jeffrey Allen, MIT2

This article follows up on two other articles that have appeared in the LISA Newsletter in 2001 year ("Closing the Digital Divide: issues in expanding localization efforts to minority languages" in X:2, and "Is there a Universal Creole for localization efforts?" in X:3).

The second of these articles claimed that the idea of a universal Creole (or a pan Creole) is not possible due to a number of linguistic and extra-linguistic factors that have led to the need to localize for each Creole language. Factors that push for localization and discourage universalization were taken from examples of words in several different written varieties of French Creole (St. Lucian French Creole, Dominican French Creole, Martinican Creole, Haitian Creole). The range of linguistic and sociolinguistic differences between these closely-related languages demonstrate that establishing a single pan-Caribbean orthography for the French-based Creoles would be a difficult task. These are just a few examples of less-prevalent languages (sometimes also referred to as minority languages, sparse data languages, low-density languages, etc) that are the "emerging languages" in the international communication and technology arenas, in contrast to the world's established international "major" languages.


printer friendly version

Comment on this article


Marilyn Mason

 

Non-technology standards

A significant obstacle for development efforts on authoring, translation and localization technologies of less-prevalent languages has been the underlying assumption that the written form of such languages is naturally standardized at the same level as is the case with major international languages. The purpose of the present article is to reveal that standardization efforts for less-prevalent languages are not always fully implemented.

In most discussions about standardization in localization circles today, the issues tend to focus on database and import / export formats, on hardware and software compatibility, and other scalability concerns. Yet, one of the topics hardly discussed is that of spelling standardization, except probably for a few debates from time to time on American versus British spelling variants. Why is spelling standardization hardly considered? Because the major languages seem to have already passed through the stages where spelling issues have considerable impact.

Yet, we are reminded that over 75% of the world's languages struggle with some of the most basic spelling standardization issues in their transition from becoming non-written oral languages to languages with a written form. The path of pre-literacy, early literacy and post-literacy stages indicates that spelling standardization is not an assumed accomplishment for many languages.

Whereas many of the world's major languages underwent the standardization and normalization stages of their written form over a period of several centuries, today's modern less-prevalent languages are forced to undergo rapid standardization processes within a period of less than one or two decades. In addition, standardization must take into account the impact and consequences of computer technologies from the beginning because all decisions on how to spell the language will have a direct effect on the eventual development of language spell-checkers.

The example of Haitian Creole

The language 'standardization' process of Haitian Creole has taken place over a period of several decades and has mainly focused on 'orthographic standardization' issues. Sound-to-letter spelling (known as phoneme / grapheme) standardization is only the first step in written language standardization. The spelling issue of sound-to-letter correspondence was more or less resolved in the late 1970s and early 1980s with the creation of the Institut Pédagogique National (IPN) orthography (Bernard 1980) for Haitian Creole. Yet, for many decades "in Haiti, there have often been two or more competing orthographies in the same territory" (Baker 1997, p. 120).

Even more importantly, other researchers have shown that there have been 11 known proposed spelling systems for Haitian Creole (Schiefflin & Doucet 1992), which does not include the another dozen known hybrid spelling systems for this language. Of all of the spelling systems that have been devised, whether they have resulted from political or religious movements, the IPN orthography has been declared by the Haitian government as the 'official' orthography and is consequently the most widely accepted spelling system for Haitian Creole today.

Despite the existence of an official spelling system for this language, there is no guarantee 1) that all known and available texts follow the same orthography, and 2) that the Haitian Creole (HC) written language will naturally and automatically pass through the stage of wider-use normalization whereby the lexicon standardizes itself in written form. Spelling is more than an issue of simply combining individual letters into words. Spelling standardization also actually takes into account the contoured forms of words by standardization of the lexicon.

It is important to get past the simple letter/sound correspondence and to achieve a standardized written form for each word in the language. By doing this, one creates a contoured form for each word that native speakers learn to recognize, memorize and reproduce when they write. This is an essential issue when it comes to using the written form of the language across the different levels of Information and Communication channels (authoring, publishing, translation, Web site information, government administrative information, etc).

The lack of standardized forms for HC creates a socio-linguistic context in which there is a constant negative reinforcement at the cognitive level for literacy learners who read and use two or more graphic signs to symbolize the same sound. The readers of this newsletter may wish to browse through the discussion archives of the Haitian Creole Forum at the Windows on Haiti (www.creoletalk.com/) where you can find several alternative ways to writing the graphemes è and ò of the IPN orthography: 1) with apostrophes before or after the vowel letter (i.e., 'e or e' and 'o or o'); 2) as an uppercase letter with no diacritic (i.e., E and O); and 3) lowercase with no diacritic (i.e., e and o). The first two sets of alternative forms can be handled easily with conversion algorithms. The latter of the three is the most difficult to deal with in developing multilingual text- and speech-based systems because the non-accented graphemes e and o also exist in written HC.

Written lexical variation

The reason for focusing on word-level contour and form is due to written lexical variation that has resulted from past literacy efforts in Haiti. Take for example the word frequency counts presented below (Mason, 2000) which show spelling variation for very common HC words found in texts that were collected from a series of independent sources (newspapers, Haitian Creole Bible, Creole language publishers, health organizations, etc).

(1) The word for "enemy"
 Frequency   Written form   
 457lènmi
 2lènnmi
 9lenmi
 5lennmi
 9ènmi
 6enmi
 7ennmi
(2) The word for "week"
 Frequency   Written form   
 295semèn
 11semènn
 20semen
 28semenn
 2senmenn
(3) The word for "government"
 Frequency   Written form   
 10gouvèman
 8gouvèmnan
 7gouvènmam
 924gouvènman
 5gouvènnman
 20gouvenman

Despite the fact that in many cases there tends to be a dominant spelled form, it is important to note that many of the variants each account for 2-10% of the instances of the word across the different texts.

Another important source of information about variation in Haitian Creole is by Allen & Hogan (1998). Examples from their study (provided below) that cover 13 different data sources show a high amount of variation in Haitian Creole spelling for the very same lexical items. This indicates clearly that the spelling variation for the same word occurs not only between the different data sources (inter-textual), which might be expected, but also within the same source texts (intra-textual).

(1) The word for "week"
Text Sourcesspelled form   # of occurrences   
Educa Visionsemèn3
 semenn2
Haïti Progrèssemen25
 semenn1
Jounal Libètesemenn10
 semèn1
Christian Reformed
World Relief Mission
semenn7
 semèn1
 semènn1
Carnegie Mellon University   semèn280
 semen22
 semenn13
 semènn11
 senmenn2
 senmèn1
 senmen1
(2) The word for "government"
Text Sourcesspelled form   # of occurrences   
Bib-lagouvènman122
 gouvenman3
Educa Visiongouvènman6
 gouvenman1
Carnegie Mellon University   gouvènman645
 gouvenman18
 gouvènnman5

Spelling variation on a wide-scale level for any given less-prevalent language is not an isolated case for Haitian Creole. Ken Decker [1996 12, section 3.2] states that in "B[elize] C[reole] texts, I have often found the same word spelled different ways in the same text, or even the same sentence." For Reunion Creole, "la variation graphique atteint ici 100 % des unites lexicales" (our translation: variation of the graphemic form affects every lexical item in the language), according to Pierre-Louis Mangeard (e-mail communication).

Risks of implementing orthographies for less-prevalent languages

As we have seen from the information provided above, a less-prevalent language like Haitian Creole might have an official orthography, yet there is always a risk of several self-creating orthographies in such pre-literate and early literate cultures. In these contexts, spelling variation can permeate the entire lexicon of the written forms of such a language. One cannot be assured that all existing databases and Web sites containing texts will follow the same orthography, nor that the written language will naturally pass through the stage of wider-use normalization with self-standardization of each specific written lexical form.

Standardization of the lexicon, and not simply just of the orthography, is therefore a crucial issue concerning the use of the written language in all potential areas of localization (authoring, publishing, translation, Web site, software, training, etc). The lack of focus on spelling standardization for less-prevalent language localization can simply lead to linguistic chaos that will be extremely difficult to unravel afterwards.

Lexical standardization is a key issue for a new generation of localization efforts for less-prevalent languages. Less-prevalent languages will account for more of the translation and localization market in the future due to new legal mandates in many countries. A high wave of immigration to a major city in a powerful economic country can significantly influence the translation job request volume for the health, immigration, communication and judicial sectors.

With tight turnaround schedules, it is not sure that localizers would have adequate time and resources to check the texts according to one or several co-existing, competing spelling systems, and then determine which one is the most appropriate to use. Given that many spelling systems are often affiliated with specific political or religious groups in a given country, there need to be ways of validating localized work in semi-automatic and near-automatic ways. Choosing an obsolete spelling system or one that has negative connotations for the entire population of immigrants could lead to the rejection of a software interface or of important information (such as health documents) that has been requested by a client.

Conclusion

Techniques need to be developed and implemented to provide for lexical standardization of less-prevalent languages (Mason 1999; Mason 2000). If not, these languages run the risk of suffering greatly in efforts to meet needs based on information (authoring, translation, Web site localization, documentation, etc) and on software development (translation systems, desktop publishing, OCR, spell-checking, information retrieval, question-answer, speech recognition and synthesis, etc), upon which the modern world is basing its current and future decisions for information processing and communication. Resources and efforts must be found to streamline the spelling standardization process for less-prevalent languages as they continue to take part in localization and translation workflow processes.

References

Allen, Jeffrey & Christopher Hogan. 1998. Evaluating Haitian Creole orthographies from a non-literacy-based perspective. Paper presented at the Society for Pidgin and Creole Linguistics conference, New York City, 9–10 January 1998.

Bernard, Joseph. 1980. Ki Jan Nou Ekri Kreyòl Ayisyen [Reprint of communiqué on Haitian Creole official orthography]. Études Créoles 3.1:101–05.

Baker, Philip. 1997. Developing Ways of Writing Vernaculars: Problems and Solutions in a Historical Perspective. In Vernacular Literacy: A Re-evaluation, ed by. Tabouret-Keller et al., 93–141. Oxford: Oxford University Press.

Decker, Ken. 1996. Orthography Development for Belize Creole. In 1994 Mid-America Linguistics Conference Papers, Volume II, edited by Frances Ingemann, 351–362 Lawrence, Kansas: The University of Kansas.

Mason, Marilyn. 1999. Orthographic Conversion and Lexical Standardization for Vernacular Languages. ELRA Newsletter, 4.4: 5–7.

———. 2000. Spelling issues for Haitian Creole Authoring and Translation Workflow. International Journal for Language and Documentation 4:28–30.

Schiefflin, Bambi & Rachelle Charlier Doucet. 1992. The 'Real' Haitian Creole: Metalinguistics and Orthographic Choice. Pragmatics 2.3:427–43.


Marilyn Mason (Marilyn's Bio & Pubs Page) has specialized in the field of professional documentation preparation and publication since 1967 in cross-cultural settings in the US and in developing nations. Creator of CreoleConvert(tm) and CreoleScan(tm), she is President and Chief Operating Officer of Mason Integrated Technologies Ltd (MIT2), a provider of natural language processing and language stabilization computer processes and localization software. In addition, Marilyn is a Founding Director of The Alain Rocourt Endowment of the Illinois Great Rivers Conference of the United Methodist Foundation, which sends the earnings of its invested funds totally and directly to Eglise Méthodiste d'Haïti, in support of its schools. Marilyn has 3 grown children and 7 grandchildren.


Readers are encouraged to email their comments on this article to the Editor at letters@lisa.org.


Rank this article

Excellent!
Very Good
Good
Fair
Poor

 

Send this page to your friend

Friend E-mail:
Your E-mail:
Message to your friend:


The LISA Newsletter is best viewed with Netscape 6.x or higher or Internet Explorer 5.x or higher.

Login to the
Globalization Insider
archives


LISA Forum Europe - London, UK
June 30 - July 3
Early Bird Registration Ends May 15!
Workshop Information

IDEAlliange XML Europe
May 5-8


LISA Surveys

Content Management
Asia - China
Global eBusiness
Translation Memory
Salary




Why the TMX Logo Means Increased Portability


2003 Archives

2002 Archives

New LISA Members

Industry Events

Industry News

Publication Info

Public Subscription

Premium Subscription


About LISA

Joining LISA

LISA Special Interest Groups

International Legal Resources

Industry Surveys

About LISA Events

LISA Toolbar


OSCAR

TBX Standard

TMX Standard

Terminology SIG

Useful Resources

Job and CV Postings