DOC PREVIEW
Om: One tool for many (Indian) languages

This preview shows page 1-2 out of 6 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Ganapathiraju et al. / J Zhejiang Univ SCI 2005 6A(11):1348-1353 1348 Om: One tool for many (Indian) languages GANAPATHIRAJU Madhavi†1, BALAKRISHNAN Mini2, BALAKRISHNAN N.†2, REDDY Raj†1 (1Language Technologies Institute, Carnegie Mellon University, Pittsburgh, PA 15213, USA) (2Supercomputer Education and Research Centre, Indian Institute of Science, Bangalore 560 012, India) †E-mail: [email protected]; [email protected]; [email protected] Received Aug. 5, 2005; revision accepted Sept. 10, 2005 Abstract: Many different languages are spoken in India, each language being the mother tongue of tens of millions of people. While the languages and scripts are distinct from each other, the grammar and the alphabet are similar to a large extent. One common feature is that all the Indian languages are phonetic in nature. In this paper we describe the development of a translit-eration scheme Om which exploits this phonetic nature of the alphabet. Om uses ASCII characters to represent Indian language alphabets, and thus can be read directly in English, by a large number of users who cannot read script in other Indian languages than their mother tongue. It is also useful in computer applications where local language tools such as email and chat are not yet available. Another significant contribution presented in this paper is the development of a text editor for Indian languages that integrates the Om input for many Indian languages into a word processor such as Microsoft WinWord®. The text editor is also developed on Java® platform that can run on Unix machines as well. We propose this transliteration scheme as a possible standard for Indian language transliteration and keyboard entry. Key words: Om transliteration, Indian language technologies, Text editor doi:10.1631/jzus.2005.A1348 Document code: A CLC number: TP391 INTRODUCTION India is a nation with pluralistic culture, a large number of cultures, ethnicities, languages and relig-ions coexisting with each other. While the culture and faith unify the country under one umbrella either by similarity or by tolerance, the language is what separates them. In the 1951 census, the first census after India attained independence, 845 languages (dialects) were identified, of which 60 were spoken by at least 100000 people each. The Indian constitu-tion identifies 22 languages, of which six languages (Hindi, Telugu, Tamil, Bengali, Marathi and Gujarati) are spoken by at least 50 million people within the boundaries of the country—there are a large number of them living outside the country. Although the In-dian languages were identified as belonging only to four different language families, namely, the Austric, Dravidian, Tibeto-Burman, and Indo-Aryan, the language spoken by one person is rarely understood by a person familiar only with another language; this does not however rule out bilingualism of a large number of people, especially those who migrate from one state to another, where they speak the mother tongue at home and can usually follow the dominant language of the new state. For example, Telugu speakers are found in good numbers in Karnataka (3,325,062), Maharashtra (1,122,332), Orissa (665,001), and Tamil Nadu (3,975,561); about 10% of Telugu speakers live outside of the Telugu territory, according to an old 1901 estimate; this number would be much larger today. Bilingualism is also found at the borders of two states, where people can usually speak languages of both the states sharing the border. Taking the example of Andhra Pradesh again, where the native language is Telugu, a large number of people speak languages of its neighbours: Kannada (519,507), Marathi (503,609), Oriya (259,947), and Tamil (753,484). Language technologies and PC penetration in In-dia Journal of Zhejiang University SCIENCE ISSN 1009-3095 http://www.zju.edu.cn/jzus E-mail: [email protected] et al. / J Zhejiang Univ SCI 2005 6A(11):1348-1353 1349India is fast becoming a software super-power—the nation has over 3000 computer training institutes; software exports were about 6 billion US dollars in 2003, and are expected to grow to 50 billion US dollars, which is 33% of total exports, very soon. However, net-surfers that were at 0.2% of the total population are expected to grow only up to 7% by 2006. The PC penetration rate is merely 1.4%. Sixty-eight million homes out of 408 million homes (17%) in the country have a TV, while only 22 million (5%) have a telephone; which is still much larger compared to the 1.4% penetration of a computer. Two most important influencing factors for this low computer usage by non software-professionals may be low income and illiteracy. The low income popu-lation in the country, which is a third of the total population, prefers to buy a television set rather than a PC because of the entertainment value, ease of use and the current non-utility of a PC in their everyday life. At the time of the birth of independent India, about half a century ago, the Indian middle class was an insignificant minority; although the middle class is upwardly mobile. With the economic reforms brought about in the early 1990’s, the Indian middle class is growing at a rapid rate and is expected to reach 50% within a generation, and the poverty is expected to diminish to 15%. Complementing the economic growth rate, the new Indian middle class is filled with entrepreneurs who are spreading the power of infor-mation technology to the rural areas. Although the PC has not yet penetrated into rural homes, there are countless Internet facilities (called cyber-cafés) that are expanding similar in scope and impact to the public telephone booths in the rural areas. Low-end computers, costing about $100 to $200 are coming to the market (Simputer, Mobilis, Nova NetPC). Thus, irrespective of economic status, the power of infor-mation technology is expected to be available for the Indian population very soon. The second limiting factor in PC usage, however, is non-availability of the operational software in na-tive language, and the language barriers between people. While the development of an operating sys-tem in the native language is a solution, this is likely to be limited to only a couple of languages; and the development of natural language processing tech-nologies would have to wait until the standardization of the digital representation; the porting of available


Om: One tool for many (Indian) languages

Download Om: One tool for many (Indian) languages
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Om: One tool for many (Indian) languages and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Om: One tool for many (Indian) languages 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?