To challenge the predominance of English in international communications

Скачать 41.91 Kb.
НазваниеTo challenge the predominance of English in international communications
Дата конвертации12.05.2013
Размер41.91 Kb.
Guaranteeing Multilinguality in the Information Society

Carol Peters

Istituto di Elaborazione della Informazione

Consiglio Nazionale delle Ricerche

Pisa, Italy

Why we need Multilinguality

To challenge the predominance of English in international communications.

To preserve the enormous wealth of knowledge and ideas represented by the diversity of languages and cultures.

To ensure equal opportunities world-wide:

for users who want to access the information available on the Web, whatever the language,

for information providers who want

to make ideas available on the Web,

whatever the language.

The key to progress is education - the main path to education is access to and diffusion of knowledge

What do we mean by multilingual access to the Web?

Ideally, users should be able to access Internet available document bases in different languages, specifying their information needs in their native language, and achieving a high level of search and retrieval precision.

They should be able to retrieve documents matching their query in whatever language the document is stored.

(We are talking about multilingual access NOT machine translation.)

World-wide Multilingual Access to the Web implies:

(i) Multiple language representation, manipulation, and display;

(ii) Multilingual search and retrieval.

The SAMOS System

SAMOS is an ERCIM sponsored project that aims at the development of a networked computer science technical report library.

A digital library architecture will be developed which provides Internet access to a distributed, decentralised multi-format collection of documents and includes a multilingual interface.

SAMOS is working in close collaboration with the US-based consortium - NCSTRL.

The evolution of SAMOS towards a system able to support a multimedia CS digital library will be studied.

Providing Multilingual Access and Query Functionalities



· implement an interface to the SAMOS system in each European language

· compile large-scale multilingual language reference resources in the Computer Science domain

· design an enriched multilingual classification thesaurus for CS

· study approaches to multilingual information retrieval

Multilingual Interface

A multilingual application must be able to present data in multiple languages meaningfully; it must support the characters sets and encodings used to represent the information it is processing.

The ERCIM consortium is a multilingual community of more than one dozen languages:

including English, German, the Romance languages (French, Spanish, Portuguese, Italian), most of the Scandinavian languages, Czech, Hungarian and Greek.

Multilingual Character Encoding

SAMOS must identify and adopt a suitable character encoding standard to cover all of the languages it represents.

Two possible approaches :

· use several 8-bit ISO standard Latin character sets and include indication of code used in the document metadata;

· use single 16-bit character encoding, like UNICODE standard, to represent all the languages.


Unicode Character Standard encodes scripts (collections of symbols) rather than languages.

Unicode standard currently contains 34,168 coded characters covering principal written languages of the Americas, Europe, Middle East, Africa, India, Asia.

Unicode characters are language neutral; if necessary, a higher level protocol must be used to specify the language.

Primary scripts supported include:

Arabic, Armenian, Bengali, Bopomofo, Cyrillic, Devanagari, Georgian, Greek, Gujarati, Gurmkhi, Han, Hangul, Hebrew, Hiragana, Kannada, Katakana, Latin, Lao, Malayalam, Oriya, Phonetic, Tamil, Telugu, Thai.

Secondary scripts:

Numbers, General Diacritics, General punctuation, General Symbols, Mathematical Symbols, Technical Symbols, Miscellaneous symbols, ...

Implications of different approaches will be studied:

· effects on storage and transmittal times;

· compatibility with existing document collections, e.g. effects of introducing an extended character encoding into the wider NCSTRL consortium.

Close contacts will be maintained with WWW Working Group studying this question. Any decision made by SAMOS project must be compatible with character encoding standard for Web.


will implement and test the encoding of our extended character set by providing user interfaces to the system in English, French, German, Greek and Italian.

The user will be able to select independently the interface language and the language in which the query will be formulated and submitted.

Once the character encoding has been implemented and tested on the initial five languages, we will provide additional interfaces in the other project languages.

Developing Multilingual Search and Retrieval Tools

In SAMOS we will implement:

· key-word based search tools

· more sophisticated thesaurus independent query tools

SAMOS provides a platform for the integration of methodologies and tools developed for Natural Language Processing (NLP) with results from the Information Retrieval (IR) field

Tools for NLP

Recent years have seen development of a series of lexicon and text management tools for all types of NLP applications.

Lexicon-based Systems:

Mono- and bilingual electronic dictionaries and lexical databases

Lexical knowledge bases

Morphological analysers and generators

Procedures to generate taxonomic data


Text and Corpus Systems:

Monolingual Corpus Management Systems

Parallel and comparable text systems

Part-of-speech taggers, Syntactic parsers Sense disambiguators


These tools are now being applied to typical Information Retrieval tasks.

Multilingual Language Reference Resources

In the creation and evaluation of our multilingual search tools we will use corpus data.

Recent studies in corpus linguistics show the importance of real language data for the acquisition of reliable statistics on term usage and frequency; this can be supplied by language reference corpora.

Thus we intend to construct reference corpora for Computer Science:

Main corpus : English sub-language corpus for theoretical computer science.


Set of comparable corpora in the main project languages representative of a CS sub-domain

Comparable corpora are sets of texts from pairs or multiples of languages that have the same communicative function and can be contrasted and compared because of their common features.

Multilingual classification thesaurus for CS

The basis for the search tools will be an enriched multilingual classification thesaurus.

The core thesaurus will be translated into German, French, and Italian, using electronic dictionaries where possible. The translated terminology will be mapped to base thesaurus. The results will then be evaluated and correlated with list of key terms directly generated from corpora in the target languages (French, German, Italian).

Related Work

Of immediate relevance to our work on multilingual thesaurus building and the development of key-word search based tools are two current projects in the Libraries programme: TRANSLIB and CANAL/LS.

Both projects aim at supporting multilingual access to library on-line public access catalogues (OPACs) and both are building multilingual thesauri as tools for this purpose.

SAMOS aims at extending multilingual access to search not only catalogue but also full text document data in a specific domain.

The SAMOS multilingual thesaurus will thus differ in that it will refer to a selected sublanguage (computer science) and should be more exhaustive: it will be supplemented using reference corpus data and expanded to include a network of semantically and syntactically related data.


EuroWordNet aims at building a multilingual database with wordnets containing explicit semantic relations for several European languages (English, Dutch, Spanish and Italian).

The wordnets will be stored in a central lexical database system and word meanings linked to meaning in the Princeton WordNet 1.5.

Major concepts and words in the individual wordnets will be merged to form language-independent ontology (set of semantic relations between concepts).

Aim is flexible general-language multilingual search tool (not domain-specific terminology).

We hope to establish links between SAMOS multilingual ontology for CS and EuroWordNet database.

Multilingual Information Retrieval


I. use multilingual thesaurus in development of multilingual search tools

II study more broad-coverage tools to complement the thesaurus-based tools

The aim is to cover the different requirements of users of our system:

· librarians using rich thesauri to specify precise queries where exact results are required,

· researchers specifying more vague queries where a high level of recall is desired.

The methodologies studied should be extendible to other languages.

Pilot Programs

There are three proposed pilot programs within SAMOS:

I. Corpus-based enrichment of multilingual thesauri and lexicon- based query procedures (CNR)

II. Multilingual querying based on a graphical thesaurus browser (FORTH)

III. Automatic query expansion for multilingual information retrieval (ETH)

Useful URLs on the Web:

Multilinguism and the Internet

WINTER: Web Internationalization and Multilinguism


Добавить в свой блог или на сайт


To challenge the predominance of English in international communications icon1: The Scope and Challenge of International Marketing

To challenge the predominance of English in international communications icon10th wseas international Conference on communications

To challenge the predominance of English in international communications iconThe senior teacher of chair of English language № The international institute of management. The Moscow state institute of the international relations
Международный институт управления. Московский государственный институт международных отношений (университет)

To challenge the predominance of English in international communications icon3 4 Computer Programming Lab 3 4 Electrical and Electronics Lab 3 4 English Language Communications Skills Lab 3 4 it work-Shop 3 4 Total 25 15 56

To challenge the predominance of English in international communications iconWelcome to the October 2002 issue of the European Communications for Mathematical and Theoretical Biology, the communications journal of the European Society

To challenge the predominance of English in international communications iconMake the English you speak sound more natural with The English We Speak from bbclearningenglish com. Every week, we look at a different everyday English phrase

To challenge the predominance of English in international communications iconMake the English you speak sound more natural with The English We Speak from bbclearningenglish com. Every week, we look at a different everyday English phrase

To challenge the predominance of English in international communications iconInternational Business Environment: International business' an overview Concept of international business Classification of international business Factors

To challenge the predominance of English in international communications icon#1 international website in English & Chinese about the Russian aviation and transport industry. RussianAvia net
Полезная нагрузка: инфракрасные камеры, видеокамеры, лазерные дальномеры до 8000м, система подсветки наблюдаемых объектов в заданном...

To challenge the predominance of English in international communications iconThe purpose of the new ba honours course in English, under the semester system, is to provide a thorough grounding in literature written in the English

Разместите кнопку на своём сайте:

База данных защищена авторским правом © 2012
обратиться к администрации
Главная страница