From: "Saved by Windows Internet Explorer 7" Subject: MLIM: Chapter 2 Date: Wed, 8 Apr 2009 20:29:28 -0400 MIME-Version: 1.0 Content-Type: text/html; charset="Windows-1252" Content-Transfer-Encoding: quoted-printable Content-Location: http://www.cs.cmu.edu/~ref/mlim/chapter2.html X-MimeOLE: Produced By Microsoft MimeOLE V6.0.6001.18049 MLIM: Chapter 2

[This chapter is available as=20 http://www.cs.cmu.edu/~ref/mlim/chapter2.html .]

[Please send any comments to Robert Frederking = (ref+@cs.cmu.edu, Web = document=20 maintainer) or Ed=20 Hovy or Nancy=20 Ide.]

 

 

 

Chapter 2

Multilingual (or Cross-lingual) Information = Retrieval

 

Editors: Judith Klavans and Eduard Hovy

Contributors:

Christian Fluhr

Robert E. Frederking

Doug Oard

Akitoshi Okumura, Kai Ishikawa, and Kenji Satoh

 

Abstract

The term Multilingual Information Retrieval (MLIR) = involves the=20 study of systems that accept queries for information in various = languages and=20 return objects (text, and other media) of various languages, translated = into the=20 user's language. The rapid growth and online availability of information = in many=20 languages has made this a highly relevant field of research within the = broad=20 umbrella of language processing research. We ignore here issues = pertaining to=20 Machine Translation (Chapter 4) and Multimedia (Chapter 9), and focus on = the=20 extensions required of traditional Information Retrieval (IR) to handle = more=20 than one language.

 

2.1 Multilingual Information = Retrieval

2.1.1 Definition and Terms

Multilingual Information Retrieval (MLIR) = refers to the=20 ability to process a query for information in any language, search a = collection=20 of objects, including text, images, sound files, etc., and return the = most=20 relevant objects, translated if necessary into the user's language. The=20 explosion in recent years of freely-distributed unstructured information = in all=20 media, most notably on the World Wide Web, has opened the traditional = field of=20 Information Retrieval (IR) up to include image, video, speech, and other = media,=20 and has extended out to include access across multiple languages. Being = new,=20 MLIR will probably also include the historically excluded access = mechanisms=20 typical of libraries involving structured data, such as MARC catalogue = records.=20

The general field of MLIR has expanded in several = directions,=20 focusing on different issues; what exactly is within its purview remains = open to=20 discussion. It is generally agreed, however, that Machine Translation = proper=20 (see Chapter 4) and Multimedia = processing (see=20 Chapter 9) are not included. = Nonetheless,=20 several new terms have arisen around the new IR, each with a slight = variation in=20 emphasis, inclusiveness, or historical association with related fields. = For=20 example, recent research in multilingual information retrieval, such as = (Fluhr=20 et al., 1998) in (Grefenstette, 1998), includes descriptive catalogue = data from=20 libraries as well as unstructured data. Hull and Grefenstette (1996) = list five=20 uses of the term MLIR:

  1. Monolingual IR in any language other than English. This was the = usage from=20 the TREC conference series (Harman 1995) in which IR experiments in = Spanish=20 and other languages are referred to as the multilingual track.=20

  2. IR performed on a collection of documents in various languages, = the=20 documents parallel (paired across languages) or not, with queries = entered in=20 one language only. In this case, typically the query is translated and = each=20 language-specific portion of the multilingual collection is treated as = a=20 separate monolingual section.=20

  3. IR on a monolingual document collection that can be queried in = multiple=20 languages. The query is entered in more than one language and = typically=20 translated into the document language.

  4. IR on a multilingual document collection over which queries in = various=20 languages can retrieve documents in various languages. This is an = extension of=20 (2) and (3).=20

  5. IR on individually multilingual documents, where more than one = language=20 may be present in a single document. This rather curious case may = occur when=20 an original language quote is embedded within a document in a = different=20 language.=20

In addition to MLIR, four related terms have been = used:

1. Multilingual Information Access (MLIA). The = broadest=20 possible term to use is Multilingual Information Access, which refers to = query,=20 retrieval, and presentation of information in any language. The term = MLIA is=20 used in the NSF-EU working groups (Klavans and Sch=E4uble, 1998). In = general, the=20 use of information access rather than retrieval implies a = more=20 general set of access functions, including those that have been part of = the=20 traditional library, as well as other modalities of access to other = media.=20 Access could refer to the use of speech input for video output, where = the=20 language component could consist of close-captioned text or text from = speech=20 recognition, or catalogue querying to metadata. The term information=20 access came into use recently as a way to broaden the historically = narrower=20 use of information retrieval.

2. Multilingual Information Retrieval (MLIR). This = term refers=20 to the ability to process a query in any language and return objects, = such as=20 text, images, sound files, etc., relevant to the user query in any = language.=20 Historically, however, Information Retrieval (IR) as a field involved a = group of=20 researchers from the unstructured text data base community who employed=20 statistical methods to match query and document (Salton, 1988). In = general, this=20 work was English dominated, given the amount of digital information made = available to the research community in the early years in English, and = excluded=20 access mechanisms typical of libraries involving structured data, such = as MARC=20 catalogue records. Thus MLIR as used in this chapter denotes a = significantly=20 wider field of interest than that of traditional IR.

3. Cross-lingual Information Access. The use = of the term=20 cross-lingual refers (in this context) to bridging two languages, rather = than=20 the ability to access information in any language starting with input = any=20 language. Systems with cross-lingual capability can accept a query in = language=20 L1 or L2, for example English and French, and are = capable=20 of returning documents in either L1 or L2. (In = other=20 meetings, the term cross-lingual (or translingual) has been used to = distinguish=20 systems that cross a language barrier, as opposed to multiple = monolingual=20 systems as in TREC.) This term logically includes access via catalogue = record=20 and other structured indexing, as for MLIA.

4. Cross-lingual Information Retrieval (CLIR). CLIR = generally=20 implies a relationship to IR, with all the implications that apply to = MLIR. At=20 the 1997 = Cross-language=20 Information Retrieval Spring Symposium of the American Association of = Artificial=20 Intelligence (Oard et al., 1997), CLIR was defined with the = following=20 research challenge: Given a query in any medium and any language, = select=20 relevant items from a multilingual multimedia collection which can be in = any=20 medium and any language, and present them in the style or order most = likely to=20 be useful to the user, with identical or near-identical objects in = different=20 media or languages appropriately identified. This definition of = the=20 requirements of a system gives full recognition to the query, retrieval, = presentation requirements of a working system from a user perspective, = and=20 encapsulates succinctly the full set of capabilities to be included. = However,=20 its breadth makes it fit well with a definition of MLIA, the most = general term,=20 rather than CLIR, a more precise term.

2.1.2 MLIR: Linking and Hybridizing IR and MT

Multilingual Information Retrieval is a hybrid = subject area,=20 interacting with or encompassing several other fields. Section 2.5 = discusses=20 related fields.

How MLIR Relates to Information Retrieval

MLIR is an application of information retrieval. In = many=20 respects, as discussed above, the two fields share exactly the same = goals; as=20 such, well-known IR techniques such as vector space indexing, latent = semantic=20 indexing (LSI), similarity functions for matching documents, and query=20 processing procedures are equally useful in MLIR. However MLIR differs = from IR=20 in several significant ways. Most important, IR involves no translation=20 component, since only one language is involved. The related but not = identical=20 problems of translating queries and documents are discussed below. = Subsidiary=20 problems, such as keeping track of translations across several = languages, are=20 also not part of the standard monolingual information retrieval process. =

How MLIR Relates to and Uses Machine Translation =

The goal in machine translation (MT; see Chapter=20 4) is to convert a text, written in = language=20 L1, into a coherent and accurate translation in language=20 L2. To do so, most MT systems convert the input text, usually = sentence by sentence, into a series of progressively more abstract = internal=20 representations, in which sentence-internal relationships are determined = and the=20 intended meaning of each word is identified. Armed with this = information, the=20 appropriate conversions are made to support the output language, upon = which=20 output realization, usually also sentence by sentence, is performed. MT = requires=20 that the meaning of each individual word be known (as does accurate IR); = without=20 this knowledge, homographs (for example plane, which can refer to = an=20 airplane, carpentry tool, geometric surface, the action of skimming over = water,=20 and several other meanings) cannot be translated into their intended = foreign=20 words. Without word translation, no output is possible.

Can MLIR be Achieved by Coupling IR and MT?

Unfortunately, while at first blush it may seem that = MLIR is=20 simply a matter of coupling IR and MT engines, the special nature of = MLIR places=20 constraints on the input to MT that makes a straightforward coupling = infeasible.=20 At one extreme, some recent MLIR research has explored extending = IR-based=20 indexing techniques to directly bridge language gaps with no explicit=20 translation step at all; see Sections 2.2.2 and 2.3.1 below. Arguments = regarding=20 the special nature of MLIR, contained in the NSF-EU MLIA Working Group = White=20 Paper (Klavans and Sch=E4uble, 1998), are summarized here.

Differences between the two types of input submitted = by MLIR=20 for translation=97queries and documents=97necessitate two different = types of Machine=20 Translation. In the case of queries, the input to MT is a set of = disconnected=20 words, or possibly multi-word phrases. There is no call for MT to parse = the=20 input, since no syntactic sentence structure can be found. More = seriously, the=20 MT system cannot apply traditional methods of wordsense disambiguation, = since=20 the input is not a semantically coherent text. It will have to employ = other=20 (possibly IR-like) methods to determine the sense of each polysemous = word in=20 order to furnish accurate translations. On the other hand, there is no = need to=20 produce a linear, coherent output, and in fact multiple (correct) = translations=20 of a query term can provide a form of query expansion, which can improve = IR=20 performance. Finally, the processes of sentence planning and sentence=20 realization are irrelevant when the input is a string of isolated query = words.=20 Without accurate queries, IR accuracy falls dramatically (results of = recent=20 studies are given later in this chapter).

For the stage of IR after retrieval (that is, in the = case of=20 retrieved documents), in contrast, documents can be translated back into = the=20 user's language using the normal methods of MT. However, also for this = part of=20 MLIR, partial translation, or keyword extraction and translation, is = often=20 adequate for the user's needs. In particular, given the computational = expense of=20 MT, it may be inefficient to translate a full document that the user = later=20 determines is not exactly what was desired. In addition, fully general = purpose=20 MT (especially between a wide variety of languages) is a very difficult = problem.=20 Translating a few keywords or a summary (see Chapter=20 3) is often a wise policy.

Several additional differences between monolingual IR = and MLIR=20 arise if the user is familiar with more than one language too. In = particular,=20 the user interface must provide differential display capabilities to = reflect=20 differing language proficiency levels of users. When more than one user = receives=20 the results, translation into several languages may have to be provided. = Furthermore, depending on the user's level of sophistication, = translation of=20 different elements at different stages can be provided to users for a = range of=20 information access needs, including keyword translation, term = translation, title=20 translation, abstract translation, specific paragraph translation, = caption=20 translation, full document translation, etc. Finally, monolingual IR = users can=20 also take advantage of the results of MLIR. Simply the knowledge that a=20 particular query will access a certain number of documents in other = languages=20 could, in itself, be valuable information, even if translations are not=20 required.

Thus for MLIR much of the typical MT machinery is = irrelevant,=20 or at best only partially relevant. The differences with traditional MT = mean=20 that MLIR cannot simply employ MT engines as front-end query translators = and=20 back-end document translators.

Rather, efficient ways of coupling together the = internal=20 processes of IR and MT engines are required, allowing them to employ the = results=20 of the other's intermediate results. It is inevitable that = second-generation=20 MLIR systems will exhibit some more-than-surface integration of MT and = IR=20 modules.

2.1.3 Key Technical Issues for MLIR

We discuss three different positions on what are the = key=20 problems in MLIR. Grefenstette (1998) focuses on term choice and = filtering. Oard=20 (1998) presents user-centered challenges. Finally, Klavans (1999) = outlines a=20 two-part view that accommodates system-directed and user-directed = research=20 issues.

Grefenstette (1998) outlines three problems = involving the=20 processing of query terms for MLIR:

This problem requires knowing how terms map between = languages.=20 Since little or no contextual text is present in the query to help with = term=20 disambiguation, this involves knowing the full range of choices of = translations,=20 not just one possible translation, coupled with an understanding how = different=20 domains affect translation possibilities.

The second problem deals with determining how to = filter, from=20 all possible choices, which ones should be retained in the current = application.=20 Unlike MT, a MLIR system can retain a wider set of possibilities that = can later=20 be automatically filtered, depending on the kinds of variants that are=20 permitted. Thus the MLIR system has to balance the amount of inaccurate=20 translations (noise) that degrade results against the amount of = processing=20 performed to disambiguate the terms and ensure accuracy.

Given that it is advisable to retain a set of = well-chosen=20 possible terms for the best retrieval performance, a problem new to MLIR = arises.=20 The possibility of assigning alternate weights to different translations = permits=20 more accurate term choice. For example, in a compound term such as=20 "morphological change", the first word is quite narrow in translation=20 possibilities (e.g., in French, only one translation la = morphologie)=20 while the second is more general ("change" could be changement = or=20 monnaie). In such cases, more weight could be given to the first = word's=20 translation than to the second. This problem is compounded by the fact = that some=20 multi-word terms do not decompose, but should be treated as a = collocation. Thus,=20 mechanisms for weighing alternatives must consider individual word = translation=20 weights as well as multi-word term translation weights.

Grefenstette points out that the first two problems = are also=20 found in machine translation, and still require research for fully = effective=20 solutions. The third problem is one that clearly distinguishes MLIR from = both MT=20 and IR.

Oard (1998), in presentations during the = Workshops on MLIR,=20 outlined a historical view of CLIR that is user-centered in nature. He = views the=20 overall problem of CLIR as a series of processes, including query = formulation=20 and document selection, involving feedback from system to user and from = user to=20 system. The system-internal processes of indexing, document processing, = and=20 matching are treated as components supporting direct user interaction. = He=20 presents three points of historical perspective:

Oard's five challenges for the next five years are = given in=20 Section 2.4 below.

Klavans (1999) approaches the central problems in = a=20 somewhat different way, focusing on two sets of issues. One set involves = three=20 questions relating to the parts of the query-retrieval process, and the = other=20 set relates to user needs.

System issues. If the query-retrieval process = is=20 considered in sequential terms, the first task is to process a query, = the second=20 is to index documents and information in a way that permits access by a = query,=20 and the third is to match and rank the similarity between query and = document set=20 in order to chose relevant documents. (This model of IR applies to the=20 traditional vector-based approaches to IR. As discussed in Sections = 2.2.2 and=20 2.3.1, it is rather different for Latent Semantic Indexing (LSI) and = related=20 techniques.)

Usability Issues. IR systems present two main = interface=20 challenges: first, how to permit a user to input a query in a natural = and=20 intuitive way, and second, how to enable the user to interpret the = returned=20 results. A component of the latter encompasses ways to permit a user to = comment=20 and provide feedback on results and to iteratively improve and refine = results.=20 MLIR brings an added complexity to the standard IR task. Users can have=20 different abilities for different languages, affecting their ability to = form=20 queries and interpret results. For example, a user might be proficient = in=20 understanding documents in French, but could not produce a query in = French. In=20 this case, the user will need to formulate a query in his native = language, but=20 will want documents returned only in French, not translated. At the same = time,=20 this user may have spotty knowledge of German. In this case, he might = request a=20 set of key terms translated to his native language, and not want to view = source=20 documents in German at all. Or he may simply want a numerical count, in = order to=20 know that for a given query, there are a certain number in German, a = certain=20 number in French, a certain number in Vietnamese, and so on. In = addition,=20 knowing the specific sources of relevant information may also be very=20 valuable.

Since research and applications in MLIR are so new, a = full=20 understanding of user needs has yet to be developed and tested. However, = these=20 needs differ from simple MT needs, given the user query production and=20 refinement stages.

2.1.4 Summary of Technical Challenges

MLIR involves at least the following four technical = challenges:=20

2.2 Where We Were Five Years Ago

2.2.1 Capabilities Then

The lure of cross language information retrieval = attracted=20 experimentation by the IR community early on. Already in 1971, Salton = showed=20 that the use of a transfer dictionary for English and French (a = bilingual=20 wordlist with predefined mappings between terms) could be used to = translate from=20 a query in one language to another (Salton, 1971). This experiment, = although=20 ignoring the realistic and challenging problem of ambiguity, nonetheless = served=20 the information retrieval community well in providing a model for a = viable=20 approach to cross language IR. However, at the same time, the experiment = also=20 illustrated some of the exceedingly difficult problems in the language=20 translation and mapping component of a system, namely one to many = mappings, gaps=20 in term translations, and ambiguity. Similarly, in a manual test with a = small=20 corpus, Pevzner (1972) showed for English and Russian that a controlled=20 thesaurus can be used effectively for query term translation.

For nearly twenty years, the areas of IR and MT = remained=20 separate, leaving MLIR somewhat dormant. Apart from a few forays into = refining=20 these early techniques, all significant advances in MLIR have been made = in the=20 past five years. This is not surprising, given that increased amounts of = information are becoming available in electronic format, and the economy = is=20 globalizing.

2.2.2 Major Methods, Techniques, and Approaches Five = Years=20 Ago

We discuss the problem within the framework outlined = above.=20

System issues include the following.

Usability issues include the following. Early = experiments were=20 performed at such a small scale, more in the nature of proof-of-concept = rather=20 than full-fledged large-scale systems. User feedback and user needs were = simply=20 not part of what was tested.

2.2.3 Major Bottlenecks and Problems Five Years Ago =

The three major bottlenecks of the early part of this = decade=20 still persist. They are: limited resources for building domain and = language=20 models; limited new technologies for coping with size of collections; = and=20 limited understanding of the myriad of user needs.

2.3 Where We Are Today

The burgeoning field of MLIR field is clearly in = evidence, as=20 can be seen in the bibliography in the first major review article on the = topic=20 (Oard and Dorr, 1996). Papers cited include related work on machine = translation,=20 including some research translated from Russian. There are 16 citations = prior to=20 1980, 10 from 1980-89, and 52 from 1990 to early 1996. The first major = book to=20 be published on the topic (Grefenstette, 1998) reflects the same = temporal bias.=20 This work is slanted towards IR rather than toward MT. It contains 11 = citations=20 prior to 1980, 25 from 1980-89, and 101 from 1990 to very early 1998. =

2.3.1 Major Methods, Techniques, and Approaches Now =

Following the format above, we divide the methods = into=20 system-centered and user-centered concerns, although each provides = feedback to=20 the other.

System issues include the following:

Usability issues include the following. The = development of=20 effective MLIR technology will have no impact if the user's needs and = operation=20 patterns are not considered. Since MLIR is a growing field, and since=20 applications are just emerging, formative studies of usability are = essential.=20 Currently, there are a limited number of systems in early operation = which are=20 providing important data (e.g., EuroSpider, the translate function of = AltaVista,=20 multilingual catalogue access). The incorporation of users in the = relevance=20 feedback loop is particularly important, since user needs vary greatly. = A full=20 review of user needs is found in (Klavans and Sch=E4uble, 1998).=20

2.3.2 Major Bottlenecks and Problems

Since this is a new field, the bottlenecks listed in = Section=20 2.2.3, evident in earlier years, persist.

2.4 Where We Will Be in Five Years =

The growing amount of multilingual corpora is = providing a=20 valuable and as yet untapped resource for MLIR. Such corpora are = essential to=20 building successful dynamic term and phrase translation thesauri, which = is, in=20 turn, key to effective indexing and matching. One of the key challenges = is in=20 devising efficient yet linguistically informed methods of tapping these=20 resources, methods which combine the best of what is know about fast = statistical=20 techniques along with more knowledge based symbolic methods. Even = promising new=20 techniques, such as translingual LSI (Landauer et al., 1998) and related = techniques (Carbonell et al., 1997), will most probably still rely on = parallel=20 corpora. Such corpora are often difficult to find, and very expensive to = prepare. This has been the motivation for the work on comparable = corpora.=20 However, more and more are being created electronically, especially to = conform=20 to legal requirements for the European Union. The issues surrounding = corpora are=20 extensively discussed in Chapter=20 1.

An important class of techniques involves machine = learning, as=20 applied to the cross-language term mapping problem. Since term = translation,=20 loosely defined, is at the core of query processing, document = processing, and=20 matching, it is an important process to do thoroughly and accurately. = Even if=20 multiple translations are retained in the MLIR process, obtaining a = sensible set=20 of domain linked terms is an important and central task. One way to = obtain these=20 term dictionaries is through parallel corpora, but statistical = processing is=20 typically difficult to fine tune. As discussed in Chapter=20 6, machine learning techniques are a = fundamental=20 enhancement of the power of language processing systems and hold = particular=20 promise in this area as well.

Finally, it is to be hoped that our understanding of = user needs=20 and user interactions with MLIR systems will be significantly better in = five=20 years than it is now. As early systems emerge and are tested in the = field, a=20 range of flexible and fluid applications that can learn and dynamically = adjust=20 to the users' levels of competence, across languages and across domains, = should=20 appear. One possible example of this type of flexible application might = be=20 human-aided MT systems for producing gisting-quality translations of = retrieved=20 documents, which would allow the user to make a personal time/quality = tradeoff:=20 the longer the user interacted with the translator, the better the = resulting=20 output. Most probably, these systems will incorporate multimedia = seamlessly and=20 permit multimodal input and output. Such capabilities will provide = maximum=20 usability.

2.4.1 Expected Capabilities

Oard (1998) outlines five challenges for the next = five years:=20

  • User-assisted query disambiguation, which might be limited to the = most=20 troublesome terms;=20

  • Enrichment of dictionary data with unlinked corpora;=20

  • Tailored title translation techniques;=20

  • Rapid translation and/or summarization, which involves some = research on=20 using queries to focus the translation effort; and=20

  • Automated global translation brokering, which balances capacity,=20 capability and user needs.=20

2.4.2 Expected Bottlenecks in Five Years

Four key issues must be overcome in order to achieve = effective=20 MLIR. Some of these issues also apply to IR and MT independently.

  1. The tension between systems and users. The balance between = understanding=20 user needs and building MLIR systems is delicate. On the one hand,=20 applications need to be built in order to test them with users. On the = other,=20 users have to define their desiderata for system builders. However, it = is=20 difficult to imagine in advance the full set of capabilities that = should be=20 part of a MLIR system. Asking system builders or users in advance = requires a=20 level of imagination and inventiveness that is difficult to achieve. = Therefore=20 a close coupling between these independent but related activities is=20 especially important for building complex MLIR systems.=20

  2. The dependence on resource-expensive technologies. The increased = need for=20 multilingual corpora in order to build term translation lists and = loose=20 translations in a flexible and domain-independent way brings along an=20 attendant problem: Where will these corpora come from? How reliable = are they?=20 Ways to collect, validate, and standardize comparable corpora are = needed. Ways=20 to infer associations using other resources and metadata promise some=20 solutions for this problem. Imaginative techniques (for example, using = datelines in news articles with proper nouns as anchors, or combining=20 bilingual dictionary data with corpora across languages) will have to = be=20 invented.=20

  3. The need for efficiency and accuracy. Different applications = require=20 different levels of functionality. In some cases, speed is important = and must=20 be prioritized. In others, high precision is a top demand. In others, = a=20 wide-ranging glance at the data is all that is needed, so high recall = is a=20 more important goal. For each of these priorities, different = techniques can be=20 applied. For example, very high precision applications are likely to = require=20 more in-depth language analysis, but this type of processing tends to = be slow=20 and knowledge intensive. It is important to understand the tradeoffs = between=20 shallow statistically motivated techniques and deeper linguistically = motivated=20 ones, as discussed in Chapter=20 6, to achieve processes that are both = fast and=20 accurate.=20

  4. The effective presentation of complex information. How should = multilingual=20 results of a search be presented back to the user? What kinds of new=20 summarization and visualization techniques will most help people be = able to=20 evaluate, digest, and then use the information that is delivered to = them?=20 Because multilingual information retrieval adds complexity to the = presentation=20 problem, we have yet to fully understand new presentation challenges.=20

2.5 Juxtaposition of This Area with Other Areas=20

Two major classes of technical issues must be = addressed when=20 dealing with multilingual data:

First, technical issues involving data exchange, with = a set of=20 attendant sub-issues. This includes questions such as character = encoding, font=20 displays, browser/display issues, etc. Such issues have implications for = metadata for the Internet, international sharing of bibliographic = records, and=20 transliteration and transcription systems.

Second, natural language questions, also with a set = of=20 attendant research issues. This includes natural language processing=20 technologies (e.g., syntactic or semantic analysis), machine = translation,=20 information retrieval (or information discovery) in multiple languages, = speech=20 processing, and summarization. Also included are questions of = multilingual=20 language resources, such as dictionaries and thesauri, corpora, and test = collections.

The new application of MLIR draws on achievements and = techniques in several related areas. However, the challenges unique to = MLIR must=20 be handled independently. Listing some of the relevant technologies, = these=20 include:

  • Information Access: document indexing (multilingual); retrieving,=20 filtering, clustering; presentation and summarization of information;=20 multilingual metadata; cross-language information retrieval. See = Chapter=20 3 and Chapter=20 9.=20

  • Machine Translation: comparable and parallel text alignment; = language=20 generation. See Chapter=20 4.=20

  • Computational Linguistics: morphological analysis, syntactic = parsing,=20 techniques for disambiguation, document segmentation, corpus analysis, = creation of derivative lexicons, term recognition and term expansion. = See=20 Chapter 6.=20

  • Resources: dictionaries, thesauri, index terms, test collections, = speech=20 data bases. See Chapter=20 1.=20

Several potentially valuable connections have not yet = been=20 made. The Database and Computational Linguistics research and = development=20 communities, for example, contain in their members a great deal of = relevant=20 expertise. The National Science Foundation PI meeting on Information and = Data=20 Management (1998) concluded that closer links between the IR and = Database=20 communities would be beneficial to each. Similarly, the human-computer=20 interaction / multimedia community offers important insights into = ensuring=20 user-driven design of systems.

In order to facilitate cross-fertilization, a series = of small=20 workshops to define new projects, and a series of very small seed = projects,=20 would help the specification of prototype systems and the elucidation of = complex=20 problem areas. Projects should be interdisciplinary, very limited in = scope, with=20 well-defined goals leaving room for exploratory research. The results of = such=20 cross-fertilization would depend on the backgrounds of the potential=20 participants. Assembling a group from commerce to assist computer = scientists in=20 specifying the needs that MLIR systems must address, or focus groups = from high=20 information-needs communities, such as journalism and finance, could be = used to=20 specify new projects and prototypes and guide the direction of research = in=20 beneficial directions.

 

2.6 References

Ballesteros, L. and W.B. Croft. 1998. Statistical = Methods for=20 Cross-Language Information Retrieval. In G. Grefenstette (ed), = Cross-Language=20 Information Retrieval (23-40). Boston: Kluwer.

Carbonell, J., Y. Yang, R. Frederking, R. Brown, Y. = Geng, and=20 D. Lee. 1997. Translingual Information Retrieval: A Comparative = Evaluation.=20 Proceedings of the Fifteenth International Joint Conference on = Artificial=20 Intelligence (IJCAI-97). Nagoya, Japan. Best paper award.

Fluhr, Ch., D. Schmit, Ph. Ortet, F. Elkateb, K. = Gurtner, and=20 Kh. Radwan. 1998. Distributed Cross-Language Information Retrieval. In = G.=20 Grefenstette (ed), Cross-Language Information Retrieval (41-50). = Boston:=20 Kluwer.

Grefenstette, G. (editor) 1998. Cross-Language = Information=20 Retrieval. Boston: Kluwer.

Harman, D. (editor) 1995. Proceedings of the = 5th=20 Text Retrieval Conference (TREC).

Hull, D. and G. Grefenstette. 1996. Querying across = Languages:=20 A Dictionary-Based Approach to Multilingual Information Retrieval.=20 Proceedings of the 19th Annual ACM Conference on = Information=20 Retrieval (SIGIR) (49-57).

Klavans and Tzoukermann, 1996. Dictionaries and = Corpora:=20 Combining Corpus and Machine-readable Dictionary Data for Building = Bilingual=20 Lexicons. Machine Translation 10 (3-4).

Klavans, J. and P. Sch=E4uble. 1998. Report on = Multilingual=20 Information Access. Report commissisoned jointly by NSF and EU.

Klavans, J. 1999. Work in progress.

Landauer, T.K, P.W. Foltz, and D. Laham. 1998. An = Introduction=20 to Latent Semantic Analysis. Discourse Processes 25(2&3) = (259-284).=20

Leacock, C., G. Towell, and E. Voorhees. 1993. = Corpus-Based=20 Statistical Sense Resolution. Proceedings of the DARPA Human Language = Technology Workshop (260-265). Princeton, NJ.

Oard, D. and B. Dorr. 1996. A Survey of Multilingual = Text=20 Retrieval. Technical Report UMIACS-TR-96-19, University of Maryland = Institute=20 for Advanced Computer Studies. http://www.clis.umd.edu/dlrg/filter/papers/mlir.ps.

Oard, D. and B. Dorr. 1998. Evaluating Cross-Language = Text=20 Filtering Effectiveness. In G. Grefenstette (ed), Cross-Language = Information=20 Retrieval (151-162). Boston: Kluwer.

Oard, D., et al., 1997. Proceedings of the AAAI Spring Symposium = on=20 Cross-Language Information Retrieval. San Francisco: Morgan Kaufmann = AAAI=20 Press.

Pevzner, B.R. 1972. Comparative Evaluation of the = Operation of=20 the Russian and English Variants of the "Pusto-Nepusto-2" System. = Automatic=20 Documentation and Mathematical Linguistics 6(2) (71-74). English = translation=20 from Russian.

Salton, G. 1971. Automatic Processing of Foreign = Language=20 Documents. Englewood Cliffs, NJ: Prentice-Hall.

Salton, G. 1988. Automatic Text Processing. = Reading, MA:=20 Addison-Wesley.

Sheridan, P., J.P. Ballerini, and P. Sch=E4uble. = 1998. Building a=20 Large Multilingual Test Collection from Comparable News Documents. In G. = Grefenstette (ed), Cross-Language Information Retrieval = (137-150).=20 Boston: Kluwer.