Resources for Chinese-English
Cross-Language IR
Douglas W. Oard
College of Library and Information Services
University of Maryland
College Park, MD 20742
March 23, 1999
Introduction
Specific evaluation criteria are identified in the sections that follow. A stoplight chart (with l representing fully suitable, "w " representing suitability limited in some way, and "m " representing possibly unsuitable) has been used to summarize the assessment of each criterion for each resource in each language. Missing data is indicated by leaving the cell blank. The stoplight charts in the last three parts of this document can provide a basis for resource selection and for gap analysis to focus additional search for specific resources that could satisfy unmet requirements.
Encoding, Character Set, and Language Identification
Definition. Resources for recognizing commonly encountered character sets and their encodings and for identifying the dominant language in which a document is written. Automatic segmentation of a single document into regions written in different languages is not generally performed by these tools. There are two commonly used character sets for traditional Chinese characters: GB and Unicode. GB can serve as its own encoding, or it can be encoded as HZ if it is desired to interleave it with 7-bit ASCII. In either case, it is possible to represent Chinese and English simultaneously since the GB character set does contain Roman characters. There are three commonly used character sets for simplified Chinese characters: Big-5 (which, with some extensions, is also known as "CP-950"), CNS 11643, and Unicode. Big-5 and CNS 11643 are normally encoded by simply interleaving it with 7-bit ASCII, for which there are provisions in the code space. Thus, both can represent Chinese and English simultaneously.
Comments. Encodings that can accommodate multiple character sets (e.g., ISO-2022 and the Extended Unix Code (EUC)) can provide unambiguous indications of the encoded character set, and such encodings are typically designed to be distinguishable from each other. Many character sets routinely serve as their own encoding (e.g., ASCII, GB, and Big-5), however, and such codes are rarely designed to be easily distinguishable from each other. This can be an important issue for systems that must process unrestricted text from a wide variety of sources, since a priori knowledge (perhaps inferred from the source or provided as HTML markup) or some automatic character set detection technique is needed. Kikui (1996) applied algorithms for converting several language-specific encodings to a common representation, applied an automatic language identification technique to each candidate, and selected the candidate that produced the best match between the hypothesized encoding and the identified language. Overall detection accuracy and a confusion matrix depicting the predominance of misclassifications between each possible encoding pair were presented. Reeder & Geisler (1998) are working on character set identification techniques to handle a greater number of languages, but evaluation results have not yet been published. Language identification accuracy depends on the similarity of the language pairs that might potentially be confused and on document length. The most common evaluation methodology is to report confusion matrices for a representative set of passage lengths. Except for certain language pairs, language identification accuracy typically asymptotically approaches 100% for longer documents, so evaluations generally focus on relatively short passages. Developers of commercial software rarely report the effectiveness on short passages, however, so that does not serve as a useful discriminator among commercially available systems.
Evaluation Criteria.
l
Available nowm
Projected to become available in 1999l
Free, or available with a multiuser license for $100 or lessw
Available for sale on negotiated terms or at a fixed price that exceeds $100 for multiple usersl
Available in an easily readable digital formatw Available in a suitable format, but requires extensive preprocessing
m
Available only in hardcopy, data entry and validation estimated at over 40 hoursl
All present languages in ISO-8859-1 and all common character sets for Chinesew At least one common character set for Chinese
l
Provisions are included to explicitly reject unknown character sets.w A fairly large number of character sets can be recognized, and hence rejected.
m
No effective provisions are included for rejection of undesired character sets.
|
Name |
Availability |
Cost |
Format |
Coverage |
Rejection Effectiveness |
|
codeguess |
l |
l |
l |
w |
w |
|
Intelliscope |
l |
w |
l |
l |
|
|
Mitre |
m |
l |
l |
l |
l |
|
Que |
l |
w |
l |
w |
l |
codeguess
A character set guesser designed specifically for Chinese characters written by Erik Peterson
Available by HTTP from http://www.erols.com/eepeter/codeguess.html
The source code contains a statement that the software is free for noncommercial use but that a fee is required for commercial use of the software.
Distributed as a PERL 5 source code.
The software is designed to recognize only the character set, rather than the character set and language. The software can recognize GB, HZ, Big-5, and ASCII. CNS 11643 and ISO-8859-1 are not supported, although the author has stated an interest in supporting CNS 11643 in the future.
A confidence threshold is used to reject unknown languages, but the omission of ISO-8859-1 may compromise its effectiveness.
IntelliScope
The Lernout & Hauspie "IntelliScope Language Recognizer" includes character set recognition, character set conversion and language recognition, but only the languages handled are specified.
Presently available from Lernout & Hauspie N.V., a Belgian company, through their Language Technologies division. Intelliscope was originally developed by Inso Corp. Additional information is available at http://www.lhsl.com/tech/icm/retrieval/toolkit/lr.asp.
IntelliScope can be licensed for a fee, but the price is not stated in their advertising literature.
Apparently distributed as precompiled binaries for Windows 95/NT and Solaris 2.3. Functions are accessed through an API.
The character set coverage is not explicitly stated, but the advertising materials explicitly state that both traditional and simplified characters are recognized, and all present languages are recognized.
Intelliscope can distinguish 36 languages.
Mitre
Flo Reeder of Mitre is developing a character set and language identification tool.
The software is being developed under government contract, and release authority from the government contracting officer is required. Chinese is included within the set of languages to be supported, but a version of the software with Chinese recognition capabilities is not yet ready for distribution.
There is no charge for use of the software on government projects, but any required support may incur costs. Terms for commercial use would need to be discussed with the government contracting officer.
Once authority to use the software is granted, the software and the associated training data can be obtained through FTP.
Very comprehensive character set coverage for each language is planned.
The software will eventually distinguish multiple encodings of more than 30 languages.
Que
The Alis Qué system for the identification of language and character encoding.
Presently available as part of the Flores toolkit from Alis Technologies, Inc., a Canadian company. Described at http://www.alis.com/castil/silc/index.en.html.
Que is offered on a license fee basis that varies based on the estimated value added to the end-user product.
Que is distributed as Windows 95/NT and Solaris 2.x binaries with a C/C++ API designed for use with gcc on Solaris and Visual C++ on Windows 95/NT.
Que can recognize GB, HZ, Big-5, and all present languages in the ISO-8859-1 character set, but not CNS 11643.
Que can recognize 28 languages and 98 language-encoding pairs.
Character Set Conversion
Definition. Resources for mapping other character sets into Unicode.
Comments. The Unicode standard, now in version 2.1, has essentially been stable for the present languages and the languages of interest since it was merged with ISO 10646 in 1993 to produce version 1.1 of the standard. Unicode does, however, define alternative representations (known as "composed" and "decomposed") for some characters. Although standards-compliant Unicode applications are required to handle all representations correctly, normalization to a uniform representation is typically needed in information retrieval applications. A description of Unicode normalization issues can be found in McCallum and Ertel (1993), and the present status of the standard with respect to normalization is described in Davis (1998).
Evaluation Criteria.
l
Available nowm
Projected to become available in 1999l
Free, or available with a multiuser license for $100 or lessw
Available for sale on negotiated terms or at a fixed price that exceeds $100 for multiple usersl
Available in an easily readable digital formatw Available in a suitable format, but requires extensive preprocessing
m
Available only in hardcopy, data entry and validation estimated at over 40 hoursl
ISO-8859-1 and all common character sets for Chinesew At least one common character set for Chinese
l
Includes normalization functionsw Generates Unicode, but does not provide explicit control over normalization
m
Generates a single code other than Unicode that could be converted to Unicode
Traditional and simplified characters occupy different portions of the Unicode code space. Because there is a many-to-one mapping from traditional to simplified characters, an irreversable normalization from traditional to simplified characters is required when a query expressed using one type of characters must be matched with documents written using the other type of characters. The CNS 11643 character set evolved from Big-5 and large portions of the two code sets are identical, so applying a Big-5 converter to CNS 11643 is a reasonable engineering solution when a CNS 11643 converter is not available if occasional errors can be tolerated.
|
Name |
Availability |
Cost |
Format |
Char Set Coverage |
Unicode Normalization |
|
Flores |
l |
w |
l |
w |
w |
|
MUTT |
l |
l |
w |
w |
|
|
Rosette |
l |
w |
l |
w |
l |
Flores
The Alis Flores/Bantam "universal character set conversion engine."
Both the Flores toolkit and the Bantam library are offered on a license fee basis that varies based on the estimated value added to the end-user product.
Available in the Flores toolkit as a C or C++ API, either as source code or precompiled for Windows or Solaris. The same capabilities are available as a Windows 95/NT DLL with a C++ API in the Alis Bantam Library.
Converts bidirectionally between GB, HZ or Big-5 and Unicode. CNS 11643 is not handled as a separate character set from Big-5. Source code for a CNS 11643 to Big 5 converter is available in b5cns.tar.gz from
No explicit control over Unicode normalization is apparent in the documentation.
MUTT
The Multilingual Unicode Toolkit (MUTT) is a set of Unicode tools for display and conversion of text that was developed at New Mexico State University.
MUTT is available at
MUTT is distributed as precompiled binaries for Solaris 2.1. Installation of Tcl/Tk is required.
Converts bidirectionally between GB or Big-5 and Unicode. CNS 11643 is not handled as a separate character set from Big-5, and the HZ encoding is not handled. Source code for a CNS 11643 to Big 5 converter is available in b5cns.tar.gz and for a HZ to GB converter is available in HZ-2.0.tar.gz. Both are available from
MUTT provides no control over Unicode normalization.
Rosette
The Rosette C++ library for Unicode was developed by Basis Technology Corp.
Presently available as a commercial product from Basis Technology Corp. Uniconv, a full-featured precompiled standalone demonstration or Rosette for noncommercial use, is available at
Both Rosette and uniconv can be licensed for a fee, but the price is not stated in their advertising materials.
Rosette is available as C++ source code. Uniconv is available in binary form, with versions for Windows 95/NT and for Solaris 2.5.
Converts bidirectionally between GB, HZ, or Big-5 and Unicode. CNS 11643 is not handled as a separate character set from Big-5. Source code for a CNS 11643 to Big 5 converter is available in b5cns.tar.gz from http://www.ifcss.org/ftp-pub/software/unix/convert/.
Unicode normalization functions are included.
Segmentation and Compound Splitting
Definition. Resources for segmenting texts in languages such as Chinese that lack orthographic boundaries between words.
Comments. Segmentation and compound splitting are instances of the more general term selection problem. As an abstract task, term selection is inherently an ill-formed problem because no single level of granularity is universally appropriate. Information retrieval systems designed for English text often cope with this situation by retaining multiple levels of granularity (e.g., both multiword terms and constituent phrases). Compound splitting and segmentation have, however, been predominantly studied at the component level, and thus the commonly used evaluation methodology is to compare the postulated segment boundaries with a single "gold standard" segmentation that is produced by a native speaker of the language. The reported statistics are normally derived from the number of missed segment boundaries and the number of incorrectly postulated segment boundaries.
Evaluation Criteria.
l
Available nowm
Projected to become available in 1999l
Free, or available with a multiuser license for $100 or lessw
Available for sale on negotiated terms or at a fixed price that exceeds $100 for multiple usersl
Available in an easily readable digital formatw Available in a suitable format, but requires extensive preprocessing
m
Available only in hardcopy, data entry and validation estimated at over 40 hoursl
Appears to be built using the best known techniquesw
Fails to exploit some known techniques that could improve performancel
Works directly on Unicode representationsw
Works on character representations that could be generated from Unicode
|
Name |
Availability |
Cost |
Format |
Accuracy |
Unicode Compatibility |
|
Flores |
l |
w |
l |
l |
|
|
ch_seg |
l |
l |
l |
w |
|
|
segmenter |
l |
l |
l |
w |
w |
Flores
The Alis Flores toolkit includes a "word extraction" capability that performs segmentation.
Presently available as part of the Flores toolkit from Alis Technologies, Inc., a Canadian company. Described at
The Flores toolkit is offered on a license fee basis that varies based on the estimated value added to the end-user product.
Available in the Flores toolkit as a C or C++ API, either as source code or precompiled for Windows or Solaris.
Flores is Unicode-based.
ch_seg
The ch_seg Chinese segmentation was developed by Lei Chen at New Mexico State University
Presently available by FTP from ftp://crl.nmsu.edu/pub/misc/.
The software is distributed as C source code.
The algorithm was developed for a Masters Thesis at one of the best computational linguistics laboratories.
The software is designed to work with GB rather then Unicode.
segmenter
A program developed by Erik Peterson to perform segmentation on Chinese text.
Presently available by HTTP from http://www.erols.com/eepeter/segmenter.html
The software is freely available. It contains no statements either granting or restricting rights for commercial use.
The software is distributed as PERL source code.
No accuracy figures are reported, and the description of the algorithm suggests that some useful sources of information are not yet exploited.
The software is designed to work with GB rather than with Unicode.
Proper Name Resources
Definition. Lists of names for individuals, organizations, and geographic features in a language of interest, preferably with categories assigned to those names. Monolingual training corpora in which proper names are tagged in a way that could support machine learning algorithms for proper name recognition such as those described by Gallippi (1996) are also of interest.
Comments. The MUC-6 "named entity" evaluation methodology is described in DARPA (1995) and in Hirschman (1998), and the evolution of the MUC-6 task is described in Grishman & Sundheim (1996). Chinese, Japanese, English and Spanish evaluations have been conducted using the same methodology in what is known as the "multilingual Entity Task" (MET). MET-1 was described in the TIPSTER Phase II workshop (DARPA 1996), and MET-2 is described at
ftp://ftp.muc.saic.com/pub/MET/participation/call-for-participation. The MUC-7 named entity evaluation methodology is essentially the same as that used for MUC-6 and both MET evaluations. It is described at http://www.muc.saic.com/scorer/Manual/manual.html. The task is essentially one of classification, so a single result set is computed. Recall and precision are then computed using a hand-scored evaluation corpus. A version of van Rijsbergen’s F measure (van Rijsbergen 1979) in which recall and precision are weighted equally is typically reported as a single figure of merit for each participating system.Evaluation Criteria.
l
Available nowm
Projected to become available in 1999l
Free, or available with a multiuser license for $100 or lessw
Available for sale on negotiated terms or at a fixed price that exceeds $100 for multiple usersl
Available in an easily readable digital formatw Available in a suitable format, but requires extensive preprocessing
m
Available only in hardcopy, data entry and validation estimated at over 40 hoursl
Handles person names, organization names and location namesw
Handles at least one of those categoriesl
Broad coverage of names expected to occur in general news and technical textsw Moderate coverage of names expected to occur in general news
m
Some potentially useful namesl
Encoded in Unicodew
Encoded in a character representation that could be converted to Unicode
|
Name |
Availability |
Cost |
Format |
Category Coverage |
Domain Coverage |
Unicode Compatibility |
|
cweb |
l |
l |
l |
w |
m |
w |
|
MET |
l |
l |
l |
l |
m |
w |
cweb
A web-accessible bilingual term list hosted at the National Chiao-Tung University in Taiwan that contains city, country and personal names in Chinese and English.
Presently available through HTTP from
The files are freely available. They contain no statements either granting or restricting rights for commercial use.
Each file is available as text and as HTML.
There is a file for location names (country.txt, country.html) and a file for person names (name.txt, name.html)
The place names file contains the names of 179 countries and their capitals and the person names file contains 459 popular given names in China, the UK, and the US.
The files are encoded in Big-5.
MET
The training collection for the Multilingual Entity Tasks (MET-1 and MET-2) included Chinese training materials in which proper names were marked.
MET training data was distributed to participants by FTP. The procedures are described at
There was no charge for participation in MET, but the available materials do not detail any provision for providing the material to nonparticipants, nor whether there are any restrictions on the use of the material as a basis for derivative works.
The MET training data was distributed using a password-protected FTP site.
Person, organization and location names were hand-tagged in the MET training corpus.
The Chinese materials in MET-2 were drawn from the Xinhua news agency, the Peoples Daily newswire, and China Radio transcripts.
A sample of the MET-2 evaluation corpus that appears to be in GB code is available at
Resources for Mapping Terms Between Languages
Definition. Resources such as thesauri, ontologies, lexicons, terminology lists, and cognate matching rules that explicitly specify relationships between terms in Chinese and terms in English.
Comments. For resources that lack conceptual structure (as is the case for all Chinese resources listed below), the most salient factor is size. Melamed (1995, 1997) developed a fully automatic methodology for assessing the match between a translation lexicon and an unannotated evaluation corpus of parallel documents. The evaluation corpus is first automatically aligned at the sentence level using dynamic programming techniques. Translations that appear in the lexicon are scored as valid if an occurrence of a word in one language within a source-language sentence is matched in any position of the corresponding target-language sentence by any possible translation of that word. An alternate methodology based on manually annotated ground truth was developed for the MUC-6 and MUC-7 "template element" tasks. That task evaluated the ability of participating systems to recognize alternate forms for person and organization names in English text based on evidence from an individual document. Recall, precision and the F measure are computed over each name and each alias that could be extracted from every document (allowing duplicates if they are in different documents). In MUC, other template elements (e.g., person title and organization location) were also scored in the evaluation corpus, so published results on the template element task are confounded with tasks that are extraneous to term-term matching. Furthermore, the MUC template element evaluation methodology confounds entity name recognition with entity name matching. Source code for the MUC scoring software is available from
ftp://ftp.muc.saic.com/pub/MUC/scorer/, MUC-6 evaluation material is available on the ACL/DCI disk, and the ground truth markup may be available from SAIC. The MUC-6 and MUC-7 template element evaluations have been conducted only in English, so similar evaluation resources may not be available in other languages. A simpler methodology was used by Knight & Graehl (1997) to evaluate back-transliteration. In Knight’s back-transliteration experiment the goal was to select the English word from which a Japanese katakana transliteration had been generated. Accuracy figures were reported for 100 personal names selected from a bilingual dictionary. As formulated by Knight & Graehl, back-transliteration is a more challenging task than transliteration matching because a single correct English counterpart must be selected.
Evaluation Criteria.
l
Available nowm
Projected to become available in 1999l
Free, or available with a multiuser license for $100 or lessw
Available for sale on negotiated terms or at a fixed price that exceeds $100 for multiple usersl
Available in an easily readable digital formatw Available in a suitable format, but requires extensive preprocessing
m
Available only in hardcopy, data entry and validation estimated at over 40 hoursl
Large lexicon (over 100,000 unique roots or multiword terms in the language of interest)w
Moderate-sized lexicon (10,000 to 100,000 unique roots or terms in language of interest)m
Small lexicon (fewer than 10,000 unique word roots or terms in the language of interest)l
General news and technical terminologyw
General news terminologym
Potentially useful technical terminologyl
Complete inflectional morphologyw
Moderately robust inflectional morphologym
Spotty or no coverage of inflectional morphologyl
Hand constructed of hand verifiedw
Automatically built from corporal
A rich set of domain-specific translation probabilities are providedw Either a translation preference order or a single preferred translation is provided
m
No translation preference information is providedl
Encoded in Unicodew
Encoded in a character representation that could be converted to Unicode
|
Name |
Availability |
Cost |
Format |
Lexicon Size |
Domain Coverage |
|
CEDICT |
l |
w |
l |
w |
w |
|
cweb |
l |
l |
l |
m |
l |
|
ecdict |
l |
|
w |
w |
w |
|
eng-chi |
l |
l |
l |
l |
w |
|
TwinBridge |
l |
w |
w |
w |
|
Name |
Morphology |
Accuracy |
Translation Preference |
Unicode Compatibility |
|
CEDICT |
m |
l |
m |
w |
|
cweb |
w |
l |
w |
|
|
ecdict |
m |
l |
l |
w |
|
eng-chi |
m |
l |
l |
w |
|
TwinBridge |
m |
l |
CEDICT
Paul Denisowski’s CEDICT Chinese-English dictionary.
Freely downloadable from http://www.mindspring.com/~paul_denisowski/cedict.html.
The README file states that noncommercial use is permitted, but that permission is required from the copyright holder for commercial use. No cost is stated.
CEDICT is stored in the same format at Jim Breen’s Japanese-English EDICT.
In September. 1998 CEDICT contained 16,830 entries.
It appears from the documentation that CEDICT presently consists mostly of general terminology.
No morphology information is included with CEDICT.
The entries in CEDICT have been manually verified.
It does not appear that translation preference information is encoded in CEDICT.
CEDICT is available in both the GB and Big-5 character sets.
cweb
A unidirectional Chinese to English bilingual dictionary.
Freely downloadable using HTTP from http://www.csie.nctu.edu.tw/center/cweb/.
The dictionary is freely available, and there is no statement granting or restricting permission for commercial use.
Each file contains one English word per line, with (possibly) several Chinese translations for each English word.
From the file sizes, it appears that several thousand total words are present in the various lists.
There are a few files for general terminology and 92 short files for technical terminology in a number of fields.
The entries are grouped in a way that may provide useful information about morphology.
The bilingual term lists appear to have been constructed manually.
It is not clear whether alternative translations are presented in preference order.
The Chinese terms are encoded in Big-5.
ecdict
http://www.ok88.com/go/svc/ecdict.html
Version 2.12 of an online English to Chinese dictionary developed by Linda Ng.
The dictionary is provided by OK88 Bilingual Internet Services.
There is no information about commercial availability of the dictionary.
It appears that the dictionary is available only as an online resource. A complete list of English words beginning with any letter can be displayed, so automated retrieval of every English to Chinese translation would be practical.
The reported size of the dictionary is 12,054 entries.
The dictionary contains only general terminology.
Only root forms are present in the dictionary, and no morphology information is provided.
The dictionary appears to be constructed by hand.
Only a single translation is provided for each English word.
The Chinese characters are coded in the Big-5 character set.
eng-chi
A Chinese-English bilingual term list.
Available through HTTP at
The term list is freely available, and there is no statement granting or restricting permission for commercial use.
The dictionary is available as a text file, with one translation equivalent word pair per line.
There are 103,000 word pairs in the term list.
The term list contains general terminology.
The English words are root forms, and no information about morphology is provided.
The term list appears to have been hand constructed or hand validated.
There is a single translation for each English word.
The dictionary is coded in the Big-5 character set. A version that has been automatically converted to GB is available at
TwinBridge
The TwinBridge bidirectional English-Chinese Dictionary.
Presently available from The TwinBridge Software Corp. Additional information is available at
$99 for the end-user version. No price for access to the dictionary through an API is stated.
Available on CDROM as an end-user product with an integrated user interface designed for any language version of Windows 95. The dictionary is clearly designed for human use and it appears that extensive reformatting would be needed to produce a usable cross-language lexicon. It is not clear whether an API is available.
The dictionary contains 70,000 words.
Both general terminology and computer industry terminology dictionaries are included.
It appears that no morphology information is provided.
The dictionary is clearly manually constructed or verified.
It is not clear whether translation preference information is present.
There is no information available about the internal character code that is used.
Machine Translation Resources
Definition. Modular resources for producing translations in both directions between English and Chinese. For languages that presently lack available bidirectional machine translation systems, unidirectional systems are also of interest. Both quick-and-dirty gloss-style translations and best-possible-quality translation systems are of interest.
Comments. Church & Hovy (1993) identified three types of measures for machine translation evaluation: text-based, cost-based, and system-based. Text-based measures are further subdivided into sentence-based measures, comprehensibility measures, and post-editing measures. Sentence-based measures, computed by hand-scoring each translated sentence for attributes such as semantic and stylistic correctness, report the fraction of the sentences assessed as satisfying each quality level. Comprehensibility measures are outcome-based document-level measures that use techniques such as multiple choice tests to determine whether representative end users were able to discern information contained in the original document by examining the translation. Post-editing measures are task-based measures that seek to characterize the difficulty of editing MT output to produce high-quality translations. Because that task is not appropriate to cross-language IR, post-editing measures are not considered further. Sentence-based measures and comprehensibility measures were both used in the DAPRA MT evaluations described by White & O’Connell (1994), and a strong correlation between the two types of measures was observed. White & Taylor (1998) report that the relationship between sentence-oriented measures and task performance has not yet been determined. Together, these results suggest that comprehensibility measures may be the better choice at present. Cost-based measures include measurements of the time required for translation and any marginal costs (such as human networked translation resources and human post-editing labor) associated with use of an MT system for the intended task. System-based measures are glass box measures of the type described by Nirenburg, et al. (1996). Oard & Resnik (1999) reported a relatively inexpensive technique for evaluating the use of gloss translations as indicative abstracts.
Evaluation Criteria.
l
Available nowm
Projected to become available in 1999l
Free, or available with a multiuser license for $100 or lessw
Available for sale on negotiated terms or at a fixed price that exceeds $100 for multiple usersl
Available in an easily readable digital formatw Available in a suitable format, but requires extensive preprocessing
m
Available only in hardcopy, data entry and validation estimated at over 40 hoursl
Large lexicon with good general news coverage available for translation in each directionw
Moderate-sized lexicon with general news coverage available in at least one directionl
Nearly real time, at least 100 words per second on a high-end workstationw Suitable for online use, between 10 words per second and 100 words per second
m
Suitable for offline use, less than 10 words per secondl
Works with the standard US version of Netscape or Internet Explorer
|
Name |
Availability |
Cost |
Format |
Lexicon Size |
Translation Speed |
Browser Compatibility |
|
Auto-Trans |
w |
l |
l |
w |
||
|
Systrans |
l |
w |
l |
w |
w |
w |
|
Transperfect |
l |
w |
l |
w |
w |
Auto-Trans
A bidirectional English-Chinese machine translation system.
The software is described on a web site maintained by the ComStar Company, a retailer specializing in multilingual software, at
Although a description of the software is available, it appears to have been recently removed from the ComStar price list.
The software runs under Windows 95, and the documentation implies that it is distributed on a set of diskettes.
The lexicon claims to contain 2 million general terms and 30 million technical terms.
The translation speed is not specified.
Auto-Trans is configured for offline translation, rather than for use with a web browser.
Systran
A unidirectional Chinese to English machine translation system from Systran Software, Inc.
Presently available from Systran. Additional information is available at
A five-user license for Systran Professional Client/Server sells for $3,250 per bidirectional language pair, with prices increasing to $20,000 for a twenty-user license. A single-user standalone system is available for $1,000 per bidirectional language pair.
Systran Professional is distributed on CDROM for Windows 95/NT.
Systran Professional contains a large lexicon (2.5 million entries in 14 languages) that includes both general and technical terminology.
Systran Professional can translate about 40 words per second when used with a Pentium processor.
Browser support is not presently available for Chinese to English translation.
Transperfect
A unidirectional English to Chinese machine translation system from Otek International, Inc, a Taiwanese company.
Additional information is available in English at
A single-user copy of Transperfect sells for $300 as a "professional edition" and $100 as a "standard edition." It is not clear how these versions differ.
Transperfect runs under Windows 95, and is distributed as a set of diskettes.
The lexicon contains about 100,000 words.
The translation speed is not specified.
Transperfect does not appear to be designed for use with web browsers.
References
Comrie, Bernard. 1987. The World’s Major Languages. (Croom Helm).
DARPA. 1995. Proceedings Sixth Message Understanding Conference. Columbia, MD. November.
DARPA. 1996. Tipster Program Phase II. Vienna, VA. May.
Davis, Mark. 1998. Unicode Normalization Forms. Draft Unicode Technical Report #15, The Unicode Consortium. August. Available at
http://www.unicode.org/unicode/reports/tr15/.Gallippi, Anthony F. 1996. Learning to Recognize Names Across Languages. In Proceedings of the Sixteenth International Conference on Computational Linguistics. Copenhagen, Denmark. August.
Grishman, R. and B. Sundheim. 1996. Message Understanding Conference-6: A Brief History. In Proceedings of the Sixteenth International Conference on Computational Linguistics. Copenhagen, Denmark. August.
Hirschman, L. 1998. Language Understanding Evaluations: Lessons Learned from MUC and ATIS. In Proceedings of the First International Conference on Language Resources and Evaluation. Granada, Spain. May.
Hovy, E. H. 1998. Creating Useful Evaluation Metrics for Machine Translation. In Proceedings of the First International Conference on Language Resources and Evaluation. Granada, Spain. May.
Kikui, Gen-itiro. 1996. Identifying the Coding Scheme and Language of On-Line Documents on the Internet. In Sixteenth International Conference on Computational Linguistics. Copenhagen. August.
Knight, Kevin and Jonathan Graehl, 1997. Machine Transliteration. In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics, Madrid. July. Available at
http://www.isi.edu/natural-language/projects/GAZELLE.html.McCullum, Sally and Monica Ertel. 1993. Proceedings of the Second International Federation for Library Automation Satellite Meeting: Automated Systems for Access to Multilingual and Multiscript Library Materials. Madrid, Spain. August.
Melamed, I Dan. 1995. Automatic Evaluation and Uniform Filter Cascades for Inducing N-best Translation Lexicons. In Third Workshop on Very Large Corpora, Boston. Available at
http://www.cis.upenn.edu/~melamed/.
Melamed, I. Dan. 1997. Automatic Discovery of Non-Compositional Compounds in Parallel Data. In 2nd Conference on Emperical Methods in Natural Language Processing, Providence, RI. Available at
http://www.cis.upenn.edu/~melamed/Oard, Douglas W. and Philip Resnik. 1999. Support for Interactive Document Selection in Cross-Language Information Retrieval. Information Processing and Management. To appear.
Reeder, F. and J. Geisler. 1998. Multi-Byte Issues in Encoding/Language Identification. In Proceedings of the Third Conference of the Association for Machine Translation in the Americas, Langhorne, PA. October.
White, J. S. and T. A. O’Connell. 1994. The DARPA MT Evaluation Methodologies: Evolution, Lessons, and Future Approaches. In Proceedings of the First Conference of the Association for Machine Translation in the Americas. Columbia, MD.
White, J. S. and K. B. Taylor. 1998. A Task-Oriented Evaluation Metric for Machine Translation. In Proceedings of the First International Conference on Language Resources and Evaluation, Granada, Spain. May.