Tenjin The Tenjinno Machine Translation Competition
Home
Task
Background
Download data
Status
Oracle
Important Dates
Archive
 
 

Competition Philosophy

In this section of the website you will find background information on why the competition was constructed the way that it was including

If you have difficulting viewing this page in your browser (for instance if you dont see any greek symbols) you can download a pdf version of this page here.

Requirements

The target transducers, training and testing examples were created with the following objectives in mind:

  1. The learning task should be sufficiently difficult. Specifically the task should be outside of the current state of the art, but not so difficult that it is unlikely that a winner will be found.
  2. It should be provable either that the training sentences are sufficient to identify the target transducer, or at least an estimation of the probability of the training sentences being sufficient should be known.

To determine whether or not requirement1 was met, a measure of the complexity of the learning task was derived, similar to that used for the Omphalos Competition. The measure was derived by creating a model of the learning task based upon a brute force search. This model was also used to determine if the training data would be sufficient to enable competitors to identify the target transducer. The conclusions drawn were that it is possible to infer the formal models used in the Tenjinno competition from a sufficiently large enough set of translation pairs. An upper bound on the number of samples required was derived using PAC learning theory, but it was decided to construct significantly smaller test sets than is suggested using the PAC learning equations, based upon the knowledge that the upper bounds indicated from the PAC learning equations are close to lower bounds on the number of training examples required when all hypotheses in the hypothesis space are mutually exclusive.

The Ranking of the Competition Problems

The Tenjinno competition has adopted a linear ordering for the competition problems based upon the formalism used, the complexity of the problems, and whether or not the formal model from which the data is derived is deterministic.

It is well understood that the class of transducers that can be described using FSTs is a proper subset of the class of transducers that can be described using SDTSs. Similarly it is well known that the class of transducers that can be described using deterministic FSTs is a subset of the class of transducers that can be described using non-deterministic FSTs. Similarly the class of transducers that can be described using deterministic SDTSs is a subset of the class of transducers that can be described using non-deterministic SDTSs. Solving the problem generated from a non-deterministic model is considered to be a more difficult problem. Therefore if both problems are solved, the competitor who has solved the non-deterministic problem will be declared the winner.

In addition a complexity measure for each learning task has been constructed to compare the competition problems with problems with well known benchmark corpora. In addition during the life of the competition the organizers reserve the right to add more difficult or easier problems as indicated by the complexity measure.

The following table describes the equations used to measure the complexity of the competition problems. Different measures are used for the different classes of problems. For instance the complexity measure for the FST tasks describes an upper bound on the number of possible FSTs that could be constructed given a finite set of input symbols, output symbols and non-terminals of size s, d and n respectively where there are at most L-1 output symbols on the right hand side of the longest rule. Similarly the complexity measure for the SDTS tasks describes an upper bound on the number of possible SDTSs that could be constructed given a finite set of input symbols, output symbols and non-terminals of size s, d and n respectively where there are at most L-1 output and non-terminal symbols on the right hand side of the longest rule.

Formalism

Complexity Measure

SDTS

FST

Where

n = the number of non-terminals

s = The size of the input lexicon.

d = The size of the output lexicon.

L = the length of the longest rule (in the case of FST and SDTS) or the number of ground terms in the longest term rewriting rule for CFTTs.

C i = the i th Catalan number.

g = L × s + s +1+L × n.

Estimating the Current State of the Art

To ensure that the competition problems where just beyond the state of the art, the proceedings of recent conferences were examined to identify the performance of different inference techniques on some corpora commonly used for the inference of machine translation systems. The complexities of the problems in the Tenjinno competition were then set to be an order of magnitude larger than these problems. In addition the complexities of the problems were set to be large enough that they cannot be solved using a brute force search in the time in which the competition will be open (i.e. there are approximately 1.57 ´ 10 13 microseconds in a six month period).

Table 1 contains a list of some corpora commonly used for the inference of machine translation systems. This data is derived from (Amengual et al. 2000) , (Casacuberta 2000) , (Matusov, Kanthak & Ney 2005) and (Matusov, Kanthak & Ney 2005) . Some of the earlier experiments listed here are described in (Vidal & Casacuberta 2004) as being indicative of the state of the art. Conveniently the documents from which this table has been compiled describe experiments in which n-gram finite state transducers were inferred from the given corpora. In addition sufficient detail of corpora has been given in the papers to allow for easy comparison between the complexity of the learning tasks and the word error rates obtained.

Although there is yet to be a published technique that can achieve a 0% WER rate on any of these corpora, it should be noted that these corpora are natural language corpora, which are unlikely to be defined using finite state transducers. Therefore we would expect that the word error rates obtained using the techniques described in the papers on data that is defined using finite state transducers would be better than that listed in Table 1 .

Therefore it seems appropriate to begin with competition tasks that are of a slightly greater complexity to those listed in this table. When calculating the complexity of the underlying models we have used the somewhat arbitrary assumption that it is possible to learn the structure of the underlying models from the training examples without n-gram smoothing, that is whether or not the probability of each possible transition is either zero or non-zero. Therefore we have estimated   the number of non-terminals to be either one quarter of the number of words in the training examples, or the number of possible states in the given n-gram { ( d +1) n-1 } whichever is smallest.   We have also included a column that describes an estimate of the complexity of an SDTS that might be used to describe the target language. The SDTS measure is estimated using the given vocabularies and assuming that the target SDTS could be described using a similar number of non-terminals to the number of Penn Treebank tags. Specifically we have selected the number of non-terminals to be 36 and have arbitrarily selected the length of the longest rule to be 7.

Most of these corpora in Table 1 have large vocabularies, and some other commonly used corpora such as the Canadian Hansard are used due to the close alignment between word orderings in the source and target languages. However it is well understood that word reordering is of crucial importance to machine translation and the most difficult. For this reason, although the competition problems in the Tenjinno competition have a similar complexity to those listed in Table 1 , the competition problems have smaller vocabularies and derive more of their complexity from non-monotonic alignment.

Using the results of Table 1 the complexities of the Tenjinno competition problems were selected to be just outside of the complexities listed in Table 1.

Table 1 Commonly Used Corpora for the Inference of Machine Translation Systems (Vidal & Casacuberta 2004) & (Matusov, Kanthak & Ney 2005)

Ref#

Corpora

Source Lang.

Target Lang.

Input Lexicon Size

Output Lexicon Size

WER

Log 2 of   Estimated Complexity (as FSM)

Log 2 of Estimated Complexity (as SDTS)

1.

EuTrans-0

Spanish

English

683

514

0.74%

1.89 ´ 10 19 (7 class n-gram)

2.30 ´ 10 37

2.

EuTrans-I

Spanish

English

686

513

9.7%

2.14 ´ 10 22 (4-gram)

3.45 ´ 10 39

2.

EuTrans-II

Italian

English

2,459

1,712

28.3%

3.16 ´ 10 24 (4-gram)

6.87 ´ 10 46

3.

Verbmobil

German

English

7,939

4,672

36.2%

3.89 ´ 10 28 (4-gram)

6.91 ´ 10 48

3.

Basic Travel Corpus

Chinese

English

7,643

6,982

48%

3.99 ´ 10 28 (4-gram)

8.56 ´ 10 50

 

Table 2 References used to compile Table 1 .

Ref #

Reference

1.

(Amengual et al. 2000)

2.

(Casacuberta 2000)

3.

(Matusov, Kanthak & Ney 2005)

 

The name Tenjinno

The ICGI competition committee decided that it would be ideal to have a Japanese inspired name for the ICGI 2006 competition, due to it being held in Japan. Ideally the name should be related to language and learning. Using these requirements, we decided to use the Japanese name "Tenjin" as the basis for the name of the competition. Tenjin is the Shinto kami of scholarship, the deified Sugawara no Michizane. You can find more information on the significance of Tenjin on Wikipedia .

Tenjin is a word that appears frequently on the Internet (645,000 hits on Google) and therefore we have added the Japanese particle "no" to the end of the word, so that it becomes the "Tenjin no" competition, or loosely Tenjin's competition. Although "no" should be a separate word, by conjugating it with Tenjin (i.e. Tenjinno), we have created something more unique. For instance "Tenjin no" returns 97,000 hits, Tenjin-no 1,060 hits, but Tenjinno returns only 12 hits.

References

Amengual, JC, Castaño, A, Castellanos, A, Jiménez, VM, Llorens, D, Marzal, A, Prat, F, Vilar, JM, Benedi, JM, Casacuberta, F, Pastor, M & Vidal, E 2000, 'The EuTrans-I Spoken Language Translation System' , Machine Translation , vol. 15, no. 1, pp. 75-103.

Casacuberta, F 2000, 'Inference of Finite-State Transducers by Using Regular Grammars and Morphisms.' paper presented to International Colloquium, ICGI 2000, Lisbon, Portugal.

Matusov, E, Kanthak, S & Ney, H 2005, 'Efficient Statistical Machine Translation with Constrained Reordering' , paper presented to Proceedings of EAMT 2005 (10th Annual Conference of the European Association for Machine Translation),, Budapest, Hungary, May 2005.

Vidal, E & Casacuberta, F 2004, Learning Finite-State Models for Machine Translation , 3264 edn, Lecture Notes in Computer Science.

For any comments or questions about these pages please contact the Tenjinno organisers .
Brad Starkie
Menno van Zaanen
Dominique Estival

Copyright 2005 ICGI. Last updated: Fri Dec 2 14:57:41 EST 2005