ACCURAT project is aimed at researching methods and techniques to overcome one of the central barriers in Machine Translation (MT), namely the lack of large-scale linguistic resources (i.e., parallel corpora) for under-resourced languages and/or narrow domains. The project will research and evaluate novel methods that exploit comparable corpora in order to compensate for the shortage of linguistic resources, and ultimately to significantly improve MT quality.
ACCURAT will provide researchers and developers with a methodology and fully functional model for exploiting comparable corpora in MT, including
- methods for automatic acquisition of a comparable corpus from the Web and other sources;
- comparability metrics, i.e., criteria to measure the comparability of source and target language documents in comparable corpora;
- methods for alignment and extraction of lexical, terminological and other linguistic data from comparable corpora;
- measurement of the improvements from applying acquired data against baseline results from SMT and RBMT systems.