The Language and Voice Lab and CADIA present a lunchtime talk by Steinþór Steingrímsson: Extracting more value from parallelizable corpora by re-evaluating data that is usually discarded
Tuesday April 19th at 12:20
Zoom – https://bit.ly/mt-lunch
Parallel corpora are necessary for training machine translation (MT) systems, and useful for a variety of other tasks. When corresponding texts in two or more languages are aligned using automatic methods, a considerable part of the data is usually discarded due to misalignments and because the aligned segments are filtered out as they are considered not to be semantically close enough by automatic metrics.
The talk describes ongoing experiments in doing more accurate sentence alignment and filtering, as well as approaches to re-evaluate the discarded data and to extract from it segment pairs that can still be useful for MT and bilingual lexicon induction.