Last week ACL hosted their 2020 ACL conference. It was supposed to be in Seattle, Washington. But due to COVID-19 it has been moved online, including their satellite events like the ACL 2020 Student Research Workshop (SRW).
This means their proceedings have now been published, including a paper by our very own student, Steinþór Steingrímsson: Effectively Aligning and Filtering Parallel Corpora under Sparse Data Conditions. He wrote about his methods for preparing parallel text for machine translation.
Í síðustu viku hélt ACL árlegu ráðstefnuna sína. Upprunalega átti að halda ráðstefnuna í Seattle en vegna COVID-19 var hún öll færð yfir í netheima, einnig vinnustofur eins og ACL 2020 Student Research Workshop (SRW).
Þetta þýðir að allar innsendar greinar hafa verið gefnar út, þar á meðal grein eftir nemenda okkar Steinþór Steingrímsson. Heiti greinarinnar er Effectively Aligning and Filtering Parallel Corpora under Sparse Data Conditions. Þar skrifar Steinþór um aðferðir til að undirbúa samhliða texta fyrir vélþýðingu.
Parallel corpora are key to developing good machine translation systems. However, abundant parallel data are hard to come by, especially for languages with a low number of speakers. When rich morphology exacerbates the data sparsity problem, it is imperative to have accurate alignment and filtering methods that can help make the most of what is available by maximising the number of correctly translated segments in a corpus and minimising noise by removing incorrect translations and segments containing extraneous data. This paper sets out a research plan for improving alignment and filtering methods for parallel texts in low-resource settings. We propose an effective unsupervised alignment method to tackle the alignment problem. Moreover, we propose a strategy to supplement state-of-the-art models with automatically extracted information using basic NLP tools to effectively handle rich morphology.
This is a positive but somewhat sad week for LVL. Many LVL members were going to go to Marseille, France this week to attend Language Resources & Evaluation Conference (LREC) 2020 and the joint Spoken Language Technologies for Under-Resourced Languages and Collaboration and Computing for Under-Resourced Languages (SLTU-CCURL) 2020 Workshop. Once there they were going to present their many papers, providing an in-depth look into our TTS, data collection, compound splitting, and general language technology research in recent months. However, due to COVID-19 these conferences were both cancelled. Luckily the organizers have still decided to publish the proceedings this month. The joint SLTU proceedings were published May 8th on the SLTU-CCURL 2020 website at Workshop Proceedings (our paper is on page 316). Head over to the SLTU 2020 website if you want to read more SLTU-CCURL papers. We’re still waiting for the LREC proceedings to be published. But our papers can now be found as pdfs below and on our publications page.
Our TTS paper was accepted at SLTU-CCURL 2020:
Title: Manual Speech Synthesis Data Acquisition – From Script Design to Recording Speech Authors: Atli Þor Sigurgeirsson, Gunnar Thor Örnólfsson, Jon Gudnason Summary: In this paper we present the work of collecting a large amount of high quality speech synthesis data for Icelandic. A script design strategy is proposed and three scripts have been generated to maximize diphone coverage, varying in length. The largest reading script contains 14,400 prompts and includes 81% of all Icelandic diphones at least twenty times. As of writing, 58.7 hours of high quality speech data has been collected. PDF
Our Samrómur, Kvistur, and Language Technology programme papers were accepted at LREC 2020:
Title: Language Technology Programme for Icelandic 2019-2023 Authors: Anna Nikulásdóttir, Jón Guðnason, Anton Karl Ingason, Hrafn Loftsson, Eiríkur Rögnvaldsson, Einar Freyr Sigurðsson and Steinþór Steingrímsson Summary: In this paper, we describe a national language technology programme for Icelandic. The programme aims at making Icelandic usable in communication and interactions in the digital world, by developing accessible, opensource language resources and software. The research and development work within the programme is carried out by SÍM, a consortium of universities, institutions, and private companies, with a strong emphasis on cooperation between academia and industries. Five core projects will be the main content of the programme: language resources, speech recognition, speech synthesis, machine translation, and spell and grammar checking. PDF
Title: Kvistur: a BiLSTM Compound Splitter for Icelandic Authors: Jón Daðason, David Mollberg and Hrafn Loftsson Summary: In this paper, we present a character-based BiLSTM model for splitting Icelandic compound words, and show how quantity of training data affects model performance. Compounding is highly productive in Icelandic, and new compounds are constantly being created. This results in a large number of out-of-vocabulary (OOV) words, negatively impacting the performance of many NLP tools. Our model is trained on a dataset of 2.9 million unique word forms and their constituent structures from the Database of Icelandic Morphology. The model learns to split compound words into two and can be used to derive a word form’s constituent structure. Knowing the constituent structure of a word form makes it possible to generate the optimal split for a given task. The model outperforms other previously published methods when evaluated on a corpus of manually split word forms. This method has been integrated into Kvistur, an Icelandic compound word analyzer. PDF
Title: Samrómur: Crowd-sourcing Data Collection for Icelandic Speech Recognition Authors: David Erik Mollberg, Ólafur Helgi Jónsson, Sunneva Þorsteinsdóttir, Steinþór Steingrímsson, Eydís Huld Magnúsdóttir and Jon Gudnason Summary: This contribution describes an ongoing speech data collection, using Samrómur which is built upon Mozilla’s Common Voice. The goal is to build a large-scale speech corpus for Automatic Speech Recognition (ASR) for Icelandic. Upon completion, Samrómur will be the largest open speech corpus for Icelandic. The paper discusses the methods used for crowd-sourcing and illustrate the importance of marketing and good media coverage for a crowd-sourced dataset. Preliminary results exceed our expectations. The paper also reports on the process of validating recordings. PDF
Our SÍM colleagues also had two papers at LREC 2020: “Facilitating Corpus Usage: Making Icelandic Corpora More Accessible for Researchers and Language Users” and “Parallel Universal Dependencies”. Congratulations!
While it is sad that our LVL members cannot meet with fellow researchers and visit the great city of Marseille, they still look forward to connecting with researchers online through your comments on their papers and links to your related papers.
Þessa vikuna hefði átt að halda Language Resources & Evaluation (LREC) ráðstefnuna í Frakklandi, sem og Spoken Language Technologies for Under-resourced Languages vinnustofuna en báðum þessum viðburðum var aflýst vegna COVID-19. Margir starfsmenn LVL ætluðu sér að sækja þessa viðburði og kynna þar 4 greinar og veita innsýn í þær máltæknirannsóknir sem hafa farið fram hérna síðustu mánuði. Hérna má lesa nánar um þetta og nálgast greinarnar. (Athugið að greinarnar eru aðeins aðgengilegar á ensku).
Last Tuesday Svanhvít and Ásmundur completed the first stage in Named Entity Recognition project for Icelandic. They finished the daunting task of labeling all named entities in a text corpus of 1 million tokens (MIM-GOLD), into the following categories: Person, Location, Organization, Micellaneous, Money, Percent, Time, and Date.
To assist with the task, they first preprocessed the corpus using regular expressions to catch some cases and then verified and completed the labeling using the brat rapid annotation tool. Their next task will be to create a few baseline NER tagging systems using the labelled dataset.
The dataset will be publicly available this spring.
Last week we celebrated achieving the first milestone in the Language Technology for Icelandic project with a cake!
After a lot of hard work the past few months we achieved the first milestone in Automatic Speech Recognition (ASR), Text-to-Speech (TTS) and Machine Translation (MT).
In ASR, the focus has mostly been on data creating and gathering. 55,000 utterances have been collected (donated by adults) via the crowd-sourcing platformsamromur.is (based on Common Voice) with plans to reach 100.000 utterances for the next milestone. The process is being extended to include younger voices in collaboration with schools and authorities. Today we started working with Öldutúnsskóli in Hafnarfjörður. The goal is to reach 80.000 young voice utterances for the next mileston. Additionally, data has been gathered fromRÚV (audio, video and subtitles) and CreditInfo (transcriptions). Along with data gathering, the team is also developing tools to post-process Icelandic ASR text for better readability.
In TTS, we successfully created a voice recording client (LOBE) and three reading scripts in order to collect high quality speech and corresponding text data. The reading scripts were created from Risamálheild and seek to maximize diphone coverage. So far 20 hours have been collected from two speakers, male and female. The aim is to finish collecting 20 hours from each speaker early this year. From the collected data two TTS prototypes have been created in Ossian, which extends the Merlin back-end. The current prototypes are quite naive but we have integrated a grapheme-to-phoneme model for the Icelandic language into the prototypes.
In MT, we successfully created a phrase-based statistical machine translation system using the open source tool Moses. Our collaborators at Miðeind created neural machine translation systems based on BiLSTMs and Transformers. The models were trained on the newly available English-Icelandic parallel corpus, ParIce. The systems were then evaluated w.r.t. training time, throughput and BLEU score. The code and systems are freely available but are still under development for milestone two. In milestone two we will continue to develop the systems further and adjust them to specific needs of the Icelandic language.
September 2019 signals the end of the LVL automatic speech recognition, ASR, project with Althingi, the Icelandic parliament. To close, the radio station Rás 1 is airing an interview September 8 at 9:30pm (21:30). The interview is conducted by the head of the Althingi speech department, Berglind Steinsdóttir. In the interview, Berglind talks to both Inga Rún, our ASR expert, and Steinunn, an Althingi editor. They discuss both sides of the project: software development and user experiences. This broadcast will hopefully give our Icelandic readers and listeners a deeper understanding of the specifics involved in ASR. Thus, we invite you to tune in this Sunday at 21:30.
Practical information about the radio program is below:
Date: Sunday, September 8th @ 21:30 (re-airing Saturday, September 14th 20:45)
Location: Rás 1 website or the radio station
Duration: 30 minutes
Title: “Háttvirtur þingmaður tekur til máls”
Topic: Sjálfvirknivæðing. Gervigreind. Fjórða iðnbyltingin. Hvað kemur þetta ræðum þingmanna við? Tekinn hefur verið í notkun talgreinir sem skrifar upp ræður þingmanna og í þættinum er rætt við Ingu Rún Helgadóttur eðlisfræðing sem hefur tekið þátt í að þróa hann og Steinunni Haraldsdóttur íslenskufræðing sem hefur notað talgreininn.
Language of Interview: Icelandic
Interviewees: Inga Rún Helgadóttir, ASR developer
Steinunn Haraldsdóttir, icelandic specialist who uses the ASR
Interviewer: Berglind Steinsdóttir
Supervisor: Ásdís Emilsdóttir Petersen.
During the interview go to https://ruv.is/ras1 Click on Í BEINNI. Select Rás 1 and press the play button.
This has been a great summer for LVL. We have many conference acceptances: 8 papers, 3 conferences. It will also be a busy autumn, as all the conferences are in September.
Our first is Recent Advances in Natural Language Processing, a very competitive NLP conference. This year it is in Varna, Bulgaria. Steinþór, Örvar and Hrafn’s paper, “Augmenting a BiLSTM tagger with a Morphological Lexicon and a Lexical Category Identification Step” and Hrafn’s paper “A Wide-Coverage Context-Free Grammar for Icelandic and an Accompanying Parsing System” will both be presented.
Next is Interspeech in Graz, Austria. We have four papers:
Yu-Ren Chien – “F0 Variability Measures Based on Glottal Closure Instants”
Inga Helgadóttir – “The Althingi ASR System”
Anna Rúnarsdóttir – “Lattice re-scoring during manual editing for automatic error correction of ASR transcripts”
Anna Nikulásdóttir – “Bootstraping a Text Normalization System for an Inflected Language. Numbers as a Test Case”
Rannis (The Strategic Research and Development Programme for Language Technology) has awarded Hrafn two grants this year. Congratulations! The first project, Automatic Text Summarization (ATS) for Icelandic, will be worked on by a post-doctoral researcher and an Icelandic linguist in collaboration with mbl.is, Morgunblaðið’s news website. The second one is Named Entity Recognition (NER) for Icelandic. Svanhvít Lilja Ingólfsdóttir and Ásmundur Guðjónsson, two students from the Language Technology (Máltækni) masters program will work on the NER project in collaboration with the Icelandic Stock Exchange. Welcome to LVL!
Anna Björk has also been awarded a grant, for her company, Grammatek ehf., in cooperation with the city of Akranes. Congratulations and we wish you all the best with your new endeavor!
The paper describes an Icelandic pronunciation dictionary for use in a text-to- speech system for Icelandic. Procedures were implemented to create a consistent training set for grapheme-to-phoneme (g2p) conversion modeling, needed for automatic extensions of the dictionary. The experiments show a clear benefit of using clean data for training, both in terms of PER and in terms of categories of errors made by the g2p algorithm. The results of the dictionary processing were also used to create an initial version of an open source database for Icelandic speech applications. The scripts used in the experiments are available via our Github repository: https://github.com/cadia-lvl/SLT2018.
Jón and Anna Björk’s poster presentation will be at SLT on Friday, Dec. 21 between 10:00 and 12:00 PM. Here’s a sneak peek.
We hope to see you there. If you see Anna or Jón please stop by and say hello.
The cooperation between LVL and other leading icelandic organizations is increasing. Tomorrow Reykjavik University and Societas Scientiarum Islandica (Vísindafélag Íslendinga) are holding a seminar and panel discussion on the current progress and the future of implementing language technologies for Icelandic.
It will be held at Reykjavik University room M105. Hrafn Loftsson, of LVL, will be moderating the seminar starting at 13:30. It will consist of talks from a professor at University of Iceland, the chairman of Almannaromur, Jón Guðnason of LVL, and the director of Miðeindar ehf. Afterwards is the panel discussion.
We welcome everyone to attend the lively Saturday afternoon discussion!