Last Tuesday Svanhvít and Ásmundur completed the first stage in Named Entity Recognition project for Icelandic. They finished the daunting task of labeling all named entities in a text corpus of 1 million tokens (MIM-GOLD), into the following categories: Person, Location, Organization, Micellaneous, Money, Percent, Time, and Date.
To assist with the task, they first preprocessed the corpus using regular expressions to catch some cases and then verified and completed the labeling using the brat rapid annotation tool. Their next task will be to create a few baseline NER tagging systems using the labelled dataset.
The dataset will be publicly available this spring.
Last week we celebrated achieving the first milestone in the Language Technology for Icelandic project with a cake!
After a lot of hard work the past few months we achieved the first milestone in Automatic Speech Recognition (ASR), Text-to-Speech (TTS) and Machine Translation (MT).
In ASR, the focus has mostly been on data creating and gathering. 55,000 utterances have been collected (donated by adults) via the crowd-sourcing platformsamromur.is (based on Common Voice) with plans to reach 100.000 utterances for the next milestone. The process is being extended to include younger voices in collaboration with schools and authorities. Today we started working with Öldutúnsskóli in Hafnarfjörður. The goal is to reach 80.000 young voice utterances for the next mileston. Additionally, data has been gathered fromRÚV (audio, video and subtitles) and CreditInfo (transcriptions). Along with data gathering, the team is also developing tools to post-process Icelandic ASR text for better readability.
In TTS, we successfully created a voice recording client (LOBE) and three reading scripts in order to collect high quality speech and corresponding text data. The reading scripts were created from Risamálheild and seek to maximize diphone coverage. So far 20 hours have been collected from two speakers, male and female. The aim is to finish collecting 20 hours from each speaker early this year. From the collected data two TTS prototypes have been created in Ossian, which extends the Merlin back-end. The current prototypes are quite naive but we have integrated a grapheme-to-phoneme model for the Icelandic language into the prototypes.
In MT, we successfully created a phrase-based statistical machine translation system using the open source tool Moses. Our collaborators at Miðeind created neural machine translation systems based on BiLSTMs and Transformers. The models were trained on the newly available English-Icelandic parallel corpus, ParIce. The systems were then evaluated w.r.t. training time, throughput and BLEU score. The code and systems are freely available but are still under development for milestone two. In milestone two we will continue to develop the systems further and adjust them to specific needs of the Icelandic language.
Rannis (The Strategic Research and Development Programme for Language Technology) has awarded Hrafn two grants this year. Congratulations! The first project, Automatic Text Summarization (ATS) for Icelandic, will be worked on by a post-doctoral researcher and an Icelandic linguist in collaboration with mbl.is, Morgunblaðið’s news website. The second one is Named Entity Recognition (NER) for Icelandic. Svanhvít Lilja Ingólfsdóttir and Ásmundur Guðjónsson, two students from the Language Technology (Máltækni) masters program will work on the NER project in collaboration with the Icelandic Stock Exchange. Welcome to LVL!
Anna Björk has also been awarded a grant, for her company, Grammatek ehf., in cooperation with the city of Akranes. Congratulations and we wish you all the best with your new endeavor!
The paper describes an Icelandic pronunciation dictionary for use in a text-to- speech system for Icelandic. Procedures were implemented to create a consistent training set for grapheme-to-phoneme (g2p) conversion modeling, needed for automatic extensions of the dictionary. The experiments show a clear benefit of using clean data for training, both in terms of PER and in terms of categories of errors made by the g2p algorithm. The results of the dictionary processing were also used to create an initial version of an open source database for Icelandic speech applications. The scripts used in the experiments are available via our Github repository: https://github.com/cadia-lvl/SLT2018.
Jón and Anna Björk’s poster presentation will be at SLT on Friday, Dec. 21 between 10:00 and 12:00 PM. Here’s a sneak peek.
We hope to see you there. If you see Anna or Jón please stop by and say hello.
The cooperation between LVL and other leading icelandic organizations is increasing. Tomorrow Reykjavik University and Societas Scientiarum Islandica (Vísindafélag Íslendinga) are holding a seminar and panel discussion on the current progress and the future of implementing language technologies for Icelandic.
It will be held at Reykjavik University room M105. Hrafn Loftsson, of LVL, will be moderating the seminar starting at 13:30. It will consist of talks from a professor at University of Iceland, the chairman of Almannaromur, Jón Guðnason of LVL, and the director of Miðeindar ehf. Afterwards is the panel discussion.
We welcome everyone to attend the lively Saturday afternoon discussion!
This Friday is Researchers’ Night (Vísindavaka Rannís 2018). It is an all ages event on the 28th of September, 2018 from 16:30 – 22:00 at Laugardalshöllin, Reykjavik.
We will be there with Reykjavik University demonstrating the possibilities of speech with tech: evaluating collected speech data (Eyra), testing the accuracy of an automatic speech recognizer(ASR) – https://tal.ru.is, listening to a text-to-speech synthesizer, and telling your phone to read the news to you. Come try out the state-of-the-art in Icelandic speech technology, and tell us what you think!
For the students of Reykjavik University or summer exchange students, we now have a list of student projects available. They are on https://lvl.ru.is/student-projects/ or available from the Menu of the LVL website as Student Projects. They range from straight forward to difficult and are suitable for undergraduate final projects, Masters students, and PhD students. If you want to work on a one, please contact the people listed in the contact column, and they can give you more details to get you started. We look forward to hearing from you!