Current Projects

Language Technology for Icelandic 2018-2022

Language Technology for Icelandic 2018-2022

This project is a part of a collaboration effort funded by the Icelandic government to make Icelandic available for use in today’s technological environment. Automatic Speech Recognition, Text-to-Speech and Machine Translation are three of the five core projects defined in the report, Language Technology for Icelandic 2018-2022 (Máltækni fyrir íslensku 2018-2022). Within the collaboration, called Samstarf um íslenska máltækni (SÍM), are nine companies and organizations specialized in linguistics and Natural-Language Processing. Entities within SÍM are Reykjavik University, University of Iceland, Árni Magnússon Institute for Icelandic studies, Blindrafélagið (BIAVI), Ríkisútvarpið (RÚV- National radio), Creditinfo (Media monitoring), Tiro ehf., Grammatek ehf. and Miðeind ehf.
Work on the project started formally the 1st of October 2019 with a 5 year total duration accepted by the Icelandic parliament.

Funding: Language Technology for Icelandic 2018-2022 (Máltækni fyrir íslensku 2018-2022)

Projects:
“Machine Translation (MT) of Icelandic and English”
“Automatic speech recognition (ASR) for Icelandic”
“Text-to-Speech (TTS) for Icelandic speech synthesis”

Timeline: October 2019 – October 2024.

Paper: Language Technology Programme


Automatic Text Summarization for Icelandic

ATS –
extracting sentences for a summary

Automatic Text Summarization (ATS) is the task of creating a concise and fluent summary from a given source text. The summary preserves the main content of the original source and the overall meaning. Due to the increasing number of articles being published on-line every day, there is a growing need for robust ATS systems. The aim of this project is to develop the first ATS systems for Icelandic based on machine learning methods, as well as to create the first corpus of human-generated summaries of Icelandic news articles. Three different systems will be developed, and the best performing system selected for deployment and testing at an Icelandic news site, mbl.is.

Funding: Strategic Research and Development Programme for Language Technology (Markáætlun í tungu og tækni)

Contributors: Jón Friðrik Daðason, Hrafn Loftsson, Salome Lilja Sigurðardóttir, Þorsteinn Björnsson


Text-to-Speech (TTS) for Icelandic speech synthesis

Speech data

The Language and Voice Lab is responsible for developing text to speech synthesis for Icelandic in such a manner that it will be possible to produce multiple different voices. LVL will create an environment, and language resources, that will be released to enable players in the market to quickly and simply build synthetic voices for end users. LVL will ensure that the speech synthesis solutions developed can be integrated into software, where e.g. automatic recital or voice answering is needed.

Funding: Language Technology for Icelandic 2018-2022 (Máltækni fyrir íslensku 2018-2022)

Contributors: Atli Thor Sigurgeirsson, Þorsteinn Daði Gunnarsson

Timeline: First iteration: October 2019 – October 2020.

Research paper: Manual Speech Synthesis Data Acquisition

Code: LOBE


Automatic speech recognition (ASR) for Icelandic

The Language and Voice Lab is responsible for developing automatic speech recognition for Icelandic within the Language Technology for Icelandic project. The aim of developing ASR within the project is to enable people who design and develop voice-based user interfaces to add Icelandic easily. An open environment will be established for the development of speech recognition systems, and recipes for common usage will be made open and accessible.

Funding: Language Technology for Icelandic 2018-2022 (Máltækni fyrir íslensku 2018-2022)

Contributors: Helga Svala Sigurðardóttir, Jón Guðnason, Judy Fong, Inga Rún Helgadóttir, Eydís Huld Magnúsdóttir, David Erik Mollberg, Ólafur Helgi Jónsson, Sunneva Þorsteinsdóttir

Timeline: First iteration: October 2019 – October 2020.

News, Voice Donation Platform, Code: Gáfu 1.500 raddsýni fyrir hádegi (Icelandic), Samromur.is, Broad Data Prep repository, Samromur paper, Punctuation models, Speaker Diarization recipes

Video:


Named Entity Recognition – dataset and baselines

Named entities in Icelandic

Named entity recognition (NER) is the task of finding and classifying the named entities (names of people, places, organizations, events, etc.) that appear in text. This is a common preprocessing step before conducting various downstream tasks, such as question answering and machine translation. The aim of this research project is to create the first labelled corpus for Icelandic NER and to use machine learning methods for training a named entity recognizer for Icelandic. This involves labelling all named entities in a text corpus of 1 million tokens (MIM-GOLD), into the following categories: Person, Location, Organization, Micellaneous, Money, Percent, Time, and Date. Using this new data, different machine learning methods (both traditional and deep learning methods) will be tested, and the best performing models selected and combined into a new named entity recognizer for Icelandic. The project is carried out in collaboration with Nasdaq Iceland.

Funding: Strategic Research and Development Programme for Language Technology (Markáætlun í tungu og tækni), Nasdaq Iceland

Contributors: Ásmundur Alma Guðjónsson, Svanhvít Lilja Ingólfsdóttir, Hrafn Loftsson

Timeline: May 2019 – May 2020.


Machine Translation (MT) of Icelandic and English

The goal of machine translation is to translate text or speech between two or more natural languages. In this project the goal is to implement a baseline statistical machine translation system between Icelandic and English and vice versa. The project is a part of the core machine translation project (V3) within the Icelandic National Language Technology Programme, defined in the Language Technology for Icelandic 2018-2022 project plan. We leverage the newly released ParIce corpus, a parallel corpus of 3.5M Icelandic and English translation segments.

Funding: Language Technology for Icelandic 2018-2022 (Máltækni fyrir íslensku 2018-2022)

Contributors: Steinþór Steingrímsson, Haukur Páll Jónsson, Hrafn Loftsson.

Timeline: October 2019 – Summer 2020.

Code: Moses SMT, ParIce


Natural Language Understanding model for Airline Reservation System

This is a two-semester 60 ECTS Masters project. The goal is to create a Dialog System that can understand users’ requests, such as find a flight or to book it.

One of the core components of a Spoken Dialog System (SDS) is the natural language understanding (NLU) model.
The main functionality of the NLU is to create structured data from the users’ requests, which might, for example, be asking for flight information, airline information, etc.
This information is extracted using an intent classification and slot-filling model that has been trained on a dataset composing of user requests.
The ATIS dataset has been used as a standard benchmark dataset widely for the task of NLU for English.
This project is split up into the following tasks:

  • Create a Text Annotation Tool.
  • Create an Icelandic translated version of the ATIS dataset (ICE-ATIS), with the use of the Text Annotation Tool.
  • Run ICE-ATIS through variant already made models to see how it compares with the other NLU models trained on the ATIS dataset.

Funding: Icelandair

Contributors: Egill Anton Hlöðversson, Jón Guðnason

Timeline: Autumn 2019 – Spring  2020

Text Annotation Tool: https://github.com/egillanton/flask-text-annotation-tool
Live Server: http://egillanton.pythonanywhere.com/ (temporary)
ICE-ATIS: https://github.com/egillanton/ice-atis


Broddi: Voice-controlled Information Delivery

The goal of the project is to design a system that enables voice-driven delivery of web content, such as content from news sites, blogs or radio programs. The idea is to have an environment specifically designed for audio interaction and not tied to visual layout of a web page. The user can choose content with voice commands. The content is then presented to the user as audio, e.g. recorded speech or generated speech or a radio episode or podcast. A pure audio interface is useful for situations where hands and eyes are busy, such as when driving, cooking, running, etc., as well as for people with disabilities.

Funding: Tækniþróunarsjóður and menntamálaráðuneytið

iOS App: Broddi

Contributors: Kristján Rúnarsson, Róbert Kjaran, Stefán Jónsson

Timeline: Autumn 2017 – Winter 2020

News Article: Raddstýrður fréttalesari (Icelandic)


Eyra – speech data acquisition

Eyra is a free and open source project designed to provide tools to gather speech data for languages.

Speech data acquisition is particularly important for under-resourced languages. The data gathering is the most labour-intensive part of developing speech technologies such as automatic speech recognizers and synthesizers.

A screenshot from the Eyra software recording screen where the prompts are read

Eyra aims to make this task cheaper (and better), by  providing a free platform to handle the data acquisition process. Eyra analyzes incoming data and can provide feedback on the quality to attempt to get better quality data.

Designed with flexibility in mind, Eyra is a web app compatible with most browsers, and its offline setting is available by using a laptop as a server. It is open source so you can contribute, use only parts of it, or modify it to suit your needs (e.g. if you want to use pictures instead of prompts).

Currently, Eyra is being used to collect children’s speech data in collaboration with the University of Akureyri.

Funding: Google

Code: https://github.com/Eyra-is/Eyra
Article: SLTU 2016, Building ASR Corpora Using Eyra

Contributors: Matthías Pétursson, Róbert Kjaran, Simon Klüpfel, Judy Fong, Stefán Jónsson

Timeline: Autumn 2016 – Ongoing