Murre was first introduced in the paper Dialect Text Normalization to Normative Standard Finnish. It is a model built from the Samples of Spoken Finnish corpus, which is available in the Language Bank of Finland under CC-BY license. This corpus contains original dialectal transcriptions and manually done normalizations. We took those sentence pairs and trained a normalization model. As a result Murre is able to normalize dialectal Finnish texts with a very high accuracy, and it has learned this skill entirely from the examples in the dialect corpus. Murre reaches a word error rate of 5.73%, which is already very good, and the resulting text is significantly easier to analyse after the normalization than in its original state!
At times Murre makes funny mistakes — nobody is perfect! But Murre is still really useful, as it is not possible to have humans normalizing every dialectal text we want to analyze! We discussed in the error analysis section of our paper examples such as the model mixing up ratapölkyntervauskone and ratapölkyntervauskoinen. But this illustrates how difficult job normalization of dialectal wordforms is! Also in this case one can understand where the model is coming from with the other interpretation, which is something we want to emphasize in our work. By careful inspection of the data we use it is often possible to understand and evaluate even complex machine learning systems.
Murre is available as a Python package, and we continue experiments to improve the results and to adapt the model into new use cases. If you use Murre, or want to continue investigating this topic further, please cite this paper as:
Niko Partanen, Mika Hämäläinen, and Khalid Alnajjar. 2019. Dialect Text Normalization to Normative Standard Finnish. In the Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT).