System transcribing speech to text

Background

The purpose of the project is to build an application used for transcribing speech to text for our clients. The demand for this type of software is high nowadays because of the European Union regulations regarding the accessibility of digital products and services. The percentage contribution of broadcasts with subtitles is increased every year. We are responsible for the whole project and it has been created by us from scratch.

Our clients

In this project, we have multiple clients. Most of them are leading Polish media groups that are obliged to add subtitles to their materials, but there are also companies not connected with the media industry – for example, a pharmaceutical wholesaler. For each client, we need to adjust the software so that it meets their requirements and train it with new, specialized vocabulary. We have done POCs with our clients, and are moving towards production deployments of the outcomes.

System overview

Challenges and our solutions

Testing the accuracy of speech-to-text services

One of the milestones within this project was to test the platforms with speech-to-text services to check which is the best for specific purposes of a given client.
We tested the accuracy of the following services:
– Google Speech-to-Text,
– Speechmatics,
– Microsoft Azure Speech Studio.

So far, after testing all platforms listed above for the PoC purposes, Microsoft Azure services have been used for building the application because the transcription was the most accurate in this case. This platform also has numerous other advantages. However, we have not concluded that the other platforms will not be used in the future.

Microsoft Speech Studio has a standard engine for speech recognition, but there is also a custom one that can be trained with anything the user wants to. Custom Speech service includes also the basic engine. Microsoft’s basic engine advantage is also the fact that it is trained with a higher number of words in comparison to the other platforms. What is more, Microsoft provides the best services for the Polish language, and so far, our clients in this project have been Polish companies. The next benefit is that Speech Studio is a part of Microsoft Azure Cloud, and other system components can be also implemented on Microsoft Azure services. Another advantage of Microsoft is that it updates Speech Studio every 6 months and the user is able to see the difference in the quality of the software after it is updated.

Improving the accuracy of the engine

Microsoft Speech Studio has numerous benefits, but we also needed to improve the accuracy of the engine. First of all, punctuation was definitely a part of the system that needed to be improved – the engine was not adding any punctuation marks, and we solved this issue. We also needed to add or modify terms connected with a given topic because the engine was transcribing some vocabulary letter by letter, e.g. “F16” (which is a fighter aircraft) was transcribed in Polish as “e f szesnaście” (which is “f sixteen” in English). Therefore building the application was not only deciding about which speech to text provider should be used, but also improving details having an impact on the accuracy of transcription in the Polish language. Linguists specializing in Polish cooperate with us on the improvements.

Training the engine

Another challenge is connected with training the custom engine. Training the engine means extending the vocabulary used by the basic version engine. Usually the engine needs to be trained every time the software is updated by Microsoft.

We need to prepare the materials used for training. The engine can be trained with the use of texts, sounds, or sound and corresponding text. Preparing these materials requires a lot of work. Moreover, when training the engine, one needs to take particular care not to overtrain the model, because it has an impact on the accuracy of transcription. Additionally, we need to include jargon in the materials for training the module, for example, pharmaceutical jargon.

User Interface

We needed to create a User Interface for our clients so that they can adjust the settings and see how the speech is transcribed to text. We needed to create a UI working both on Windows (the system of our clients) and on Linux. The UI is also used for internal testing purposes – thanks to it, people not connected with programming are able to test the engines.

Live and batch mode of transcription

We needed to implement two modes of speech-to-text transcription: batch mode and live mode. The materials to be transcribed can be live or pre-recorded.

The live mode, as the name suggests, includes instant transcription, using microphone or file. Regarding this mode, the engine gets streams of data and transcribes them simultaneously.

The batch mode uses files with recordings prepared before transcription, and the transcription is not made instantly, but after downloading the whole file. Additionally, the batch mode in the Microsoft services allows the user to process recordings twice as fast as the real-time mode. What is more, identifying the speaker is possible in the batch mode. This feature has been also tested with several different speakers in the same recording, and this part of Microsoft services has been thoroughly examined by us.

Outcome

The project is still developed, and we have multiple ideas for facing future challenges. We are also in the process of deploying software components resulting from this project into our clients’ production systems. Our experience allows us to use STT technologies and solutions researched in this project and adjust them to our clients needs.

Industry

Media industry, pharmaceutical industry

Keywords

Speech-to-text, STT, Speech Recognition, Machine Learning

Technologies

Microsoft Azure Cognitive Services, Microsoft Speech to Text, Microsoft Custom Speech, Microsoft Azure Cloud