Developing voices at Readspeaker

5 July 2022

Would you like to introduce and explain yourself what Readspeaker does?

My name is Ludmila Menert and I work as a linguist or linguist at ReadSpeaker. My work is very diverse. I supervise audio recordings with voice actors. I analyze and record patterns and irregularities in different languages. I check and correct the labels that other colleagues add to sounds in our speech databases. And I often think about how our voices can better pronounce specific words or names. This is necessary because a computer cannot know everything. In the Eindhoven name "Genderpark" our voice will pronounce "gender" in English, but because it refers to the river Gender there, it must therefore be pronounced with the Dutch g of "go".

ReadSpeaker develops synthetic voices and also products that use those voices. Using proprietary, industry-leading technology, our voices are among the most natural on the market. So real voices. These can be used to make websites, documents, books or teaching materials more accessible with our reading aid. The content is read aloud, which is supportive for website visitors with reading disabilities. Our solutions can also be used to give voice to, for example, an app, device or software program. Scribit.Pro uses our voices to add audio description to videos. The products we build offer dozens of languages ​​and more than 100 different voices, and also allow those voices to speak faster or slower, highlight text being read for easy reading, and many more extras. ReadSpeaker is located in 15 countries and serves more than 10,000 customers in 65 countries. With over 20 years of experience, the ReadSpeaker team of experts is at the forefront of text-to-speech.

Can you explain how computer voting technology works?

To begin with, the computer must know WHAT to say, ie how to convert the written text into words. Should “112” be pronounced “one hundred and twelve” or “one one two”? Should “file” sound like “fiele” or “fajl”? And does “bv” stand for “for example” or “bee cattle”? Then we need to determine HOW to pronounce the words. For this we use a pronunciation dictionary, which contains a few hundred thousand combinations of spelling and pronunciation. The software extracts information about the pronunciation from this. But these dictionaries are also used to train the software; the software learns what the most common patterns of spelling and pronunciation are. After some training, the software can predict the pronunciation of words that are NOT in the dictionary, such as names or words from other languages. The speech itself should, of course, resemble that of the voice actor used for the voice. Because for each voice, a considerable number of hours of speech is recorded with a voice actor. This also involves training the synthetic voice on the basis of a sound dictionary that is built up with the recordings of the words and sentences spoken by the voice actor.  

We all know the computer voice that falters, but that is now a thing of the past. How did this evolve?

Current technology is all about so-called machine learning, whereby the computer is trained on the basis of real, natural speech, but ultimately produces a completely artificial voice sound with the so-called vocoder technique, you could say with an artificial voice. The previous generation of text-to-speech voices was different. There the speech was built up from very small fragments of speech, which were taken from the database of recorded speech and stitched together. The software looked for the right speech fragments and had to combine them in such a way that the result sounded as smooth as possible. Even though this eventually achieved quite high quality, you sometimes kept hearing the “less successful” welds between the speech fragments.

Have real people voices been used as an example for the Readspeaker voices?

That's exactly right. Today, recordings of real speech are used “as an example” for the digital voice, whereas in the past they were also the building material of the digital voice, as it were. ​The more speech is recorded with the voice actor and also the more time and computer power that goes into the training sessions, the higher the quality of the speech of the built voice will be. In the highest quality, the computer voice can hardly be distinguished from the real person.

How much time does it take to “develop” a new voice?

A basic quality voice can be ready within a few months. But for higher quality we need more time, both for the recordings and for the processing and training. And especially for optimizing, which involves multiple cycles of improvements and tests.

Are there also children's voices?

They do exist, but the offer is not overwhelming. This is because it is very difficult to get children, especially younger children, to record sufficient speech material of a quality that is high enough and above all consistent enough. Often recorded speech from young adults or a mix of speech databases are used.

Will there be more votes? For example, voices with accents or dialects?

Yes, we already have several variants of different languages ​​on offer. From English ao American, British, Indian, Scottish and South African. We also develop “custom voices”, for which the customer chooses their own voice actor, who can be someone with a regional accent.

If there are new words, do they have to be spoken by a 'real' person first?

No, that is not necessary, the computer voice can in principle pronounce everything as the original speaker would have done on the basis of the recorded speech. However, it is sometimes necessary to correct the pronunciation of such a new word, because the software had predicted it wrong. That's why new words, especially proper names, are constantly being added to our pronunciation dictionaries. These are the most recent additions I have made to our Dutch lexicon: Lviv, Charkiv, Mariupol…

How do you pronounce your names…? Are unknown words and names spoken extra?

The computer learns based on statistical patterns, so the pronunciation of a name not in our pronunciation dictionary will be predicted much like a Dutch speaker would from the written form. Only, an average Dutch speaker knows a lot more about the world than is written in our dictionaries. For example, people listen to the news and know, for example, that “Fauci” is an American and his name is pronounced “fautsjie”. Or that "Angela" is almost always pronounced with the dzj of "jazz", but not when it comes to former Chancellor Angela Merkel, where you hear the g for "goal". That is why we have to continuously 'retrain' our system.

Your portfolio offers multiple male and female voices, all of which have a name. Which voice is used the most?

Our most used voice at the moment is Ilse. That is a very versatile voice with a neutral, clear sound and a high quality, because very well trained.

Which voice is most used male or female?

In general, a female voice is chosen more often in Europe and North America, but the choice also depends on the application. We also notice that customers want more variety and want to be able to choose from multiple voices. It is possible that the relative overrepresentation of female voices in text-to-speech applications is related to the fact that computer voices are currently mostly used in the field of services, care and support. Female voices are more likely to be associated with friendliness and service orientation, so perhaps for that reason people are more likely to choose a female voice.

In how many countries/languages ​​are you active?

ReadSpeaker is located in 15 countries and provides text-to-speech solutions to more than 10,000 customers in 65 countries. More than 110 voices are available in over 35 languages. And we are continuously working on more…. Recently I was working on Welsh, colleagues in the team are working on Catalan, Hindi and Romanian.

Finally, do you have a favorite voice?

Old fashioned perhaps, but I always enjoy listening to our Alice's friendly correct, impeccable but not too posh British English.