AI-generated voices in video accessibility: Scribit.Pro tests voice-over for audio description using artificial intelligence

9 September 2024  
Illustration in Scribit.Pro's colours of a man adjusting knobs to turn on a lamp.

Last spring, supermarket Aldi replaced voice actor Diederik Ebbinge with an AI-generated voice. The actor and television maker had narrated all Aldi commercials for years, but is now being replaced by a voice created using artificial intelligence (AI). This allows the supermarket to save time and money. The customer would benefit from this.

Aldi has made voice recordings of ten female employees and based on this collection of audio, a new, artificial voice has been created. Studio recordings with real people are therefore no longer necessary, as this AI voice can simply, directly, and automatically speak any desired text.

Listen to the new voice here.

Our experiment with an AI-generated voice

This led Scribit.Pro to try an experiment: can we also develop an AI-generated voice? The approach is slightly different from Aldi's. Scribit.Pro makes video content accessible. We do this, among other things, by adding audio description; a voice-over that provides a visual description during moments in the video when there are no dialogues or other important sounds. This makes these media productions accessible to people with visual impairments. Audio description can also help users with cognitive limitations or challenges (such as brain injury or autism) better understand the content.

To achieve this, Scribit.Pro uses the synthetic voices from ReadSpeaker, the international leader in text-to-speech technology. The visual description can easily be typed in the Scribit.Pro editor, and the result can be listened to immediately. After publication, the end user can have this audio description play as a voice-over with the video. In this way, we can deliver fast, 24-hour service, of high and consistent quality. The organisations that are our customers can also make their own videos accessible with our software, and voice recordings, sound studios, or microphones are not needed.

Even though each of these voices sounds natural and pleasant (and familiar to many blind or partially sighted users); it remains an artificial voice. With sound professional Ferry Molenaar, who works as a Podcast Creator and voice actor, Scribit.Pro took on the challenge: to create an audio description voice that is also artificial, but still sounds human, in a way a clone of his real voice.

We went into the studio and made various audio recordings of Ferry's voice. With this input, we commissioned Elevenlabs, an AI company specialising in generative audio, to produce a voice. Within two days, the result was ready in the Elevenlabs environment.

Can you tell the difference between a human and an artificial voice?

Listen to the result yourself below. This video is a montage of five short video clips provided with audio description by Scribit.Pro. Each of the five clips is narrated once by Ferry himself, and once with the AI voice made from Ferry's voice. These two versions of the same clip can be heard one after the other (in Dutch). Can you tell the difference between a real voice and artificial intelligence?

The result of our experiment with AI

We put it to the test and asked people for their opinions on the audio descriptions in the above video. It was striking that all the people we interviewed indicated a preference for Ferry's real voice. But we also discovered that it was not always easy to distinguish the voices from each other. Sometimes there was confusion about which voice was the human voice and which was the AI voice. In each clip, there were respondents who mistook the AI voice for the real voice. In general, the artificially generated voice is perceived as pleasant and lifelike, and the respondents find the voice of good quality.

And what do we think of our experiment? Scribit.Pro is currently exploring different ways in which AI can assist us in the process of video accessibility. Can we further improve our product and services through the use of artificial intelligence? Is it possible, in this case, to create an AI voice that sounds pleasant and human, and that can provide an audio description of video content?

The answer is: yes, we succeeded! The voice that was created in our artificial intelligence test sounds like a real human being, but it is an artificial voice that can render any desired text. Just like the current synthetic voices in our software can. In Dutch, English, or German, but also in Finnish or Chinese. Sometimes the artificial voice makes an error in pronunciation or intonation, but the voice sounds surprisingly lifelike and pleasant to the ear.

We continue to research whether artificial intelligence can be used in the production of audio description, transcription, subtitling, and sign language translations. This way, we can optimize the video accessibility process and work towards an inclusive future that is accessible to everyone.

Do you want to read more about our research into AI in video accessibility, in which we investigate whether artificial intelligence can improve, speed up, and refine our services?

Read the blog about AI in audio description.

Read our interview with Ferry Molenaar about AI.

Want to learn more?

Sign up to our newsletter