AI-generated voice-over for audio description: how do end users and image describers rate the audio description voice created by Scribit.Pro using AI?
Scribit.Pro is currently conducting extensive research into artificial intelligence. We are examining and testing multiple ways in which AI could further improve Scribit.Pro’s product and service.
Scribit.Pro makes videos accessible by adding audio description, transcription, subtitles for the deaf and hard of hearing, and/or sign language translations to video content, thereby making its content understandable to the largest possible group of users. This way, people with disabilities also have access to this information, making our society more inclusive.
To continuously optimise our services, we are investigating the potential of artificial intelligence in our work processes and the software we use. Can AI facilitate, accelerate, and enhance the production of video accessibility?
Together with sound professional, podcast maker, and voice actor Ferry Molenaar, we conducted an experiment. We asked AI to create a voice based on Ferry’s actual voice. This artificial voice can be used as an instant voice-over for video audio description, similar to the synthetic voices Scribit.Pro currently uses. The latest voice-over for the advertising campaigns of the Aldi supermarket chain was created in a similar manner.
The end result: what does the AI version of a human voice sound like?
We were positively surprised by the final result. The AI-generated voice of Ferry sounds like Ferry: lifelike. And the voice is certainly usable as a voice-over in our audio descriptions. But how do our image describers and end users like this cloned voice? We asked some of them for their opinions. In a small test, we used five videos as research material. For each of the five videos, we created two versions: one with an image description voiced by Ferry, and one with an audio description by the AI-generated version of Ferry’s voice. The respondents’ task was to distinguish the human voice from the artificial one. And that proved to be quite tricky. Many were not always able to clearly identify which voice-over was Ferry's and which was AI’s. In more than one video, some people mistook the artificial voice for Ferry’s. In one of the videos, we used AI to generate both versions. However, none of the respondents noticed that the human voice was missing in this example. Overall, people found the voice pleasant and lifelike, as evidenced by these research results. Only the intonation – and the occasional mispronunciation – sometimes gave away the computer voice. All participants in our research said they preferred Ferry’s real voice, although some mentioned that the artificial voice sounded somewhat less rushed in certain instances.
This experiment also made it clear that an AI-cloned voice is not always recognised as such and is often perceived as genuinely human. This is, of course, good news for Scribit.Pro: we want to see if AI can improve video accessibility. A synthetic voice that sounds just like the human voice it’s modelled after is undoubtedly a significant step towards an even better product: higher quality and better tailored to the wishes of our clients, business partners, and end users. Together, we make the future accessible.
Read the blog about Scribit.Pro's research into AI in audio description.
Read the extensive interview with Ferry Molenaar about his dream applications of AI.