The advances in generative artificial intelligence have benefited the development of high-quality audio synthesis models for text-to-speech or voice conversion applications. These systems can imitate human voices, both taking as input a text to be spoken or the voice of another person. Since 2016-2017 there have been significant advances in this field, when Google DeepMind released the WaveNet speech generation model and the Tacotron end-to-end system. These approaches evolved to current methods that use sophisticated generative AI based on different paradigms, such as generative adversarial networks or diffusion-based models. Recently, Microsoft presented VALL-E, which can replicate a person’s voice using only a three-second recording of them.
These improvements have favored the widespread use of these technologies, with different companies and service providers offering tools to create synthetic audio. For example, the web Speechify generates human-like synthesized speech, while other platforms allow voice cloning to develop audio deepfakes. These AI voices have great potential for the market and can be used for different purposes: corporate voices, dialogue systems, streaming platforms, and audiobooks, among others.
However, the generation of better audio deepfakes that cannot be distinguished from authentic voices comes with high risk: the malicious use of these deepfakes to spoof a person’s identity. In recent years, scammers have used these deepfake tools for different purposes, such as cloning a person’s voice in telephone calls with familiars or friends and convincing them to send money. These synthetic voices can also be used to create speeches from celebrities or influential people (e.g., politicians), putting them in compromising scenarios. Moreover, these voices can be employed to fool automatic speaker verification (ASV) systems in applications such as bank account management or call centers, which can yield economic losses for these companies.
Fortunately, deep learning advances can also be used to develop precise audio deepfake detection tools to identify these deepfakes and prevent inappropriate use. For the last few years, researchers have been investigating algorithms that can discriminate these non-human generated voices from genuine speech, using deep neural networks that detect the artifacts introduced for these AI generation systems.
As an example of this effort, the research community has organized different challenges to develop better algorithms that can deal with the most recent audio-generative models. For example, the ASVspoof series, a set of bi-annual challenges, focuses on developing anti-spoofing systems as countermeasures for ASV deployments. In addition, the recent Audio Deepfake Detection challenges go further in the task of detecting audio deepfakes in even more challenging conditions, such as low-quality audio signals or partially fake audio. Different security companies have started commercializing these anti-spoofing solutions to robust their biometric voice systems, preparing for the potential threat of audio deepfakes.
EITHOS will develop powerful audio deepfake detection algorithms for forensics to support law enforcement agents in the fight to detect the potential misuse of synthesized and voice-cloned audio.