Look Who’s Talking: The Risks and Rewards of Voice Cloning

What happens when a computer no longer sounds like a stereotypical robot—monotone, mechanical, synthetic—but human? More specifically: what if it sounded like you? Voice cloning, a type of text-to-speech technology which can replicate an individual’s speech, has existed since at least 1998, but developments in deep learning and neural networks have rapidly evolved the quality of reproductions [1]. In that time, the technology to accurately recreate the voices of others has become widely available and accessible: you can upload audio files to voice cloning services, or access online tutorials explaining how to do it yourself [2,3]. While this has led to creative and innovative outcomes—from enhanced dubbing to former presidents enjoying a game of Fortnite together [4]—voice cloning has also been criminally exploited. Recently, a chief executive at a UK energy firm was tricked into sending tens of thousands of pounds to scammers at the instruction of an audio fake of his boss’s voice [5].

Earlier this month, OpenAI—the company best known for ChatGPT—announced that the public release of Voice Engine, their newest text-to-speech artificial intelligence (AI) model, would be delayed [6]. They cited the overwhelming risk of disinformation in an important, and likely bitterly-fought, election year for the United States. Voice Engine can reportedly reproduce natural-sounding speech trained on only 15 seconds of audio. OpenAI’s concern stems from events like the circulation of an AI-generated audio fake of Joe Biden’s voice, which earlier this year urged voters to boycott January’s Presidential primary in New Hampshire [7].

When assessing whether the benefits of voice cloning are worth the risks, it’s important to be mindful of how other technologies have been previously integrated into human patterns of life. From writing to the internet, camera obscurae to photoshopping, there is no progress-defining technology which has not been weaponised by state and non-state actors. Forged letters, cyber warfare, spying, and malicious image manipulation: these behaviours have caused extensive damage, both material and economic, despite the overwhelming wider value of the technologies they exploit.

This raises an important question: what threshold must voice cloning meet for its rewards to outweigh its risks? This blog post considers the benefits and pitfalls of voice cloning technology—including the space our voices occupy in the collective imagination—and what a trade-off between the two might look like.

Benefits

Conditions such as Amyotrophic Lateral Sclerosis (ALS), Apraxia, and traumatic brain injuries can render people non-verbal and unable to communicate with speech [8]. This was the case for the late Stephen Hawking, who lived with ALS and whose speech synthesiser was famously robotic in tone. Much in the same way that we store eggs and sperm for those at risk of infertility, people can now ‘bank’ their voice. Should an accident occur or a condition advance which impedes the use of speech, voices can be recreated and computerised via voice cloning. This retention of identity can be important for post-traumatic recovery.

Losing our voice, however, is not only a physical phenomenon: it’s linguistic too. Research published by the Australian National University, in 2022, found that language diversity is under threat. Of the 6511 languages currently spoken, around 1500 are likely to become extinct by the end of this century [9]. Language is a social construct and so inevitably lives, changes, and dies according to use. However, many communities globally have had their native tongues forcibly driven to premature endangerment or extinction [10]. Cloning the voices of a language’s remaining speakers can enable its revitalisation amongst younger generations seeking stronger ties with their historical communities. Current preservation methods rely on recording language samples from vanishingly few remaining speakers, which can be a logistical challenge. Ever-improving cloning models could be used to automatically generate phonetically accurate examples of endangered languages and dialects based on limited and short samples. This could radically expand a language’s learning resources [11].

Outside of health and heritage, voice cloning could be transformative for entertainment. TED talks could be delivered by Bertrand Russell or Amelia Earhart, and the immersiveness of historical biopics could be intensified with dubbed voice clones. And whilst understandably controversial, voice cloning might support narrative continuity when an actor passes away. Lance Riddick, who played Cedric Daniels in The Wire (2002), passed away in 2023. He voiced a major character in the ongoing Horizon video game series, Sylens, whose role was set up to be substantial in the series’ final instalment [12]. Voice cloning would help maintain the sense of familiarity built up between the character and the player.

These developments would, however, necessitate a broad range of complex conversations about artistic expression, the value of finiteness, and—outlined in the following section—voice ownership. Is there a unique utility to historical figures communicating the work of modern scientists? Is the aspiration of acting to mimic or interpret? And how long should studios be entitled to voice rights, if at all?

Risks

The ability to create synthetic voices that are virtually indistinguishable from real ones using free and open-source models has caused alarm. Once a voice is cloned, it’s as simple as typing some text into a box: the trained audio will say whatever you instruct and sound clearly like the trained voice [13]. Voice cloning can already be used to bypass voice authentication systems and gain unauthorised access to personal accounts [14].

From the perspective of intellectual property, there is ambiguity around whether voices can be protected under existing United Kingdom copyright laws (no law explicitly recognises a standalone right of publicity) [15]. This potentially leaves room for infringement through unauthorised cloning. In 2004, the former runner David Bedford brought a successful claim against the information service company, 118 118, for using his image—two moustachioed runners in white vests—without his permission [16]. 118 118 had argued that their runners were based on comical representations of generic 70s athletes. But could this ruling be transplantable to voice ownership? Sound is transient and temporary whereas images are static and persistent: mapping the former to someone is mechanically harder than the latter.

Resolving this issue will require a legal definition to be established of what one’s voice is [17]. Does it include its rhythm? Tone? Word choice? Speech idiosyncrasies? Hesitation phenomena? Is it still ‘my’ voice if I’m repeating someone else’s words? Moreover, our speech isn’t necessarily fixed: moving to a new country and becoming immersed in a new community can lead to significant changes in articulation. Do those changes constitute a new voice? And what would its legal relationship be to the previous one? What we often casually refer to as our voice fulfils a broad range of functions, and a single definition may be difficult to inscribe in law. Future legal dramas may centre on common sense: did the defendant making the simulated voice deliberately intend for it to be mistaken for a specific real person? However, even if the defendant concedes that point, they might counter that their actions were not driven by malice but intended as a form of artistic expression or parody. This leads to an interesting epistemological question.

What’s so Special about Voices?

Those of us who were undeterred by reason and decided to study English Literature for our bachelors will be familiar with Roland Barthes and his essay, The Death of the Author (1967). In it, he describes the “total existence of writing” and explains:

“...A text is made of multiple writings, drawn from many cultures and entering into mutual relations of dialogue, parody, contestation, but there is one place where this multiplicity is focused and that place is the reader, not, as was hitherto said, the author” [18].

Once in the hands of the reader, Barthes observes, ownership of meaning is transferred from the writer. We see this theory reflected in the popular activities of headcanons [19] and fanfictions [20]. Could we ever conceive of an author’s physical ‘voice’ in the same way

That seems unlikely, at least for the time being. Our sound is distinctive and intrinsic to us in ways other forms of communication aren’t. While speech and writing are similarly constructed from a nexus of influences, the former is biological and internal, the latter exogenous. We can reflexively identify a friend, colleague or loved one by their speech but an unsigned message instils uncertainty. This distinction is reflected in our legal system, which distinguishes between plagiarism and identity theft.

Contemporary shibboleths, however, may be just that. Developments in voice cloning have advanced faster than cultural adjustment, a deficit which will be rebalanced as younger people grow up familiar with this technology. Future generations might evolve a more Barthian attitude to our speech.

Future Implications

The integration of voice cloning technology into our personal and professional lives will likely lead to behavioural change. In the same way many of us now self-moderate what we write on LinkedIn or Facebook, and carefully consider what pictures of ourselves we post on Instagram, we will likely be more mindful of how or if we speak in uploaded videos: it may, after all, ultimately be our intellectual property.

Similar to how we use VPNs to protect our online identities from criminals, people may have to consider installing software to protect their voices. The good news is that such protections are already available and often open-source. Software such as ‘AntiFake’ aims to prevent the sound of your voice being stolen by installing subtle distortions in audio files which are undetectable by humans but inhibit AI analysis, rendering them unsuitable for training [21].

As with any transformative technology, the responsible development and deployment of voice cloning will require a continual balance between harnessing its benefits and mitigating its risks. A pragmatic approach will involve implementing robust ethical guidelines, legal frameworks, and technological safeguards. This won’t entirely prevent criminal exploitation but it will ensure the overwhelming use of voice cloning—like other technologies—is beneficial.