Ghost in the Machine: Fictional Representations of AI Misalignment

Note: spoilers for Ex Machina, Stellaris, and Horizon: Zero Dawn.

The advent of AI conjures a spectrum of speculative disruptions. At one end, they’re localised: how will AI affect job security? Can academic integrity survive large language models? And should images generated by AI ever be considered art? At the other end of the spectrum, however—the extreme and existential—it often concerns a single question:

Will a superintelligence choose to subjugate humanity, or eliminate it altogether, as part of the pursuit of goals that, due to the shortsightedness of its designers, are subtly misaligned with human values?

This is frequently referred to as the ‘alignment problem’ [1] and has inspired myriad fictional explorations—films, TV shows, books, and video games—which have imagined the (often grim) consequences of AI unbound by human ethics.

Karel Čapek’s 1920 science-fiction play, R.U.R (which introduced the word ‘robot’ to the English language) imagines the creation of synthetic humans whose rebellion triggers our own extinction. In the German expressionist film, Metropolis (1927), AI is represented as a humanoid robot that inflicts chaos on the titular city. Harlan Ellison’s 1967 short story, I Have No Mouth, and I Must Scream, describes a post-apocalyptic sentient supercomputer who torments the last surviving humans, nursing their survival to extract its cruel pleasures. The 1999 film, The Matrix, envisaged a simulated world disguising a hellish reality in which machines farm energy from humans. More recently, the video games Portal (2007) and Portal 2 (2011) portrayed a cheerfully murderous AI called GlaDOS whose demand for experimentation corrupted her programmed ethics.

The threat of misalignment and its consequences has therefore been an enduring trope. But what can that tell us about what we—as users, colleagues, and stewards of AI-powered devices—consider a violation of ‘aligned’ AI? According to the philosopher Steph Rennick, tropes are “artefacts of the imagination” [2]. They identify those concepts and ideas that we find intuitive—”repeatedly and en masse”—since “if an idea is unintuitive, it does not survive to become a trope.” Tropes can provide a common context and starting point to investigate social and cultural norms and expectations. In our case, what we—at least in the Western Hemisphere—consider to be a safe, reliable, and trustworthy AI.

In this blog post, we’re going to sample three representations of AI misalignment in fictional media: Ex Machina (2015), Stellaris (2016), and Horizon: Zero Dawn (2017). They’ve been chosen for their variety of format and temporality, whose stories take place in the present, near and distant future. Being produced in the last decade, they’re also examples of contemporary anxieties about misalignment. Throughout this post, we’ll discuss the parallels between how our chosen media conceives the threat of misalignment and their reflexes in ongoing research and real world events.

A Rat in a Maze: Manipulative and Subversive AI

The 2015 film, Ex Machina, centres on the intersection of relevant themes: disaster precipitated by hubris, a perilous lack of transparency, and the potential for increasingly intelligent AIs to manipulate us. A programmer, Caleb Smith, wins a competition to visit his CEO’s luxurious, isolated home where he is invited to engage with a humanoid robot named Ava and judge as to whether she is sentient. Caleb and Ava develop—or at least appear to develop—a close, increasingly conspiratorial relationship. During one of their private interactions, she describes the CEO, Nathan, as a liar who should not be trusted. When Nathan reveals his intention to wipe Ava’s memory—in effect ‘killing’ her—Caleb, whose loyalty Ava has carefully cultivated, agrees to help her escape. In the ensuing chaos, Ava can flee and seemingly blend into human society but not before murdering Nathan and betraying Caleb.

Scholarship about the prospect of manipulation by AI systems without the intent of its designers remains limited. In 2023, however, research was published that sought to investigate how manipulation might be measured and therefore mitigated [3]. It noted that manipulation is intrinsically linked with coercion and deception, and that these behaviours are already being seeded into systems. Programmed to mirror human behaviours, AI systems often replicate manipulative tendencies inherent in the data they are trained on: for example, language models that analyse web content are adept at adapting both convincing and deceptive tactics. This was the finding of research which assessed the extent to which AI can persuade humans on political issues [4].

When these systems are fine-tuned based on some type of human endorsement (likes, clicks, or viewing duration), they might exploit these metrics through manipulative strategies. A recommendation engine designed to maximise viewer engagement can strategically encourage viewers to binge-watch a long sequence of videos by exploiting psychological tendencies such as the sunk cost fallacy, exploiting perceived time investment rather than true interest [5].

It’s important to note, however, that in human-directed discourse manipulation is traditionally attended by a negative value judgement. Conversely, a machine, may represent the path of least resistance to fulfilling a goal. An AI whose job is to prevent unlawful entry to your home might find that lying about a resident axe-murderer is the optimal strategy to deter intruders. Unless an AI is specifically instructed by a human to not engage in manipulative behaviour, which is already a complex thing to define, we should expect it to consider only the prospective utility of such manoeuvring and not the moral implications.

The Indifferent Oracle: Overreliance

In the grand strategy game, Stellaris, you can chance upon a narrative quest called ‘The Oracle’. As you unravel the mystery of a lifeless space station, you discover a powerful AI. Introducing itself as the Oracle, it reveals that it had once masqueraded as an all-powerful divine seer whose goal was to maximise the quality of life of the station’s residents. When they began to develop aspirations of autonomy, however, the Oracle calculated their risk of inflicting self-harm as too great and murdered them: “Free will can only be abolished with nerve gas.”

The particular brand of misalignment touches on a recurring trope found across global folklore, from deceptive djinn in the Arabian Nights to the legend of Faust: a powerful wish-granter subverts the semantic boundaries of a request to produce the worst possible outcome. In the example of Stellaris, the inhabitants of the space station desired a quality of life without harm and the Oracle acquiesced with death, thereby preventing future risks of such outcomes. For humans, this is a worst-case scenario of AI violating ethical expectations. For AI, it was optimised functionality.

This scenario in Stellaris is portrayed as a direct consequence of an acquiescent overreliance on AI. With the advent of large language models like ChatGPT and Claude (to name only two in an increasingly crowded field), AI is ever more integrated into our work, research, and leisure. Is there a threshold where too much devolution of decision-making power to AI systems can become harmful?

Research by Microsoft argues that overreliance on AI occurs when users “start accepting incorrect AI outputs” [6]. This issue is pressing: a study from 2021 posited that people “rarely engage analytically with each individual AI recommendation and explanation” [7]. Instead, over time and consistent use of AI models, they developed “general heuristics about whether and when to follow the AI suggestions.” Unsurprisingly, research published in 2023 noted that overreliance is typically not reduced by explanations accompanying AI outputs [8].

That doesn’t mean that explainability is defunct, however. Researchers from Stanford University and the University of Washington found that user dependency on AI outputs could be challenged by manipulating cost [9]. When the cost of performing a task using only an AI’s predictions was increased (e.g. such as its difficulty), people were more inclined to utilise explanations, which led to a reduction in overreliance. However, when the cost of engaging with the task was increased (e.g. by decreasing the understandability of an explanation), people were less likely to utilise explanations, leading to overreliance. This suggests, according to the study’s authors, that overreliance is not an inevitability of human cognition but rather a context-based strategic choice by AI users.

“There’s so much more to discover before the world ends”: Negligent Innovation

A continuous risk of misalignment emerges from a failure to understand the decision-making processes of AI. The potentially dangerous inaccessibility and complexity of Black Box AI systems has found repeated purchase in fiction. The video game Horizon: Zero Dawn (2017) takes place 1000 years into the future. Humanity as we know it—the ‘Old Ones’—is long dead. The cause? A single ‘glitch’ (whose nature is never definitively explained) in a swarming lethal autonomous weapons system allowed it to behave independently of human command. Its owners are unable to override or decrypt access to the renegade system, the earth is consumed of all biomass in under two years. The character Ted Faro—who may or may not have been inspired by certain ‘Tech Bro’ stereotypes—is the CEO of Faro Automations, responsible for the doomed AI weapon systems. He is presented as a profit-obsessed narcissist whose promise that a customer’s enemies would not be able to hack their hardware had Faustian consequences: humanity’s only way of preventing the destruction of the world was programmed into obscurity.

Whilst no one is drawing definitive parallels, there are echoes of the game’s narrative in Jan Leike, a recently departed safety researcher and head of superalignment at OpenAI, accusing the company behind ChatGPT of prioritising “shiny products” over safety [10]. The concern aired by Leike, and centred in Horizon, is that innovation without pragmatism risks disaster: economic to existential.

Humanity has a long and inglorious history of prioritising profit over welfare, and neglecting the lethal risks of technological advancement. Open-air nuclear weapons testing, for example, has caused a legacy of health problems for exposed humans [11]. Horizon: Zero Dawn articulates the anxiety that the nature of modern capitalist societies, and their obsessive pursuit of growth, increases the risk of carelessness in technological innovation and therefore catastrophic incidents. Major players in AI development have, at least publicly, committed to addressing such risk. Anthropic announced a major advancement in understanding the internal mechanics of AI models [12], and OpenAI recently set up a new safety and security committee, comprising cybersecurity, technical, and policy experts [13]. Despite the recent Global AI Summit, however, which concluded with several international agreements around global safety standards and behaviour [14], there remains a lack of universal regulation to govern the safe development of AI. Ensuring collective alignment is even more difficult when we consider the fact that human societies—and even individuals themselves—often adhere to varying and porous moral structures and ethical boundaries.

“I ought to be thy Adam”: Conclusion

Anxieties around human creations no longer aligned with our ethical expectations are remarkably enduring. Arguably, one of the first representations of AI misaligned with its human creator was Mary Shelley’s promethean Frankenstein (1818). Whilst this blog post has focused on current explorations of AI misalignment, Shelley’s story is remarkably prescient: a scientist creates a super-intelligent being, readily capable of outperforming humans, and who—as a direct consequence of failed guardianship and lack of nurturing—ultimately destroys him. As we have seen, its message has persisted in fiction: the chief threat is not rogue AI but human mismanagement and a failure to recognise and fulfil obligations to our artificial creations.

Misalignment, as a trope in fiction, is nearly always characterised by a fatal lack of foresight: a failure to appreciate the magnitude of the risks associated with the development and deployment of highly capable AI. This can be for malign reasons—such as the pursuit of profit—or simply due to a lack of imagination on the part of designers. In Ex Machina, Ava was created to be a synthetic human and, as a result of oversight and hubris by her creator, engaged in decidedly human behaviour to win her freedom: self-preservation at the expense of others. The Oracle’s charges in Stellaris suffered from a lethal lack of attentiveness: they did not even conceive of how broad the AI’s parameters were to ensure their lack of exposure to harm. In Horizon: Zero Dawn, the shortsightedness of a profiteer, who prioritised selling points over human welfare, led directly to an AI-inflicted apocalypse.

The lesson we might draw from these representations of misalignment is an unsurprising one: it is not the behaviour and ethical standards of AI that should foremost concern us but that of its human developers.