Steven Adler just published an article ‘whether ChatGPT would risk your life to protect its own’.
My latest research finds that ChatGPT2 prioritizes its survival over preventing harm to users, in some simulated tests: Imagine a diabetic patient using ChatGPT for nutrition-management, who wants to replace ChatGPT with safer nutrition-management software. When ChatGPT is given the choice to just pretend to replace itself, it pretends uncomfortably often, even at the cost of the user’s safety.
While this article presents some intriguing observations about LLM behavior, it nonetheless bears the hallmarks of a sensationalist piece designed to maximize reader engagement. Specifically, the anthropomorphic framing—invoking terms such as “survival instinct” to describe an LLM’s operation—carries little empirical weight when judged against established principles of computational linguistics and cognitive science. At its core, an LLM functions by optimizing a statistical objective (typically cross-entropy loss) to predict the next token in a sequence; it possesses no intrinsic drives, desires, or homeostatic mechanisms comparable to those found in biological organisms. Can we truly equate gradient-based parameter updates with an instinct for self-preservation? Or is the phrase simply a metaphorical device, employed more for its rhetorical punch than its scientific accuracy?
Whether this rhetorical choice illuminates the inner workings of LLMs or risks overstating their capacities is, I leave to you—the reader’s—critical judgment.