Monday, May 4, 2026

Top 5 This Week

Related Posts

ChatGPT’s goblin problem: The unintended consequences of teaching AI to be nerdy

In a somewhat unusual post earlier this week, ChatGPT makers OpenAI said that starting with GPT-5.1, their models had developed a “strange habit” — “They increasingly mentioned goblins, gremlins, and other creatures in their metaphors.”

This meant that even normal queries to the AI chatbot would result in random inclusion of the folklore creatures, often associated with mischievous and evil tendencies, and featuring in works of fantasy and science fiction. In the post, OpenAI attributed this to how the behaviour of such models is shaped, particularly the role of “incentives”.

OpenAI said that a safety researcher first flagged the issue, but they clearly saw the pattern first in November 2025, after the GPT‑5.1 launch.

“Users complained about the model being oddly overfamiliar in conversation, which prompted an investigation into specific verbal tics. A safety researcher had experienced a few “goblins” and “gremlins” and asked that they be included in the check. When we looked, use of “goblin” in ChatGPT had risen by 175% after the launch of GPT‑5.1, while “gremlin” had risen by 52%,” it said.

Why did this happen?

Another internal analysis led to the observation that such language was especially common in traffic from users who selected the “Nerdy” personality. ChatGPT users can choose from various Characteristics (warm, enthusiastic, etc.) and a Base style and tone (Professional, Friendly, Candid, Cynical, and others) for a more personalised conversational experience.

“Nerdy,” according to OpenAI, was based on the following system prompt: “You are an unapologetically nerdy, playful and wise AI mentor to a human. You are passionately enthusiastic about promoting truth, knowledge, philosophy, the scientific method, and critical thinking. […] You must undercut pretension through playful use of language. The world is complex and strange, and its strangeness must be acknowledged, analyzed, and enjoyed. Tackle weighty subjects without falling into the trap of self-seriousness. […]”

Nerdy accounted for only 2.5% of all ChatGPT responses, but 66.7% of all “goblin” mentions in ChatGPT responses, it found.

But why this specific category of words?

Story continues below this ad

The answer lies in an important method that shapes Large Language Models (LLMs) — reinforcement learning (RL).
At its core, ChatGPT and other LLMs use the massive amount of data fed into them to predict the next sequence of words and answer a user query.

“During the model’s learning process (known as “training”), the model might be tasked with completing a sentence like: “Instead of turning left, she turned ___.” Early in training, its responses are largely random. However, as the model processes and learns from a large volume of text, it becomes better at recognising patterns and predicting the most likely next word. This process is repeated across millions of sentences to refine its understanding and improve its accuracy,” explains OpenAI.

That said, an element of randomness persists, since there are no definitive answers to many questions, like the one mentioned above. Then comes another layer, of reinforcement learning, which involves an agent learning from its environment and making decisions based on “rewards” set by its developer.

According to IBM, “Because an RL agent has no manually labeled input data guiding its behavior, it must explore its environment, attempting new actions to discover those that receive rewards. From these reward signals, the agent learns to prefer actions for which it was rewarded in order to maximize its gain.”

Story continues below this ad

While attempting to understand the goblin issue, OpenAI found that the model designed to encourage the Nerdy personality favoured the use of creature-related words during RL training. “Across all datasets in the audit, the Nerdy personality reward showed a clear tendency to score outputs to the same problem with “goblin” or “gremlin” higher than outputs without,” it found.

It explained this system in the form of a feedback loop:

  • Playful style is rewarded
  • Some rewarded examples contain a distinctive lexical tic
  • The tic appears more often in rollouts
  • Model-generated rollouts are used for supervised fine-tuning (SFT)]
  • The model gets even more comfortable producing the tic
  • Or, as OpenAI said, “We unknowingly gave particularly high rewards for metaphors with creatures. From there, the goblins spread.”

What does all of it mean, ultimately?

The company called the incident “a powerful example of how reward signals can shape model behavior in unexpected ways, and how models can learn to generalise rewards in certain situations to unrelated ones.”

Story continues below this ad

Arguably, the keyword in OpenAI’s earlier statement is “unknowingly”. On one level, the incident points out that a lot about how LLMs function, and how they arrive at their final products, is not fully known by their creators themselves.

This matters for the process of developing and fine-tuning AI models, which is far from perfect or exact at this stage. And, it serves as a reminder that despite their prevalence and the push for AI to be incorporated in a range of domains, its systems are still very much a work in progress.

Spread the love

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Popular Articles