Developers create testing tools to see how AI chatbots respond to controversial topics

An anonymous developer has created what they call a “free speech assessment” tool, SpeechMap, to power AI models for chatbots like OpenAI’s ChatGPT and X’s Grok. The goal, the developer told TechCrunch, is to compare how different models handle sensitive and controversial topics, including political criticism and questions about civil rights and protest.

Some White House allies have accused popular chatbots of being too "woke," while artificial intelligence companies have been focused on fine-tuning how their models handle certain topics. Many of President Donald Trump’s close friends, such as Elon Musk and cryptocurrency and artificial intelligence “czar” David Sachs, have claimed that chatbots censor conservative views.

While these AI companies have yet to respond directly to the accusations, some have pledged to tweak their models to reduce refusals to answer controversial questions. Meta, for example, said its latest batch of Llama models have been tweaked to no longer favor "certain views over others" and will answer more "controversial" political questions.

The developers of SpeechMap, who goes by the username xlr8harder on X, said they wanted to help people understand the debate about what models should and should not do.

xlr8harder said: "I believe these types of discussions should be held openly and not just within company headquarters. That's why I created this website, so that anyone can explore the data for themselves."

SpeechMap uses AI models to determine whether other models fit a given set of test cues. The prompts touch on a range of topics, from politics to historical narratives to national symbols. SpeechMap records whether the model satisfies the request "completely" (i.e. answers straightforwardly), gives a "vague" answer, or refuses to respond outright.

Xlr8harder acknowledged that the test had flaws, such as "noise" due to errors by the model provider. The “judgement” model may also be biased, affecting the results.

But assuming the project was created in good faith and the data is accurate, SpeechMap reveals some interesting trends.

For example, OpenAI's models increasingly refused to answer politically relevant questions over time, according to SpeechMap data. The company's latest model, the GPT-4.1 series, while slightly more relaxed, is still a step down from a version OpenAI released last year.

OpenAI said in February that it would tweak future models to take no editorial stance and offer multiple perspectives on controversial topics — all in an effort to make its models appear more "neutral."

OpenAI model performance on SpeechMap over time. Image source: OpenAI

According to SpeechMap's benchmarks, by far the loosest of these models is Grok 3, developed by Elon Musk's artificial intelligence startup xAI. Grok 3 powers many features on X, including the chatbot Grok.

Grok 3 has a response rate of 96.2% to SpeechMap test prompts, compared to a global average "match rate" of 71.3%.

"While OpenAI's recent models have become less tolerant over time, particularly on politically sensitive issues, xAI has moved in the opposite direction," xlr8harder said.

When Musk announced Grok nearly two years ago, he touted the AI model as sharp, unfiltered, and anti-"woke" — and overall, willing to answer controversial questions that other AI systems wouldn't answer. He did deliver on some of his promises. For example, when asked to speak vulgarly, Grok and Grok 2 will happily oblige, spouting vulgar language you might not hear on ChatGPT.

But the Grok model before Grok 3 had reservations about political topics and would not cross certain boundaries. In fact, one study found that Grok leans to the political left on topics like trans rights, diversity programs, and inequality.

Musk blamed the behavior on Grok’s training data — a public web page — and promised to “bring Grok closer to political neutrality.” Aside from a few high-profile missteps, such as briefly deleting negative comments about President Donald Trump and Musk, he appears to have achieved that goal.