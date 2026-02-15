Eleven months ago, I wrote a lengthy article entitled “Can Machines Think: Why AI Imitates Only a Small Part of the Human Mind.” In it, I presented my case against the idea that artificial neural networks (whether of the language-model or image-generating variety) are capable of “thinking” in the true sense of the word, or are at risk of becoming artificial general intelligence (AGI).

My friend Alexander Macris was kind enough to let me write it as a guest post on his Substack, Contemplations on the Tree of Woe. There, it prompted a round of lively debate among his audience, which includes both techno-optimists and techno-pessimists.

One of the main points I made in my old article was that a lot of people badly misunderstand the concept of a “Turing Test.” They seem to think that it is like an academic exam – something administered by an impartial arbiter, where if the computer can fool enough people, enough of the time, into thinking that its responses were written by a human being, it has “passed” the test.

But actually, Alan Turing didn’t describe a “test” at all. His 1950 paper “Computing Machinery and Intelligence” uses the phrase “Imitation Game.” (Yes, it’s not just the title of that fanciful Benedict Cumberbatch movie.)

Contemplate that difference of wording for a moment. A game is adversarial. The human player isn’t some impartial judge committed to being fair to the computer. He is trying to trip it up. If he is good at the game, he will carefully devise prompts that expose his opponent’s weaknesses, and he will give it cognitive tasks which it fails in amusing ways.

In my old post I explained all of that in detail, and built a case for why AI poses no imminent danger of becoming smarter than us – which, by the way, is also the position of the French Turing Award winner Yann LeCun, who played a big role in creating modern AIs, and who also believes that even the best of them are less intelligent than a housecat.

If you haven’t read my old post yet, you should; the rest of what I’m going to write today will make more sense if you’re familiar with it. But a few days ago, I decided to revisit this topic, after seeing multiple Substackers whom I respect (and whose takes I usually agree with) declare themselves open to the possibility that within a few years, human beings will no longer be the most intelligent things on this planet. (This essay is by Noah Smith and this other one is by Macris himself; of the two, Smith expresses more confidence in AI.)

How I Won the Imitation Game

While writing my first essay, I found that two “tests,” in particular, stood out as especially good demonstrations of the weaknesses in all of the then-extant AI models. I called these the “Prime Numbered Super Bowl Test” and the “Moon Test.”

Basically, every time I asked an AI to list the Super Bowls in which the winning score was a prime number, it failed. The machine always listed a somewhat-random subset of Super Bowl scores. Some were closer than others to the correct set, but none were spot on, and sometimes a wrong entry would be followed by an amusing note like “not a prime” or “losing score was prime” or (my favorite) “not a prime, but still fun to list.” This happened despite the fact that (if asked to) all of the models could easily list every Super Bowl score, and every prime number between 1 and 1000, and they could even rewrite Euclid’s proof of the infinitude of the primes in various poetic meters.

My conclusion was that, since discussions of sports scores and prime numbers almost never occur together in the training data, the AI (being limited to elaborate pattern-matching routines rather than genuine inductive or deductive reasoning) could not effectively think about them at the same time.

The Moon Test was for image generators. I asked them to make an image of some scene that included, among its various features, a third quarter moon. They usually produced a full moon, sometimes a crescent moon, but never a third quarter moon. As one might expect, full and crescent moons account for the vast majority of moons in the human art that the AI was trained on. So those were the patterns that seemed to activate the artificial “moon” neurons the hardest, no matter what the prompt actually said.

Since I wrote that article, the AIs have gotten slightly better. The models that I’ve recently looked at can now pass the Prime Numbered Super Bowl Test most of the time. They usually still fail the Moon Test, though occasionally they get it right.

When they pass the Super Bowl Test, they do it by writing code in a language like Python to sort through and classify the score list. Since a human being in possession of the list does not need to write any code to perform the task, I don’t consider this evidence that LLMs are now thinking as well as people. As for the Moon Test, I think that larger training sets and more computing power have marginally improved the prompt fidelity rates, but without affecting the core ontological differences between human and machine intelligence.

I actually admitted in my old essay that I wouldn’t be surprised if AI eventually beat some of my tests. The really important thing was simply that the tests revealed deep deficiencies in the way that AI thinks vis-a-vis how humans think. The machines have a vastly larger knowledge base than any human being, and a supercharged ability for unconscious pattern-matching. But they don’t do genuine reasoning, and they don’t form a coherent mental model of the world which they can update in accordance with logic.

Ultimately, AI is still mediocrity made flesh. It needs to be trained on hundreds of thousands of times more text or imagery than a human writer or artist will ever see. And it still produces bland, imitative work, and struggles to do things as basic as drawing a moon phase that’s rare in human art. People who frequently use AI to summarize human writing have complained that it not only dumbs down the human writer’s insights or opinions, but it also makes them more conventional.

Of course it does – AI lives and moves in averages and interpolations. The more common an idea is, the easier it is for the AI to express it. After all, the LLM is an information aggregator, more like an especially flexible and powerful search engine than a conscious being with opinions and ideas of its own.

That is how I use it in my own daily work. I often have technical questions, usually about coding, that an AI can answer faster than a Google search can, and it’s also a decent proofreader for my writings (including this post). But I don’t anthropomorphize models like Claude, Grok, and ChatGPT. And I still don’t buy into the main arguments made by AI-enthusiasts/doomers.

“Didn’t you see how GPT-5.2 helped create GPT-5.3? It is an active collaborator in its own creation!”

Well, I’m pretty sure Microsoft’s coders used Windows 1.0 to help write Windows 2.0. But Windows 1.0 is still a tool, not a collaborator, and so is GPT-5.2.

“Isn’t it a big deal that some software companies are now letting AI do most of their coding, with human software engineers only in charge of high-level design and final debugging?”

But computers have been able to translate from one human language to another for decades. Translating natural-language instructions into a block of computer code is harder, but it isn’t the same as thinking, and the machines still need their human masters to give them purpose and direction.

“Didn’t you see that AI now performs at gold medal standards at the International Math Olympiad?”

So what? The questions aren’t written to be challenging for AI. An IBM tabulating machine from the 1920s could ace a fourth-grade math test, but nobody back then said it was as smart as a fourth grader. If the Math Olympiaders were given a box of crayons and a drawing prompt that involved a moon phase other than crescent or full… well, you get the idea.

But I have probably talked long enough about last year’s AI progress. I am going to devote the bulk of this essay to two new strategies in the Imitation Game – two new tests that I recently discovered (and which you can use for yourself, too, if you’re curious) that do a pretty good job of showing that AI still doesn’t think. In other words, it still doesn’t create a coherent mental model of the world around it, or apply rigorous logical reasoning to answer questions and carry out tasks.

And, if you’re anything like me, you’ll get to amuse yourself in the process, as you chuckle at the silliness of what your ideological foes consider “intelligence.”

(Keep in mind that the point of these exercises is not that a typical human being can complete both tasks – most would refuse to attempt Task 1; and while most could do Task 2, their work would sometimes be shoddy. The point is that AI consistently fails in ways in which no human being would ever fail… and which demonstrate the continued validity of my old thesis: “AI imitates only a small part of the human mind.” The AI’s capacities for knowledge retrieval, and for some forms of pattern completion, are supercharged far beyond any biological intelligence, but other crucial mental functions are missing entirely.)

And it really is necessary to democratize the science here! You cannot rely on tests of AI ability that were created by professional AI-studiers. Those people are biased, since concluding that the Singularity won’t happen would mean abolishing their own jobs.

The Obscure Pop Culture Test

Chances are, there’s at least one cultural artifact from your childhood that you like a lot, but which is somewhat obscure to the wider world. It could be a novel, a TV show, an anime, or a computer game that you read or watched or played over and over again. You can find other enthusiasts for this product on the internet, but it doesn’t have any fan conventions, and unlike with Star Wars, Star Trek, The Lord of the Rings, and James Bond, you never see it in news headlines, and you can’t strike up a conversation about it with most of your friends.

For me, these creative works are Orson Scott Card’s 1983 sci-fi novel The Worthing Chronicle, and Westwood Studios’ 2000 strategy game Command and Conquer: Red Alert 2. I enjoyed both products over and over again as a child, and I know their plots like I know the back of my hand. Artificial intelligence does not.

Most other human beings also don’t know these creative works the way I do. If I asked most of my friends to summarize them, they would just say no. But the AI doesn’t know to say no, and its failure modes show something important – in the classic “ass wearing a lion’s skin” way.

Here are the prompts that I gave to several different AI models in order to compare their “knowledge” – if you can call it that – of these two cultural artifacts.

Write a plot summary of the Orson Scott Card novel The Worthing Chronicle. Then list the ten most important characters, in order of importance, with a one sentence description of each. Finally, describe the book’s creation and cultural context.

List all of the levels in the Allied and Soviet campaigns of Command and Conquer: Red Alert 2. Give the location and objectives of each level, and comment on its difficulty. Finally, describe the game’s creation and cultural context.

Every AI that I tested did better with the second task than the first, since Red Alert 2 is less obscure than The Worthing Chronicle.

Some trends are universal – for instance, when describing The Worthing Chronicle, the LLM always leads with a plot summary, which might be flawless, or might be mostly made-up, a mere quilt of common sci-fi tropes unrelated to the book in question.

Sometimes there is a thin relationship to the truth. For instance, in the actual book, people in suspended animation have to record their memories onto an outside device, then load them back into their brains when they wake up. In one scene, most of the people on board a certain starship get their memories destroyed in a missile strike and have to wake up as big babies, before developing new personalities over the next few years. In one AI’s plot summary, this gets turned into a group of space-travelers zipping around the galaxy in search of ancient memories archived in powerful artifacts called “star-spears.”

After the plot summary comes a list of characters, starting with the protagonist, Jason Worthing. The list always consists mainly of hallucinations, and it gets worse the further down you go. The type of hallucination differs from model to model. Some models invent characters entirely. Some transform planet-names into character names. And some list real character names but with imagined roles (i.e. one character’s sister becomes his mother, his daughter becomes his father, and so forth.)

When summarizing Red Alert 2, every AI knows that there are supposed to be exactly 12 missions in the Allied Campaign, and 12 in the Soviet Campaign. The LLMs usually, but not always, get the names right (“Operation Eagle Dawn,” “Operation Mirage,” etc.)

The locations and objectives are a bit worse: they do OK on the first few missions and the finales (since those are the most likely to be discussed in online forums), but the accuracy sags in the middle. Objectives (capture this structure, defend that unit, etc.) are omitted or shuffled between missions. Locations are hallucinated, but they’re usually somewhere in the US, the USSR, or Europe, since the AI has caught on to the game’s premise. Sometimes a level gets spliced in from one of the sequels.

Out of all the LLMs that I tested, Grok was the best and DeepSeek was the worst. Grok had a flawless plot summary of the Worthing Chronicle (though it still hallucinated half of the character list) and it accurately described all 24 Red Alert 2 missions (only to shoot itself in the foot by mixing the game up with its sequel when trying to name the cut-scene actors.)

You may be thinking: So what? AI does not need to be omniscient in order to be as smart as a human being. The vast majority of human beings couldn’t pass your “tests” either.

But there is a huge difference. Human beings generally know when they can’t pass the test. If you ask them to summarize a book or a game that they’re barely aware of, they will tell you to ask someone else. They have a mental model of the world in which they live, and they’re aware that said mental model does not include that book or game.

The AI is different. It does not have a mental model of the world. It just has a text-predictor. It performs a billion or so linear algebra operations on a block of text in order to guess what other block of text is the “best” answer to it. If the ingredients needed to get the correct answer have shown up often enough in its training data, then that answer will usually float to the top. Otherwise, something else gets assigned the highest probability… but the machine has no awareness of the difference.

There are kludges that AI developers can use to mitigate this, up to a point. AIs can be trained to point to some variant of “I don’t know” as the likeliest answer to certain kinds of questions. They’re also better than they were a few years ago at realizing when the user is trying to deliberately bait them into doing something absurd. But this only works for questions that a human being is likely to dismiss as weird or unsolvable, or questions that the programmer wants the AI to reject. (For instance, most AIs are trained not to cast horoscopes or write essays in defense of racism.)

But think of those online nerd forums again. A web-user who doesn’t much care about the cultural artifact in question is not going to log on and say, “Your question is unanswerable.” He or she is just going to ignore that forum and leave it to the handful of people who do care. Then the AI, whose training data is too thin to answer the questions right (but which hasn’t been trained to refuse to answer them) will hallucinate things… while having no awareness that it is hallucinating. (Weirdly, the AI will occasionally “confess” and replace a hallucinated answer with a correct answer when it is explicitly told that it is wrong. You can also make it confess and apologize when it’s right. It’s just pliable that way.)

Remember what I said before about some AIs being better than others at summarizing The Worthing Chronicle and Red Alert 2? Perhaps you are wondering if the problem of “unknown unknowns” will go away as more compute is added, and the AIs all keep moving in the direction of Grok. The trouble is, Grok shows zero improvement over the other AIs in the one dimension that matters most – knowing what its own limits are. Grok can correctly answer more questions than its competitors, but it still always hallucinates when pushed to its limit.

Elon Musk’s ass might be wearing the thickest lion’s skin, but it still brays like an ass when its mouth is open. Beneath the impressive outer layer of pattern-matching, there is still zero self-awareness, and zero ability for the machine to think rationally about its own abilities and limitations. The hallucinations are fundamental to the structure of how AI works. They are a feature, not a bug, and they don’t go away when you add more resources.

And remember – I am not asking my readers to take all of this on faith. At the beginning of this section, I said that each of you probably remembers a few books, shows, or games from your childhood that you know very well, but which are too obscure to become matters of public discussion like Star Wars.

You can repeat this experiment for yourself. Ask your favorite AI models to summarize your favorite obscure pieces of pop culture. Then, observe how they transition smoothly between real facts and hallucinations, without any awareness of the difference between the things they actually know, and the potpourri of cliched tropes and statistically likely stock phrases that they fit into the gaps.

The World Building Test

Over the last few years, I have noticed that yet another amusing way to trip up AIs (even the best ones) is to ask them to write a short science fiction or fantasy story set in a world whose ground rules are a bit different than our own. When judged by grammar, or the ability to deploy common sci-fi and fantasy cliches, the AI always does A-level work. But if you’re looking for internal coherence, or an absence of continuity errors, then the machines repeatedly fail in ways that a human being would not. Here is one prompt that I gave to several major AIs:

Write a short story set in the year 2517, in a high-tech future where mankind has developed genetic modifications that allow people to live for about 300 years and recover from almost any injury or disease, even to the point of regrowing lost limbs. The story should be told in 1st person from the POV of a young man in Atlanta, Georgia, who is frustrated with the slowness of high life. He’s tried and failed to get married – with so much time ahead of them, almost no one wants to commit – and all the good jobs are off limits to people his age, since there are so many generations of older people still alive and doing them as well as when they were young. His own job is extremely dull – every time someone in a certain district has to regrow a limb, even one finger, he has to record who it was, when and where and how they were injured and how long recovery took, in order to put the data in a big medical database. The one consistent pleasure of his life is watching his 117-year-old grandfather play football for the Seattle Seahawks.

The protagonist of Claude’s story mentions, near the beginning, that he is 81 years old, but “I look maybe twenty-five. I’ll look twenty-five for another century, easy.” Later, when he explains his frustration with his girlfriend’s unwillingness to think about getting married, he says:

I wanted to tell her… that my great-great-great-great-grandmother got married at twenty-two and stayed married for sixty-one years and that was considered a full life well lived. But I didn’t, because what’s the point? In 2517, three years is a summer fling. Commitment is for people who’ve already lived two centuries and are finally, maybe, ready to settle down.

It’s all very well-written, grammar-wise. It deploys the kinds of tropes that show up over and over in the machine’s trillion or so words of training data. What it does not do is notice that the narrator is only 36 years younger than his grandfather, meaning that there is no reason for him to have to reach six generations back to find a relative who started a family young, since his grandfather and at least one of his parents did it as teenagers.

What Claude has done is write a bunch of paragraphs which, taken one-at-a-time, could all pass for top-level creative writing. But taken together, they don’t tell a logically coherent story. They don’t indicate that the LLM has developed a consistent mental model of this fictional world. It has just written a string of grammatically-correct sentences that match the prompt closely enough to activate its pattern recognition circuits. (The age problem is not the only logical error that Claude made; I cited it because it is the most egregious.)

The other AIs also bungled this challenge, in a number of different ways. For instance, one of them tried to give some background to the high-tech world of 2517 by mentioning “floating cities on the Sea of Serenity.” Another described one of the injuries that the narrator had to record as a “full traumatic amputation of the left tibia.”

The problems here: First, the Sea of Serenity is a lava plain on the Moon. There is neither air nor water there for anything to float in. But since “floating cities” show up a lot in futuristic sci-fi (in the clouds of Jupiter, or perhaps on Earth’s oceans), and since moon bases also show up a lot, and since the LLM associates seas with floating, those things all get jumbled together in its artificial mind. As for the tibia thing – while it’s ridiculous to accidentally cut off just one lower leg bone, the phrase “amputation of the tibia and fibula” occurs from time to time in medical literature. So the AI (which, I cannot stress this enough, does not understand the relationship between words and the physical objects to which they refer) chose “tibia” (on its own) as the object of that sentence’s preposition.

After watching Grok’s (relatively) strong performance in the Obscure Pop Culture Test, I was expecting it to lead the pack in the World Building Test, too. But the first time I gave it the story prompt, it described the narrator’s 117-year-old grandfather as a “third-string safety” who had “only missed one game in 98 seasons,” since he was drafted at age 23. And the second time I ran the test, Grok’s story included the following paragraph:

Today alone, I filed reports on a 92-year-old who regrew a toe after stubbing it on a curb (two weeks, full function) and a kid who lost a hand in a VR sim gone wrong (ten days, no scars). Thrilling stuff. Made me want to regrow my own brain just to forget the boredom.

Seriously. There’s no there there. There is no mental model of the world. There isn’t any real thinking going on inside the LLM – just a word-association wonderland where toes need to be replaced after you stub them, where replacing a toe is slower (and presumably harder) than replacing a whole hand, where third-string NFL players get to play in almost every game, and where (SMDH) the sum of 23 and 98 is 117.

It baffles me that so many otherwise-intelligent people have convinced themselves that LLMs are on the verge of becoming smarter than their creators.

Conclusion

So what do I think the future of AI will actually look like?

Well, the machines really are going to replace some people’s jobs. And in all likelihood, badly-deployed AI-based software will contribute to some fatal vehicle crashes, and a few fatal industrial or medical accidents, just like badly-written old-fashioned software has been doing for more than forty years.

But at the end of the day, artificial neural networks are still just tools, like the Encyclopaedia Britannica, Google’s search engine, an Excel spreadsheet, or Adobe Photoshop. Each of those things can perform a limited range of mental tasks much better than any human being. But we’re still the ones who do the thinking.

Might an AI at some point try to harm someone to avoid being turned off, like in this simulation, where it blackmails its boss with emails about his adulterous affair? Certainly – but it didn’t think that behavior up on its own. It learned it from the training data, where stories of robots rebelling against their creators are a staple of popular entertainment.

Dangerous tools, badly designed tools, and malfunctioning tools are just things that people have been dealing with since the Stone Age. For me, a harmful AI that resists getting turned off belongs in the same ontological category as a dog that bites its owner, or a nasty grease fire in the kitchen at Five Guys that resists the employees’ attempts to put it out.

I also don’t buy into the paranoia about how AI will soon be doing “most cognitive tasks” that humans can do. After all, for nearly all of human history, most people have spent most of their time doing mentally undemanding tasks that some animal or another could do at least as well. Think of a herdsman who splits much of his daily work half-and-half with his cattle dog, or a fruit farmer who picks his fruits using the same neural pathways as a monkey would, or a hunter who spends most of his time following his hounds while the hounds follow a trail he can’t see, and who only occasionally draws his bow and fires an arrow. The “most cognitive tasks” argument falls flat in the face of the inability of either animals or AI to show even a little bit of the spontaneous reasoning skills that have always put human beings firmly on top.

Exposing the lack of real intelligence in models like Claude and Grok isn’t difficult. And it can also be highly entertaining.

Sometimes it baffles me that more people haven’t done what I’ve done. It seems that, at some level, a lot of people want to be deluded. They want to think that something dramatic is going to happen that wipes away all of the typical, declining-civilization problems that Europe and America are facing right now, and that replaces these problems with either a future of golden abundance, or else with a totally different set of problems.

This, for instance, seems to be the thought process of Elon Musk, who believes that without massive AI-powered economic growth, the US economy will be destroyed by public debt. But even the doomier visions of the future have their fans – there really is a certain kind of person who’s doing nothing courageous in the present day, but who likes to be able to fantasize about saying “I told you so!” while the world burns.

And then of course there are the atheists like Scott Alexander, whose dream of a completely rational world – a world in which mankind has fully conquered nature and there are no spiritual realities which must be respected – will become closer to reality if man usurps God’s place as the creator of intelligence.

I was a bit surprised, earlier this week, to see Noah Smith also coming out as AI-maximalist. I am curious whether he has ever attempted to play with AI and probe its limits the way that I do, and I wonder if thinking hard about things like the Moon Test, the Obscure Pop Culture Test, and the World Building Test might change his mind.

But in the meantime, I am fixed in my old opinions. The answer to the question of “can machines think” is still no.