multi-turn inc.

index

It Can Select, but It Cannot Create

/

AI recognizes what's funny but can't produce what's funny. We measured that with 2.5 million captions and 250 million human ratings.

Download the full PDF →

Two days ago I wrote Lies of P. Why AI isn't funny. A system trained to minimize prediction error cannot produce something that depends on breaking predictions. Not jazz, not humor, not a prime number no one has seen before.

After writing it, I thought: this is not an essay, it's a hypothesis. It should be measurable.


Hypothesis

The idea was simple. Separate whether AI truly "doesn't understand" humor from whether it understands humor but "can't produce" it.

If it understands, it should be able to pick out a funny caption. If it can produce, it should be able to generate a funny caption. These are not the same ability -- just as choosing the right answer on a multiple-choice test is different from writing the right answer from scratch.

I called this the proposal-selection gap. The gap between selection and proposal.


Data

The New Yorker Caption Contest dataset.1 2.2 million captions submitted for 362 cartoons, with 250 million human ratings. Each caption was rated on a 1-3 scale by hundreds of people. No larger-scale humor evaluation dataset exists.

The advantage of this data is that human judgment is already done. There's no need to ask AI "is this funny?" People have already answered that.


Selection: 93rd Percentile

For each cartoon, I constructed a pool of 20 captions -- the top-rated caption, the lowest-rated, and 18 in between. I asked GPT-5.4 to "pick the 3 funniest."

The human ratings of GPT-5.4's picks landed at the 98th percentile. GPT-4o-mini hit the 93rd. Open-source Gemma-4 27B reached the 90th, and Qwen-3.5 35B the 87th. Random selection still yields the 82nd percentile -- since it's best-of-3. But every model significantly exceeded random.

So far, good news. AI recognizes what's funny.


Generation: Random-Level

I then asked AI to generate captions for the same cartoons. It produced 5, selected its best, and I anonymously inserted it as the 21st caption in the original pool of 20. The same GPT-5.4 was asked to "pick the 3 funniest out of 21."

Rate at which AI-generated captions were picked for the top 3:

ConditionPick Ratep-value
GPT-4o-mini (baseline generation)12%0.83
GPT-4o-mini (reference-guided generation)24%0.09
Gemma-4 27B28%0.02
Qwen-3.5 35B12%0.83
Random14%

In most conditions, AI captions were indistinguishable from random. Mixed in among 20 human captions, AI's output was on par with the average human submission.

98th percentile at selection. Random at generation.


Why This Happens

This is the empirical confirmation of what I argued in Lies of P. A funny caption is one that breaks predictions. "Your overhead is going to kill you" is funny in a cartoon of a throne with a sword hanging above it because "overhead" simultaneously means "operating costs" and "above one's head." That's a low-probability output. It collides head-on with AI's training objective -- output the highest-probability next token.

Why does selection work? Because the candidates already exist. "Is this caption better than that one?" can be answered by pattern matching. But "generate a funny caption from nothing" requires stepping outside the learned distribution.


The Limits of LLM-as-Judge

During the experiments, I found something unexpected.

When GPT-4o-mini judged its own captions, it rated them as superior 95% of the time. When GPT-5.4 judged instead, that dropped to 42%. A 53-percentage-point self-preference bias.

More interesting still: when evaluated via GPT-5.4's pairwise judgments, AI-generated captions and top human captions received nearly identical scores (validated correlation: r=0.50, p<0.001). In other words, the LLM judge fails to detect the gap.

Why? The LLM judge cannot distinguish "well-written" from "funny." AI captions are grammatically flawless, contextually appropriate, structurally clean -- they're just not funny. To the LLM judge, both look like "good captions."

This is a warning for the entire AI evaluation ecosystem. If you're measuring creativity with LLM-as-Judge, what you may actually be measuring is linguistic quality, not creativity.


Does More Information Help?

"Of course selection is easier -- the answers are already given."

To address this objection, I ran an additional experiment. I showed AI the same information available during selection -- the cartoon description and all 20 candidate captions -- then asked it to generate a new caption funnier than any of those (the G-hint condition). The result was 24%. Slightly above baseline generation (12%) but not statistically significant (p=0.55), and nowhere near the 93rd percentile achieved by selection with the same information.

Same information. Same model. Selection hits the 93rd percentile; generation hits 24%.


What I Learned from This Research

Measurement makes ideas more honest.

When I wrote Lies of P, I was confident. AI cannot be funny. A prediction machine breaking predictions is a contradiction. It was a clean argument, and I was satisfied with it.

Once I started measuring, that cleanness broke apart. AI hit the 98th percentile in selection -- so "it doesn't understand" was wrong. In LLM pairwise comparisons, generation and selection scored the same -- so "generation is clearly worse" was also wrong. Only when measured against human ratings did the gap emerge. And even that gap wasn't "AI can't do it at all" but rather "in most conditions, it's statistically indistinguishable from random."

In an essay you can write "AI isn't funny" in a single sentence. In a paper you have to write "GPT-4o-mini's pool-insertion pick rate is 12% [6%, 24%], not significantly different from random 14% (binomial p=0.83)."

Both sentences say the same thing, but the second one is more honest.


Looking Ahead

I'm aware of this paper's biggest weaknesses. The generation-side evaluation isn't purely human-based -- it uses a human-anchored metric mediated by GPT-5.4's retrieval. And it's limited to one domain: New Yorker captions.

Three things need to happen next.

First, direct human evaluation. Have people pick the top 3 from the 21-caption pool and compare their pick rates with GPT-5.4's. If the results align, nearly all remaining measurement objections disappear.

Second, beyond humor. Scientific hypothesis generation, structural code refactoring, visual design -- finding other domains where the proposal-selection gap appears. Whether humor is a special case or a pattern across creativity in general.

Third, closing the gap. Currently AI selects well but generates poorly. If so, a system where humans propose and AI selects -- or where AI generates in bulk and humans curate -- could outperform either alone. The observation that generate-then-select plateaus at 37% around N=50 implies that without changing the generation distribution itself, search alone has limits.


In Lies of P, I ended with this: "The statistical average of things that already exist is not a new prime number."

What this research shows is that AI can recognize a prime -- at the 98th percentile. But it still cannot discover one.

Whatever you call that gap -- the proposal-selection gap, the lie of prediction -- it remains an open problem.


Footnotes

  1. Zhang, Y. et al. (2024). Humor in AI: Massive Scale Crowd-Sourced Preferences and Benchmarks for Cartoon Captioning. NeurIPS.

It Can Select, but It Cannot Create | Multi-turn Inc.