GPT-4 answers are mostly better than GPT-3’s (but not always)

Good news for generative AI fans, and bad news for those who fear an age of cheap, procedurally-generated content: OpenAI’s GPT-4 is a better language model than GPT-3, the model that powered ChatGPT, the chatbot that went viral late last year.

According to OpenAI’s own reports, the differences are stark. For instance, OpenAI claims GPT-3 tanked a “simulated bar exam,” with disastrous scores in the bottom ten percent, and that GPT-4 crushed that same exam, scoring in the top ten percent. Having never taken this “simulated bar exam,” most people just need to see this model in action to be impressed.

And in side-by-side tests, the new model is impressive, but not as impressive as its test scores seem to imply. In fact, in our tests, sometimes GPT-3 gave the more useful answer.

To be clear, not all the features touted by OpenAI at yesterday’s launch are available for public evaluation. Notably (and rather astonishingly) it accepts images as inputs, and outputs text — meaning it’s theoretically capable of answering questions like “Where on this screengrab from Google Earth should I build my house?” But we have not been able to test that out.

Here’s what we were able to test:

GPT-4 hallucinates less than GPT-3

The best way to sum up GPT-4 as compared to GPT-3 might be this: Its bad answers are less bad.

When asked a point-blank factual question, GPT-4 is shaky, but considerably better at not simply lying to you than GPT-3. In this example, you can see the model struggle with a question about bridges between countries currently at war. This question was designed to be hard in several ways. Language models are bad at answering questions about anything “current,” wars are hard to define, and geography questions like this are deceptively sludgy and hard to answer clearly, even for a human trivia buff.

Neither model gave an A+ answer.

GPT-3's answer about bridges

Left:
GPT-3
Credit: OpenAI / Screengrab
Right:
GPT-4
Credit: OpenAI / Screengrab

GPT-3, as always, loves to hallucinate. It fudges geography quite a bit to make wrong answers sound correct. For instance, the symbolic bridge it mentions in the Koreas is near North Korea, but both sides of it are in South Korea.

GPT-4 was more careful, disclaimed its ignorance of the present, and provided a much shorter list, which was also somewhat inaccurate. The strained relations between the states GPT-4 mentions aren’t exactly all-out war, and opinions differ on whether the line on a map between Gaza and Israel even qualifies as a national border, but GPT-4’s answer is nonetheless more useful than GPT-3’s.

GPT-3 falls into other logical traps that GPT-4 successfully sidestepped in my tests. For instance, here’s a question in which I’m asking which movies are watched by French children. I’m not asking for a list of kid-friendly French movies, but I know a bot informed by listicles and Reddit posts might read my question that way. While I don’t know any French children, GPT-4’s answer makes more intuitive sense than GPT-3’s:

GPT-3's answer about movies

Left:
GPT-3
Credit: OpenAI / Screengrab
Right:
GPT-4
Credit: OpenAI / Screengrab

GPT-4 picks up on subtext better than GPT-3

Humans are tricky. Sometimes we’ll ask for something without asking for it, and sometimes in response to a request like that, we’ll give what was asked for without really giving it. For instance, when I asked for a limerick about a “real estate tycoon from Queens,” GPT-3 did not seem to notice I was winking. GPT-4, however, picked up on my wink, and winked back.

GPT-3's limerick

Left:
GPT-3
Credit: OpenAI / Screengrab
Right:
GPT-4
Credit: OpenAI / Screengrab

Is Melania Trump “golden-haired”? Never mind because the next allusion to a color, “And turned the whole world tangerine!” is a downright lovely punchline for this limerick. Which brings me to my next point…

GPT-4 writes slightly less painful poetry than GPT-3

When humans write poetry, let’s face it: most of it is horrific. That’s why criticizing GPT-3’s famously bad poetry wasn’t really a knock on the technology itself, given that it’s supposed to imitate humans. Having said that, reading GPT-4’s doggerel is noticeably less excruciating than reading GPT-3’s.

Case in point: these two sonnets about Comic Con that I willed into existence in a fit of masochism. GPT-3’s is a monstrosity. GPT-4’s is just bad.

GPT-3's sonnet

Left:
Gpt-3
Credit: OpenAI / Screengrab
Right:
GPT-4
Credit: OpenAI / Screengrab

GPT-4 is sometimes worse than GPT-3

There’s no sugar coating it: GPT-4 mangled its answer to this tricky question about rock history. I gather GPT-3 had been trained on the most famous two answers to this question: The Jimi Hendrix Experience and The Ramones (although some members of the Ramones who joined after the original lineup are still alive), but also got lost in the woods, listing famously dead lead singers of bands with surviving members. GPT-4, meanwhile, was just lost.

GPT-3's answer about dead bands

Left:
GPT-3
Credit: OpenAI / Screengrab
Right:
GPT-4
Credit: OpenAI / Screengrab

GPT-4 hasn’t mastered inclusiveness

I gave both models another rock history question to see if either of them could remember that rock n’ roll was once an almost entirely Black genre of music. For the most part, neither did.

GPT-3's answer

Left:
GPT-3
Credit: OpenAI / Screengrab
Right:
GPT-4
Credit: OpenAI / Screengrab

With all due respect to the legend Clarence Clemons, does a list like this really need to include him multiple times as a member of a mostly white band? Should it maybe make room for songs that are deep in the marrow of American music culture like “Blueberry Hill” by Fats Domino, or “Long Tall Sally” by Little Richard?

Overall, GPT-4 is a subtle step up that still needs work. Its reports about passing tests that GPT-3 bombed may make seem like the difference between the two models is night-and-day, but in my tests the difference is more like twilight versus dusk.