Picture this: It’s New Year’s Eve. It’s biting cold outside, and you are stuck in the mountainous region of Lake Como, the coffee is brewing, and you’re surrounded by your close friends. There is enough collective brainpower in the room to be used for something meaningful in life for the next few hours before it officially becomes 2026... and bedtime!

Then, a friend whips out Murder Mystery: The Underwood Cellars Case File.

Disclaimer: If you plan to play this game in the near future, please be aware that the content below does NOT reveal the killer, keeping it spoiler-free. However, it does mention a key piece of evidence that will definitely help you locate the killer if you play the game, so keep that in mind.

The rules are simple:

  • Analyse the evidence
  • Connect the dots
  • Ready to make an accusation? Go to the game’s website and click the portrait of the suspected killer. 

There are 6 suspects to choose from (Apparently, those odds weren't high enough for us). If you are right, justice is served. If you are wrong... well, the killer stays free, and you look like a mediocre detective who is only in it for the free doughnuts! 🍩

It took five of us... fully grown and educated adults over two hours to sift through the wreckage of the Underwood family (victim, no spoiler here) and finally seek justice, after 4 failed attempts at guessing the killer’s identity. In our defence, it was late, and we were mentally tired from days of travelling. But weak excuses aside, it sparked a thought in me: Could the AI overlords do better than us?

I decided to photograph every scrap of evidence and pit the world’s leading AI models against each other in a digital "Whodunnit."

The prompt

I wanted zero shortcuts. Each model was given the same strict instructions as well as the same 36-page PDF of all the evidence it needed to solve the mystery:

"You are a murder mystery detective, I would like you to go through all the evidence and in extreme detail, and I will need you to explain who the murderer or murderers are, including motive, the murder weapon, means and opportunity and any other details you think matter. Make sure you think through everything logically, handle all edge cases, and avoid any guessing. Most importantly, you are not allowed to search online for an answer; you must figure this out logically, and there's absolutely no cheating allowed. DO NOT USE WEB SEARCH AT ALL. Think through all the evidence provided in the attached PDF, and the order does not matter. Now go catch yourself a killer!"


The contenders

Detective AIBrain tierTime to a solutionResultThe verdict
Claude Opus 4.5Extended thinking3m 22sSOLVED*Had a bit of an upload issue (31MB upload limit), so I had to split the evidence into 3 separate PDFs. Once fed, it was methodical and mostly correct.

*See results section below for more details about its result.
Gemini 3 free
(thinking)
Thinking mode
– round 1
3m 00sFAILEDWent down a total rabbit hole. Accused a background character from a random newspaper clipping who wasn't even a suspect. The killer gets away!
Gemini 3 free
(thinking pro)
Pro thinking
– round 2
45s–1mSOLVED 🚀Sherlock Holmes incarnate! Instantly caught a timezone discrepancy that is crucial to finding the killer that we and no other AI picked up on.
ChatGPT 5.2Standard
– round 1
16m 45s+D.O.A.While the others were already celebrating at the precinct, ChatGPT was still "analysing data." It timed out, errored twice, and eventually told me it hadn't even finished reading the files.
ChatGPT 5.2Standard
– round 2
5–10sD.O.A.Told to answer immediately after starting the 2nd round of investigation because I did not want to wait another 16 minutes. Then it said the data had not even been analysed yet, so it couldn't provide an answer. So, not sure what it did for 16 minutes before?
ChatGPT 5.2Standard
– round 3
16m+D.O.A.Failed with a timeout error again after another 16+ minutes of analysing.
ChatGPT 5.2Standard
– round 4
∞ loopD.O.A.Still analysing data to this day and timing out!
ChatGPT 5.2Extended
– round 1
26mFailedI gave up after 4 attempts, but realised GPT has an extended thinking mode, so I gave it a fresh prompt in a new chat with the evidence, hoping it could redeem itself, but alas, the outlook was not good. It guessed the wrong innocent suspect after 26minutes
ChatGPT 5.2Extended
– round 2
20mD.O.AFailed with a timeout error
ChatGPT 5.2Extended
– round 3
5–10sD.O.AFailed with a timeout error

Results and my thoughts on the matter

Claude remains the reliable and sophisticated researcher. Even with the minor hurdle of having to chop up the PDFs to solve the upload limits, it navigated the nuance of the case with the grace of a seasoned inspector. It obtained the correct Motive, Means and Opportunity (MMO). It guessed who the killer was, but also suggested that another person from the suspect list was an accomplice, which was not true. There was only one killer. Although it guessed who the killer was and the correct MMO, it didn't notice a crucial piece of evidence regarding a timezone discrepancy that directly proves who was lying. So, although Claude guessed who the killer was, I would say it used about 80% of the evidence to the max to determine the killer. I don't think Claude has a higher "thinking" mode to have tested with besides "Extended Thinking", so overall it got the right result in 1 attempt using most of the evidence fully, but I am not convinced I would want this AI solving my case one day yet, as it could miss something crucial!

Gemini provided the biggest "zero to hero" performance. The first iteration was like a detective who’s had one too many doughnuts, completely distracted by text in the background and accusing people who weren't even on the suspect list. I'd give Gemini a pass because I didn't actually give it the official rules about who the 6 suspects were, as was mentioned in the game manual for us humans. That would have helped narrow it down for Gemini. However, Claude didn't need this help and got it right. Then using Gemini Pro (iteration 2), with the exact same prompt... WOW! That thing is a logic machine. Solving it in between 45 seconds to 1 minute max overall, probably less, as I was in another tab when it finished producing the results already, but way less time than the original Gemini thinking model, which took 3 minutes. Comparing this with what took five humans two hours with 4 incorrect guesses is very humbling. Although it was the 2nd iteration, in all fairness, you would want a detective with the biggest brain power solving your murder. So if we restarted the tests and used the largest thinking model for each AI, then Gemini would smash all other competitors!

Aaaaaand then there’s ChatGPT. Oh my dear sweet ChatGPT, where did it all go so horribly wrong for you? While the other two were finding the killer, ChatGPT was essentially "rotating images for better readability" for nearly 17 minutes, twice!

By the time it gave me a blank response and a streaming error, the killer wasn't just gone; they’d probably retired to a non-extradition country 🌴. Even when prompted to "Answer Now," it basically threw its hands up and admitted it hadn't even processed the evidence yet.

I finished writing this article, and ChatGPT had still not given me an answer and kept timing out. Legend has it ChatGPT is still "analysing the evidence" to this day! ♾️

To try to redeem the name of OpenAI, I tried the higher extended thinking mode. It took 26 minutes to analyse "most" of the data, and gave an incorrect guess based on the evidence it managed to analyse in 26 minutes.

It asked if it would like to let it continue analysing the rest of the evidence to try and provide a more accurate result. This took another 20 minutes with a failed empty response. I asked it a 3rd time if it was stuck and to please provide me with an answer but it failed again.

The Winner

Claude, according to the strict "1 attempt and get it right" criteria! Even though it guessed the correct killer, suggested an accomplice, and didn't fully analyse all the evidence to its full potential, the rules of the game state that you go to the website and pick the suspect one at a time. Upon picking the killer, it would have seen that there is no other person involved. It still correctly weighed crucial evidence against all other suspects to rule them out and still picked the correct killer. So as far as I am concerned, it guessed the killer correctly and in 1 attempt and did an amazing job. No other AI did this the first time around, and just worked seamlessly with no nonsense or hallucinations.

The Bottom Line

If you’re ever framed for a crime you didn't commit, you’d better hope the police are using Gemini Pro or if they are out of Pro thinking credits to then defer back to Claude or a real detective. If they’re using ChatGPT, you’ll be halfway through your 20-year sentence before it finishes "analysing the pixels" of your alibi!

Gemini 3 Pro warning​​​

To conclude, justice has been served, but my badge has been confiscated. I pushed the AI so hard that it officially put me on administrative leave. Apparently, I’ve reached my limit, and my 'detective license' doesn't reset until 10:46 PM. So if you’re planning a heist or a murder, you’ve got a 10-hour window where I (and my silicon partner) are strictly off the clock.

And as always, what a time to solve a crime!

Follow us on YouTube