StatistaGilfix: Film Critics Hate This One Weird Trick: Finding Movies in the Age of AI

I’m back! After an outpouring of concern from my thoughtful and numerous readers, I realized it was time to return to my true calling: writing random stuff on the internet for no one to read.

What’s kept me busy since the last time you heard from me? Well I’ve picked up a bit of a cinema addiction, having seen a slightly embarrassing 421 movies since my last post 18 months ago.

And while I’ve been glued to the screen, the world’s been buzzing about LLMs. I’ve always followed AI developments pretty closely and have done some AI-adjacent work as a data scientist, but I’ve yet to find the right way to bring LLMs into my life. As they get better and more useful in personal and career life, it’s time to get stuck in. And the natural starting point seems to be testing LLMs’ movie chops.

So what’s our mission here? Should we choose to accept it, we want to see what it's like to interact with LLMs, and which ones are the most fun and useful to talk to about movies! And ultimately we want to know if they can successfully make personalized recommendations after being given a list of favorite movies.

Meet the Contenders

First impressions: Woah. I am seriously impressed! These guys can really take in and spit out a lot of information. And they’re also fun to talk to! I’ve never felt good about my writing ability, but after working with these LLMs I certainly feel inadequate.

I gave all 3 contenders (Twitter/Elon’s Grok, OpenAI’s ChatGPT, Anthropic’s Claude) a list of 6 of my favorite movies and asked for some recommendations. First, they all came back with a list of common traits.

Grok:

GPT:

Claude:

Nothing too impressive here, but still useful and interesting to see. Kinda weird how some of the phrasing is identical (“whimsical storytelling”, “visually distinctive”) in multiple LLMs. I guess that makes sense if they’re pulling from the same online sources, but it’s still weird. And while this isn’t THAT impressive, it’s also not THAT trivial to make a list like this! If you didn’t know anything about these movies and had to search around to come up with a list, it would probably take at least 15 minutes and I’m sure you wouldn’t come up with such a concise description.

The initial recommendations were both quite solid and quite similar!

Grok:

4 out of 6 of these are movies I either have or am quite confident I will really like. Good job Grok

GPT:

A solid list, though I’m less impressed. A couple of these seem Slow Cinema adjacent, which is definitely not my thing. The Third Man is also a wild choice. It's a good movie, but not similar to my favorites, and comparing it to The Best Years of our Lives doesn’t make any sense.

Claude:

5 of these are in the other lists too! They’re definitely good choices…but perhaps it's pretty easy to pick out these very obvious ones. Most interestingly, AMELIE 2! You’d think I would know if my favorite movie had a sequel…

Amelie 2: Coming to a theater near you (these posters are as real as Amelie 2)

SIDENOTE: I think there are much better AI image generation tools than these LLMs, but these attempts are hilarious. Grok’s initial output to the query “Can you make a movie poster for Amelie 2: Le Fabuleux Destin de Nino?” was literally just a headshot of a guy; fully unrelated to movie posters or Amelie. After I shamed it “that’s terrible! Do you know what a movie poster looks like?”, it came up with the Dora-esque poster in the top right. Claude seems to be most limited in its image generation, and came up with the incredible Microsoft Painty poster in the bottom left. It then had the audacity to claim the poster includes “A subtle Eiffel Tower silhouette in the background” and “Nino's bicycle as a key visual element”, and that “The design still maintains the dreamy, romantic aesthetic of the original Amelie film while providing more recognizable elements”. Google's Gemini in the bottom right actually looks pretty decent!

All 3 LLMs ended up hallucinating once, but it’s pretty insane to eff up the easiest question I ended up asking! But hey, at least it fessed up:

Apology not accepted

Recommender Systems

So all 3 LLMs came up with good lists, but these are popular movies that I already suspected I would enjoy. Should I be impressed? Well, let’s see how these suggestions compare to a Movie Recommendation Engine. In theory, a Recommender System should be a lot better at this than a general purpose LLM. LLMs are set up to be generalists, while Recommender Systems specialize in this one task.

Well…I tried the 2 most popular recommender systems and they sucked harder than Ben Simmons trying to score.

First, MovieLens. I put in 37 ratings from my Letterboxd (including all 6 above) and let it cook:

Very unimpressive! This seems to be putting way too much weight on IMDB ratings. This is the opposite problem from what I observed later on with the LLMs, who focused almost exclusively on the movie’s style and theme and didn't care about how well received it was.

I clicked “suggest less popular” movies, and after doing that a few times I got this list:

I haven’t heard of any of these… but this list stinks! For the most part (exception: Giuseppe Makes a Movie sounds fun), these movies don’t line up with anything I like! And they have mediocre-at-best ratings on IMDB and Letterboxd.

MovieLens gets an F rating, but surely a good recommender should be doing WAYYYY better than this. So I’m just gonna ignore these results.

Next I tried Cricketer:

Hot Stinking Smelly Garbage again! To Have and Have Not is a fine movie, but this list is so bizarre.

Well, I guess I’m forced to conclude that online recommender systems truly are as bad as Ben Simmons passing up a dunk in Game 7. I’m sure a competently made system (like Netflix has) would do way better than these, but those don’t seem to exist outside of the streaming platforms.

So even though it seems like an easy task to recommend a handful of popular movies, LLMs are the only method that does it well!

Deeper Cuts

But still…the movies recommended by the LLMs felt so obvious to me. Can it find good movies that I haven’t already heard of? Bear in mind that my watchlist is almost 1,500 movies long, and I’ve probably heard of a couple thousand more movies that aren’t on there. So this is a difficult task.

It took a couple iterations of asking for more obscure movies to get truly esoteric lists. I was specifically looking for movies that fewer than 3 people that I follow on Letterboxd have rated. That means really obscure! That probably rules out something like 40,000 movies (I follow ~ 15 people that have seen at least 5,000 movies).

Grok is super cool in that it shows you some (or all?) of the resources it used to answer your question. From what I can tell, it found overlap in lists of “movies similar to X” with lists of obscure movies. That’s pretty neat and would be super tedious for someone to do! But it’s a little less impressive than having a qualitative understanding of the types of movies I like. I could be wrong though, I don’t actually know what it's doing and I’m guessing it’s a bit more impressive than inner joining a few lists.

One interesting note before diving into the recommendations: Unlike the non-obscure lists of recommendations, there was almost no overlap between the obscure lists!

Grok

Grok gave me a couple movies that had pretty low ratings on Letterboxd, and it was so receptive when I gave it feedback not to do that! It started filtering to only include movies with ratings above 3.0 and it showed the rating of every movie it recommended after that. Super helpful.

Though it did incorrectly infer that I liked this poorly rated movie…

But that was easy to correct:

Then it hallucinated when I asked it to give me its best recommendation:

Finally, from the 20 or so obscure recommendations it gave, I picked one to try out:

ChatGPT

ChatGPT gave much shorter descriptions. For better and worse, it didn’t try to tie in elements of my favorite movies to its descriptions (until I asked it to do so after a few lists).

The descriptions were kinda boring, which matches my general experience with ChatGPT vs Grok, which has way more personality and was so much more fun to talk to.

One plus for ChatGPT here is that I asked all 3 for a list of obscure recs, and then later on I asked it to pick one it was most confident I would like…and ChatGPT was the only one that picked a movie from its own longer list of recommendations. Grok and Claude both confessed to being morons after I pointed it out, but that’s a pretty notable error.

This ended up being the one I chose:

Claude

Here’s what I went with for Claude:

I also asked for a little more explanation on why it recommended it:

Well it sure sounds up my alley, but does it deliver???

Reviewing the Obscure

As a data scientist, I am contractually obligated to say that you are not allowed to watch just one recommendation from each LLM and declare a winner. So let’s watch one recommendation from each LLM and declare a winner.

The Tune was pretty fun! It was totally ridiculous and had basically no narrative, but it had some fun songs (and some meh songs) and some very creative and unique animation. A pretty solid pick! I gave it a 3.75 out of 5 and told Grok it did well:

Great stuff. Fun tone, mentions Letterboxd, and remembers that it’s not just about style: execution matters

A Cat in Paris was less of a hit. I did enjoy the animation, the premise was fun (double agent cats, cops and robbers) and the not-so-serious tone was nice, but I thought the writing was kind of lame. The characters and dialogue felt pretty cookie cutter, cliche, and unrealistic. I ended up giving it a 3.1 out of 5

It more or less gets it! I’d have to watch its other recommendations to be sure though. Not as fun as Grok once again.

Bunny and the Bull was solid. Of the three movies, you could see most obviously the connection to a couple of my favorites. Tons of Amelie and Wes Anderson-esque style in here. Really enjoyed all that! Story-wise, it wasn’t terrible, but it felt like a pretty run-of-the-mill buddy road trip. Had a lot of funny moments, but also had a few too many cliches, and the few times it went for an emotional impact, it felt short for me. A good time though! 3.6 out of 5

Long response! Pretty solid

I also questioned Claude’s initial suggestion that Memento and Bunny and the Bull are similar because they both move back and forward in time. Pretty impressive answer! It puts things into words so much better than I could!

Not only does Grok have the highest rated movie, but it gets bonus points for obscurity. The Tune has only been logged 3,427 times on Letterboxd, compared to 25,824 for A Cat in Paris (which was also Oscar nominated), and 5,125 for Bunny and the Bull.

Congrats to Elon. Now please get back to building cool things.

Head to Head Predictions

Let’s try one more test, this time one with an N of more than 1.

I’ll give each LLM 10 sets of two movies that I’ve seen, and ask them to predict which one I liked more. Here’s Claude:

I love that it says which ones it's most confident in. 8 out 10 here…not bad…except Grok and ChatGPT both aced this! 400 Blows is a reasonable mistake to make, but I think It’s a Wonderful Life’s positivity clearly fits in better with my favorites. There are no excuses for picking the monstrous Emilia Perez over my beloved Jerry Maguire (featuring my beloveds Tom Cruise and Renee Zellweger). Claude, I am offended and hurt.

Spot on Grok!

I gave them all four more pairs of movies. These were “trick” questions, where I preferred the movie that was less in line with my typical style, because it was much better made (as could easily be seen through their Letterboxd and IMDB ratings). And sure enough they all got tripped up! ChatGPT and Claude got 3, and Grok got 2.

Well done ChatGPT

After explaining that it should factor in the consensus rating, I gave them all a couple more sets. Easy A vs The Third Man was the first pairing. All but Grok got that one right. After giving Grok another telling to, I asked them all Citizen Kane vs Beetlejuice Beetlejuice and they all got it!

So much potential here! As you explain your reasoning it quickly adapts and self-improves. And it can easily accommodate any changes to what you’re looking for (ex. more obscure, higher rated, more upbeat).

I think it would be fascinating to see if you could give the LLM feedback after each watch and see if it can improve its predictions over time! I imagine that will work better over time as its context window expands. I could have given it all my Letterboxd ratings instead of just naming 6 of my favorites. But sadly I am way too committed to my ongoing movie project (catching up on “the classics”) to continue this exercise.

Though it's not clear they are really UNDERSTOOD what they were doing. Here’s Claude’s explanation for Citizen Kane:

“Like Marcel the Shell and Amelie, it finds profound meaning in seemingly small moments”. It may find profound meaning in small moments, but not in a similar way to those two much more playful movies!

I was generally unclear on if all the comparisons it makes are post-hoc justifications or if they actually pointed to the decision making process (I think it's the former; if it's the latter then it's being pretty dumb).

And The Winner Is…

I found Grok the most enjoyable to talk to, and it seemed to be the most open to feedback, and it gave the best recommendation! There wasn’t a huge difference between any of the 3 contenders though, and with these models changing on practically a weekly basis, I plan to keep using all 3 until one emerges as a clear victor.

You could try a similar process to this for any type of recommendation! Books, music, recipes, etc. What a time to be alive. And obviously there is so much beyond recommendations. It really should be my everyone’s immediate go-to anytime there’s a question or something to summarize.

And I should also start using it to help me write faster and betterer :)

See you again in 18 months! JK…hopefully

P.S Maybe I was too harsh on ChatGPT’s lack of personality. When I couldn’t remember what it’s called when a bull stabs someone, I asked it for help:

StatistaGilfix

Friday, March 7, 2025

Film Critics Hate This One Weird Trick: Finding Movies in the Age of AI