Meta's AI Isn't That Good, Yet

Beat it while you still can.

Dec 23, 2022

Consternation, panic, pandemonium ensued when Meta researchers announced the release of an AI intelligence, Cicero, that could play Diplomacy at a high level. Cicero can not only make excellent game-theoretical decisions, but also communicate with other players through chat, adjusting based on their responses.

The announcement was hurriedly followed by several more, each from a team of researchers kicking itself for not having moved slightly quicker through its final review process. Rival AI company DeepMind announced a new AI for Stratego on December 1st, which they called “a game of hidden information… more complex than chess, Go and poker,” a claim that seems true in a raw computational sense, if not at the conceptual level. Another DeepMind team published a paper a week later titled “AI for the board game Diplomacy,” touting their strategic agents’ abilities not only to coordinate with each other but also to adjust to agents that break their diplomatic commitments.

The kerfuffle burnished Meta’s growing reputation as a malevolent, all-powerful force, committed to plunging you further into a virtual world filled with devilish, chittering agents. It also served to obscure vague language in the paper itself, which noted that Cicero is good but not that good at Diplomacy:

“Across 40 games of an anonymous online Diplomacy league, Cicero achieved more than double the average score of the human players and ranked in the top 10% of participants who played more than one game.”

Diplomacy performance, in my experience, has high standard deviation: like many other strategy games, a good player is much better than a mediocre player, and sometimes nowhere near the ability of a great player. “More than double the average score” of human players is remarkable, but not devastating. The second statement, that Cicero was better than the majority of regular players, elides the kind of games Cicero was deployed on: a blitz Diplomacy tournament, in which players have 5 minutes per move, rather than more traditional, luxurious time spans. Participants in Cicero’s games were typically toggling open the tab while watching sports or pretending to work: the set of participants who played more than one game may be quite limited.

Blitz Diplomacy truncates the verbal diplomacy aspect of the game in much the same way as blitz chess truncates tactical calculation. One long game of Diplomacy I played over October hinged, at the crucial moment, on board-wide inability to stab Italy, despite the barrenness of his defenses. Over the course of several weeks, we’d all begun to shoot the breeze about our personal lives, and when it turned out the power had gone out at Italy’s house after a flood, neither allies nor enemies had it in them to sink the knife into his back. None of this happens in blitz Diplomacy. Rather than flowery paragraphs in which players may trade information, build relationships, and sign off as the Czar or Prime Minister of their nation, you get perhaps a few lines of terse back and forth between moves.

It’s unclear how Cicero would fare in such a game state, but the paper provides some hints that the large language model it relies on may not yet be sturdy enough for recurring back and forth. At least one opponent of Cicero over blitz messaged other players, asking if they also thought Cicero was a bot.

Meta’s researchers claim that Cicero “does not backstab.” Their claim relies on a particularly felicitous rhetorical sleight of hand, which is how you know they play Diplomacy. When Cicero makes promises to other players, it, at the moment of the interaction, plans on keeping those promises. There’s no lying involved, not precisely. Of course, this leaves open a wide range of mechanical duplicity. Cicero will make promises to one player, then to a second player, and abide by its promises to the latter. “You should move from Berlin to Munich,” it says, believing it to be a good move, and then it takes Berlin from you, also believing it to be a good move.

AI researcher Jack Clark noted Cicero’s “relatively modest” language model, trained on a dataset of 40,000 human games on webdiplomacy.net. Unfortunately, you and I can’t engage with it, but the researchers (to their great credit) partnered with the website to make games available with its strategic reasoning module, in “gunboat” Diplomacy. Gunboat refers to versions of Diplomacy without any written or verbal communication allowed, making the game something like Risk without the dice rolling. You and I can play against six different, uncoordinated instantiations of Cicero’s decision-making model, but we can’t talk to them and they can’t talk to us. They have fun names like Skynet and Cortana.

Over the past month I tried to beat the agglomerated agents with each of the seven playable countries, and succeeded. For context, I’m fairly good at Diplomacy. I have few compunctions about lying horridly to friends over the board and even fewer about betraying other anons on a website conveniently called Backstabbr. I’m tactically sound, I’ve read articles about the Sealion and the Key Lepanto, and, most importantly, I’ve played way too much. But I’m not great.

Cicero is fairly good. It punishes overreaching with a brutal efficiency, it changes up its defensive strategies in a way concordant with game theory, and even sans communication it occasionally comes up with neat alliances between its different instances. But it’s also eminently beatable, in some ways more than typical gunboat competitors. It seems to have no ability to smell danger across the board, or to set aside one set of strategic goals to tackle a common enemy. Some of this will be sorted out in Cicero 2.0, no doubt, but it’s incredibly satisfying to spank the AI while playing as Italy or Germany, two of the lesser powers.

Will Cicero ruin Diplomacy? Artificial intelligence conquered humans in my preferred strategic realm of chess ages ago, and yet the game chugs along. Recent commentators on an alleged cheating scandal had to squint to read it as a Man vs. Machine story — perhaps the fear that Hans Niemann was wielding digital help in this or that orifice belied a deeper fear, that this or that chess engine has finally made the human obsolete.

With Diplomacy, for a brief moment, it’s possible to be Kasparov in his first Deep Blue match, You can beat the best AI in a given realm, moments before its creators take it back to the shop for the human-defeating upgrades. Give it a ride if you have some time to kill this Christmas.

Reading

Let's restore the sacred groves of our ancestors, calls Aris Roussinos in a lyrical essay on the magic of Britain’s rainforests. Aris is one of my favorite essayists in the game, as at home writing political daggers as he is reporting on conservation.

If you’re interested in sharply reported stories on modern agriculture, you should check out the newsletter of Ambrook Research. It’s really quite good.

Colleague and rock star Caleb Watney wrote the best breakdown you’ll see of the kind of state required to do innovative science policy: But Seriously, How Do We Make An Entrepreneurial State?

Is God good because he’s God? Or is good itself good because it’s like God? Complicated question, ably answered by Alan Jacobs 28 years ago through an inquest into the work of British philosopher/novelist Iris Murdoch.

Here are a couple newsletters I’ve come back to several times this month: