Q* or What Comes Next in LLM Land

4 min readApr 3, 2024

It is relatively obvious ( bar emerging surprises ) that brute force LLM scaling won’t get us to LLMs capable of planning and reasoning. Major labs ( OpenAI, Google ) are working on how to get next level GPT-4 by integrating LLM with planning and reasoning.

A glimpse of what may be coming is Cicero, Diplomacy game specific attempt to get us closer to this goal. Cicero combines planning and reasoning with LLM.

Many researchers believe that it may be a descendent of Cicero, rather than GPT-5, that sparks the next big disruptions.

Step 1 Using the board state and current dialogue, Cicero makes an initial prediction of what everyone will do.

Step 2 CICERO iteratively refines that prediction using planning and then uses those predictions to form an intent for itself and its partner.

Step 3 It generates several candidate messages based on the board state, dialogue, and its intents.

Though the research team trusted their L.L.M. to understand conversations with other Diplomacy players, they didn’t trust it to actually come up with smart strategies in response to these interactions. “The language model is doing fuzzy pattern-matching, trying to see things seen in training data, and then copying something similar to what was said,” Mike Lewis, a Meta engineer who worked on the project, told me. “It is not trying to predict good moves.” As his colleague Emily Dinan, who also worked on the project, put it, “We tried to relieve the language model of most of the responsibility of learning which moves in the game are strategically valuable, or even legal.” This responsibility was instead placed into a future-oriented planning engine of the sort more typically deployed in a poker or chess bot.

In the resulting system, the language model passes annotated versions of the messages it receives to the planning engine, which uses this information to help simulate possible strategies. Should it trust Italy’s suggestion to help it invade Turkey? Or is the suggestion to invade Australia better? What if Italy is being dishonest? The planning engine explores countless ways forward, integrating many different assumptions about the human players’ allegiances and potential for betrayal. Once it decides on a plan that maximizes its chance for success, it instructs the L.L.M. on what it wants from the other players; the model then turns these terse descriptions into convincing messages.

The strategic reasoning module generates game actions on each game turn, as well as intents, which control the LLM output ( below the hyphenated line ). The current crop of LLMs apes real reasoning via RLHF ie imitation learning.

The key to our achievement was developing new techniques at the intersection of two completely different areas of AI research: strategic reasoning, as used in agents like AlphaGo and Pluribus, and natural language processing, as used in models like GPT-3 … .

Cicero’s major improvement over behavioral cloning ie RLHF is AlphaGo style use of self-play ( against previous copies of itself ) for value function ( to asses state value ) and policy ( best action and intent given the state ) optimization. Policy generates output actions and intents. Intents are meta data ( here implemented as LLM prompts ) — they prod dialogue model ( LLM trained on dialogue history, board states )² to generate intent related messages.

A general AI agent ( not game specific ) would need to have an extended set of actions — math engines, fact checkers¹ etc, that would be activated via the planning module constantly recomputed intents.

Today, for many people, L.L.M.-powered tools such as ChatGPT are synonymous with A.I. But Cicero suggests a broader reality. The future of artificial intelligence may not depend on the pursuit of increasingly complicated large language models but instead on the development of nuanced connections between these models and other types of A.I. Cicero combined language and game strategy, but future systems might draw on more general planning abilities, allowing a chatbot to create a smart plan for your week or navigate tricky interpersonal dynamics in responding to your e-mails. Stretch these possibilities further, and we might even arrive at a real-world hal 9000, capable of pursuing goals in flexible (and perhaps terrifying) ways. The promise of this ensemble approach is reflected in the fact that the big players are already investing heavily in non-language-based forms of digital intelligence. Not long after Brown’s success with Cicero, for example, OpenAI hired him away from Meta to help integrate more planning into its popular language-model-based tools.

¹ Cicero uses R2C2 base model which combines search engine and LLM which still doesn’t completely alleviate factual inaccuracies.

² Each game turn requires time to recompute new intents based on board state and dialogue history.

Q* or What Comes Next in LLM Land

Written by Ranko Mosic