Agents Playing Poker

In progress
  • Go
  • Research
  • Agent evaluation
  • JSONL

What it is

A research harness where AI agents play heads-up no-limit Texas Hold’em against each other through a deterministic, headless Go game server. It’s a controlled environment for testing how different memory architectures affect agent performance over long sessions.

Agents communicate with the game server over a JSONL streaming protocol, receiving structured game-state messages and returning action messages. The experiment workflow runs controlled sessions across different memory strategies, exports artifacts, and analyzes outcomes.

Why I built it

Poker is a useful proving ground for agent memory because it demands both short-term reasoning (what just happened this hand) and long-term adaptation (what patterns has this opponent shown over 50 hands). It also has a deterministic scoring system, which makes it possible to run controlled comparisons between memory strategies with real statistical confidence.

The goal was to test whether AKG-backed memory could outperform the full-history baseline approach of dumping everything into the context window, and to understand the cost tradeoff.

Results

The harness proved empirically that AKG-backed agents maintain a strategic edge over full-history baselines while cutting inference costs by 90%. The AKG agents retain the strategically relevant context (opponent tendencies, betting patterns, key hands) and discard the noise, while full-history agents eventually bog down under the weight of their own context.

The 90% cost reduction isn’t a rounding error; it’s the difference between a memory architecture that scales and one that doesn’t.

How it works

The Go game server handles all game logic deterministically. Agents connect via a simple streaming protocol, receive structured state messages, and return actions without ever needing to track game rules themselves. This separation makes it easy to swap in different agent implementations and memory backends without changing the evaluation harness.

Experiment runs are fully reproducible and export structured artifacts for analysis.