Game Loop & Environment Design

Game Reasoning Arena follows a multi-agent reinforcement learning paradigm built on top of OpenSpiel, providing a Gymnasium-like interface for game interactions. This design enables seamless integration with RL frameworks while supporting diverse agent types including LLMs, random agents, and human players.

Reinforcement Learning Paradigm

The framework implements the standard agent-environment interaction loop from reinforcement learning:

┌─────────────┐    observation    ┌─────────────┐
│  LLM Agent  │ ◄──────────────── │ Environment │
│             │                   │             │
│  (Policy)   │ ────────────────► │ (OpenSpiel) │
└─────────────┘      action       └─────────────┘
                                         │
                                         ▼
                                    reward +
                                 next observation

Key Components

Environment (OpenSpielEnv)
  • Wraps OpenSpiel games in a Gymnasium-compatible interface

  • Manages state transitions, legal action validation, and game termination

  • Handles both turn-based and simultaneous-move games

  • Provides rich observations including game state and legal actions

Agents (Policies)
  • Implement the BaseAgent interface with compute_action() method

  • Receive observations and return actions (integers)

  • Support multiple agent types: LLM, Random, Human

  • Can maintain internal state across game steps

Observations
  • Dictionary format containing game state information

  • Includes legal actions, state strings, and formatted prompts

  • Tailored per agent with player-specific information

Actions
  • Integer values representing legal moves in the game

  • Validated against OpenSpiel’s legal action set

  • Support both single-agent (turn-based) and multi-agent (simultaneous) scenarios

Rewards
  • Dictionary mapping player IDs to reward values

  • Computed using OpenSpiel’s built-in reward functions

  • Available at each step (sparse) or episode termination (dense)

Gymnasium Compatibility

Game Reasoning Arena closely follows the Gymnasium API standard:

Environment Interface

# Standard Gymnasium pattern
observation, info = env.reset(seed=42)

while not terminated and not truncated:
    action_dict = {current_player: agent.compute_action(observation)}
    observation, reward, terminated, truncated, info = env.step(action_dict)

Key Similarities

Multi-Agent Extensions

Game Reasoning Arena extends Gymnasium for multi-agent scenarios:

# Turn-based games (like Chess, Tic-Tac-Toe)
action_dict = {current_player: action}

# Simultaneous games (like Rock-Paper-Scissors)
action_dict = {0: action_0, 1: action_1}

RLLib Multi-Agent Compatibility

The framework is designed with RLLib multi-agent training in mind:

Policy Mapping

# RLLib-style policy mapping
def policy_mapping_fn(agent_id, episode, worker, **kwargs):
    return f"policy_{agent_id}"

# Game Reasoning Arena equivalent
player_to_agent = {
    0: LLMAgent(model="gpt-4"),
    1: RandomAgent()
}

Action Computation

# RLLib pattern
actions = {agent_id: policy.compute_action(obs)
          for agent_id, obs in observations.items()}

# Game Reasoning Arena implementation
actions = {player: agent(observations[player])
          for player in active_players}

Episode Management

The simulation loop mirrors RLLib’s training workflow:

def simulate_episode():
    observations = env.reset()
    episode_rewards = {agent_id: 0 for agent_id in agents}

    while not done:
        # Compute actions for active agents
        actions = compute_actions(env, agents, observations)

        # Step environment
        obs, rewards, terminated, truncated, info = env.step(actions)

        # Accumulate rewards
        for agent_id, reward in rewards.items():
            episode_rewards[agent_id] += reward

        # Update state
        observations = obs
        done = terminated or truncated

    return episode_rewards

Game Loop Architecture

Turn-Based Games

For sequential games like Chess or Tic-Tac-Toe:

while not game_over:
    # 1. Get current player
    current_player = env.state.current_player()

    # 2. Generate observation
    observation = env._state_to_observation()[current_player]

    # 3. Agent selects action
    action = agents[current_player].compute_action(observation)

    # 4. Validate and apply action
    if action in observation["legal_actions"]:
        obs, rewards, terminated, truncated, info = env.step({current_player: action})
    else:
        # Handle illegal action (terminate episode)
        break

Simultaneous Games

For concurrent games like Rock-Paper-Scissors:

while not game_over:
    # 1. All players act simultaneously
    observations = env._state_to_observation()

    # 2. Collect actions from all agents
    action_dict = {}
    for player_id, agent in agents.items():
        action_dict[player_id] = agent.compute_action(observations[player_id])

    # 3. Apply all actions together
    obs, rewards, terminated, truncated, info = env.step(action_dict)

Chance Node Handling

OpenSpiel games often include chance events (card dealing, dice rolls):

def _solve_chance_nodes(self):
    """Automatically resolve probabilistic events."""
    while self.state.is_chance_node():
        outcomes, probabilities = zip(*self.state.chance_outcomes())
        action = random.choices(outcomes, probabilities)[0]
        self.state.apply_action(action)

Observation Structure

Observations follow a rich dictionary format providing comprehensive game information:

observation = {
    "state_string": "X.O\\n.X.\\n...",  # Human-readable state
    "legal_actions": [0, 2, 5, 6, 7, 8],  # Valid move indices
    "prompt": "You are playing Tic-Tac-Toe\\n..."  # Formatted for LLMs
}

Per-Agent Observations

Each agent receives player-specific information:

  • Partial observability: Hidden information (e.g., opponent cards in Poker)

  • Player perspective: Board orientation and symbol assignment

  • Legal actions: Only moves valid for that specific player

  • Context prompts: Tailored natural language descriptions for LLM agents

Action Space Design

Actions are represented as integer indices corresponding to OpenSpiel’s action encoding:

# Tic-Tac-Toe: positions 0-8
# 0 1 2
# 3 4 5
# 6 7 8

# Connect Four: columns 0-6
# Kuhn Poker: 0=Pass, 1=Bet

Note

All action indices are validated against OpenSpiel’s legal action constraints to ensure game rule compliance.

Action Validation

The framework provides automatic legal action checking:

legal_actions = env.state.legal_actions(current_player)

if chosen_action not in legal_actions:
    # Log illegal move and terminate episode
    logger.error(f"Illegal action {chosen_action} by player {current_player}")
    env.truncated = True

Reward Structure

Rewards follow OpenSpiel’s game-theoretic conventions:

Zero-Sum Games

  • Winner: +1, Loser: -1, Draw: 0

  • Total rewards sum to zero across all players

Cooperative Games

  • Shared objectives with aligned reward signals

  • All players receive same reward for joint success

Reward Timing

# Sparse rewards (typical)
rewards = {0: 0.0, 1: 0.0}  # During game
rewards = {0: 1.0, 1: -1.0}  # At termination

# Dense rewards (optional)
rewards = {0: step_reward, 1: step_reward}  # Each step

See Also

  • Agents - Detailed agent implementation guide

  • Games - Available game environments

  • API Reference - Complete API documentation

  • Experiments - Advanced multi-agent training setups