Game Loop & Environment Design ============================== Game Reasoning Arena follows a **multi-agent reinforcement learning paradigm** built on top of OpenSpiel, providing a Gymnasium-like interface for game interactions. This design enables seamless integration with RL frameworks while supporting diverse agent types including LLMs, random agents, and human players. Reinforcement Learning Paradigm -------------------------------- The framework implements the standard **agent-environment interaction loop** from reinforcement learning: .. code-block:: text ┌─────────────┐ observation ┌─────────────┐ │ LLM Agent │ ◄──────────────── │ Environment │ │ │ │ │ │ (Policy) │ ────────────────► │ (OpenSpiel) │ └─────────────┘ action └─────────────┘ │ ▼ reward + next observation Key Components ~~~~~~~~~~~~~~ **Environment (OpenSpielEnv)** - Wraps OpenSpiel games in a Gymnasium-compatible interface - Manages state transitions, legal action validation, and game termination - Handles both turn-based and simultaneous-move games - Provides rich observations including game state and legal actions **Agents (Policies)** - Implement the ``BaseAgent`` interface with ``compute_action()`` method - Receive observations and return actions (integers) - Support multiple agent types: LLM, Random, Human - Can maintain internal state across game steps **Observations** - Dictionary format containing game state information - Includes legal actions, state strings, and formatted prompts - Tailored per agent with player-specific information **Actions** - Integer values representing legal moves in the game - Validated against OpenSpiel's legal action set - Support both single-agent (turn-based) and multi-agent (simultaneous) scenarios **Rewards** - Dictionary mapping player IDs to reward values - Computed using OpenSpiel's built-in reward functions - Available at each step (sparse) or episode termination (dense) Gymnasium Compatibility ------------------------ Game Reasoning Arena closely follows the **Gymnasium API standard**: Environment Interface ~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python # Standard Gymnasium pattern observation, info = env.reset(seed=42) while not terminated and not truncated: action_dict = {current_player: agent.compute_action(observation)} observation, reward, terminated, truncated, info = env.step(action_dict) Key Similarities ~~~~~~~~~~~~~~~~ ==================== ===================== ================================ **Gymnasium** **Game Reasoning Arena** **Description** ==================== ===================== ================================ ``env.reset()`` ``env.reset()`` Initialize episode, return observation ``env.step(action)`` ``env.step(actions)`` Apply action(s), return transition ``env.render()`` ``env.render()`` Display current game state ``env.seed()`` ``env.set_seed()`` Set random seed for reproducibility Observation space Observation dict Structured state information Action space Legal actions list Valid moves for current state Reward signal Reward dictionary Per-player reward values ==================== ===================== ================================ Multi-Agent Extensions ~~~~~~~~~~~~~~~~~~~~~~~ Game Reasoning Arena extends Gymnasium for **multi-agent scenarios**: .. code-block:: python # Turn-based games (like Chess, Tic-Tac-Toe) action_dict = {current_player: action} # Simultaneous games (like Rock-Paper-Scissors) action_dict = {0: action_0, 1: action_1} RLLib Multi-Agent Compatibility -------------------------------- The framework is designed with **RLLib multi-agent training** in mind: Policy Mapping ~~~~~~~~~~~~~~ .. code-block:: python # RLLib-style policy mapping def policy_mapping_fn(agent_id, episode, worker, **kwargs): return f"policy_{agent_id}" # Game Reasoning Arena equivalent player_to_agent = { 0: LLMAgent(model="gpt-4"), 1: RandomAgent() } Action Computation ~~~~~~~~~~~~~~~~~~ .. code-block:: python # RLLib pattern actions = {agent_id: policy.compute_action(obs) for agent_id, obs in observations.items()} # Game Reasoning Arena implementation actions = {player: agent(observations[player]) for player in active_players} Episode Management ~~~~~~~~~~~~~~~~~~ The simulation loop mirrors RLLib's training workflow: .. code-block:: python def simulate_episode(): observations = env.reset() episode_rewards = {agent_id: 0 for agent_id in agents} while not done: # Compute actions for active agents actions = compute_actions(env, agents, observations) # Step environment obs, rewards, terminated, truncated, info = env.step(actions) # Accumulate rewards for agent_id, reward in rewards.items(): episode_rewards[agent_id] += reward # Update state observations = obs done = terminated or truncated return episode_rewards Game Loop Architecture ---------------------- Turn-Based Games ~~~~~~~~~~~~~~~~ For sequential games like Chess or Tic-Tac-Toe: .. code-block:: python while not game_over: # 1. Get current player current_player = env.state.current_player() # 2. Generate observation observation = env._state_to_observation()[current_player] # 3. Agent selects action action = agents[current_player].compute_action(observation) # 4. Validate and apply action if action in observation["legal_actions"]: obs, rewards, terminated, truncated, info = env.step({current_player: action}) else: # Handle illegal action (terminate episode) break Simultaneous Games ~~~~~~~~~~~~~~~~~~ For concurrent games like Rock-Paper-Scissors: .. code-block:: python while not game_over: # 1. All players act simultaneously observations = env._state_to_observation() # 2. Collect actions from all agents action_dict = {} for player_id, agent in agents.items(): action_dict[player_id] = agent.compute_action(observations[player_id]) # 3. Apply all actions together obs, rewards, terminated, truncated, info = env.step(action_dict) Chance Node Handling ~~~~~~~~~~~~~~~~~~~~ OpenSpiel games often include chance events (card dealing, dice rolls): .. code-block:: python def _solve_chance_nodes(self): """Automatically resolve probabilistic events.""" while self.state.is_chance_node(): outcomes, probabilities = zip(*self.state.chance_outcomes()) action = random.choices(outcomes, probabilities)[0] self.state.apply_action(action) Observation Structure --------------------- Observations follow a **rich dictionary format** providing comprehensive game information: .. code-block:: python observation = { "state_string": "X.O\\n.X.\\n...", # Human-readable state "legal_actions": [0, 2, 5, 6, 7, 8], # Valid move indices "prompt": "You are playing Tic-Tac-Toe\\n..." # Formatted for LLMs } Per-Agent Observations ~~~~~~~~~~~~~~~~~~~~~~ Each agent receives **player-specific information**: - **Partial observability**: Hidden information (e.g., opponent cards in Poker) - **Player perspective**: Board orientation and symbol assignment - **Legal actions**: Only moves valid for that specific player - **Context prompts**: Tailored natural language descriptions for LLM agents Action Space Design ------------------- Actions are represented as **integer indices** corresponding to OpenSpiel's action encoding: .. code-block:: python # Tic-Tac-Toe: positions 0-8 # 0 1 2 # 3 4 5 # 6 7 8 # Connect Four: columns 0-6 # Kuhn Poker: 0=Pass, 1=Bet .. note:: All action indices are validated against OpenSpiel's legal action constraints to ensure game rule compliance. Action Validation ~~~~~~~~~~~~~~~~~ The framework provides **automatic legal action checking**: .. code-block:: python legal_actions = env.state.legal_actions(current_player) if chosen_action not in legal_actions: # Log illegal move and terminate episode logger.error(f"Illegal action {chosen_action} by player {current_player}") env.truncated = True Reward Structure ---------------- Rewards follow **OpenSpiel's game-theoretic conventions**: Zero-Sum Games ~~~~~~~~~~~~~~ - Winner: +1, Loser: -1, Draw: 0 - Total rewards sum to zero across all players Cooperative Games ~~~~~~~~~~~~~~~~~ - Shared objectives with aligned reward signals - All players receive same reward for joint success Reward Timing ~~~~~~~~~~~~~ .. code-block:: python # Sparse rewards (typical) rewards = {0: 0.0, 1: 0.0} # During game rewards = {0: 1.0, 1: -1.0} # At termination # Dense rewards (optional) rewards = {0: step_reward, 1: step_reward} # Each step See Also -------- - :doc:`agents` - Detailed agent implementation guide - :doc:`games` - Available game environments - :doc:`api_reference` - Complete API documentation - :doc:`experiments` - Advanced multi-agent training setups