Reasoning Traces Analysis

Game Reasoning Arena provides powerful reasoning traces functionality that captures and analyzes LLM decision-making processes during gameplay. This tutorial will guide you through obtaining, viewing, and analyzing reasoning traces to gain deep insights into how LLMs think through game strategies.

What are Reasoning Traces?

Reasoning traces capture three key pieces of information for each LLM move:

Board State: The exact game position when the decision was made
Agent Reasoning: The LLM’s thought process and explanation for the move
Action Context: The chosen action along with metadata (timestamp, episode, turn)

This combination provides unprecedented insight into AI decision-making patterns, strategic thinking, and potential weaknesses in LLM game-playing abilities.

Key Features

Automatic Collection: No special configuration required - traces are collected automatically during LLM gameplay
Multi-Game Support: Works across all supported games (Tic-Tac-Toe, Connect Four, Kuhn Poker, etc.)
Comprehensive Logging: Stores complete game context in SQLite databases
Analysis Tools: Built-in categorization and visualization of reasoning patterns
Research Ready: Designed for academic analysis of AI decision-making

Getting Started with Reasoning Traces

Step 1: Run Games with LLM Agents

Reasoning traces are automatically collected whenever LLM agents play games. No special configuration is needed:

# Basic LLM vs Random game - traces will be automatically collected
python3 scripts/runner.py --config src/game_reasoning_arena/configs/example_config.yaml --override \
  env_config.game_name=tic_tac_toe \
  agents.player_0.type=llm \
  agents.player_0.model=litellm_groq/llama3-8b-8192 \
  num_episodes=5

# LLM vs LLM - both agents' reasoning will be captured
python3 scripts/runner.py --config src/game_reasoning_arena/configs/example_config.yaml --override \
  env_config.game_name=connect_four \
  agents.player_0.type=llm \
  agents.player_0.model=litellm_groq/llama3-8b-8192 \
  agents.player_1.type=llm \
  agents.player_1.model=litellm_groq/llama3-70b-8192 \
  num_episodes=3

Results Location: Traces are automatically stored in results/llm_<model_name>.db

Step 2: View Reasoning Traces

Use the built-in display script to examine the collected traces:

# Display all reasoning traces from recent games
python3 analysis/extract_reasoning_traces.py --db results/llm_model.db

This will show detailed output like:

🧠 Reasoning Trace #1
----------------------------------------
🎯 Game: tic_tac_toe
📅 Episode: 1, Turn: 0
🤖 Agent: litellm_groq/llama3-8b-8192
🎲 Action Chosen: 4

📋 Board State at Decision Time:
     ...
     ...
     ...

🧠 Agent's Reasoning:
     I'll take the center position for strategic advantage.
     The center square gives me the most control over the
     board and creates multiple winning opportunities.

⏰ Timestamp: 2025-08-04 10:15:23

Advanced Analysis

Extracting Specific Traces

For targeted analysis, use the extraction script with filters:

# Extract traces for specific games
python3 extract_reasoning_traces.py --game tic_tac_toe --episode 1

# Extract all traces from database and save to CSV
python3 extract_reasoning_traces.py --output-format csv --output traces.csv

Reasoning Pattern Analysis

Generate comprehensive analysis and visualizations:

# Analyze reasoning patterns and generate visualizations
python3 -c "
from analysis.reasoning_analysis import LLMReasoningAnalyzer
analyzer = LLMReasoningAnalyzer('results/merged_logs_<time_stamp>.csv')
analyzer.categorize_reasoning()
analyzer.compute_metrics(plot_dir='plots')
analyzer.plot_heatmaps_by_agent(output_dir='plots')
analyzer.plot_wordclouds_by_agent(output_dir='plots')
"

This generates multiple outputs:

Word Clouds: plots/wordcloud_<model>_<game>.png - Common reasoning terms
Pie Charts: plots/pie_reasoning_type_<model>_<game>.png - Reasoning category distributions
Heatmaps: plots/heatmap_<model>_<game>.png - Move position preferences

TensorBoard Monitoring

Game Reasoning Arena automatically logs performance metrics to TensorBoard for real-time monitoring:

# Start TensorBoard after running experiments
tensorboard --logdir=runs

# Open browser: http://localhost:6006/

TensorBoard Features:

Real-time Rewards: Monitor agent performance as games progress
Multi-Agent Comparison: Compare LLM vs Random agent performance
Episode Tracking: Visualize performance trends over multiple episodes
Export Capabilities: Download charts for analysis and presentations

Example Metrics:

Rewards/llm_litellm_groq_llama3_8b_8192: Track LLM agent rewards
Rewards/random_None: Track random agent baseline performance

TensorBoard complements reasoning traces by providing quantitative performance metrics alongside qualitative reasoning analysis.

Database Queries

For custom analysis, access the SQLite database directly:

import sqlite3
import pandas as pd

# Connect to the reasoning traces database
conn = sqlite3.connect('results/llm_litellm_groq_llama3_8b_8192.db')

# Query all reasoning traces
df = pd.read_sql_query("""
    SELECT game_name, episode, turn, action, reasoning, board_state, timestamp
    FROM moves
    WHERE reasoning IS NOT NULL
    ORDER BY timestamp
""", conn)

# Analyze reasoning length by game
reasoning_stats = df.groupby('game_name')['reasoning'].apply(
    lambda x: x.str.len().describe()
)

conn.close()

Understanding Reasoning Categories

The analysis system automatically categorizes LLM reasoning into seven types:

Positional Strategy

Focuses on board position and control:

Center control and positioning
Corner and edge play strategies
Spatial advantage concepts

Example: “I’ll take the center position for strategic advantage”

Blocking & Defense

Preventing opponent wins and defensive moves:

Blocking immediate threats
Preventing opponent strategies
Defensive positioning

Example: “I need to block their winning opportunity in column 3”

Opponent Modeling

Understanding and predicting opponent behavior:

Analyzing opponent patterns
Predicting next moves
Counter-strategy development

Example: “Based on their previous moves, they prefer corner positions”

Winning Logic

Direct winning opportunities and offensive play:

Identifying winning moves
Creating threats and forks
Forcing winning positions

Example: “This creates a fork - I can win on my next turn”

Heuristic Reasoning

General strategic principles and rules of thumb:

Best practices application
General strategy guidelines
Experience-based decisions

Example: “Opening with corner moves is generally a good strategy”

Rule-Based Decisions

Following explicit game rules or predetermined strategies:

Algorithmic approaches
Systematic decision-making
Rule application

Example: “According to basic strategy, I should prioritize the center columns”

Random/Unjustified

Unclear, random, or poorly justified reasoning:

Unclear explanations
Random choices
Weak justifications

Example: “I’ll just pick this move randomly”

Research Applications

Model Comparison Studies

Compare reasoning patterns between different LLMs:

# Compare reasoning quality between models
import sqlite3
import pandas as pd

models = ['llm_groq_llama3_8b', 'llm_groq_llama3_70b', 'llm_openai_gpt4']

for model in models:
    conn = sqlite3.connect(f'results/{model}.db')
    df = pd.read_sql_query("""
        SELECT reasoning, LENGTH(reasoning) as reasoning_length
        FROM moves WHERE reasoning IS NOT NULL
    """, conn)

    print(f"{model}: Avg reasoning length = {df['reasoning_length'].mean():.1f}")
    conn.close()

Strategy Evolution Analysis

Track how reasoning changes throughout games:

# Analyze reasoning evolution within games
df = pd.read_sql_query("""
    SELECT episode, turn, reasoning, action
    FROM moves
    WHERE game_name = 'tic_tac_toe'
    ORDER BY episode, turn
""", conn)

# Group by turn number to see patterns
turn_patterns = df.groupby('turn')['reasoning'].apply(list)

Debugging LLM Decision-Making

Identify problematic reasoning patterns:

# Find games where LLM lost despite good reasoning
losing_games = pd.read_sql_query("""
    SELECT episode, reasoning, action, board_state
    FROM moves
    WHERE game_result = 'loss' AND reasoning IS NOT NULL
""", conn)

# Analyze what went wrong
for idx, game in losing_games.iterrows():
    print(f"Episode {game['episode']}: {game['reasoning'][:100]}...")

Best Practices

Data Collection

Run Multiple Episodes: Collect sufficient data for statistical analysis (recommended: 10+ episodes per condition)
Use Consistent Models: Keep model parameters constant for fair comparisons
Document Experiments: Record experimental conditions and model configurations

Analysis Workflow

Collect Data: Run games with LLM agents
Initial Exploration: Use python3 analysis/extract_reasoning_traces.py to understand the data
Pattern Analysis: Apply reasoning categorization and generate visualizations
Custom Analysis: Write specific queries for your research questions
Validation: Manually verify automatic categorizations for accuracy

Interpretation Guidelines

Context Matters: Consider game state when evaluating reasoning quality
Length ≠ Quality: Longer reasoning isn’t necessarily better reasoning
Model Variations: Different models may use different reasoning styles
Game Complexity: Reasoning patterns vary significantly between simple and complex games

Troubleshooting

No Reasoning Traces Found

If you see “❌ No reasoning traces found”:

Ensure you’re running games with LLM agents (not just random agents)
Check that the database file exists in the results/ directory
Verify your model configuration is correct

Database Connection Issues

# Check available databases
import os
db_files = [f for f in os.listdir('results/') if f.endswith('.db')]
print("Available databases:", db_files)

Memory Issues with Large Datasets

For large reasoning trace datasets:

# Process data in chunks
import sqlite3
import pandas as pd

conn = sqlite3.connect('results/large_dataset.db')

# Use chunking for large datasets
for chunk in pd.read_sql_query(
    "SELECT * FROM moves WHERE reasoning IS NOT NULL",
    conn, chunksize=1000
):
    # Process each chunk
    process_reasoning_chunk(chunk)

Next Steps

Now that you understand reasoning traces analysis, explore:

Analysis & Evaluation - Advanced analysis techniques and metrics
Examples - More complex experimental setups
API Reference - Technical details about the logging system
Extending Game Reasoning Arena - Adding custom reasoning analysis methods

The reasoning traces feature provides a unique window into LLM decision-making processes, enabling researchers to understand not just what decisions AI systems make, but how they arrive at those decisions.