Reasoning Traces Analysis
Game Reasoning Arena provides powerful reasoning traces functionality that captures and analyzes LLM decision-making processes during gameplay. This tutorial will guide you through obtaining, viewing, and analyzing reasoning traces to gain deep insights into how LLMs think through game strategies.
What are Reasoning Traces?
Reasoning traces capture three key pieces of information for each LLM move:
Board State: The exact game position when the decision was made
Agent Reasoning: The LLM’s thought process and explanation for the move
Action Context: The chosen action along with metadata (timestamp, episode, turn)
This combination provides unprecedented insight into AI decision-making patterns, strategic thinking, and potential weaknesses in LLM game-playing abilities.
Key Features
Automatic Collection: No special configuration required - traces are collected automatically during LLM gameplay
Multi-Game Support: Works across all supported games (Tic-Tac-Toe, Connect Four, Kuhn Poker, etc.)
Comprehensive Logging: Stores complete game context in SQLite databases
Analysis Tools: Built-in categorization and visualization of reasoning patterns
Research Ready: Designed for academic analysis of AI decision-making
Getting Started with Reasoning Traces
Step 1: Run Games with LLM Agents
Reasoning traces are automatically collected whenever LLM agents play games. No special configuration is needed:
# Basic LLM vs Random game - traces will be automatically collected
python3 scripts/runner.py --config src/game_reasoning_arena/configs/example_config.yaml --override \
env_config.game_name=tic_tac_toe \
agents.player_0.type=llm \
agents.player_0.model=litellm_groq/llama3-8b-8192 \
num_episodes=5
# LLM vs LLM - both agents' reasoning will be captured
python3 scripts/runner.py --config src/game_reasoning_arena/configs/example_config.yaml --override \
env_config.game_name=connect_four \
agents.player_0.type=llm \
agents.player_0.model=litellm_groq/llama3-8b-8192 \
agents.player_1.type=llm \
agents.player_1.model=litellm_groq/llama3-70b-8192 \
num_episodes=3
Results Location: Traces are automatically stored in results/llm_<model_name>.db
Step 2: View Reasoning Traces
Use the built-in display script to examine the collected traces:
# Display all reasoning traces from recent games
python3 analysis/extract_reasoning_traces.py --db results/llm_model.db
This will show detailed output like:
🧠 Reasoning Trace #1
----------------------------------------
🎯 Game: tic_tac_toe
📅 Episode: 1, Turn: 0
🤖 Agent: litellm_groq/llama3-8b-8192
🎲 Action Chosen: 4
📋 Board State at Decision Time:
...
...
...
🧠 Agent's Reasoning:
I'll take the center position for strategic advantage.
The center square gives me the most control over the
board and creates multiple winning opportunities.
⏰ Timestamp: 2025-08-04 10:15:23
Advanced Analysis
Extracting Specific Traces
For targeted analysis, use the extraction script with filters:
# Extract traces for specific games
python3 extract_reasoning_traces.py --game tic_tac_toe --episode 1
# Extract all traces from database and save to CSV
python3 extract_reasoning_traces.py --output-format csv --output traces.csv
Reasoning Pattern Analysis
Generate comprehensive analysis and visualizations:
# Analyze reasoning patterns and generate visualizations
python3 -c "
from analysis.reasoning_analysis import LLMReasoningAnalyzer
analyzer = LLMReasoningAnalyzer('results/merged_logs_<time_stamp>.csv')
analyzer.categorize_reasoning()
analyzer.compute_metrics(plot_dir='plots')
analyzer.plot_heatmaps_by_agent(output_dir='plots')
analyzer.plot_wordclouds_by_agent(output_dir='plots')
"
This generates multiple outputs:
Word Clouds:
plots/wordcloud_<model>_<game>.png- Common reasoning termsPie Charts:
plots/pie_reasoning_type_<model>_<game>.png- Reasoning category distributionsHeatmaps:
plots/heatmap_<model>_<game>.png- Move position preferences
TensorBoard Monitoring
Game Reasoning Arena automatically logs performance metrics to TensorBoard for real-time monitoring:
# Start TensorBoard after running experiments
tensorboard --logdir=runs
# Open browser: http://localhost:6006/
TensorBoard Features:
Real-time Rewards: Monitor agent performance as games progress
Multi-Agent Comparison: Compare LLM vs Random agent performance
Episode Tracking: Visualize performance trends over multiple episodes
Export Capabilities: Download charts for analysis and presentations
Example Metrics:
Rewards/llm_litellm_groq_llama3_8b_8192: Track LLM agent rewardsRewards/random_None: Track random agent baseline performance
TensorBoard complements reasoning traces by providing quantitative performance metrics alongside qualitative reasoning analysis.
Database Queries
For custom analysis, access the SQLite database directly:
import sqlite3
import pandas as pd
# Connect to the reasoning traces database
conn = sqlite3.connect('results/llm_litellm_groq_llama3_8b_8192.db')
# Query all reasoning traces
df = pd.read_sql_query("""
SELECT game_name, episode, turn, action, reasoning, board_state, timestamp
FROM moves
WHERE reasoning IS NOT NULL
ORDER BY timestamp
""", conn)
# Analyze reasoning length by game
reasoning_stats = df.groupby('game_name')['reasoning'].apply(
lambda x: x.str.len().describe()
)
conn.close()
Understanding Reasoning Categories
The analysis system automatically categorizes LLM reasoning into seven types:
Positional Strategy
Focuses on board position and control:
Center control and positioning
Corner and edge play strategies
Spatial advantage concepts
Example: “I’ll take the center position for strategic advantage”
Blocking & Defense
Preventing opponent wins and defensive moves:
Blocking immediate threats
Preventing opponent strategies
Defensive positioning
Example: “I need to block their winning opportunity in column 3”
Opponent Modeling
Understanding and predicting opponent behavior:
Analyzing opponent patterns
Predicting next moves
Counter-strategy development
Example: “Based on their previous moves, they prefer corner positions”
Winning Logic
Direct winning opportunities and offensive play:
Identifying winning moves
Creating threats and forks
Forcing winning positions
Example: “This creates a fork - I can win on my next turn”
Heuristic Reasoning
General strategic principles and rules of thumb:
Best practices application
General strategy guidelines
Experience-based decisions
Example: “Opening with corner moves is generally a good strategy”
Rule-Based Decisions
Following explicit game rules or predetermined strategies:
Algorithmic approaches
Systematic decision-making
Rule application
Example: “According to basic strategy, I should prioritize the center columns”
Random/Unjustified
Unclear, random, or poorly justified reasoning:
Unclear explanations
Random choices
Weak justifications
Example: “I’ll just pick this move randomly”
Research Applications
Model Comparison Studies
Compare reasoning patterns between different LLMs:
# Compare reasoning quality between models
import sqlite3
import pandas as pd
models = ['llm_groq_llama3_8b', 'llm_groq_llama3_70b', 'llm_openai_gpt4']
for model in models:
conn = sqlite3.connect(f'results/{model}.db')
df = pd.read_sql_query("""
SELECT reasoning, LENGTH(reasoning) as reasoning_length
FROM moves WHERE reasoning IS NOT NULL
""", conn)
print(f"{model}: Avg reasoning length = {df['reasoning_length'].mean():.1f}")
conn.close()
Strategy Evolution Analysis
Track how reasoning changes throughout games:
# Analyze reasoning evolution within games
df = pd.read_sql_query("""
SELECT episode, turn, reasoning, action
FROM moves
WHERE game_name = 'tic_tac_toe'
ORDER BY episode, turn
""", conn)
# Group by turn number to see patterns
turn_patterns = df.groupby('turn')['reasoning'].apply(list)
Debugging LLM Decision-Making
Identify problematic reasoning patterns:
# Find games where LLM lost despite good reasoning
losing_games = pd.read_sql_query("""
SELECT episode, reasoning, action, board_state
FROM moves
WHERE game_result = 'loss' AND reasoning IS NOT NULL
""", conn)
# Analyze what went wrong
for idx, game in losing_games.iterrows():
print(f"Episode {game['episode']}: {game['reasoning'][:100]}...")
Best Practices
Data Collection
Run Multiple Episodes: Collect sufficient data for statistical analysis (recommended: 10+ episodes per condition)
Use Consistent Models: Keep model parameters constant for fair comparisons
Document Experiments: Record experimental conditions and model configurations
Analysis Workflow
Collect Data: Run games with LLM agents
Initial Exploration: Use
python3 analysis/extract_reasoning_traces.pyto understand the dataPattern Analysis: Apply reasoning categorization and generate visualizations
Custom Analysis: Write specific queries for your research questions
Validation: Manually verify automatic categorizations for accuracy
Interpretation Guidelines
Context Matters: Consider game state when evaluating reasoning quality
Length ≠ Quality: Longer reasoning isn’t necessarily better reasoning
Model Variations: Different models may use different reasoning styles
Game Complexity: Reasoning patterns vary significantly between simple and complex games
Troubleshooting
No Reasoning Traces Found
If you see “❌ No reasoning traces found”:
Ensure you’re running games with LLM agents (not just random agents)
Check that the database file exists in the
results/directoryVerify your model configuration is correct
Database Connection Issues
# Check available databases
import os
db_files = [f for f in os.listdir('results/') if f.endswith('.db')]
print("Available databases:", db_files)
Memory Issues with Large Datasets
For large reasoning trace datasets:
# Process data in chunks
import sqlite3
import pandas as pd
conn = sqlite3.connect('results/large_dataset.db')
# Use chunking for large datasets
for chunk in pd.read_sql_query(
"SELECT * FROM moves WHERE reasoning IS NOT NULL",
conn, chunksize=1000
):
# Process each chunk
process_reasoning_chunk(chunk)
Next Steps
Now that you understand reasoning traces analysis, explore:
Analysis & Evaluation - Advanced analysis techniques and metrics
Examples - More complex experimental setups
API Reference - Technical details about the logging system
Extending Game Reasoning Arena - Adding custom reasoning analysis methods
The reasoning traces feature provides a unique window into LLM decision-making processes, enabling researchers to understand not just what decisions AI systems make, but how they arrive at those decisions.