Experiments
This section covers how to design and run experiments with Game Reasoning Arena, including distributed execution capabilities.
Ray Integration for Parallel Execution
Game Reasoning Arena supports Ray for distributed and parallel execution, allowing you to:
Run multiple games in parallel across different cores/machines
Parallelize episodes within games for faster data collection
Distribute LLM inference for batch processing
Scale experiments on SLURM clusters or multi-GPU setups
Configuration Options
Option 1: Combined Configuration File (YAML)
# Combined config with all settings in one file
env_config:
game_name: tic_tac_toe
num_episodes: 5
agents:
player_0:
type: llm
model: litellm_groq/llama3-8b-8192
player_1:
type: random
use_ray: true
parallel_episodes: true
ray_config:
num_cpus: 8
include_dashboard: false
Option 2: Separate Ray Configuration (Recommended)
# Use any existing config + separate Ray settings
python3 scripts/runner.py \
--base-config src/game_reasoning_arena/configs/multi_game_base.yaml \
--ray-config src/game_reasoning_arena/configs/ray_config.yaml \
--override num_episodes=10 \
--override agents.player_0.model=litellm_groq/llama3-70b-8192
Option 3: Command-Line Override
# Enable Ray with any existing configuration
python3 scripts/runner.py --config src/game_reasoning_arena/configs/example_config.yaml \
--override use_ray=true parallel_episodes=true
Option 4: Maximum Parallelization (Multi-Model Ray)
# Run multiple models in parallel with full Ray integration
# Parallelizes: Models + Games + Episodes simultaneously
python3 scripts/run_ray_multi_model.py \
--config src/game_reasoning_arena/configs/ray_multi_model.yaml \
--override use_ray=true
Ray Configuration Parameters
The ray_config.yaml file contains Ray-specific settings:
Parameter |
Description |
Default |
|---|---|---|
|
Enable/disable Ray |
|
|
Parallelize episodes within games |
|
|
Number of CPUs for Ray |
Auto-detect |
|
Number of GPUs for Ray |
Auto-detect |
|
Enable Ray dashboard |
|
|
Dashboard port |
|
|
Object store memory limit |
Auto |
Performance Comparison
Execution Mode |
Parallelization Level |
Best For |
Expected Speedup |
|---|---|---|---|
|
Episodes only |
Single model, single game |
~N_episodes |
|
Games + Episodes |
Single model, multiple games |
~N_games × N_episodes |
|
Models + Games + Episodes |
Multiple models, multiple games |
~N_models × N_games × N_episodes |
Recommendation: Use run_ray_multi_model.py for multi-model experiments to achieve maximum speedup.
Configuration Merging Order
The system merges configurations in this order (later overrides earlier):
Default configuration
Base config (
--base-config)Main config (
--config)Ray config (
--ray-config)CLI overrides (
--override)
SLURM Integration
For cluster environments, Ray automatically detects SLURM allocation:
# SLURM job with Ray
sbatch --nodes=2 --cpus-per-task=48 --gres=gpu:4 slurm_jobs/run_simulation.sh
The SLURM script (slurm_jobs/run_simulation.sh) handles:
Multi-node Ray cluster setup
Head node and worker initialization
GPU allocation across nodes
Environment variable configuration
Debug Commands
# Check Ray status
ray status
# Monitor Ray dashboard (if enabled)
# Navigate to: http://localhost:8265
Experiment Design
Configuration Management
Use YAML configuration files to define experiments:
experiment:
name: "llm_comparison_study"
description: "Compare different LLM models on strategic games"
games:
- name: "connect_four"
num_episodes: 100
- name: "kuhn_poker"
num_episodes: 200
agents:
- type: "llm"
model: "gpt-4"
name: "GPT4_Player"
- type: "llm"
model: "claude-3-sonnet"
name: "Claude_Player"
Running Experiments
Single Experiments
python scripts/simulate.py --config experiments/my_experiment.yaml
Batch Experiments
For large-scale studies:
# Using SLURM for cluster computing
sbatch slurm_jobs/run_simulation.sh
# Or parallel execution
python scripts/runner.py --parallel --jobs 8
Distributed Computing
Use Ray for distributed execution:
execution:
backend: "ray"
num_workers: 8
resources_per_worker:
cpu: 2
memory: "4GB"
Statistical Analysis
Significance Testing
from game_reasoning_arena.analysis import statistical_tests
# Compare win rates between agents
p_value = statistical_tests.binomial_test(
wins_a=75, games_a=100,
wins_b=65, games_b=100
)