LLM Backends

Game Reasoning Arena supports multiple LLM inference backends, allowing you to use both API-based and locally-hosted models seamlessly. This flexibility enables mixing different providers in the same experiment and choosing the most appropriate backend for your research needs.

Overview

The framework provides three main backend types:

  • LiteLLM Backend: API-based inference supporting 100+ providers

  • vLLM Backend: Local GPU inference for self-hosted models

  • HuggingFace Backend: Local CPU inference using transformers pipeline

All backends implement the same interface, making them interchangeable in experiments and allowing for easy comparison between different models and deployment strategies.

LiteLLM Backend

The LiteLLM backend provides access to 100+ language models through a unified API interface, supporting major providers including OpenAI, Anthropic, Google, Groq, Together AI, and many others.

Key Features

  • Unified API: Single interface for multiple providers

  • Cost-effective: Access to free and low-cost API endpoints

  • Fast inference: Providers like Groq offer extremely fast response times

  • No local setup: No GPU requirements or model downloads

  • Wide model selection: From small efficient models to large frontier models

Supported Providers

✓ OpenAI (GPT-3.5, GPT-4, GPT-4 Turbo)
✓ Anthropic (Claude 3, Claude 3.5 Sonnet)
✓ Google (Gemini Pro, Gemma)
✓ Groq (Llama 3, Gemma - ultra-fast inference)
✓ Together AI (Llama 3.1, Mixtral, Code Llama)
✓ Fireworks AI (Llama, Mixtral models)
✓ Cohere (Command models)
✓ Hugging Face Inference API
✓ Replicate (Various open-source models)
✓ And 90+ more providers...

Configuration

Models are configured in src/game_reasoning_arena/configs/litellm_models.yaml:

models:
  # OpenAI models
  - litellm_gpt-3.5-turbo
  - litellm_gpt-4
  - litellm_gpt-4-turbo

  # Groq models (fast inference)
  - litellm_groq/llama3-8b-8192
  - litellm_groq/llama3-70b-8192
  - litellm_groq/gemma-7b-it

  # Together AI models
  - litellm_together_ai/meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo
  - litellm_together_ai/mistralai/Mixtral-8x7B-Instruct-v0.1

Model Naming Convention

LiteLLM models use the prefix litellm_ followed by the provider and model name:

Format: litellm_<provider>/<model_name>

Examples:
- litellm_groq/llama3-8b-8192
- litellm_gpt-4-turbo
- litellm_together_ai/meta-llama/Llama-2-7b-chat-hf

API Keys Setup

Create a .env file in the project root with your API keys:

# OpenAI
OPENAI_API_KEY=your_openai_key_here

# Groq (free tier available)
GROQ_API_KEY=your_groq_key_here

# Together AI
TOGETHER_API_KEY=your_together_key_here

# Fireworks AI
FIREWORKS_API_KEY=your_fireworks_key_here

# Anthropic
ANTHROPIC_API_KEY=your_anthropic_key_here

Usage Example

# Use GPT-4 via OpenAI
python scripts/runner.py --config configs/example_config.yaml --override \\
  agents.player_0.model=litellm_gpt-4

# Use Llama 3 via Groq (fast inference)
python scripts/runner.py --config configs/example_config.yaml --override \\
  agents.player_0.model=litellm_groq/llama3-8b-8192

vLLM Backend

The vLLM backend enables local GPU inference for self-hosted models, providing full control over model deployment, privacy, and customization.

Key Features

  • Local deployment: Complete control over model hosting

  • GPU acceleration: Optimized inference on NVIDIA GPUs

  • Privacy: No data leaves your infrastructure

  • Customization: Fine-tuned models and custom configurations

  • Cost control: No per-token API costs for heavy usage

  • Offline capability: Works without internet connectivity

Requirements

✓ NVIDIA GPU with CUDA support
✓ Sufficient GPU memory (varies by model size)
✓ Local model files (Hugging Face format)
✓ vLLM package installation

Model Setup

  1. Download Models: Obtain model files locally

# Example: Download Qwen2-7B-Instruct
git lfs clone https://huggingface.co/Qwen/Qwen2-7B-Instruct /path/to/models/Qwen2-7B-Instruct
  1. Configure Model Paths: Update src/game_reasoning_arena/configs/vllm_models.yaml

models:
  - name: vllm_Qwen2-7B-Instruct
    model_path: /absolute/path/to/models/Qwen/Qwen2-7B-Instruct
    tokenizer_path: /absolute/path/to/models/Qwen/Qwen2-7B-Instruct
    description: Qwen2 7B Instruct model for local inference

  - name: vllm_Llama-2-7b-chat-hf
    model_path: /absolute/path/to/models/meta-llama/Llama-2-7b-chat-hf
    description: Llama2 7B Chat model

Important

All model paths must be absolute paths to the model directories containing the model files and tokenizer.

Model Naming Convention

vLLM models use the prefix vllm_ followed by the model identifier:

Format: vllm_<model_identifier>

Examples:
- vllm_Qwen2-7B-Instruct
- vllm_Llama-2-7b-chat-hf
- vllm_CodeLlama-7b-Instruct-hf

Usage Example

# Use local Qwen2-7B model
python scripts/runner.py --config configs/example_config.yaml --override \\
  agents.player_0.model=vllm_Qwen2-7B-Instruct

# Use local Llama model
python scripts/runner.py --config configs/example_config.yaml --override \\
  agents.player_0.model=vllm_Llama-2-7b-chat-hf

Installation

Install vLLM package for local inference:

# Install vLLM
pip install vllm

# For specific CUDA versions, see vLLM documentation
pip install vllm-nightly  # Latest features

HuggingFace Backend

The HuggingFace backend enables local CPU inference using the transformers library, providing a lightweight option for running smaller models without GPU requirements.

Key Features

  • No GPU required: CPU-only inference for accessibility

  • No API costs: Completely free local inference

Supported Models

The HuggingFace backend comes pre-configured with several popular models:

✓ gpt2 - OpenAI's GPT-2 base model
✓ distilgpt2 - Distilled version of GPT-2 (faster)
✓ google/flan-t5-small - Google's T5 model fine-tuned for instructions
✓ EleutherAI/gpt-neo-125M - EleutherAI's lightweight GPT model

Configuration

HuggingFace models are automatically configured and require no additional setup. The backend uses the transformers pipeline for text generation.

Model Naming Convention

HuggingFace models use the prefix hf_ followed by the model identifier:

Format: hf_<model_name>

Examples:
- hf_gpt2
- hf_distilgpt2
- hf_google/flan-t5-small
- hf_EleutherAI/gpt-neo-125M

Usage Example

# Use GPT-2 with HuggingFace backend
python scripts/runner.py --config configs/example_config.yaml --override \\
  agents.player_0.model=hf_gpt2

# Use DistilGPT-2 for faster inference
python scripts/runner.py --config configs/example_config.yaml --override \\
  agents.player_0.model=hf_distilgpt2

Performance Notes

Note

Small transformer models may produce less coherent responses compared to larger API models. The backend includes intelligent fallback mechanisms to ensure valid game actions.

Mixed Backend Usage

One of the powerful features of Game Reasoning Arena is the ability to mix different backends in the same experiment, enabling direct comparison between API-based and local models.

LiteLLM vs vLLM Comparison

# Compare API model vs local model
python scripts/runner.py --config configs/example_config.yaml --override \\
  mode=llm_vs_llm \\
  agents.player_0.model=litellm_groq/llama3-8b-8192 \\
  agents.player_1.model=vllm_Qwen2-7B-Instruct \\
  num_episodes=10

Cross-Provider Experiments

# Mix different API providers
python scripts/runner.py --config configs/example_config.yaml --override \\
  mode=llm_vs_llm \\
  agents.player_0.model=litellm_gpt-4-turbo \\
  agents.player_1.model=litellm_groq/llama3-70b-8192

# Compare API efficiency vs local control
python scripts/runner.py --config configs/example_config.yaml --override \\
  mode=llm_vs_llm \\
  agents.player_0.model=litellm_together_ai/meta-llama/Meta-Llama-3.1-8B-Instruct \\
  agents.player_1.model=vllm_Llama-2-7b-chat-hf

# Test HuggingFace vs API models
python scripts/runner.py --config configs/example_config.yaml --override \\
  mode=llm_vs_llm \\
  agents.player_0.model=hf_gpt2 \\
  agents.player_1.model=litellm_groq/llama3-8b-8192

# Compare all three backends
python scripts/runner.py --config configs/three_way_comparison.yaml --override \\
  mode=multi_agent \\
  agents.player_0.model=litellm_gpt-4-turbo \\
  agents.player_1.model=vllm_Qwen2-7B-Instruct \\
  agents.player_2.model=hf_distilgpt2

Backend Selection Guide

Choose the appropriate backend based on your research needs:

LiteLLM When:

  • Quick prototyping and experimentation

  • Limited GPU resources or no local hardware

  • Comparing multiple models without setup overhead

  • Cost-effective research with free tiers (e.g., Groq)

  • Access to frontier models (GPT-4, Claude 3.5)

  • Fast iteration on experiments

vLLM When:

  • Privacy requirements for sensitive data

  • High-volume experiments where API costs become prohibitive

  • Custom model fine-tuning and specialized deployments

  • Offline environments without internet access

  • Full control over inference parameters and optimization

  • Research on model behavior requiring deterministic setups

HuggingFace When:

  • CPU-only environments without GPU access

  • Development and testing without external dependencies

  • Experimentation with small models for proof of concept

Performance Considerations

Inference Speed

Backend

Typical Latency

Throughput

Best For

Groq (LiteLLM)

50-200ms

Very High

Fast experimentation

OpenAI (LiteLLM)

500-2000ms

High

Quality baseline

Local vLLM

100-1000ms

Variable

Privacy, control

HuggingFace (CPU)

2000-10000ms

Low

Education, testing

Cost Comparison

Model Type

Setup Cost

Per-Token Cost

Break-Even Point

LiteLLM API

$0

$0.001-0.01

< 1M tokens

Local vLLM

GPU hardware

Electricity only

> 1M tokens

HuggingFace CPU

$0

$0 (CPU time)

Always free

HuggingFace CPU

$0

$0 (CPU time)

Always free

Troubleshooting

Common LiteLLM Issues

Authentication Errors:

# Check API key is set
echo $OPENAI_API_KEY

# Verify .env file exists and is formatted correctly
cat .env

Rate Limiting:

# Use multiple providers or add delays
# Configure rate limits in backend settings

Common vLLM Issues

CUDA Out of Memory:

# Check GPU memory
nvidia-smi

# Use smaller models or reduce batch size
# Consider model quantization

Model Path Errors:

# Verify absolute paths in vllm_models.yaml
ls /absolute/path/to/model/directory

# Ensure model files are present
ls /path/to/model/config.json

Import Errors:

# Install vLLM properly
pip install vllm

# Check CUDA compatibility
python -c "import torch; print(torch.cuda.is_available())"

Adding New Models

LiteLLM Models

  1. Find the model identifier from LiteLLM documentation

  2. Add to configuration:

# In src/game_reasoning_arena/configs/litellm_models.yaml
models:
  - litellm_new_provider/new_model_name
  1. Set up API keys in .env file if needed

  2. Test the model:

python scripts/runner.py --config configs/example_config.yaml --override \\
  agents.player_0.model=litellm_new_provider/new_model_name \\
  num_episodes=1

vLLM Models

  1. Download model files to local directory

  2. Add model configuration:

# In src/game_reasoning_arena/configs/vllm_models.yaml
models:
  - name: vllm_new_model_name
    model_path: /absolute/path/to/model
    description: Description of the new model
  1. Test the model:

python scripts/runner.py --config configs/example_config.yaml --override \\
  agents.player_0.model=vllm_new_model_name \\
  num_episodes=1

HuggingFace Models

HuggingFace models are automatically available without additional configuration. The framework comes pre-configured with several popular models:

  • gpt2 - OpenAI’s GPT-2 base model

  • distilgpt2 - Distilled version of GPT-2 (faster inference)

  • google/flan-t5-small - Google’s T5 model fine-tuned for instructions

  • EleutherAI/gpt-neo-125M - EleutherAI’s lightweight GPT model

To use additional HuggingFace models, simply use the hf_ prefix with any model from the HuggingFace Hub:

# Test with any HuggingFace model
python scripts/runner.py --config configs/example_config.yaml --override \\
  agents.player_0.model=hf_microsoft/DialoGPT-small \\
  num_episodes=1

Note

Models will be automatically downloaded on first use and cached locally. Ensure you have sufficient disk space and internet connectivity for the initial download.

See Also