Model Performance Testing Methodology

This document outlines the methodology used for testing various LLM models through Ollama on a GPU Poor setup.

Hardware Specifications

GPU

Model: AMD Radeon RX 7600 XT 16GB
Note: Currently the most affordable (GPU-poorest) graphics card with 16GB VRAM on the market, making it an excellent choice for budget-conscious AI enthusiasts

System Specifications

CPU: AMD Ryzen 7 5700X (16) @ 4.66 GHz
Motherboard: B550 Pro4
RAM: 64GB
OS: Debian 12 Bookworm
Kernel: Linux 6.8.12-8
Testing Environment: Ollama with ROCm backend

Testing Methodology

Each model is tested using a consistent creative writing prompt designed to evaluate both the model's performance and creative capabilities. The testing process includes:

Model Loading: Each model is loaded fresh before testing
Initial Warmup: A small test prompt is run to ensure model is properly loaded
Main Test: A comprehensive creative writing prompt is processed
Performance Metrics Collection: Various metrics are gathered during generation

Test Prompt

The following creative writing prompt is used to test all models:

You are a creative writing assistant. Write a short story about a futuristic city where:
1. The city is powered by a mysterious energy source
2. The inhabitants have developed unique abilities
3. There's a hidden conflict between different factions
4. The protagonist discovers a shocking truth about the city's origins

Make the story engaging and include vivid descriptions of the city's architecture and technology.

This prompt was chosen because it:

Requires creative thinking and complex reasoning
Generates substantial output (typically 500-1000 tokens)
Tests both context understanding and generation capabilities
Produces consistent length outputs for fair comparison

Metrics Collected

For each model, we collect and analyze:

Performance Metrics:
- Tokens per second (overall)
- Generation tokens per second
- Total response time
- Total tokens generated
Resource Usage:
- VRAM usage
- Model size
- Parameter count
Model Information:
- Quantization level
- Model format
- Model family

Testing Parameters

All tests are run with consistent generation parameters:

Temperature: 0.7
Top P: 0.9
Top K: 40
Max Tokens: 1000
Repetition Penalty: 1.0
Seed: 42 (for reproducibility)

Notes

Tests are run sequentially to ensure no resource contention
A 3-second cooldown period is maintained between tests
Models are unloaded after each test to ensure clean state
Results are saved both in detailed and summary formats
The testing script automatically handles model pulling and cleanup