Spaces:
Running
Running
A newer version of the Gradio SDK is available:
5.23.3
Model Performance Testing Methodology
This document outlines the methodology used for testing various LLM models through Ollama on a GPU Poor setup.
Hardware Specifications
GPU
- Model: AMD Radeon RX 7600 XT 16GB
- Note: Currently the most affordable (GPU-poorest) graphics card with 16GB VRAM on the market, making it an excellent choice for budget-conscious AI enthusiasts
System Specifications
- CPU: AMD Ryzen 7 5700X (16) @ 4.66 GHz
- Motherboard: B550 Pro4
- RAM: 64GB
- OS: Debian 12 Bookworm
- Kernel: Linux 6.8.12-8
- Testing Environment: Ollama with ROCm backend
Testing Methodology
Each model is tested using a consistent creative writing prompt designed to evaluate both the model's performance and creative capabilities. The testing process includes:
- Model Loading: Each model is loaded fresh before testing
- Initial Warmup: A small test prompt is run to ensure model is properly loaded
- Main Test: A comprehensive creative writing prompt is processed
- Performance Metrics Collection: Various metrics are gathered during generation
Test Prompt
The following creative writing prompt is used to test all models:
You are a creative writing assistant. Write a short story about a futuristic city where:
1. The city is powered by a mysterious energy source
2. The inhabitants have developed unique abilities
3. There's a hidden conflict between different factions
4. The protagonist discovers a shocking truth about the city's origins
Make the story engaging and include vivid descriptions of the city's architecture and technology.
This prompt was chosen because it:
- Requires creative thinking and complex reasoning
- Generates substantial output (typically 500-1000 tokens)
- Tests both context understanding and generation capabilities
- Produces consistent length outputs for fair comparison
Metrics Collected
For each model, we collect and analyze:
Performance Metrics:
- Tokens per second (overall)
- Generation tokens per second
- Total response time
- Total tokens generated
Resource Usage:
- VRAM usage
- Model size
- Parameter count
Model Information:
- Quantization level
- Model format
- Model family
Testing Parameters
All tests are run with consistent generation parameters:
- Temperature: 0.7
- Top P: 0.9
- Top K: 40
- Max Tokens: 1000
- Repetition Penalty: 1.0
- Seed: 42 (for reproducibility)
Notes
- Tests are run sequentially to ensure no resource contention
- A 3-second cooldown period is maintained between tests
- Models are unloaded after each test to ensure clean state
- Results are saved both in detailed and summary formats
- The testing script automatically handles model pulling and cleanup