Spaces:
Running
Running
## Setup env | |
```bash | |
uv venv | |
source .venv/bin/activate | |
uv pip install -r requirements.txt | |
# pull wikihop db | |
wget https://huggingface.co/datasets/HuggingFaceTB/simplewiki-pruned-text-350k/blob/main/wikihop.db -o wikihop.db | |
``` | |
## Which models does it support? | |
Under the hood it uses [LiteLLM](https://github.com/BerriAI/litellm) so you can use any major model (dont forget to export appropriate api key), or host any model on huggingface via [vLLM](https://github.com/vllm-project/vllm). | |
## Play the game | |
``` | |
# play the game with cli | |
python game.py --human --start 'Saint Lucia' --end 'Italy' --db wikihop.db | |
# have the agent play the game (gpt-4o) | |
export OPENAI_API_KEY=sk_xxxxx | |
python game.py --agent --start 'Saint Lucia' --end 'Italy' --db wikihop.db --model gpt-4o --max-steps 20 | |
# run an evaluation suite with qwen3 hosted on vLLM, 200 workers | |
python proctor.py --model "hosted_vllm/Qwen/Qwen3-30B-A3B" --api-base "http://localhost:8000/v1" --workers 200 | |
# this will produce a `proctor_tmp/proctor_1-final-results.json` that can be visualized in the space, as well as the individual reasoning traces for each run. This is resumable if it is stopped and is idempotent. | |
``` | |
## JQ command to strip out reasoning traces | |
This output file will be very large because it contains all the reasoning traces. You can shrink it down and still be able to visualize it with | |
```bash | |
jq '{ | |
article_list: .article_list, | |
num_trials: .num_trials, | |
num_workers: .num_workers, | |
max_steps: .max_steps, | |
agent_settings: .agent_settings, | |
runs: [.runs[] | { | |
model: .model, | |
api_base: .api_base, | |
max_links: .max_links, | |
max_tries: .max_tries, result: .result, | |
start_article: .start_article, | |
destination_article: .destination_article, | |
steps: [.steps[] | { | |
type: .type, | |
article: .article, | |
metadata: (if .metadata.conversation then | |
.metadata | del(.conversation) | |
else | |
.metadata | |
end) | |
}] | |
}] | |
}' proctor_tmp/proctor_1-final-results.json > cleaned_data.json | |
``` |