Spaces:

HuggingFaceTB
/

wikiracing-llms

Running

App Files Files Community

wikiracing-llms / parallel_eval /README.md

stillerman

docs minus assets

1e310b7 about 2 months ago

preview code

raw

history blame contribute delete

2.06 kB

	## Setup env

	```bash
	uv venv
	source .venv/bin/activate
	uv pip install -r requirements.txt

	# pull wikihop db
	wget https://huggingface.co/datasets/HuggingFaceTB/simplewiki-pruned-text-350k/blob/main/wikihop.db -o wikihop.db
	```

	## Which models does it support?
	Under the hood it uses [LiteLLM](https://github.com/BerriAI/litellm) so you can use any major model (dont forget to export appropriate api key), or host any model on huggingface via [vLLM](https://github.com/vllm-project/vllm).


	## Play the game
	```
	# play the game with cli
	python game.py --human --start 'Saint Lucia' --end 'Italy' --db wikihop.db

	# have the agent play the game (gpt-4o)
	export OPENAI_API_KEY=sk_xxxxx
	python game.py --agent --start 'Saint Lucia' --end 'Italy' --db wikihop.db --model gpt-4o --max-steps 20

	# run an evaluation suite with qwen3 hosted on vLLM, 200 workers
	python proctor.py --model "hosted_vllm/Qwen/Qwen3-30B-A3B" --api-base "http://localhost:8000/v1" --workers 200

	# this will produce a `proctor_tmp/proctor_1-final-results.json` that can be visualized in the space, as well as the individual reasoning traces for each run. This is resumable if it is stopped and is idempotent.
	```

	## JQ command to strip out reasoning traces
	This output file will be very large because it contains all the reasoning traces. You can shrink it down and still be able to visualize it with

	```bash
	jq '{
	article_list: .article_list,
	num_trials: .num_trials,
	num_workers: .num_workers,
	max_steps: .max_steps,
	agent_settings: .agent_settings,
	runs: [.runs[] \| {
	model: .model,
	api_base: .api_base,
	max_links: .max_links,
	max_tries: .max_tries, result: .result,
	start_article: .start_article,
	destination_article: .destination_article,
	steps: [.steps[] \| {
	type: .type,
	article: .article,
	metadata: (if .metadata.conversation then
	.metadata \| del(.conversation)
	else
	.metadata
	end)
	}]
	}]
	}' proctor_tmp/proctor_1-final-results.json > cleaned_data.json
	```