Spaces:
Running
on
CPU Upgrade
Reproducibility error
Hi,
I used the command from your FAQ to run the evaluation for myself.
After 13 hours of "Running loglikelihood requests" it ran into this error:
Running generate_until requests: 0%| | 0/1865 [00:00<?, ?it/s]Traceback (most recent call last):
File "/scratch-scc/users/u12246/environments/openllm_env/bin/lm-eval", line 8, in
sys.exit(cli_evaluate())
^^^^^^^^^^^^^^
File "/scratch-scc/users/u12246/lm-evaluation-harness/lm_eval/main.py", line 382, in cli_evaluate
results = evaluator.simple_evaluate(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/scratch-scc/users/u12246/lm-evaluation-harness/lm_eval/utils.py", line 397, in _wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/scratch-scc/users/u12246/lm-evaluation-harness/lm_eval/evaluator.py", line 296, in simple_evaluate
results = evaluate(
^^^^^^^^^
File "/scratch-scc/users/u12246/lm-evaluation-harness/lm_eval/utils.py", line 397, in _wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/scratch-scc/users/u12246/lm-evaluation-harness/lm_eval/evaluator.py", line 468, in evaluate
resps = getattr(lm, reqtype)(cloned_reqs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/scratch-scc/users/u12246/lm-evaluation-harness/lm_eval/models/huggingface.py", line 1326, in generate_until
chunks = re_ords.get_batched(
^^^^^^^^^^^^^^^^^^^^
TypeError: Collator.get_batched() got an unexpected keyword argument 'reset_batch_fn'
Also some INFO:
[init.py:512] The tag xnli is already registered as a group, this tag will not be registered. This may affect tasks you want to call.
[task.py:337] [Task: leaderboard_musr_team_allocation] has_training_docs and has_validation_docs are False, using test_docs as fewshot_docs but this is not recommended.
And WARNING:
[task.py:337] [Task: leaderboard_musr_object_placements] has_training_docs and has_validation_docs are False, using test_docs as fewshot_docs but this is not recommended.
[task.py:337] [Task: leaderboard_musr_murder_mysteries] has_training_docs and has_validation_docs are False, using test_docs as fewshot_docs but this is not recommended.
[task.py:337] [Task: leaderboard_ifeval] has_training_docs and has_validation_docs are False, using test_docs as fewshot_docs but this is not recommended.
Do you have a fix? What is the downside of using EleutherAI/lm-evaluation-harness?
Hi @cluebbers ,
Let me try to help you! Could you please provide the exact command you used, what model you are trying to evaluate, and what hardware you are using?
@alozowski
I am getting the same issue. FWIW I am using transformers main
and running:
accelerate launch --num_processes=2 --multi-gpu -m lm_eval --model hf --model_args "pretrained=/path/to/model,dtype=bfloat16" --tasks=leaderboard --batch_size=auto --output_path=outputs/
The command seems to lack special arguments and has an unusual structure. For instance, the pretrained=/path/to/model
placeholder doesn't specify an actual model. Also, without knowing what hardware you're using, it's challenging to diagnose potential compatibility or performance issues (e.g., multi-GPU, TPU, or CPU-only environments).
Could you confirm:
- The exact model you're trying to evaluate (e.g.,
meta-llama/Llama-3.2-1B
) - The type of hardware (e.g., number of GPUs and their specifications)
- Did you correctly build
lm-evaluation-harness
frommain
?
Additionally, the command could look like this according to the example from the lm-evaluation-harness
's README:
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-3.2-1B \
--tasks leaderboard \
--device cuda:0 \
--batch_size 8
Looking forward to your response!
Hi @alozowski
Thanks for your comment! The issue was happening because the docs were outdated and pointed to the adding_all_changes
branch instead of main
. Things seem to be working fine now that docs were updated and I am using main
!
Great! Feel free to open a new discussion in case of any questions!