open-llm-leaderboard/open_llm_leaderboard · Model evaluation failure: request for additional details

Nov 14, 2023

Hi,

I am looking for some additional info on why this model failed:
https://huggingface.co/datasets/open-llm-leaderboard/requests/blob/main/abacusai/Giraffe-beta-13b-32k_eval_request_False_float16_Original.json

I was able to locally run:

python main.py --model=hf-causal --model_args="pretrained=abacusai/Giraffe-beta-13b-32k,revision=main" --tasks=arc_challenge --num_fewshot=5 --batch_size=1 --output_path=/tmp/harness-test

So I don't think there was an issue with loading the model. I was however unable to use the use_accelerate=True flag. Adding the flag did not cause a model issue but it did fail with this error (I do believe I have the right branch):

Traceback (most recent call last):
  File "/root/lm-evaluation-harness/main.py", line 93, in <module>
    main()
  File "/root/lm-evaluation-harness/main.py", line 59, in main
    results = evaluator.simple_evaluate(
  File "/root/lm-evaluation-harness/lm_eval/utils.py", line 243, in _wrapper
    return fn(*args, **kwargs)
  File "/root/lm-evaluation-harness/lm_eval/evaluator.py", line 72, in simple_evaluate
    lm = lm_eval.models.get_model(model).create_from_arg_string(
  File "/root/lm-evaluation-harness/lm_eval/base.py", line 115, in create_from_arg_string
    return cls(**args, **args2)
TypeError: HFLM.__init__() got an unexpected keyword argument 'use_accelerate'

If you could give me any additional info, I would like to fix up the model config and try submitting again. The model itself is based on Llama2 so there should not be anything unusual about the code. We would like to follow up with a new version that has customizations to support injected soft prompt tokens but before we figure that out it seems like we should get a simpler instruct tuned model through the harness.

siddartha-abacus

Nov 14, 2023

If I switch to --model hf-causal-experimental then I am able to pass the use_accelerate=True flag.

clefourrier

Open LLM Leaderboard org Nov 14, 2023

Hi!
We have a new system to run our evaluations on the HF cluster, where the leaderboard evaluations get cancelled automatically if a higher priority job needs resources.
The jobs get relaunched automatically in the end, but they get displayed as failed in the meantime. I just checked, and your run was cancelled but not relaunched yet, it will be relaunched automatically when we have enough compute available :)

siddartha-abacus

Nov 14, 2023

Phew! Thank you. By the way, are the instruction wrong regarding model type (hf-causal-experimental + use_accelearate) Perhaps worth updating?

siddartha-abacus changed discussion status to closed Nov 14, 2023

siddartha-abacus changed discussion status to open Nov 21, 2023

siddartha-abacus

Nov 21, 2023

This model is again updated as failed:
https://huggingface.co/datasets/open-llm-leaderboard/requests/blob/main/abacusai/Giraffe-beta-13b-32k_eval_request_False_float16_Original.json

I was able to run the eval command line on our machines and generate a valid result for arc_challenge. Any additional information on the failure on your cluster?

clefourrier

Open LLM Leaderboard org Nov 21, 2023

Hi!
It's a node failure, I'll restart it!

clefourrier changed discussion status to closed Nov 21, 2023

siddartha-abacus changed discussion status to open Nov 27, 2023

siddartha-abacus

Nov 27, 2023

Unfortunately, this seemed to fail again. I will try re-running the full suite again locally to confirm but if you have any info on the failure it will be very helpful. Other than the context length change this should be very similar to any other llama2 based model so I am surprised by the failure. Should I try resubmitting the model with a shorter max context length?

siddartha-abacus

Nov 27, 2023

FYI - we trained on H100, but I verified execution on an A100 80G machine. Should I test with a 40G mem limit instead?

SaylorTwift

Open LLM Leaderboard org Nov 27, 2023

Hi, we had an issue while downloading the model. The problem appears to be on our end, your model is back in the queue if you notice it fails again don't hesitate to warn us ! :)

clefourrier

Open LLM Leaderboard org Nov 30, 2023

Closing, feel free to reopen if needed :)

clefourrier changed discussion status to closed Nov 30, 2023

siddartha-abacus

Nov 30, 2023

Seems pretty jinxed :)

I think it is marked failed again. In the background we have gone ahead and run the whole suite on our machines and it seems to work fine. We are running on 80G GPUs not sure if that is the issued. Let me know if you think we should update the model config to indicate a shorter context. The model has been tested to about ~32K context.

siddartha-abacus changed discussion status to open Nov 30, 2023

clefourrier

Open LLM Leaderboard org Nov 30, 2023

•

edited Nov 30, 2023

It's quite weird, we got a download issue again - I'll try it a third time but if it's still not working we'll have to assume there is an issue with the download of your model.

Logs:

Traceback (most recent call last):
  File "...lib/python3.10/site-packages/requests/models.py", line 822, in generate
    raise ConnectionError(e)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='cdn-lfs-us-1.huggingface.co', port=443): Read timed out.
Downloading (…)of-00003.safetensors:  79%|███████▊  | 7.81G/9.95G [00:50<00:13, 156MB/s]
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2492624 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2492625 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2492626 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2492627 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2492628 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2492629 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2492630 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2492623) of binary: python

We are also running experiments on 80G GPUs, I don't think it's a context problem since the model is not even launched ^^

siddartha-abacus

Nov 30, 2023

Thanks for the logs. We have successfully redownloaded from HF for testing. I wonder if we somehow ended up on a flaky node LFS. If this continues to be an issues perhaps we will try just reuploading the files in the hope they end up on a new node?

clefourrier

Open LLM Leaderboard org Dec 12, 2023

Hi @siddartha-abacus , can I close this issue? Has your model been properly evaluated?

siddartha-abacus

Dec 12, 2023

Yes, sorry did not realize it was not closed.

siddartha-abacus changed discussion status to closed Dec 12, 2023

clefourrier

Open LLM Leaderboard org Dec 13, 2023

No problem, thank you for closing :)