rwmasood commited on
Commit
feede7d
·
verified ·
1 Parent(s): e02397e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +50 -13
README.md CHANGED
@@ -40,7 +40,7 @@ In this work, we take a step toward realizing such an approach. Specifically, we
40
  ```python
41
  import torch
42
  from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
43
- model_id="empirischtech/Llama-3.1-10b-instruct"
44
  tokenizer = AutoTokenizer.from_pretrained(model_id)
45
  model = AutoModelForCausalLM.from_pretrained(
46
  model_id,
@@ -64,23 +64,57 @@ output_text = tokenizer.decode(output[0], skip_special_tokens=True)
64
 
65
  ## Evaluation Results
66
 
67
- ### Overview
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
68
  - The performance evaluation is based on the tasks being evaluated on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).
69
  The model is evaluated on three benchmark datasets, which include `ARC-Challenge`, `HellaSwag` and `MMLU`.
70
  The library used is [lm-evaluation-harness repository](https://github.com/EleutherAI/lm-evaluation-harness)
71
 
72
 
73
- ### Main Results
74
- | Model | ARC | HellaSwag | MMLU | TruthfulQA | | MT_Bench |
75
- |--------------------------------------------------------------------|----------|----------|----------|------|----------|-|-------------|
76
- | **[Llama-2-70b-instruct-v2](https://huggingface.co/upstage/Llama-2-70b-instruct-v2)**(Ours, Open LLM Leaderboard) | **73** | **71.1** | **87.9** | **70.6** | **62.2** | | **7.44063** |
77
- | [Llama-2-70b-instruct](https://huggingface.co/upstage/Llama-2-70b-instruct) (Ours, Open LLM Leaderboard) | 72.3 | 70.9 | 87.5 | 69.8 | 61 | | 7.24375 |
78
- | [llama-65b-instruct](https://huggingface.co/upstage/llama-65b-instruct) (***Ours***, ***Open LLM Leaderboard***) | 69.4 | 67.6 | 86.5 | 64.9 | 58.8 | | |
79
- | Llama-2-70b-hf | 67.3 | 67.3 | 87.3 | 69.8 | 44.9 | | |
80
- | [llama-30b-instruct-2048](https://huggingface.co/upstage/llama-30b-instruct-2048) (Ours, Open LLM Leaderboard) | 67.0 | 64.9 | 84.9 | 61.9 | 56.3 | | |
81
- | [llama-30b-instruct](https://huggingface.co/upstage/llama-30b-instruct) (Ours, Open LLM Leaderboard) | 65.2 | 62.5 | 86.2 | 59.4 | 52.8 | | |
82
- | llama-65b | 64.2 | 63.5 | 86.1 | 63.9 | 43.4 | | |
83
- | falcon-40b-instruct | 63.4 | 61.6 | 84.3 | 55.4 | 52.5 | | |
84
 
85
 
86
  ### Scripts to generate evalution results
@@ -94,6 +128,9 @@ from lm_eval import evaluator
94
  tasks_list = ["arc_challenge", "gpqa", "ifeval", "mmlu_pro", "hellaswag"] # Benchmark dataset
95
 
96
  model_path='rwmasood/llama-3.1-10b-instruct'
 
 
 
97
 
98
  # Run evaluation
99
  results = evaluator.simple_evaluate(
 
40
  ```python
41
  import torch
42
  from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
43
+ model_id="empirischtech/Llama-3.1-10B-Instruct"
44
  tokenizer = AutoTokenizer.from_pretrained(model_id)
45
  model = AutoModelForCausalLM.from_pretrained(
46
  model_id,
 
64
 
65
  ## Evaluation Results
66
 
67
+ The following two different evaluations are performed.
68
+
69
+
70
+ ### Preplexity as Evaluation Metric
71
+
72
+ Perplexity (PPL) is a metric used to evaluate the performance of language models. It measures how well a probability distribution or a language model predicts a sample. A **lower perplexity** score indicates better performance (i.e., the model is more confident in its predictions).
73
+
74
+
75
+
76
+ ```python
77
+ from evaluate import load
78
+ import datasets
79
+
80
+
81
+ perplexity = load("perplexity", module_type="metric")
82
+ input_texts = datasets.load_dataset("wikitext",
83
+ "wikitext-2-raw-v1",
84
+ split="test")["text"]
85
+
86
+ input_texts = [s for s in input_texts if s!='']
87
+
88
+ model_path='empirischtech/Llama-3.1-10B-Instruct'
89
+ results = perplexity.compute(model_id=model_name_or_path,
90
+ add_start_token=False,
91
+ predictions=input_texts)
92
+
93
+
94
+ print(round(results["mean_perplexity"], 2))
95
+ ```
96
+
97
+
98
+
99
+ #### Main Results
100
+
101
+ | Model | Perplexity Score |
102
+ |---------------------------------------------|----------|
103
+ | **Llama-3.1-8B-Instruct** | 842611366.59 |
104
+ | **Llama-3.1-10B-Instruct** | 2890.31 |
105
+
106
+
107
+ ### Harness Evaluation
108
+
109
  - The performance evaluation is based on the tasks being evaluated on the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).
110
  The model is evaluated on three benchmark datasets, which include `ARC-Challenge`, `HellaSwag` and `MMLU`.
111
  The library used is [lm-evaluation-harness repository](https://github.com/EleutherAI/lm-evaluation-harness)
112
 
113
 
114
+ #### Main Results
115
+ | Model | ARC | HellaSwag | MMLU |
116
+ |------------------------|----------|--------|------|
117
+ | **Llama-3.1-8B-Instruct** | **73** | **71.1** | **87.9** |
 
 
 
 
 
 
 
118
 
119
 
120
  ### Scripts to generate evalution results
 
128
  tasks_list = ["arc_challenge", "gpqa", "ifeval", "mmlu_pro", "hellaswag"] # Benchmark dataset
129
 
130
  model_path='rwmasood/llama-3.1-10b-instruct'
131
+ model_name_or_path = "./output/checkpoint-2800"
132
+
133
+ ```
134
 
135
  # Run evaluation
136
  results = evaluator.simple_evaluate(