Pengcheng He commited on
Commit
a642358
1 Parent(s): ded16d8

Update deepspeed config

Browse files
Files changed (2) hide show
  1. README.md +35 -3
  2. ds_config.json +23 -0
README.md CHANGED
@@ -31,16 +31,48 @@ We present the dev results on SQuAD 1.1/2.0 and several GLUE benchmark tasks.
31
  --------
32
  #### Notes.
33
  - <sup>1</sup> Following RoBERTa, for RTE, MRPC, STS-B, we fine-tune the tasks based on [DeBERTa-Large-MNLI](https://huggingface.co/microsoft/deberta-large-mnli), [DeBERTa-XLarge-MNLI](https://huggingface.co/microsoft/deberta-xlarge-mnli), [DeBERTa-V2-XLarge-MNLI](https://huggingface.co/microsoft/deberta-v2-xlarge-mnli), [DeBERTa-V2-XXLarge-MNLI](https://huggingface.co/microsoft/deberta-v2-xxlarge-mnli). The results of SST-2/QQP/QNLI/SQuADv2 will also be slightly improved when start from MNLI fine-tuned models, however, we only report the numbers fine-tuned from pretrained base models for those 4 tasks.
34
- - <sup>2</sup> To try the **XXLarge** model with **[HF transformers](https://huggingface.co/transformers/main_classes/trainer.html)**, you need to specify **--sharded_ddp**
35
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
36
  ```bash
37
  cd transformers/examples/text-classification/
38
- export TASK_NAME=mrpc
39
  python -m torch.distributed.launch --nproc_per_node=8 run_glue.py --model_name_or_path microsoft/deberta-v2-xxlarge \
40
- --task_name $TASK_NAME --do_train --do_eval --max_seq_length 128 --per_device_train_batch_size 4 \
41
  --learning_rate 3e-6 --num_train_epochs 3 --output_dir /tmp/$TASK_NAME/ --overwrite_output_dir --sharded_ddp --fp16
42
  ```
43
 
 
44
  ### Citation
45
 
46
  If you find DeBERTa useful for your work, please cite the following paper:
 
31
  --------
32
  #### Notes.
33
  - <sup>1</sup> Following RoBERTa, for RTE, MRPC, STS-B, we fine-tune the tasks based on [DeBERTa-Large-MNLI](https://huggingface.co/microsoft/deberta-large-mnli), [DeBERTa-XLarge-MNLI](https://huggingface.co/microsoft/deberta-xlarge-mnli), [DeBERTa-V2-XLarge-MNLI](https://huggingface.co/microsoft/deberta-v2-xlarge-mnli), [DeBERTa-V2-XXLarge-MNLI](https://huggingface.co/microsoft/deberta-v2-xxlarge-mnli). The results of SST-2/QQP/QNLI/SQuADv2 will also be slightly improved when start from MNLI fine-tuned models, however, we only report the numbers fine-tuned from pretrained base models for those 4 tasks.
34
+ - <sup>2</sup> To try the **XXLarge** model with **[HF transformers](https://huggingface.co/transformers/main_classes/trainer.html)**, we recommand using **deepspeed** as it's faster and saves memory.
35
 
36
+ Run with `Deepspeed`,
37
+
38
+ ```bash
39
+ pip install datasets
40
+ pip install deepspeed
41
+
42
+ # Download the deepspeed config file
43
+ wget https://huggingface.co/microsoft/deberta-v2-xxlarge/resolve/main/ds_config.json -O ds_config.json
44
+
45
+ export TASK_NAME=mnli
46
+ output_dir="ds_results"
47
+ num_gpus=8
48
+ batch_size=8
49
+ python -m torch.distributed.launch --nproc_per_node=${num_gpus} \
50
+ run_glue.py \
51
+ --model_name_or_path microsoft/deberta-v2-xxlarge \
52
+ --task_name $TASK_NAME \
53
+ --do_train \
54
+ --do_eval \
55
+ --max_seq_length 256 \
56
+ --per_device_train_batch_size ${batch_size} \
57
+ --learning_rate 3e-6 \
58
+ --num_train_epochs 3 \
59
+ --output_dir $output_dir \
60
+ --overwrite_output_dir \
61
+ --logging_steps 10 \
62
+ --logging_dir $output_dir \
63
+ --deepspeed ds_config.json
64
+ ```
65
+
66
+ You can also run with `--sharded_ddp`
67
  ```bash
68
  cd transformers/examples/text-classification/
69
+ export TASK_NAME=mnli
70
  python -m torch.distributed.launch --nproc_per_node=8 run_glue.py --model_name_or_path microsoft/deberta-v2-xxlarge \
71
+ --task_name $TASK_NAME --do_train --do_eval --max_seq_length 256 --per_device_train_batch_size 8 \
72
  --learning_rate 3e-6 --num_train_epochs 3 --output_dir /tmp/$TASK_NAME/ --overwrite_output_dir --sharded_ddp --fp16
73
  ```
74
 
75
+
76
  ### Citation
77
 
78
  If you find DeBERTa useful for your work, please cite the following paper:
ds_config.json ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "fp16": {
3
+
4
+ "enabled": true,
5
+
6
+ "initial_scale_power": 12
7
+
8
+ },
9
+ "zero_optimization": {
10
+
11
+ "stage": 2,
12
+
13
+ "reduce_bucket_size": 5e7,
14
+
15
+ "allgather_bucket_size": 1.25e9,
16
+
17
+ "overlap_comm": true,
18
+
19
+ "contiguous_gradients": true
20
+
21
+ },
22
+ "zero_allow_untested_optimizer": true
23
+ }