Filip commited on
Commit
cac0b2a
·
1 Parent(s): b9012e9

small update

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -37,7 +37,7 @@ Both models used the same hyperparameters during training.\
37
  `per_device_train_batch_size=2`:\
38
  `gradient_accumulation_steps=4`: The number of steps to accumulate gradients before performing a backpropagation update. Higher accumulates gradients over multiple steps, increasing the batch size without requiring additional memory. Can improve training stability and convergence if you have a large model and limited hardware.\
39
  `learning_rate=2e-4`: Rate at which the model updates its parameters during training. Higher gives faster convergence but risks overshooting optimal parameters and instability. Lower requires more training steps but better performance.\
40
- `optim="adamw_8bit"`\: Using the Adam optimizer, a gradient descent method with momentum.
41
  `weight_decay=0.01`: Penalty to add to the weights during training to prevent overfitting. The value is proportional to the magnitude of the weights to the loss function.\
42
  `lr_scheduler_type="linear"`: We decrease the learning rate linearly.
43
 
 
37
  `per_device_train_batch_size=2`:\
38
  `gradient_accumulation_steps=4`: The number of steps to accumulate gradients before performing a backpropagation update. Higher accumulates gradients over multiple steps, increasing the batch size without requiring additional memory. Can improve training stability and convergence if you have a large model and limited hardware.\
39
  `learning_rate=2e-4`: Rate at which the model updates its parameters during training. Higher gives faster convergence but risks overshooting optimal parameters and instability. Lower requires more training steps but better performance.\
40
+ `optim="adamw_8bit"`\: Using the Adam optimizer, a gradient descent method with momentum.\
41
  `weight_decay=0.01`: Penalty to add to the weights during training to prevent overfitting. The value is proportional to the magnitude of the weights to the loss function.\
42
  `lr_scheduler_type="linear"`: We decrease the learning rate linearly.
43