Update README.md
Browse files
README.md
CHANGED
@@ -759,15 +759,18 @@ Thus far, all completed in fp32 (using nvidia tf32 dtype behind the scenes)
|
|
759 |
|
760 |
- **Performance Variation**: There's notable variation in model performance across different GLUE tasks. This variation can be attributed to the distinct nature of each task, the complexity of the datasets, and how well the model's architecture and hyperparameters are suited to each task.
|
761 |
- **Hyperparameters Impact**: Different weight decay settings and batch sizes seem to have nuanced impacts on performance across tasks, indicating the importance of hyperparameter tuning for optimal results.
|
762 |
-
- **Technology Features**: The use of `tf32` and `torch_compile` in certain tasks (e.g., SST2, MRPC, CoLA) suggests exploring these features might bring performance benefits, though their impact is mixed and may depend on the specific nature of the task and the model architecture.
|
763 |
- **Batch Size and Gradient Accumulation Steps**: These hyperparameters vary across tasks, reflecting a balance between computational efficiency and model performance. Larger batch sizes and gradient accumulation steps can help stabilize training but may require adjustments based on the available hardware and the specific task.
|
764 |
- **Task-specific Challenges**: Tasks like WNLI and RTE have notably lower accuracy scores compared to others, highlighting the challenges inherent in some NLP tasks, possibly due to dataset size, complexity, or the nuances of the task itself.
|
|
|
765 |
- **Overall Performance**: The model shows strong performance on tasks with numerical scores (e.g., STSB), high accuracy in classification tasks like QQP, SST2, and MNLI, but struggles with more nuanced or smaller datasets like WNLI and RTE, underscoring the importance of tailored approaches for different types of NLP challenges.
|
766 |
|
767 |
---
|
768 |
|
769 |
## Training procedure
|
770 |
|
|
|
|
|
|
|
771 |
### Training hyperparameters
|
772 |
|
773 |
The following hyperparameters were used during training:
|
|
|
759 |
|
760 |
- **Performance Variation**: There's notable variation in model performance across different GLUE tasks. This variation can be attributed to the distinct nature of each task, the complexity of the datasets, and how well the model's architecture and hyperparameters are suited to each task.
|
761 |
- **Hyperparameters Impact**: Different weight decay settings and batch sizes seem to have nuanced impacts on performance across tasks, indicating the importance of hyperparameter tuning for optimal results.
|
|
|
762 |
- **Batch Size and Gradient Accumulation Steps**: These hyperparameters vary across tasks, reflecting a balance between computational efficiency and model performance. Larger batch sizes and gradient accumulation steps can help stabilize training but may require adjustments based on the available hardware and the specific task.
|
763 |
- **Task-specific Challenges**: Tasks like WNLI and RTE have notably lower accuracy scores compared to others, highlighting the challenges inherent in some NLP tasks, possibly due to dataset size, complexity, or the nuances of the task itself.
|
764 |
+
- TODO: add more detailed comparison & analysis, but the standard-ctx [google/bert_uncased_L-8_H-768_A-12](https://hf.co/google/bert_uncased_L-8_H-768_A-12) seems to have the same problem. Initial read is that this results from the smaller model size vs. `bert-base-uncased`
|
765 |
- **Overall Performance**: The model shows strong performance on tasks with numerical scores (e.g., STSB), high accuracy in classification tasks like QQP, SST2, and MNLI, but struggles with more nuanced or smaller datasets like WNLI and RTE, underscoring the importance of tailored approaches for different types of NLP challenges.
|
766 |
|
767 |
---
|
768 |
|
769 |
## Training procedure
|
770 |
|
771 |
+
The below is auto-generated and just applies to the 'finishing touches' run on `goodwiki`.
|
772 |
+
|
773 |
+
|
774 |
### Training hyperparameters
|
775 |
|
776 |
The following hyperparameters were used during training:
|