Tonic commited on
Commit
39db0ca
Β·
verified Β·
1 Parent(s): 2df26a0

adds monkey patch for trackio monitoring in torch and readme creator improvements

Browse files
docs/MODEL_CARD_USER_INPUT_ANALYSIS.md ADDED
@@ -0,0 +1,233 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Model Card User Input Analysis
2
+
3
+ ## Overview
4
+
5
+ This document analyzes the interaction between the model card template (`templates/model_card.md`), the model card generator (`scripts/model_tonic/generate_model_card.py`), and the launch script (`launch.sh`) to identify variables that require user input and improve the user experience.
6
+
7
+ ## Template Variables Analysis
8
+
9
+ ### Variables in `templates/model_card.md`
10
+
11
+ The model card template uses the following variables that can be populated with user input:
12
+
13
+ #### Core Model Information
14
+ - `{{model_name}}` - Display name of the model
15
+ - `{{model_description}}` - Brief description of the model
16
+ - `{{repo_name}}` - Hugging Face repository name
17
+ - `{{base_model}}` - Base model used for fine-tuning
18
+
19
+ #### Training Configuration
20
+ - `{{training_config_type}}` - Type of training configuration used
21
+ - `{{trainer_type}}` - Type of trainer (SFT, DPO, etc.)
22
+ - `{{batch_size}}` - Training batch size
23
+ - `{{gradient_accumulation_steps}}` - Gradient accumulation steps
24
+ - `{{learning_rate}}` - Learning rate used
25
+ - `{{max_epochs}}` - Maximum number of epochs
26
+ - `{{max_seq_length}}` - Maximum sequence length
27
+
28
+ #### Dataset Information
29
+ - `{{dataset_name}}` - Name of the dataset used
30
+ - `{{dataset_size}}` - Size of the dataset
31
+ - `{{dataset_format}}` - Format of the dataset
32
+ - `{{dataset_sample_size}}` - Sample size (for lightweight configs)
33
+
34
+ #### Training Results
35
+ - `{{training_loss}}` - Final training loss
36
+ - `{{validation_loss}}` - Final validation loss
37
+ - `{{perplexity}}` - Model perplexity
38
+
39
+ #### Infrastructure
40
+ - `{{hardware_info}}` - Hardware used for training
41
+ - `{{experiment_name}}` - Name of the experiment
42
+ - `{{trackio_url}}` - Trackio monitoring URL
43
+ - `{{dataset_repo}}` - HF Dataset repository
44
+
45
+ #### Author Information
46
+ - `{{author_name}}` - Author name for citations and attribution
47
+ - `{{model_name_slug}}` - URL-friendly model name
48
+
49
+ #### Quantization
50
+ - `{{quantized_models}}` - Boolean indicating if quantized models exist
51
+
52
+ ## User Input Requirements
53
+
54
+ ### Previously Missing User Inputs
55
+
56
+ #### 1. **Author Name** (`author_name`)
57
+ - **Purpose**: Used in model card metadata and citations
58
+ - **Template Usage**: `{{#if author_name}}author: {{author_name}}{{/if}}`
59
+ - **Citation Usage**: `author={{{author_name}}}`
60
+ - **Default**: "Your Name"
61
+ - **User Input Added**: βœ… **IMPLEMENTED**
62
+
63
+ #### 2. **Model Description** (`model_description`)
64
+ - **Purpose**: Brief description of the model's capabilities
65
+ - **Template Usage**: `{{model_description}}`
66
+ - **Default**: "A fine-tuned version of SmolLM3-3B for improved text generation and conversation capabilities."
67
+ - **User Input Added**: βœ… **IMPLEMENTED**
68
+
69
+ ### Variables That Don't Need User Input
70
+
71
+ Most variables are automatically populated from:
72
+ - **Training Configuration**: Batch size, learning rate, epochs, etc.
73
+ - **System Detection**: Hardware info, model size, etc.
74
+ - **Auto-Generation**: Repository names, experiment names, etc.
75
+ - **Training Results**: Loss values, perplexity, etc.
76
+
77
+ ## Implementation Changes
78
+
79
+ ### 1. Launch Script Updates (`launch.sh`)
80
+
81
+ #### Added User Input Prompts
82
+ ```bash
83
+ # Step 8.2: Author Information for Model Card
84
+ print_step "Step 8.2: Author Information"
85
+ echo "================================="
86
+
87
+ print_info "This information will be used in the model card and citation."
88
+ get_input "Author name for model card" "$HF_USERNAME" AUTHOR_NAME
89
+
90
+ print_info "Model description will be used in the model card and repository."
91
+ get_input "Model description" "A fine-tuned version of SmolLM3-3B for improved text generation and conversation capabilities." MODEL_DESCRIPTION
92
+ ```
93
+
94
+ #### Updated Configuration Summary
95
+ ```bash
96
+ echo " Author: $AUTHOR_NAME"
97
+ ```
98
+
99
+ #### Updated Model Push Call
100
+ ```bash
101
+ python scripts/model_tonic/push_to_huggingface.py /output-checkpoint "$REPO_NAME" \
102
+ --token "$HF_TOKEN" \
103
+ --trackio-url "$TRACKIO_URL" \
104
+ --experiment-name "$EXPERIMENT_NAME" \
105
+ --dataset-repo "$TRACKIO_DATASET_REPO" \
106
+ --author-name "$AUTHOR_NAME" \
107
+ --model-description "$MODEL_DESCRIPTION"
108
+ ```
109
+
110
+ ### 2. Push Script Updates (`scripts/model_tonic/push_to_huggingface.py`)
111
+
112
+ #### Added Command Line Arguments
113
+ ```python
114
+ parser.add_argument('--author-name', type=str, default=None, help='Author name for model card')
115
+ parser.add_argument('--model-description', type=str, default=None, help='Model description for model card')
116
+ ```
117
+
118
+ #### Updated Class Constructor
119
+ ```python
120
+ def __init__(
121
+ self,
122
+ model_path: str,
123
+ repo_name: str,
124
+ token: Optional[str] = None,
125
+ private: bool = False,
126
+ trackio_url: Optional[str] = None,
127
+ experiment_name: Optional[str] = None,
128
+ dataset_repo: Optional[str] = None,
129
+ hf_token: Optional[str] = None,
130
+ author_name: Optional[str] = None,
131
+ model_description: Optional[str] = None
132
+ ):
133
+ ```
134
+
135
+ #### Updated Model Card Generation
136
+ ```python
137
+ variables = {
138
+ "model_name": f"{self.repo_name.split('/')[-1]} - Fine-tuned SmolLM3",
139
+ "model_description": self.model_description or "A fine-tuned version of SmolLM3-3B for improved text generation and conversation capabilities.",
140
+ # ... other variables
141
+ "author_name": self.author_name or training_config.get('author_name', 'Your Name'),
142
+ }
143
+ ```
144
+
145
+ ## User Experience Improvements
146
+
147
+ ### 1. **Interactive Prompts**
148
+ - Users are now prompted for author name and model description
149
+ - Default values are provided for convenience
150
+ - Clear explanations of what each field is used for
151
+
152
+ ### 2. **Configuration Summary**
153
+ - Author name is now displayed in the configuration summary
154
+ - Users can review all settings before proceeding
155
+
156
+ ### 3. **Automatic Integration**
157
+ - User inputs are automatically passed to the model card generation
158
+ - No manual editing of scripts required
159
+
160
+ ## Template Variable Categories
161
+
162
+ ### Automatic Variables (No User Input Needed)
163
+ - `repo_name` - Auto-generated from username and date
164
+ - `base_model` - Always "HuggingFaceTB/SmolLM3-3B"
165
+ - `training_config_type` - From user selection
166
+ - `trainer_type` - From user selection
167
+ - `batch_size`, `learning_rate`, `max_epochs` - From training config
168
+ - `hardware_info` - Auto-detected
169
+ - `experiment_name` - Auto-generated with timestamp
170
+ - `trackio_url` - Auto-generated from space name
171
+ - `dataset_repo` - Auto-generated
172
+ - `training_loss`, `validation_loss`, `perplexity` - From training results
173
+
174
+ ### User Input Variables (Now Implemented)
175
+ - `author_name` - βœ… **Added user prompt**
176
+ - `model_description` - βœ… **Added user prompt**
177
+
178
+ ### Conditional Variables
179
+ - `quantized_models` - Set automatically based on quantization choices
180
+ - `dataset_sample_size` - Set based on training configuration type
181
+
182
+ ## Benefits of These Changes
183
+
184
+ ### 1. **Better Attribution**
185
+ - Author names are properly captured and used in citations
186
+ - Model cards include proper attribution
187
+
188
+ ### 2. **Customizable Descriptions**
189
+ - Users can provide custom model descriptions
190
+ - Better model documentation and discoverability
191
+
192
+ ### 3. **Improved User Experience**
193
+ - No need to manually edit scripts
194
+ - Interactive prompts with helpful defaults
195
+ - Clear feedback on what information is being collected
196
+
197
+ ### 4. **Consistent Documentation**
198
+ - All model cards will have proper author information
199
+ - Standardized model descriptions
200
+ - Better integration with Hugging Face Hub
201
+
202
+ ## Future Enhancements
203
+
204
+ ### Potential Additional User Inputs
205
+ 1. **License Selection** - Allow users to choose model license
206
+ 2. **Model Tags** - Custom tags for better discoverability
207
+ 3. **Usage Examples** - Custom usage examples for specific use cases
208
+ 4. **Limitations Description** - Custom limitations based on training data
209
+
210
+ ### Template Improvements
211
+ 1. **Dynamic License** - Support for different license types
212
+ 2. **Custom Tags** - User-defined model tags
213
+ 3. **Usage Scenarios** - Template sections for different use cases
214
+
215
+ ## Testing
216
+
217
+ The changes have been tested to ensure:
218
+ - βœ… Author name is properly passed to model card generation
219
+ - βœ… Model description is properly passed to model card generation
220
+ - βœ… Default values work correctly
221
+ - βœ… Configuration summary displays new fields
222
+ - βœ… Model push script accepts new parameters
223
+
224
+ ## Conclusion
225
+
226
+ The analysis identified that the model card template had two key variables (`author_name` and `model_description`) that would benefit from user input. These have been successfully implemented with:
227
+
228
+ 1. **Interactive prompts** in the launch script
229
+ 2. **Command line arguments** in the push script
230
+ 3. **Proper integration** with the model card generator
231
+ 4. **User-friendly defaults** and clear explanations
232
+
233
+ This improves the overall user experience and ensures that model cards have proper attribution and descriptions.
docs/TRACKIO_TRL_FIX.md ADDED
@@ -0,0 +1,146 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Trackio TRL Compatibility Fix
2
+
3
+ ## Problem Description
4
+
5
+ The training was failing with the error:
6
+ ```
7
+ ERROR:trainer:Training failed: module 'trackio' has no attribute 'init'
8
+ ```
9
+
10
+ This error occurred because the TRL library (specifically SFTTrainer) expects a `trackio` module with specific functions:
11
+ - `init()` - Initialize experiment
12
+ - `log()` - Log metrics
13
+ - `finish()` - Finish experiment
14
+
15
+ However, our custom monitoring implementation didn't provide this interface.
16
+
17
+ ## Solution Implementation
18
+
19
+ ### 1. Created Trackio Module Interface (`src/trackio.py`)
20
+
21
+ Created a trackio module that provides the exact interface expected by TRL:
22
+
23
+ ```python
24
+ def init(project_name: str, experiment_name: Optional[str] = None, **kwargs) -> str:
25
+ """Initialize trackio experiment (TRL interface)"""
26
+
27
+ def log(metrics: Dict[str, Any], step: Optional[int] = None, **kwargs):
28
+ """Log metrics to trackio (TRL interface)"""
29
+
30
+ def finish():
31
+ """Finish trackio experiment (TRL interface)"""
32
+ ```
33
+
34
+ ### 2. Global Trackio Module (`trackio.py`)
35
+
36
+ Created a root-level `trackio.py` file that imports from our custom implementation:
37
+
38
+ ```python
39
+ from src.trackio import (
40
+ init, log, finish, log_config, log_checkpoint,
41
+ log_evaluation_results, get_experiment_url, is_available, get_monitor
42
+ )
43
+ ```
44
+
45
+ This makes the trackio module available globally for TRL to import.
46
+
47
+ ### 3. Updated Trainer Integration (`src/trainer.py`)
48
+
49
+ Modified the trainer to properly initialize trackio before creating SFTTrainer:
50
+
51
+ ```python
52
+ # Initialize trackio for TRL compatibility
53
+ try:
54
+ import trackio
55
+ experiment_id = trackio.init(
56
+ project_name=self.config.experiment_name,
57
+ experiment_name=self.config.experiment_name,
58
+ trackio_url=getattr(self.config, 'trackio_url', None),
59
+ trackio_token=getattr(self.config, 'trackio_token', None),
60
+ hf_token=getattr(self.config, 'hf_token', None),
61
+ dataset_repo=getattr(self.config, 'dataset_repo', None)
62
+ )
63
+ logger.info(f"Trackio initialized with experiment ID: {experiment_id}")
64
+ except Exception as e:
65
+ logger.warning(f"Failed to initialize trackio: {e}")
66
+ logger.info("Continuing without trackio integration")
67
+ ```
68
+
69
+ ### 4. Proper Cleanup
70
+
71
+ Added trackio.finish() calls in both success and error scenarios:
72
+
73
+ ```python
74
+ # Finish trackio experiment
75
+ try:
76
+ import trackio
77
+ trackio.finish()
78
+ logger.info("Trackio experiment finished")
79
+ except Exception as e:
80
+ logger.warning(f"Failed to finish trackio experiment: {e}")
81
+ ```
82
+
83
+ ## Integration with Custom Monitoring
84
+
85
+ The trackio module integrates seamlessly with our existing monitoring system:
86
+
87
+ - Uses `SmolLM3Monitor` for actual monitoring functionality
88
+ - Provides TRL-compatible interface on top
89
+ - Maintains all existing features (HF Datasets, Trackio Space, etc.)
90
+ - Graceful fallback when Trackio Space is not accessible
91
+
92
+ ## Testing
93
+
94
+ Created comprehensive test suite (`tests/test_trackio_trl_fix.py`) that verifies:
95
+
96
+ 1. **Interface Compatibility**: All required functions exist
97
+ 2. **TRL Compatibility**: Function signatures match expectations
98
+ 3. **Monitoring Integration**: Works with our custom monitoring system
99
+
100
+ Test results:
101
+ ```
102
+ βœ… Successfully imported trackio module
103
+ βœ… Found required function: init
104
+ βœ… Found required function: log
105
+ βœ… Found required function: finish
106
+ βœ… Trackio initialization successful
107
+ βœ… Trackio logging successful
108
+ βœ… Trackio finish successful
109
+ βœ… TRL compatibility test passed
110
+ βœ… Monitor integration working
111
+ ```
112
+
113
+ ## Benefits
114
+
115
+ 1. **Resolves Training Error**: Fixes the "module trackio has no attribute init" error
116
+ 2. **Maintains Functionality**: All existing monitoring features continue to work
117
+ 3. **TRL Compatibility**: SFTTrainer can now use trackio for logging
118
+ 4. **Graceful Fallback**: Continues training even if trackio initialization fails
119
+ 5. **Future-Proof**: Easy to extend with additional TRL-compatible functions
120
+
121
+ ## Usage
122
+
123
+ The fix is transparent to users. Training will now work with SFTTrainer and automatically:
124
+
125
+ 1. Initialize trackio when SFTTrainer is created
126
+ 2. Log metrics during training
127
+ 3. Finish the experiment when training completes
128
+ 4. Fall back gracefully if trackio is not available
129
+
130
+ ## Files Modified
131
+
132
+ - `src/trackio.py` - New trackio module interface
133
+ - `trackio.py` - Global trackio module for TRL
134
+ - `src/trainer.py` - Updated trainer integration
135
+ - `src/__init__.py` - Package exports
136
+ - `tests/test_trackio_trl_fix.py` - Test suite
137
+
138
+ ## Verification
139
+
140
+ To verify the fix works:
141
+
142
+ ```bash
143
+ python tests/test_trackio_trl_fix.py
144
+ ```
145
+
146
+ This should show all tests passing and confirm that the trackio module provides the interface expected by TRL library.
launch.sh CHANGED
@@ -493,6 +493,7 @@ echo " Epochs: $MAX_EPOCHS"
493
  echo " Batch Size: $BATCH_SIZE"
494
  echo " Learning Rate: $LEARNING_RATE"
495
  echo " Model Repo: $REPO_NAME (auto-generated)"
 
496
  echo " Trackio Space: $TRACKIO_URL"
497
  echo " HF Dataset: $TRACKIO_DATASET_REPO"
498
  echo ""
@@ -609,6 +610,16 @@ else
609
  exit 1
610
  fi
611
 
 
 
 
 
 
 
 
 
 
 
612
  # Step 9: Deploy Trackio Space (automated)
613
  print_step "Step 9: Deploying Trackio Space"
614
  echo "==================================="
@@ -729,7 +740,9 @@ python scripts/model_tonic/push_to_huggingface.py /output-checkpoint "$REPO_NAME
729
  --token "$HF_TOKEN" \
730
  --trackio-url "$TRACKIO_URL" \
731
  --experiment-name "$EXPERIMENT_NAME" \
732
- --dataset-repo "$TRACKIO_DATASET_REPO"
 
 
733
 
734
  # Step 16.5: Quantization Options
735
  print_step "Step 16.5: Model Quantization Options"
 
493
  echo " Batch Size: $BATCH_SIZE"
494
  echo " Learning Rate: $LEARNING_RATE"
495
  echo " Model Repo: $REPO_NAME (auto-generated)"
496
+ echo " Author: $AUTHOR_NAME"
497
  echo " Trackio Space: $TRACKIO_URL"
498
  echo " HF Dataset: $TRACKIO_DATASET_REPO"
499
  echo ""
 
610
  exit 1
611
  fi
612
 
613
+ # Step 8.2: Author Information for Model Card
614
+ print_step "Step 8.2: Author Information"
615
+ echo "================================="
616
+
617
+ print_info "This information will be used in the model card and citation."
618
+ get_input "Author name for model card" "$HF_USERNAME" AUTHOR_NAME
619
+
620
+ print_info "Model description will be used in the model card and repository."
621
+ get_input "Model description" "A fine-tuned version of SmolLM3-3B for improved french language text generation and conversation capabilities." MODEL_DESCRIPTION
622
+
623
  # Step 9: Deploy Trackio Space (automated)
624
  print_step "Step 9: Deploying Trackio Space"
625
  echo "==================================="
 
740
  --token "$HF_TOKEN" \
741
  --trackio-url "$TRACKIO_URL" \
742
  --experiment-name "$EXPERIMENT_NAME" \
743
+ --dataset-repo "$TRACKIO_DATASET_REPO" \
744
+ --author-name "$AUTHOR_NAME" \
745
+ --model-description "$MODEL_DESCRIPTION"
746
 
747
  # Step 16.5: Quantization Options
748
  print_step "Step 16.5: Model Quantization Options"
scripts/model_tonic/push_to_huggingface.py CHANGED
@@ -46,7 +46,9 @@ class HuggingFacePusher:
46
  trackio_url: Optional[str] = None,
47
  experiment_name: Optional[str] = None,
48
  dataset_repo: Optional[str] = None,
49
- hf_token: Optional[str] = None
 
 
50
  ):
51
  self.model_path = Path(model_path)
52
  self.repo_name = repo_name
@@ -54,6 +56,8 @@ class HuggingFacePusher:
54
  self.private = private
55
  self.trackio_url = trackio_url
56
  self.experiment_name = experiment_name
 
 
57
 
58
  # HF Datasets configuration
59
  self.dataset_repo = dataset_repo or os.getenv('TRACKIO_DATASET_REPO', 'tonic/trackio-experiments')
@@ -131,7 +135,7 @@ class HuggingFacePusher:
131
  # Create variables for the template
132
  variables = {
133
  "model_name": f"{self.repo_name.split('/')[-1]} - Fine-tuned SmolLM3",
134
- "model_description": "A fine-tuned version of SmolLM3-3B for improved text generation and conversation capabilities.",
135
  "repo_name": self.repo_name,
136
  "base_model": "HuggingFaceTB/SmolLM3-3B",
137
  "dataset_name": training_config.get('dataset_name', 'OpenHermes-FR'),
@@ -148,7 +152,7 @@ class HuggingFacePusher:
148
  "dataset_repo": self.dataset_repo,
149
  "dataset_size": training_config.get('dataset_size', '~80K samples'),
150
  "dataset_format": training_config.get('dataset_format', 'Chat format'),
151
- "author_name": training_config.get('author_name', 'Your Name'),
152
  "model_name_slug": self.repo_name.split('/')[-1].lower().replace('-', '_'),
153
  "quantized_models": False, # Will be updated if quantized models are added
154
  "dataset_sample_size": training_config.get('dataset_sample_size'),
@@ -522,6 +526,8 @@ def parse_args():
522
  parser.add_argument('--trackio-url', type=str, default=None, help='Trackio Space URL for logging')
523
  parser.add_argument('--experiment-name', type=str, default=None, help='Experiment name for Trackio')
524
  parser.add_argument('--dataset-repo', type=str, default=None, help='HF Dataset repository for experiment storage')
 
 
525
 
526
  return parser.parse_args()
527
 
@@ -547,7 +553,9 @@ def main():
547
  trackio_url=args.trackio_url,
548
  experiment_name=args.experiment_name,
549
  dataset_repo=args.dataset_repo,
550
- hf_token=args.hf_token
 
 
551
  )
552
 
553
  # Push model
 
46
  trackio_url: Optional[str] = None,
47
  experiment_name: Optional[str] = None,
48
  dataset_repo: Optional[str] = None,
49
+ hf_token: Optional[str] = None,
50
+ author_name: Optional[str] = None,
51
+ model_description: Optional[str] = None
52
  ):
53
  self.model_path = Path(model_path)
54
  self.repo_name = repo_name
 
56
  self.private = private
57
  self.trackio_url = trackio_url
58
  self.experiment_name = experiment_name
59
+ self.author_name = author_name
60
+ self.model_description = model_description
61
 
62
  # HF Datasets configuration
63
  self.dataset_repo = dataset_repo or os.getenv('TRACKIO_DATASET_REPO', 'tonic/trackio-experiments')
 
135
  # Create variables for the template
136
  variables = {
137
  "model_name": f"{self.repo_name.split('/')[-1]} - Fine-tuned SmolLM3",
138
+ "model_description": self.model_description or "A fine-tuned version of SmolLM3-3B for improved text generation and conversation capabilities.",
139
  "repo_name": self.repo_name,
140
  "base_model": "HuggingFaceTB/SmolLM3-3B",
141
  "dataset_name": training_config.get('dataset_name', 'OpenHermes-FR'),
 
152
  "dataset_repo": self.dataset_repo,
153
  "dataset_size": training_config.get('dataset_size', '~80K samples'),
154
  "dataset_format": training_config.get('dataset_format', 'Chat format'),
155
+ "author_name": self.author_name or training_config.get('author_name', 'Your Name'),
156
  "model_name_slug": self.repo_name.split('/')[-1].lower().replace('-', '_'),
157
  "quantized_models": False, # Will be updated if quantized models are added
158
  "dataset_sample_size": training_config.get('dataset_sample_size'),
 
526
  parser.add_argument('--trackio-url', type=str, default=None, help='Trackio Space URL for logging')
527
  parser.add_argument('--experiment-name', type=str, default=None, help='Experiment name for Trackio')
528
  parser.add_argument('--dataset-repo', type=str, default=None, help='HF Dataset repository for experiment storage')
529
+ parser.add_argument('--author-name', type=str, default=None, help='Author name for model card')
530
+ parser.add_argument('--model-description', type=str, default=None, help='Model description for model card')
531
 
532
  return parser.parse_args()
533
 
 
553
  trackio_url=args.trackio_url,
554
  experiment_name=args.experiment_name,
555
  dataset_repo=args.dataset_repo,
556
+ hf_token=args.hf_token,
557
+ author_name=args.author_name,
558
+ model_description=args.model_description
559
  )
560
 
561
  # Push model
src/__init__.py ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ SmolLM3 Fine-tuning Pipeline
3
+ Core training and monitoring modules
4
+ """
5
+
6
+ from .config import SmolLM3Config
7
+ from .data import SmolLM3Dataset
8
+ from .model import SmolLM3Model
9
+ from .monitoring import SmolLM3Monitor, create_monitor_from_config
10
+ from .train import SmolLM3Trainer
11
+ from .trainer import SmolLM3Trainer as Trainer
12
+ from .trackio import init, log, finish, log_config, log_checkpoint, log_evaluation_results
13
+
14
+ __all__ = [
15
+ 'SmolLM3Config',
16
+ 'SmolLM3Dataset',
17
+ 'SmolLM3Model',
18
+ 'SmolLM3Monitor',
19
+ 'create_monitor_from_config',
20
+ 'SmolLM3Trainer',
21
+ 'Trainer',
22
+ # Trackio interface
23
+ 'init',
24
+ 'log',
25
+ 'finish',
26
+ 'log_config',
27
+ 'log_checkpoint',
28
+ 'log_evaluation_results'
29
+ ]
src/trackio.py ADDED
@@ -0,0 +1,199 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Trackio Module Interface for TRL Library
3
+ Provides the interface expected by TRL library while integrating with our custom monitoring system
4
+ """
5
+
6
+ import os
7
+ import logging
8
+ from typing import Dict, Any, Optional
9
+ from datetime import datetime
10
+
11
+ # Import our custom monitoring
12
+ from monitoring import SmolLM3Monitor
13
+
14
+ logger = logging.getLogger(__name__)
15
+
16
+ # Global monitor instance
17
+ _monitor = None
18
+
19
+ def init(
20
+ project_name: str,
21
+ experiment_name: Optional[str] = None,
22
+ **kwargs
23
+ ) -> str:
24
+ """
25
+ Initialize trackio experiment (TRL interface)
26
+
27
+ Args:
28
+ project_name: Name of the project
29
+ experiment_name: Name of the experiment (optional)
30
+ **kwargs: Additional configuration parameters
31
+
32
+ Returns:
33
+ Experiment ID
34
+ """
35
+ global _monitor
36
+
37
+ try:
38
+ # Extract configuration from kwargs
39
+ trackio_url = kwargs.get('trackio_url') or os.environ.get('TRACKIO_URL')
40
+ trackio_token = kwargs.get('trackio_token') or os.environ.get('TRACKIO_TOKEN')
41
+ hf_token = kwargs.get('hf_token') or os.environ.get('HF_TOKEN')
42
+ dataset_repo = kwargs.get('dataset_repo') or os.environ.get('TRACKIO_DATASET_REPO', 'tonic/trackio-experiments')
43
+
44
+ # Use experiment_name if provided, otherwise use project_name
45
+ exp_name = experiment_name or project_name
46
+
47
+ # Create monitor instance
48
+ _monitor = SmolLM3Monitor(
49
+ experiment_name=exp_name,
50
+ trackio_url=trackio_url,
51
+ trackio_token=trackio_token,
52
+ enable_tracking=True,
53
+ log_artifacts=True,
54
+ log_metrics=True,
55
+ log_config=True,
56
+ hf_token=hf_token,
57
+ dataset_repo=dataset_repo
58
+ )
59
+
60
+ # Generate experiment ID
61
+ experiment_id = f"trl_{datetime.now().strftime('%Y%m%d_%H%M%S')}"
62
+ _monitor.experiment_id = experiment_id
63
+
64
+ logger.info(f"Trackio initialized for experiment: {exp_name}")
65
+ logger.info(f"Experiment ID: {experiment_id}")
66
+
67
+ return experiment_id
68
+
69
+ except Exception as e:
70
+ logger.error(f"Failed to initialize trackio: {e}")
71
+ # Return a fallback experiment ID
72
+ return f"trl_fallback_{datetime.now().strftime('%Y%m%d_%H%M%S')}"
73
+
74
+ def log(
75
+ metrics: Dict[str, Any],
76
+ step: Optional[int] = None,
77
+ **kwargs
78
+ ):
79
+ """
80
+ Log metrics to trackio (TRL interface)
81
+
82
+ Args:
83
+ metrics: Dictionary of metrics to log
84
+ step: Current training step
85
+ **kwargs: Additional parameters
86
+ """
87
+ global _monitor
88
+
89
+ try:
90
+ if _monitor is None:
91
+ logger.warning("Trackio not initialized, skipping log")
92
+ return
93
+
94
+ # Log metrics using our custom monitor
95
+ _monitor.log_metrics(metrics, step)
96
+
97
+ # Also log system metrics if available
98
+ _monitor.log_system_metrics(step)
99
+
100
+ except Exception as e:
101
+ logger.error(f"Failed to log metrics: {e}")
102
+
103
+ def finish():
104
+ """
105
+ Finish trackio experiment (TRL interface)
106
+ """
107
+ global _monitor
108
+
109
+ try:
110
+ if _monitor is None:
111
+ logger.warning("Trackio not initialized, skipping finish")
112
+ return
113
+
114
+ # Close the monitoring session
115
+ _monitor.close()
116
+
117
+ logger.info("Trackio experiment finished")
118
+
119
+ except Exception as e:
120
+ logger.error(f"Failed to finish trackio experiment: {e}")
121
+
122
+ def log_config(config: Dict[str, Any]):
123
+ """
124
+ Log configuration to trackio (TRL interface)
125
+
126
+ Args:
127
+ config: Configuration dictionary to log
128
+ """
129
+ global _monitor
130
+
131
+ try:
132
+ if _monitor is None:
133
+ logger.warning("Trackio not initialized, skipping config log")
134
+ return
135
+
136
+ # Log configuration using our custom monitor
137
+ _monitor.log_configuration(config)
138
+
139
+ except Exception as e:
140
+ logger.error(f"Failed to log config: {e}")
141
+
142
+ def log_checkpoint(checkpoint_path: str, step: Optional[int] = None):
143
+ """
144
+ Log checkpoint to trackio (TRL interface)
145
+
146
+ Args:
147
+ checkpoint_path: Path to the checkpoint file
148
+ step: Current training step
149
+ """
150
+ global _monitor
151
+
152
+ try:
153
+ if _monitor is None:
154
+ logger.warning("Trackio not initialized, skipping checkpoint log")
155
+ return
156
+
157
+ # Log checkpoint using our custom monitor
158
+ _monitor.log_model_checkpoint(checkpoint_path, step)
159
+
160
+ except Exception as e:
161
+ logger.error(f"Failed to log checkpoint: {e}")
162
+
163
+ def log_evaluation_results(results: Dict[str, Any], step: Optional[int] = None):
164
+ """
165
+ Log evaluation results to trackio (TRL interface)
166
+
167
+ Args:
168
+ results: Evaluation results dictionary
169
+ step: Current training step
170
+ """
171
+ global _monitor
172
+
173
+ try:
174
+ if _monitor is None:
175
+ logger.warning("Trackio not initialized, skipping evaluation log")
176
+ return
177
+
178
+ # Log evaluation results using our custom monitor
179
+ _monitor.log_evaluation_results(results, step)
180
+
181
+ except Exception as e:
182
+ logger.error(f"Failed to log evaluation results: {e}")
183
+
184
+ # Additional utility functions for TRL compatibility
185
+ def get_experiment_url() -> Optional[str]:
186
+ """Get the URL to view the experiment"""
187
+ global _monitor
188
+
189
+ if _monitor is not None:
190
+ return _monitor.get_experiment_url()
191
+ return None
192
+
193
+ def is_available() -> bool:
194
+ """Check if trackio is available and initialized"""
195
+ return _monitor is not None and _monitor.enable_tracking
196
+
197
+ def get_monitor():
198
+ """Get the current monitor instance (for advanced usage)"""
199
+ return _monitor
src/trainer.py CHANGED
@@ -135,6 +135,23 @@ class SmolLM3Trainer:
135
 
136
  logger.info("Total callbacks: %d", len(callbacks))
137
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
138
  # Try SFTTrainer first (better for instruction tuning)
139
  logger.info("Creating SFTTrainer with training arguments...")
140
  logger.info("Training args type: %s", type(training_args))
@@ -235,6 +252,14 @@ class SmolLM3Trainer:
235
  self.monitor.log_training_summary(summary)
236
  self.monitor.close()
237
 
 
 
 
 
 
 
 
 
238
  logger.info("Training completed successfully!")
239
  logger.info("Training metrics: %s", train_result.metrics)
240
 
@@ -243,6 +268,14 @@ class SmolLM3Trainer:
243
  # Close monitoring on error
244
  if self.monitor and self.monitor.enable_tracking:
245
  self.monitor.close()
 
 
 
 
 
 
 
 
246
  raise
247
 
248
  def evaluate(self):
 
135
 
136
  logger.info("Total callbacks: %d", len(callbacks))
137
 
138
+ # Initialize trackio for TRL compatibility
139
+ try:
140
+ import trackio
141
+ # Initialize trackio with our configuration
142
+ experiment_id = trackio.init(
143
+ project_name=self.config.experiment_name,
144
+ experiment_name=self.config.experiment_name,
145
+ trackio_url=getattr(self.config, 'trackio_url', None),
146
+ trackio_token=getattr(self.config, 'trackio_token', None),
147
+ hf_token=getattr(self.config, 'hf_token', None),
148
+ dataset_repo=getattr(self.config, 'dataset_repo', None)
149
+ )
150
+ logger.info(f"Trackio initialized with experiment ID: {experiment_id}")
151
+ except Exception as e:
152
+ logger.warning(f"Failed to initialize trackio: {e}")
153
+ logger.info("Continuing without trackio integration")
154
+
155
  # Try SFTTrainer first (better for instruction tuning)
156
  logger.info("Creating SFTTrainer with training arguments...")
157
  logger.info("Training args type: %s", type(training_args))
 
252
  self.monitor.log_training_summary(summary)
253
  self.monitor.close()
254
 
255
+ # Finish trackio experiment
256
+ try:
257
+ import trackio
258
+ trackio.finish()
259
+ logger.info("Trackio experiment finished")
260
+ except Exception as e:
261
+ logger.warning(f"Failed to finish trackio experiment: {e}")
262
+
263
  logger.info("Training completed successfully!")
264
  logger.info("Training metrics: %s", train_result.metrics)
265
 
 
268
  # Close monitoring on error
269
  if self.monitor and self.monitor.enable_tracking:
270
  self.monitor.close()
271
+
272
+ # Finish trackio experiment on error
273
+ try:
274
+ import trackio
275
+ trackio.finish()
276
+ except Exception as finish_error:
277
+ logger.warning(f"Failed to finish trackio experiment on error: {finish_error}")
278
+
279
  raise
280
 
281
  def evaluate(self):
setup_launch.py β†’ tests/setup_launch.py RENAMED
File without changes
tests/test_trackio_trl_fix.py ADDED
@@ -0,0 +1,153 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Test script to verify Trackio TRL compatibility fix
4
+ Tests that our trackio module provides the interface expected by TRL library
5
+ """
6
+
7
+ import sys
8
+ import os
9
+ import logging
10
+
11
+ # Add src to path
12
+ sys.path.insert(0, os.path.join(os.path.dirname(__file__), '..'))
13
+
14
+ def test_trackio_interface():
15
+ """Test that trackio module provides the expected interface"""
16
+ print("πŸ” Testing Trackio TRL Interface")
17
+
18
+ try:
19
+ # Test importing trackio
20
+ import trackio
21
+ print("βœ… Successfully imported trackio module")
22
+
23
+ # Test that required functions exist
24
+ required_functions = ['init', 'log', 'finish']
25
+ for func_name in required_functions:
26
+ if hasattr(trackio, func_name):
27
+ print(f"βœ… Found required function: {func_name}")
28
+ else:
29
+ print(f"❌ Missing required function: {func_name}")
30
+ return False
31
+
32
+ # Test initialization
33
+ experiment_id = trackio.init(
34
+ project_name="test_project",
35
+ experiment_name="test_experiment",
36
+ trackio_url="https://test.hf.space",
37
+ dataset_repo="test/trackio-experiments"
38
+ )
39
+ print(f"βœ… Trackio initialization successful: {experiment_id}")
40
+
41
+ # Test logging
42
+ metrics = {'loss': 0.5, 'learning_rate': 1e-4}
43
+ trackio.log(metrics, step=1)
44
+ print("βœ… Trackio logging successful")
45
+
46
+ # Test finishing
47
+ trackio.finish()
48
+ print("βœ… Trackio finish successful")
49
+
50
+ return True
51
+
52
+ except Exception as e:
53
+ print(f"❌ Trackio interface test failed: {e}")
54
+ return False
55
+
56
+ def test_trl_compatibility():
57
+ """Test that our trackio module is compatible with TRL expectations"""
58
+ print("\nπŸ” Testing TRL Compatibility")
59
+
60
+ try:
61
+ # Simulate what TRL would do
62
+ import trackio
63
+
64
+ # TRL expects these functions to be available
65
+ assert hasattr(trackio, 'init'), "trackio.init not found"
66
+ assert hasattr(trackio, 'log'), "trackio.log not found"
67
+ assert hasattr(trackio, 'finish'), "trackio.finish not found"
68
+
69
+ # Test function signatures
70
+ import inspect
71
+
72
+ # Check init signature
73
+ init_sig = inspect.signature(trackio.init)
74
+ print(f"βœ… init signature: {init_sig}")
75
+
76
+ # Check log signature
77
+ log_sig = inspect.signature(trackio.log)
78
+ print(f"βœ… log signature: {log_sig}")
79
+
80
+ # Check finish signature
81
+ finish_sig = inspect.signature(trackio.finish)
82
+ print(f"βœ… finish signature: {finish_sig}")
83
+
84
+ print("βœ… TRL compatibility test passed")
85
+ return True
86
+
87
+ except Exception as e:
88
+ print(f"❌ TRL compatibility test failed: {e}")
89
+ return False
90
+
91
+ def test_monitoring_integration():
92
+ """Test that our trackio module integrates with our monitoring system"""
93
+ print("\nπŸ” Testing Monitoring Integration")
94
+
95
+ try:
96
+ import trackio
97
+
98
+ # Test that we can get the monitor
99
+ monitor = trackio.get_monitor()
100
+ if monitor is not None:
101
+ print("βœ… Monitor integration working")
102
+ else:
103
+ print("⚠️ Monitor not available (this is normal if not initialized)")
104
+
105
+ # Test availability check
106
+ is_avail = trackio.is_available()
107
+ print(f"βœ… Trackio availability check: {is_avail}")
108
+
109
+ return True
110
+
111
+ except Exception as e:
112
+ print(f"❌ Monitoring integration test failed: {e}")
113
+ return False
114
+
115
+ def main():
116
+ """Run all tests"""
117
+ print("πŸš€ Testing Trackio TRL Fix")
118
+ print("=" * 50)
119
+
120
+ tests = [
121
+ test_trackio_interface,
122
+ test_trl_compatibility,
123
+ test_monitoring_integration
124
+ ]
125
+
126
+ passed = 0
127
+ total = len(tests)
128
+
129
+ for test in tests:
130
+ try:
131
+ if test():
132
+ passed += 1
133
+ except Exception as e:
134
+ print(f"❌ Test {test.__name__} failed with exception: {e}")
135
+
136
+ print("\n" + "=" * 50)
137
+ print(f"Test Results: {passed}/{total} tests passed")
138
+
139
+ if passed == total:
140
+ print("βœ… All tests passed! Trackio TRL fix is working correctly.")
141
+ print("\nThe trackio module now provides the interface expected by TRL library:")
142
+ print("- init(): Initialize experiment")
143
+ print("- log(): Log metrics")
144
+ print("- finish(): Finish experiment")
145
+ print("\nThis should resolve the 'module trackio has no attribute init' error.")
146
+ else:
147
+ print("❌ Some tests failed. Please check the implementation.")
148
+ return 1
149
+
150
+ return 0
151
+
152
+ if __name__ == "__main__":
153
+ sys.exit(main())
trackio.py ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Trackio Module for TRL Library Compatibility
3
+ This module provides the interface expected by TRL library while using our custom monitoring system
4
+ """
5
+
6
+ # Import all functions from our custom trackio implementation
7
+ from src.trackio import (
8
+ init,
9
+ log,
10
+ finish,
11
+ log_config,
12
+ log_checkpoint,
13
+ log_evaluation_results,
14
+ get_experiment_url,
15
+ is_available,
16
+ get_monitor
17
+ )
18
+
19
+ # Make all functions available at module level
20
+ __all__ = [
21
+ 'init',
22
+ 'log',
23
+ 'finish',
24
+ 'log_config',
25
+ 'log_checkpoint',
26
+ 'log_evaluation_results',
27
+ 'get_experiment_url',
28
+ 'is_available',
29
+ 'get_monitor'
30
+ ]