Tonic commited on
Commit
5fe83da
·
verified ·
1 Parent(s): 231fcd0

adds A100 large experiments

Browse files
A100_LARGE_SCALE_GUIDE.md ADDED
@@ -0,0 +1,195 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # A100 Large Scale Training Guide
2
+
3
+ This guide provides configurations and instructions for running fully-fledged experiments with multiple passes on the full OpenHermes-FR dataset (800k+ datapoints) using A100 GPUs.
4
+
5
+ ## Available Configurations
6
+
7
+ ### 1. A100 Large Batch Configuration
8
+ **File**: `config/train_smollm3_openhermes_fr_a100_large.py`
9
+
10
+ **Key Features**:
11
+ - **Effective Batch Size**: 128 (8 × 16 gradient accumulation)
12
+ - **Training Duration**: ~1.3 passes (8,000 steps)
13
+ - **Learning Rate**: 5e-6 (optimized for large batches)
14
+ - **Mixed Precision**: bf16 (A100 optimized)
15
+ - **Sequence Length**: 8192 tokens
16
+ - **Memory Optimizations**: No gradient checkpointing for A100 efficiency
17
+
18
+ **Estimated Training Time**: ~6-8 hours on A100
19
+
20
+ ### 2. Multiple Passes Configuration
21
+ **File**: `config/train_smollm3_openhermes_fr_a100_multiple_passes.py`
22
+
23
+ **Key Features**:
24
+ - **Effective Batch Size**: 120 (6 × 20 gradient accumulation)
25
+ - **Training Duration**: ~4 passes (25,000 steps)
26
+ - **Learning Rate**: 3e-6 (conservative for long training)
27
+ - **Warmup Steps**: 2000 (longer warmup for stability)
28
+ - **Checkpoint Strategy**: More frequent saves (every 2000 steps)
29
+
30
+ **Estimated Training Time**: ~20-24 hours on A100
31
+
32
+ ## Training Commands
33
+
34
+ ### Quick Start - Large Batch Experiment
35
+ ```bash
36
+ python run_a100_large_experiment.py \
37
+ --config config/train_smollm3_openhermes_fr_a100_large.py \
38
+ --experiment-name "smollm3_openhermes_fr_large_batch" \
39
+ --output-dir ./outputs/large_batch
40
+ ```
41
+
42
+ ### Multiple Passes Experiment
43
+ ```bash
44
+ python run_a100_large_experiment.py \
45
+ --config config/train_smollm3_openhermes_fr_a100_multiple_passes.py \
46
+ --experiment-name "smollm3_openhermes_fr_multiple_passes" \
47
+ --output-dir ./outputs/multiple_passes
48
+ ```
49
+
50
+ ### Dry Run (Check Configuration)
51
+ ```bash
52
+ python run_a100_large_experiment.py \
53
+ --config config/train_smollm3_openhermes_fr_a100_large.py \
54
+ --dry-run
55
+ ```
56
+
57
+ ### Resume Training
58
+ ```bash
59
+ python run_a100_large_experiment.py \
60
+ --config config/train_smollm3_openhermes_fr_a100_multiple_passes.py \
61
+ --resume ./outputs/multiple_passes/checkpoint-10000 \
62
+ --output-dir ./outputs/multiple_passes
63
+ ```
64
+
65
+ ## Configuration Details
66
+
67
+ ### Memory Usage Optimization
68
+ - **Gradient Checkpointing**: Disabled for A100 efficiency
69
+ - **Flash Attention**: Enabled for memory efficiency
70
+ - **bf16 Mixed Precision**: Better for A100 than fp16
71
+ - **Gradient Clipping**: 1.0 for stability
72
+ - **Group by Length**: Enabled for better batching
73
+
74
+ ### Data Loading Optimization
75
+ - **Num Workers**: 8 for faster data loading
76
+ - **Pin Memory**: Enabled for GPU transfer efficiency
77
+ - **Prefetch Factor**: 2 for pipeline optimization
78
+
79
+ ### Training Stability
80
+ - **Conservative Learning Rate**: Lower LR for large effective batch sizes
81
+ - **Longer Warmup**: More warmup steps for stability
82
+ - **Higher Beta2**: 0.999 for AdamW stability
83
+ - **Gradient Clipping**: Prevents gradient explosion
84
+
85
+ ## Expected Results
86
+
87
+ ### Large Batch Configuration (1.3 passes)
88
+ - **Training Steps**: 8,000
89
+ - **Effective Batch Size**: 128
90
+ - **Steps per Epoch**: ~6,250
91
+ - **Epochs**: ~1.3
92
+ - **Expected Loss**: Should converge to ~1.5-2.0
93
+
94
+ ### Multiple Passes Configuration (4 passes)
95
+ - **Training Steps**: 25,000
96
+ - **Effective Batch Size**: 120
97
+ - **Steps per Epoch**: ~6,667
98
+ - **Epochs**: ~3.75
99
+ - **Expected Loss**: Should converge to ~1.2-1.5
100
+
101
+ ## Monitoring and Logging
102
+
103
+ ### Trackio Integration
104
+ Both configurations include Trackio monitoring:
105
+ - **Metrics Logging**: Every 25-50 steps
106
+ - **Artifact Logging**: Model checkpoints
107
+ - **Config Logging**: Training configuration
108
+
109
+ ### Checkpoint Strategy
110
+ - **Large Batch**: Save every 1000 steps (8 checkpoints)
111
+ - **Multiple Passes**: Save every 2000 steps (12 checkpoints)
112
+ - **Best Model**: Automatically load best model at end
113
+
114
+ ## Hardware Requirements
115
+
116
+ ### Minimum Requirements
117
+ - **GPU**: A100 80GB (or multiple A100s)
118
+ - **RAM**: 64GB+ system RAM
119
+ - **Storage**: 100GB+ for checkpoints and logs
120
+ - **Network**: Fast internet for dataset download
121
+
122
+ ### Recommended Setup
123
+ - **GPU**: 2-4x A100 80GB
124
+ - **RAM**: 128GB+ system RAM
125
+ - **Storage**: 500GB+ NVMe SSD
126
+ - **Network**: 10Gbps+ connection
127
+
128
+ ## Troubleshooting
129
+
130
+ ### Out of Memory (OOM)
131
+ If you encounter OOM errors:
132
+ 1. Reduce `batch_size` from 8 to 6 or 4
133
+ 2. Increase `gradient_accumulation_steps` to maintain effective batch size
134
+ 3. Reduce `max_seq_length` from 8192 to 4096
135
+
136
+ ### Slow Training
137
+ If training is too slow:
138
+ 1. Increase `dataloader_num_workers` to 12-16
139
+ 2. Ensure you're using bf16 mixed precision
140
+ 3. Check that gradient checkpointing is disabled
141
+ 4. Verify flash attention is enabled
142
+
143
+ ### Convergence Issues
144
+ If loss doesn't converge:
145
+ 1. Reduce learning rate by 2x
146
+ 2. Increase warmup steps
147
+ 3. Check gradient norms in logs
148
+ 4. Verify dataset quality
149
+
150
+ ## Customization
151
+
152
+ ### For Different Dataset Sizes
153
+ Adjust `max_iters` based on your dataset size:
154
+ ```python
155
+ # For 1M datapoints with effective batch size 120
156
+ steps_per_epoch = 1000000 // 120 # ~8,333 steps
157
+ max_iters = steps_per_epoch * desired_epochs
158
+ ```
159
+
160
+ ### For Different GPU Memory
161
+ Adjust batch size and gradient accumulation:
162
+ ```python
163
+ # For 40GB A100
164
+ batch_size = 4
165
+ gradient_accumulation_steps = 32 # Effective batch size = 128
166
+
167
+ # For 24GB GPU
168
+ batch_size = 2
169
+ gradient_accumulation_steps = 64 # Effective batch size = 128
170
+ ```
171
+
172
+ ## Performance Tips
173
+
174
+ 1. **Use bf16**: Better than fp16 for A100
175
+ 2. **Disable Gradient Checkpointing**: A100 has enough memory
176
+ 3. **Use Flash Attention**: Memory efficient attention
177
+ 4. **Group by Length**: Better batching efficiency
178
+ 5. **Pin Memory**: Faster GPU transfers
179
+ 6. **Multiple Workers**: Faster data loading
180
+
181
+ ## Expected Timeline
182
+
183
+ - **Large Batch**: 6-8 hours for 1.3 passes
184
+ - **Multiple Passes**: 20-24 hours for 4 passes
185
+ - **Full Dataset (5+ passes)**: 30+ hours
186
+
187
+ ## Next Steps
188
+
189
+ After training completes:
190
+ 1. Evaluate on validation set
191
+ 2. Test generation quality
192
+ 3. Push to Hugging Face Hub
193
+ 4. Deploy for inference
194
+
195
+ For deployment instructions, see `DEPLOYMENT_GUIDE.md`.
CLOUD_DEPLOYMENT_GUIDE.md ADDED
@@ -0,0 +1,462 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Cloud Deployment Guide for SmolLM3 DPO Training
2
+
3
+ This guide provides the exact sequence of commands to deploy and run SmolLM3 DPO training on a cloud computing instance with 6 epochs.
4
+
5
+ ## Prerequisites
6
+
7
+ ### Cloud Instance Requirements
8
+
9
+ - **GPU**: NVIDIA A100, H100, or similar (16GB+ VRAM)
10
+ - **RAM**: 64GB+ system memory
11
+ - **Storage**: 100GB+ SSD storage
12
+ - **OS**: Ubuntu 20.04 or 22.04
13
+
14
+ ### Required Information
15
+
16
+ Before starting, gather these details:
17
+ - Your Hugging Face username
18
+ - Your Hugging Face token (with write permissions)
19
+ - Your Trackio Space URL (if using monitoring)
20
+
21
+ ## Step-by-Step Deployment
22
+
23
+ ### Step 1: Launch Cloud Instance
24
+
25
+ Choose your cloud provider and launch an instance:
26
+
27
+ #### AWS (g5.2xlarge or g5.4xlarge)
28
+ ```bash
29
+ # Launch instance with Ubuntu 22.04 and appropriate GPU
30
+ aws ec2 run-instances \
31
+ --image-id ami-0c7217cdde317cfec \
32
+ --instance-type g5.2xlarge \
33
+ --key-name your-key-pair \
34
+ --security-group-ids sg-xxxxxxxxx
35
+ ```
36
+
37
+ #### Google Cloud (n1-standard-8 with T4/V100)
38
+ ```bash
39
+ gcloud compute instances create smollm3-dpo \
40
+ --zone=us-central1-a \
41
+ --machine-type=n1-standard-8 \
42
+ --accelerator="type=nvidia-tesla-t4,count=1" \
43
+ --image-family=ubuntu-2204-lts \
44
+ --image-project=ubuntu-os-cloud
45
+ ```
46
+
47
+ #### Azure (Standard_NC6s_v3)
48
+ ```bash
49
+ az vm create \
50
+ --resource-group your-rg \
51
+ --name smollm3-dpo \
52
+ --image Canonical:0001-com-ubuntu-server-jammy:22_04-lts:latest \
53
+ --size Standard_NC6s_v3 \
54
+ --admin-username azureuser
55
+ ```
56
+
57
+ ### Step 2: Connect to Instance
58
+
59
+ ```bash
60
+ # SSH to your instance
61
+ ssh -i your-key.pem ubuntu@your-instance-ip
62
+
63
+ # Or for Azure
64
+ ssh azureuser@your-instance-ip
65
+ ```
66
+
67
+ ### Step 3: Update System and Install Dependencies
68
+
69
+ ```bash
70
+ # Update system
71
+ sudo apt-get update
72
+ sudo apt-get upgrade -y
73
+
74
+ # Install system dependencies
75
+ sudo apt-get install -y git curl wget unzip python3 python3-pip python3-venv
76
+
77
+ # Install NVIDIA drivers (if not pre-installed)
78
+ curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
79
+ curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
80
+ sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
81
+ sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
82
+
83
+ sudo apt-get update
84
+ sudo apt-get install -y nvidia-container-toolkit
85
+ ```
86
+
87
+ ### Step 4: Clone Repository and Setup Environment
88
+
89
+ ```bash
90
+ # Clone your repository
91
+ git clone https://github.com/your-username/flexai-finetune.git
92
+ cd flexai-finetune
93
+
94
+ # Create virtual environment
95
+ python3 -m venv smollm3_env
96
+ source smollm3_env/bin/activate
97
+
98
+ # Install PyTorch with CUDA
99
+ pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
100
+
101
+ # Install project dependencies
102
+ pip install -r requirements.txt
103
+
104
+ # Install additional DPO dependencies
105
+ pip install trl>=0.7.0
106
+ pip install peft>=0.4.0
107
+ pip install accelerate>=0.20.0
108
+ ```
109
+
110
+ ### Step 5: Configure Authentication
111
+
112
+ ```bash
113
+ # Set your Hugging Face token
114
+ export HF_TOKEN="your_huggingface_token_here"
115
+
116
+ # Login to Hugging Face
117
+ huggingface-cli login --token $HF_TOKEN
118
+ ```
119
+
120
+ ### Step 6: Create Configuration Files
121
+
122
+ Create the DPO configuration file:
123
+
124
+ ```bash
125
+ cat > config/train_smollm3_dpo_6epochs.py << 'EOF'
126
+ """
127
+ SmolLM3 DPO Training Configuration - 6 Epochs
128
+ Optimized for cloud deployment
129
+ """
130
+
131
+ from config.train_smollm3_dpo import SmolLM3DPOConfig
132
+
133
+ config = SmolLM3DPOConfig(
134
+ # Model configuration
135
+ model_name="HuggingFaceTB/SmolLM3-3B",
136
+ max_seq_length=4096,
137
+ use_flash_attention=True,
138
+ use_gradient_checkpointing=True,
139
+
140
+ # Training configuration
141
+ batch_size=2,
142
+ gradient_accumulation_steps=8,
143
+ learning_rate=5e-6,
144
+ weight_decay=0.01,
145
+ warmup_steps=100,
146
+ max_iters=None, # Will be calculated based on epochs
147
+ eval_interval=100,
148
+ log_interval=10,
149
+ save_interval=500,
150
+
151
+ # DPO configuration
152
+ beta=0.1,
153
+ max_prompt_length=2048,
154
+
155
+ # Optimizer configuration
156
+ optimizer="adamw",
157
+ beta1=0.9,
158
+ beta2=0.95,
159
+ eps=1e-8,
160
+
161
+ # Scheduler configuration
162
+ scheduler="cosine",
163
+ min_lr=1e-6,
164
+
165
+ # Mixed precision
166
+ fp16=True,
167
+ bf16=False,
168
+
169
+ # Logging and saving
170
+ save_steps=500,
171
+ eval_steps=100,
172
+ logging_steps=10,
173
+ save_total_limit=3,
174
+
175
+ # Evaluation
176
+ eval_strategy="steps",
177
+ metric_for_best_model="eval_loss",
178
+ greater_is_better=False,
179
+ load_best_model_at_end=True,
180
+
181
+ # Data configuration
182
+ data_dir="smoltalk_dataset",
183
+ train_file="train.json",
184
+ validation_file="validation.json",
185
+
186
+ # Chat template configuration
187
+ use_chat_template=True,
188
+ chat_template_kwargs={
189
+ "enable_thinking": False,
190
+ "add_generation_prompt": True
191
+ },
192
+
193
+ # Trackio monitoring configuration
194
+ enable_tracking=True,
195
+ trackio_url="https://your-trackio-space.hf.space", # Change this
196
+ trackio_token=None,
197
+ log_artifacts=True,
198
+ log_metrics=True,
199
+ log_config=True,
200
+ experiment_name="smollm3_dpo_6epochs"
201
+ )
202
+ EOF
203
+ ```
204
+
205
+ ### Step 7: Download and Prepare Dataset
206
+
207
+ ```bash
208
+ # Create dataset preparation script
209
+ cat > prepare_dataset.py << 'EOF'
210
+ from datasets import load_dataset
211
+ import json
212
+ import os
213
+
214
+ # Load SmolTalk dataset
215
+ print('Loading SmolTalk dataset...')
216
+ dataset = load_dataset('HuggingFaceTB/smoltalk')
217
+
218
+ # Create dataset directory
219
+ os.makedirs('smoltalk_dataset', exist_ok=True)
220
+
221
+ # Convert to DPO format (preference pairs)
222
+ def convert_to_dpo_format(example):
223
+ # For SmolTalk, we'll create preference pairs based on response quality
224
+ # This is a simplified example - you may need to adjust based on your needs
225
+ return {
226
+ 'prompt': example.get('prompt', ''),
227
+ 'chosen': example.get('chosen', ''),
228
+ 'rejected': example.get('rejected', '')
229
+ }
230
+
231
+ # Process train split
232
+ train_data = []
233
+ for example in dataset['train']:
234
+ dpo_example = convert_to_dpo_format(example)
235
+ if dpo_example['prompt'] and dpo_example['chosen'] and dpo_example['rejected']:
236
+ train_data.append(dpo_example)
237
+
238
+ # Process validation split
239
+ val_data = []
240
+ for example in dataset['validation']:
241
+ dpo_example = convert_to_dpo_format(example)
242
+ if dpo_example['prompt'] and dpo_example['chosen'] and dpo_example['rejected']:
243
+ val_data.append(dpo_example)
244
+
245
+ # Save to files
246
+ with open('smoltalk_dataset/train.json', 'w') as f:
247
+ json.dump(train_data, f, indent=2)
248
+
249
+ with open('smoltalk_dataset/validation.json', 'w') as f:
250
+ json.dump(val_data, f, indent=2)
251
+
252
+ print(f'Dataset prepared: {len(train_data)} train samples, {len(val_data)} validation samples')
253
+ EOF
254
+
255
+ # Run dataset preparation
256
+ python prepare_dataset.py
257
+ ```
258
+
259
+ ### Step 8: Calculate Training Parameters
260
+
261
+ ```bash
262
+ # Calculate training steps based on epochs
263
+ TOTAL_SAMPLES=$(python -c "import json; data=json.load(open('smoltalk_dataset/train.json')); print(len(data))")
264
+ BATCH_SIZE=2
265
+ GRADIENT_ACCUMULATION_STEPS=8
266
+ MAX_EPOCHS=6
267
+ EFFECTIVE_BATCH_SIZE=$((BATCH_SIZE * GRADIENT_ACCUMULATION_STEPS))
268
+ STEPS_PER_EPOCH=$((TOTAL_SAMPLES / EFFECTIVE_BATCH_SIZE))
269
+ MAX_STEPS=$((STEPS_PER_EPOCH * MAX_EPOCHS))
270
+
271
+ echo "Training Configuration:"
272
+ echo " Total samples: $TOTAL_SAMPLES"
273
+ echo " Effective batch size: $EFFECTIVE_BATCH_SIZE"
274
+ echo " Steps per epoch: $STEPS_PER_EPOCH"
275
+ echo " Total training steps: $MAX_STEPS"
276
+ echo " Training epochs: $MAX_EPOCHS"
277
+ ```
278
+
279
+ ### Step 9: Start DPO Training
280
+
281
+ ```bash
282
+ # Start training with all parameters
283
+ python train.py config/train_smollm3_dpo_6epochs.py \
284
+ --dataset_dir smoltalk_dataset \
285
+ --out_dir /output-checkpoint \
286
+ --init_from scratch \
287
+ --max_iters $MAX_STEPS \
288
+ --batch_size $BATCH_SIZE \
289
+ --learning_rate 5e-6 \
290
+ --gradient_accumulation_steps $GRADIENT_ACCUMULATION_STEPS \
291
+ --max_seq_length 4096 \
292
+ --save_steps 500 \
293
+ --eval_steps 100 \
294
+ --logging_steps 10 \
295
+ --enable_tracking \
296
+ --trackio_url "https://your-trackio-space.hf.space" \
297
+ --experiment_name "smollm3_dpo_6epochs"
298
+ ```
299
+
300
+ ### Step 10: Push Model to Hugging Face Hub
301
+
302
+ ```bash
303
+ # Push the trained model
304
+ python push_to_huggingface.py /output-checkpoint "your-username/smollm3-dpo-6epochs" \
305
+ --token "$HF_TOKEN" \
306
+ --trackio-url "https://your-trackio-space.hf.space" \
307
+ --experiment-name "smollm3_dpo_6epochs"
308
+ ```
309
+
310
+ ### Step 11: Test the Uploaded Model
311
+
312
+ ```bash
313
+ # Test the model
314
+ python -c "
315
+ from transformers import AutoModelForCausalLM, AutoTokenizer
316
+ import torch
317
+
318
+ print('Loading uploaded model...')
319
+ model = AutoModelForCausalLM.from_pretrained('your-username/smollm3-dpo-6epochs', torch_dtype=torch.float16, device_map='auto')
320
+ tokenizer = AutoTokenizer.from_pretrained('your-username/smollm3-dpo-6epochs')
321
+
322
+ print('Testing model generation...')
323
+ prompt = 'Hello, how are you?'
324
+ inputs = tokenizer(prompt, return_tensors='pt').to(model.device)
325
+ outputs = model.generate(**inputs, max_new_tokens=50, do_sample=True, temperature=0.7)
326
+ response = tokenizer.decode(outputs[0], skip_special_tokens=True)
327
+ print(f'Prompt: {prompt}')
328
+ print(f'Response: {response}')
329
+ print('✅ Model test completed successfully!')
330
+ "
331
+ ```
332
+
333
+ ## Complete One-Line Deployment
334
+
335
+ If you want to run everything automatically, use the deployment script:
336
+
337
+ ```bash
338
+ # Make script executable
339
+ chmod +x cloud_deployment.sh
340
+
341
+ # Edit configuration in the script first
342
+ nano cloud_deployment.sh
343
+ # Change these variables:
344
+ # - REPO_NAME="your-username/smollm3-dpo-6epochs"
345
+ # - TRACKIO_URL="https://your-trackio-space.hf.space"
346
+ # - HF_TOKEN="your_hf_token_here"
347
+
348
+ # Run the complete deployment
349
+ ./cloud_deployment.sh
350
+ ```
351
+
352
+ ## Monitoring and Debugging
353
+
354
+ ### Check GPU Usage
355
+
356
+ ```bash
357
+ # Monitor GPU usage during training
358
+ watch -n 1 nvidia-smi
359
+ ```
360
+
361
+ ### Check Training Logs
362
+
363
+ ```bash
364
+ # Monitor training progress
365
+ tail -f training.log
366
+
367
+ # Check system resources
368
+ htop
369
+ ```
370
+
371
+ ### Monitor Trackio
372
+
373
+ ```bash
374
+ # Check if Trackio is logging properly
375
+ curl -s "https://your-trackio-space.hf.space" | grep -i "experiment"
376
+ ```
377
+
378
+ ## Expected Timeline
379
+
380
+ - **Setup**: 15-30 minutes
381
+ - **Dataset preparation**: 5-10 minutes
382
+ - **Training (6 epochs)**: 4-8 hours (depending on GPU)
383
+ - **Model upload**: 10-30 minutes
384
+ - **Testing**: 5-10 minutes
385
+
386
+ ## Troubleshooting
387
+
388
+ ### Common Issues
389
+
390
+ #### 1. Out of Memory (OOM)
391
+ ```bash
392
+ # Reduce batch size
393
+ BATCH_SIZE=1
394
+ GRADIENT_ACCUMULATION_STEPS=16
395
+
396
+ # Or use gradient checkpointing
397
+ # Already enabled in config
398
+ ```
399
+
400
+ #### 2. Slow Training
401
+ ```bash
402
+ # Check GPU utilization
403
+ nvidia-smi
404
+
405
+ # Check if mixed precision is working
406
+ # Look for "fp16" in training logs
407
+ ```
408
+
409
+ #### 3. Dataset Issues
410
+ ```bash
411
+ # Check dataset format
412
+ head -n 5 smoltalk_dataset/train.json
413
+
414
+ # Verify dataset size
415
+ wc -l smoltalk_dataset/train.json
416
+ ```
417
+
418
+ #### 4. Authentication Issues
419
+ ```bash
420
+ # Test HF token
421
+ python -c "
422
+ from huggingface_hub import HfApi
423
+ api = HfApi(token='$HF_TOKEN')
424
+ print('Token is valid!')
425
+ "
426
+ ```
427
+
428
+ ## Cost Estimation
429
+
430
+ ### AWS (g5.2xlarge)
431
+ - **Instance**: $0.526/hour
432
+ - **Training time**: 6 hours
433
+ - **Total cost**: ~$3.16
434
+
435
+ ### Google Cloud (n1-standard-8 + T4)
436
+ - **Instance**: $0.38/hour
437
+ - **Training time**: 6 hours
438
+ - **Total cost**: ~$2.28
439
+
440
+ ### Azure (Standard_NC6s_v3)
441
+ - **Instance**: $0.90/hour
442
+ - **Training time**: 6 hours
443
+ - **Total cost**: ~$5.40
444
+
445
+ ## Next Steps
446
+
447
+ After successful deployment:
448
+
449
+ 1. **Monitor training** in your Trackio Space
450
+ 2. **Check model repository** on Hugging Face Hub
451
+ 3. **Test the model** with different prompts
452
+ 4. **Share your model** with the community
453
+ 5. **Iterate and improve** based on results
454
+
455
+ ## Support
456
+
457
+ - **Training issues**: Check logs and GPU utilization
458
+ - **Upload issues**: Verify HF token and repository permissions
459
+ - **Monitoring issues**: Check Trackio Space configuration
460
+ - **Performance issues**: Adjust batch size and learning rate
461
+
462
+ Your SmolLM3 DPO model will be ready for use after training completes!
CLOUD_TRAINING_GUIDE.md ADDED
@@ -0,0 +1,440 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Cloud Training Guide for OpenHermes-FR Dataset
2
+
3
+ This guide provides step-by-step instructions for training SmolLM3 models on cloud instances using the [legmlai/openhermes-fr](https://huggingface.co/datasets/legmlai/openhermes-fr) dataset.
4
+
5
+ ## Overview
6
+
7
+ The OpenHermes-FR dataset contains 799,875 French instruction-response pairs, perfect for fine-tuning SmolLM3 models for French language tasks. This guide covers:
8
+
9
+ - ✅ **Cloud Instance Setup** - Complete environment configuration
10
+ - ✅ **Dataset Integration** - Automatic loading and filtering
11
+ - ✅ **Training Configuration** - Optimized for French instruction tuning
12
+ - ✅ **Monitoring Integration** - Trackio experiment tracking
13
+ - ✅ **Model Deployment** - Push to Hugging Face Hub
14
+
15
+ ## Dataset Information
16
+
17
+ ### Schema
18
+ ```json
19
+ {
20
+ "prompt": "Explique la différence entre la photosynthèse C3 et C4.",
21
+ "accepted_completion": "La photosynthèse C3 utilise… (réponse détaillée)",
22
+ "bad_prompt_detected": false,
23
+ "bad_response_detected": false,
24
+ "bad_entry": false
25
+ }
26
+ ```
27
+
28
+ ### Key Features
29
+ - **Size**: 799,875 examples (~1.4GB)
30
+ - **Language**: 100% French
31
+ - **Quality**: GPT-4o generated responses with automatic filtering
32
+ - **License**: ODC-BY 1.0
33
+
34
+ ## Cloud Instance Setup
35
+
36
+ ### 1. Choose Your Cloud Provider
37
+
38
+ #### **AWS EC2 (Recommended)**
39
+ ```bash
40
+ # Launch instance with GPU
41
+ # Recommended: g4dn.xlarge or g5.xlarge
42
+ # AMI: Deep Learning AMI (Ubuntu 20.04)
43
+ ```
44
+
45
+ #### **Google Cloud Platform**
46
+ ```bash
47
+ # Launch instance with GPU
48
+ # Recommended: n1-standard-4 with Tesla T4 or V100
49
+ ```
50
+
51
+ #### **Azure**
52
+ ```bash
53
+ # Launch instance with GPU
54
+ # Recommended: Standard_NC6s_v3 or Standard_NC12s_v3
55
+ ```
56
+
57
+ ### 2. Instance Specifications
58
+
59
+ #### **Minimum Requirements**
60
+ - **GPU**: 16GB+ VRAM (Tesla T4, V100, or A100)
61
+ - **RAM**: 32GB+ system memory
62
+ - **Storage**: 100GB+ SSD
63
+ - **CPU**: 8+ cores
64
+
65
+ #### **Recommended Specifications**
66
+ - **GPU**: A100 (40GB) or H100 (80GB)
67
+ - **RAM**: 64GB+ system memory
68
+ - **Storage**: 200GB+ NVMe SSD
69
+ - **CPU**: 16+ cores
70
+
71
+ ### 3. Environment Setup
72
+
73
+ ```bash
74
+ # Update system
75
+ sudo apt update && sudo apt upgrade -y
76
+
77
+ # Install CUDA (if not pre-installed)
78
+ # Follow NVIDIA CUDA installation guide for your GPU
79
+
80
+ # Install Python dependencies
81
+ sudo apt install python3-pip python3-venv git -y
82
+
83
+ # Create virtual environment
84
+ python3 -m venv smollm3_env
85
+ source smollm3_env/bin/activate
86
+
87
+ # Clone repository
88
+ git clone <your-repo-url>
89
+ cd <your-repo-directory>
90
+
91
+ # Install dependencies
92
+ pip install -r requirements.txt
93
+
94
+ # Install additional dependencies for cloud training
95
+ pip install accelerate transformers datasets huggingface_hub
96
+ ```
97
+
98
+ ## Training Configuration
99
+
100
+ ### 1. Use the OpenHermes-FR Config
101
+
102
+ The repository includes a specialized configuration for the OpenHermes-FR dataset:
103
+
104
+ ```bash
105
+ python train.py config/train_smollm3_openhermes_fr.py \
106
+ --enable_tracking \
107
+ --trackio_url "https://your-space.hf.space" \
108
+ --experiment_name "smollm3_fr_openhermes_v1"
109
+ ```
110
+
111
+ ### 2. Configuration Details
112
+
113
+ The `config/train_smollm3_openhermes_fr.py` includes:
114
+
115
+ #### **Dataset Configuration**
116
+ ```python
117
+ dataset_name: str = "legmlai/openhermes-fr"
118
+ dataset_split: str = "train"
119
+ input_field: str = "prompt"
120
+ target_field: str = "accepted_completion"
121
+ filter_bad_entries: bool = True
122
+ bad_entry_field: str = "bad_entry"
123
+ ```
124
+
125
+ #### **Training Optimization**
126
+ ```python
127
+ batch_size: int = 2 # Reduced for French text (longer sequences)
128
+ gradient_accumulation_steps: int = 8 # Maintains effective batch size
129
+ learning_rate: float = 1e-5 # Lower for instruction tuning
130
+ max_iters: int = 2000 # More iterations for large dataset
131
+ ```
132
+
133
+ #### **Monitoring Integration**
134
+ ```python
135
+ enable_tracking: bool = True
136
+ experiment_name: str = "smollm3_openhermes_fr"
137
+ ```
138
+
139
+ ## Training Commands
140
+
141
+ ### Basic Training
142
+ ```bash
143
+ python train.py config/train_smollm3_openhermes_fr.py
144
+ ```
145
+
146
+ ### Training with Monitoring
147
+ ```bash
148
+ python train.py config/train_smollm3_openhermes_fr.py \
149
+ --enable_tracking \
150
+ --trackio_url "https://your-trackio-space.hf.space" \
151
+ --experiment_name "smollm3_fr_openhermes_v1"
152
+ ```
153
+
154
+ ### Training with Custom Parameters
155
+ ```bash
156
+ python train.py config/train_smollm3_openhermes_fr.py \
157
+ --batch_size 4 \
158
+ --learning_rate 2e-5 \
159
+ --max_iters 3000 \
160
+ --enable_tracking \
161
+ --trackio_url "https://your-trackio-space.hf.space" \
162
+ --experiment_name "smollm3_fr_high_lr"
163
+ ```
164
+
165
+ ### Training with Checkpoint Resume
166
+ ```bash
167
+ python train.py config/train_smollm3_openhermes_fr.py \
168
+ --init_from resume \
169
+ --enable_tracking \
170
+ --trackio_url "https://your-trackio-space.hf.space" \
171
+ --experiment_name "smollm3_fr_resume"
172
+ ```
173
+
174
+ ## Dataset Processing
175
+
176
+ ### Automatic Filtering
177
+
178
+ The training script automatically:
179
+ - ✅ **Loads** the OpenHermes-FR dataset from Hugging Face
180
+ - ✅ **Filters** out bad entries (`bad_entry = true`)
181
+ - ✅ **Splits** data into train/validation/test (98/1/1)
182
+ - ✅ **Formats** prompts and completions for instruction tuning
183
+
184
+ ### Manual Dataset Inspection
185
+
186
+ ```python
187
+ from datasets import load_dataset
188
+
189
+ # Load dataset
190
+ dataset = load_dataset("legmlai/openhermes-fr")
191
+
192
+ # Check dataset info
193
+ print(f"Dataset size: {len(dataset['train'])}")
194
+ print(f"Sample columns: {dataset['train'].column_names}")
195
+
196
+ # Check filtering
197
+ bad_entries = dataset['train'].filter(lambda x: x['bad_entry'])
198
+ print(f"Bad entries: {len(bad_entries)}")
199
+
200
+ # Sample data
201
+ sample = dataset['train'][0]
202
+ print(f"Prompt: {sample['prompt']}")
203
+ print(f"Completion: {sample['accepted_completion']}")
204
+ ```
205
+
206
+ ## Monitoring and Tracking
207
+
208
+ ### Trackio Integration
209
+
210
+ The training automatically logs:
211
+ - **Training metrics**: Loss, accuracy, learning rate
212
+ - **System metrics**: GPU memory, CPU usage
213
+ - **Dataset info**: Size, filtering statistics
214
+ - **Model checkpoints**: Regular saves with metadata
215
+
216
+ ### View Training Progress
217
+
218
+ 1. **Trackio Space**: Visit your Trackio Space URL
219
+ 2. **Experiment Details**: Check the "View Experiments" tab
220
+ 3. **Metrics**: Monitor loss curves and system usage
221
+ 4. **Logs**: Download training logs for analysis
222
+
223
+ ## Model Deployment
224
+
225
+ ### Push to Hugging Face Hub
226
+
227
+ After training, deploy your model:
228
+
229
+ ```bash
230
+ python push_to_huggingface.py /output-checkpoint username/smollm3-fr-openhermes \
231
+ --trackio-url "https://your-trackio-space.hf.space" \
232
+ --experiment-name "smollm3_fr_openhermes_v1"
233
+ ```
234
+
235
+ ### Use Your Model
236
+
237
+ ```python
238
+ from transformers import AutoModelForCausalLM, AutoTokenizer
239
+
240
+ # Load your fine-tuned model
241
+ model = AutoModelForCausalLM.from_pretrained("username/smollm3-fr-openhermes")
242
+ tokenizer = AutoTokenizer.from_pretrained("username/smollm3-fr-openhermes")
243
+
244
+ # Generate French text
245
+ prompt = "Expliquez le concept de l'intelligence artificielle."
246
+ inputs = tokenizer(prompt, return_tensors="pt")
247
+ outputs = model.generate(**inputs, max_new_tokens=200)
248
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
249
+ ```
250
+
251
+ ## Performance Optimization
252
+
253
+ ### GPU Memory Management
254
+
255
+ ```bash
256
+ # Monitor GPU usage
257
+ nvidia-smi -l 1
258
+
259
+ # Optimize for your GPU
260
+ # For 16GB VRAM: batch_size=2, gradient_accumulation_steps=8
261
+ # For 24GB VRAM: batch_size=4, gradient_accumulation_steps=4
262
+ # For 40GB+ VRAM: batch_size=8, gradient_accumulation_steps=2
263
+ ```
264
+
265
+ ### Training Speed
266
+
267
+ ```bash
268
+ # Use mixed precision (enabled by default)
269
+ fp16: bool = True
270
+
271
+ # Enable gradient checkpointing (enabled by default)
272
+ use_gradient_checkpointing: bool = True
273
+
274
+ # Use flash attention (enabled by default)
275
+ use_flash_attention: bool = True
276
+ ```
277
+
278
+ ## Troubleshooting
279
+
280
+ ### Common Issues
281
+
282
+ #### 1. **Out of Memory (OOM)**
283
+ ```bash
284
+ # Reduce batch size
285
+ python train.py config/train_smollm3_openhermes_fr.py --batch_size 1
286
+
287
+ # Increase gradient accumulation
288
+ # Edit config: gradient_accumulation_steps = 16
289
+ ```
290
+
291
+ #### 2. **Slow Training**
292
+ ```bash
293
+ # Check GPU utilization
294
+ nvidia-smi
295
+
296
+ # Verify data loading
297
+ # Check if dataset is cached locally
298
+ ```
299
+
300
+ #### 3. **Dataset Loading Issues**
301
+ ```bash
302
+ # Clear cache
303
+ rm -rf ~/.cache/huggingface/
304
+
305
+ # Check internet connection
306
+ # Verify dataset name: "legmlai/openhermes-fr"
307
+ ```
308
+
309
+ #### 4. **Monitoring Connection Issues**
310
+ ```bash
311
+ # Test Trackio connection
312
+ curl -I https://your-trackio-space.hf.space
313
+
314
+ # Check token permissions
315
+ # Verify experiment name format
316
+ ```
317
+
318
+ ### Debug Mode
319
+
320
+ ```bash
321
+ # Enable debug logging
322
+ export LOG_LEVEL=DEBUG
323
+ python train.py config/train_smollm3_openhermes_fr.py
324
+ ```
325
+
326
+ ## Cost Optimization
327
+
328
+ ### Cloud Provider Tips
329
+
330
+ #### **AWS EC2**
331
+ - Use Spot Instances for cost savings
332
+ - Monitor usage with CloudWatch
333
+ - Use appropriate instance types
334
+
335
+ #### **Google Cloud Platform**
336
+ - Use Preemptible VMs for non-critical training
337
+ - Monitor with Cloud Monitoring
338
+ - Use committed use discounts
339
+
340
+ #### **Azure**
341
+ - Use Spot VMs for cost optimization
342
+ - Monitor with Azure Monitor
343
+ - Use reserved instances for long training
344
+
345
+ ### Training Time Estimates
346
+
347
+ | GPU Type | Batch Size | Estimated Time |
348
+ |----------|------------|----------------|
349
+ | Tesla T4 (16GB) | 2 | 8-12 hours |
350
+ | V100 (32GB) | 4 | 4-6 hours |
351
+ | A100 (40GB) | 8 | 2-3 hours |
352
+ | H100 (80GB) | 16 | 1-2 hours |
353
+
354
+ ## Security Best Practices
355
+
356
+ ### Token Management
357
+ ```bash
358
+ # Use environment variables
359
+ export HF_TOKEN="your_token_here"
360
+ export TRACKIO_TOKEN="your_trackio_token"
361
+
362
+ # Don't hardcode in scripts
363
+ # Use IAM roles when possible
364
+ ```
365
+
366
+ ### Data Privacy
367
+ ```bash
368
+ # Use private repositories for sensitive models
369
+ python push_to_huggingface.py model username/private-model --private
370
+
371
+ # Secure your cloud instance
372
+ # Use VPC and security groups
373
+ ```
374
+
375
+ ## Complete Workflow Example
376
+
377
+ ### 1. Setup Cloud Instance
378
+ ```bash
379
+ # Launch GPU instance
380
+ # Install dependencies
381
+ git clone <your-repo>
382
+ cd <your-repo>
383
+ pip install -r requirements.txt
384
+ ```
385
+
386
+ ### 2. Train Model
387
+ ```bash
388
+ python train.py config/train_smollm3_openhermes_fr.py \
389
+ --enable_tracking \
390
+ --trackio_url "https://your-space.hf.space" \
391
+ --experiment_name "smollm3_fr_v1"
392
+ ```
393
+
394
+ ### 3. Deploy Model
395
+ ```bash
396
+ python push_to_huggingface.py /output-checkpoint username/smollm3-fr-v1 \
397
+ --trackio-url "https://your-space.hf.space" \
398
+ --experiment-name "smollm3_fr_v1"
399
+ ```
400
+
401
+ ### 4. Test Model
402
+ ```python
403
+ from transformers import AutoModelForCausalLM, AutoTokenizer
404
+
405
+ model = AutoModelForCausalLM.from_pretrained("username/smollm3-fr-v1")
406
+ tokenizer = AutoTokenizer.from_pretrained("username/smollm3-fr-v1")
407
+
408
+ # Test French generation
409
+ prompt = "Qu'est-ce que l'apprentissage automatique?"
410
+ inputs = tokenizer(prompt, return_tensors="pt")
411
+ outputs = model.generate(**inputs, max_new_tokens=100)
412
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
413
+ ```
414
+
415
+ ## Support and Resources
416
+
417
+ ### Documentation
418
+ - [OpenHermes-FR Dataset](https://huggingface.co/datasets/legmlai/openhermes-fr)
419
+ - [SmolLM3 Model](https://huggingface.co/HuggingFaceTB/SmolLM3-3B)
420
+ - [Trackio Monitoring](https://github.com/Josephrp/trackio)
421
+
422
+ ### Community
423
+ - [Hugging Face Forums](https://discuss.huggingface.co/)
424
+ - [Transformers Documentation](https://huggingface.co/docs/transformers/)
425
+
426
+ ### Examples
427
+ - [French Language Models](https://huggingface.co/models?search=french)
428
+ - [Instruction Tuned Models](https://huggingface.co/models?pipeline_tag=text-generation&sort=downloads)
429
+
430
+ ## Conclusion
431
+
432
+ This guide provides everything needed to train SmolLM3 models on the OpenHermes-FR dataset in the cloud:
433
+
434
+ - ✅ **Complete Setup** - From cloud instance to model deployment
435
+ - ✅ **Optimized Configuration** - Tailored for French instruction tuning
436
+ - ✅ **Monitoring Integration** - Trackio experiment tracking
437
+ - ✅ **Cost Optimization** - Tips for efficient cloud usage
438
+ - ✅ **Troubleshooting** - Solutions for common issues
439
+
440
+ Start training your French language model today!
DEPLOYMENT_GUIDE.md ADDED
@@ -0,0 +1,397 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Trackio Deployment Guide for Hugging Face Spaces
2
+
3
+ This guide provides step-by-step instructions for deploying Trackio experiment tracking to Hugging Face Spaces and integrating it with your SmolLM3 fine-tuning pipeline.
4
+
5
+ ## Prerequisites
6
+
7
+ - Hugging Face account
8
+ - Hugging Face CLI installed (`pip install huggingface_hub`)
9
+ - Git configured with your Hugging Face credentials
10
+
11
+ ## Method 1: Automated Deployment (Recommended)
12
+
13
+ ### Step 1: Run the Deployment Script
14
+
15
+ ```bash
16
+ python deploy_trackio_space.py
17
+ ```
18
+
19
+ The script will prompt you for:
20
+ - Your Hugging Face username
21
+ - Space name (e.g., `trackio-monitoring`)
22
+ - Hugging Face token (needs a write token obviously)
23
+
24
+ ### Step 2: Wait for Build
25
+
26
+ After deployment, wait 2-5 minutes for the Space to build and become available.
27
+
28
+ ### Step 3: Test the Interface
29
+
30
+ Visit your Space URL to test the interface:
31
+ ```
32
+ https://huggingface.co/spaces/YOUR_USERNAME/YOUR_SPACE_NAME
33
+ ```
34
+
35
+ ## Method 2: Manual Deployment
36
+
37
+ ### Step 1: Create a New Space
38
+
39
+ 1. Go to https://huggingface.co/spaces
40
+ 2. Click "Create new Space"
41
+ 3. Configure the Space:
42
+ - **Owner**: Your username
43
+ - **Space name**: `trackio-monitoring` (or your preferred name)
44
+ - **SDK**: Gradio
45
+ - **Hardware**: CPU (Basic)
46
+ - **License**: MIT
47
+
48
+ ### Step 2: Upload Files
49
+
50
+ Upload these files to your Space:
51
+
52
+ #### `app.py`
53
+ The main Gradio interface (already created in this repository)
54
+
55
+ #### `requirements_space.txt`
56
+ ```
57
+ gradio>=4.0.0
58
+ gradio-client>=0.10.0
59
+ requests>=2.31.0
60
+ numpy>=1.24.0
61
+ pandas>=2.0.0
62
+ jsonschema>=4.17.0
63
+ plotly>=5.15.0
64
+ matplotlib>=3.7.0
65
+ python-dotenv>=1.0.0
66
+ ```
67
+
68
+ #### `README.md`
69
+ ```markdown
70
+ # Trackio Experiment Tracking
71
+
72
+ A Gradio interface for experiment tracking and monitoring.
73
+
74
+ ## Features
75
+
76
+ - Create and manage experiments
77
+ - Log training metrics and parameters
78
+ - View experiment details and results
79
+ - Update experiment status
80
+
81
+ ## Usage
82
+
83
+ 1. Create a new experiment using the "Create Experiment" tab
84
+ 2. Log metrics during training using the "Log Metrics" tab
85
+ 3. View experiment details using the "View Experiments" tab
86
+ 4. Update experiment status using the "Update Status" tab
87
+
88
+ ## Integration
89
+
90
+ To connect your training script to this Trackio Space:
91
+
92
+ ```python
93
+ from monitoring import SmolLM3Monitor
94
+
95
+ monitor = SmolLM3Monitor(
96
+ experiment_name="my_experiment",
97
+ trackio_url="https://your-space.hf.space",
98
+ enable_tracking=True
99
+ )
100
+ ```
101
+
102
+ ### Step 3: Configure Space Settings
103
+
104
+ In your Space settings, ensure:
105
+ - **App file**: `app.py`
106
+ - **Python version**: 3.9 or higher
107
+ - **Hardware**: CPU (Basic) is sufficient
108
+
109
+ ## Integration with Your Training Script
110
+
111
+ ### Step 1: Update Your Configuration
112
+
113
+ Add Trackio settings to your training configuration:
114
+
115
+ ```python
116
+ # config/train_smollm3.py
117
+ @dataclass
118
+ class SmolLM3Config:
119
+ # ... existing settings ...
120
+
121
+ # Trackio monitoring configuration
122
+ enable_tracking: bool = True
123
+ trackio_url: Optional[str] = None # Your Space URL
124
+ trackio_token: Optional[str] = None
125
+ log_artifacts: bool = True
126
+ log_metrics: bool = True
127
+ log_config: bool = True
128
+ experiment_name: Optional[str] = None
129
+ ```
130
+
131
+ ### Step 2: Run Training with Trackio
132
+
133
+ ```bash
134
+ python train.py config/train_smollm3.py \
135
+ --dataset_dir my_dataset \
136
+ --enable_tracking \
137
+ --trackio_url "https://your-username-trackio-monitoring.hf.space" \
138
+ --experiment_name "smollm3_finetune_v1"
139
+ ```
140
+
141
+ ### Step 3: Monitor Your Experiments
142
+
143
+ 1. **Create Experiment**: Use the "Create Experiment" tab in your Space
144
+ 2. **Log Metrics**: Your training script will automatically log metrics
145
+ 3. **View Results**: Use the "View Experiments" tab to see progress
146
+ 4. **Update Status**: Mark experiments as completed when done
147
+
148
+ ## Advanced Configuration
149
+
150
+ ### Environment Variables
151
+
152
+ You can set Trackio configuration via environment variables:
153
+
154
+ ```bash
155
+ export TRACKIO_URL="https://your-space.hf.space"
156
+ export TRACKIO_TOKEN="your_token_here"
157
+ ```
158
+
159
+ ### Custom Experiment Names
160
+
161
+ ```bash
162
+ python train.py config/train_smollm3.py \
163
+ --experiment_name "smollm3_high_lr_experiment" \
164
+ --trackio_url "https://your-space.hf.space"
165
+ ```
166
+
167
+ ### Multiple Experiments
168
+
169
+ You can run multiple experiments and track them separately:
170
+
171
+ ```bash
172
+ # Experiment 1
173
+ python train.py config/train_smollm3.py \
174
+ --experiment_name "smollm3_baseline" \
175
+ --learning_rate 2e-5
176
+
177
+ # Experiment 2
178
+ python train.py config/train_smollm3.py \
179
+ --experiment_name "smollm3_high_lr" \
180
+ --learning_rate 5e-5
181
+ ```
182
+
183
+ ## Using the Trackio Interface
184
+
185
+ ### Creating Experiments
186
+
187
+ 1. Go to the "Create Experiment" tab
188
+ 2. Enter experiment name (e.g., "smollm3_finetune_v1")
189
+ 3. Add description (optional)
190
+ 4. Click "Create Experiment"
191
+ 5. Note the experiment ID for logging metrics
192
+
193
+ ### Logging Metrics
194
+
195
+ 1. Go to the "Log Metrics" tab
196
+ 2. Enter your experiment ID
197
+ 3. Add metrics in JSON format:
198
+ ```json
199
+ {
200
+ "loss": 0.5,
201
+ "accuracy": 0.85,
202
+ "learning_rate": 2e-5
203
+ }
204
+ ```
205
+ 4. Add step number (optional)
206
+ 5. Click "Log Metrics"
207
+
208
+ ### Viewing Experiments
209
+
210
+ 1. Go to the "View Experiments" tab
211
+ 2. Enter experiment ID to view specific experiment
212
+ 3. Or click "List All Experiments" to see all experiments
213
+
214
+ ### Updating Status
215
+
216
+ 1. Go to the "Update Status" tab
217
+ 2. Enter experiment ID
218
+ 3. Select new status (running, completed, failed, paused)
219
+ 4. Click "Update Status"
220
+
221
+ ## Troubleshooting
222
+
223
+ ### Common Issues
224
+
225
+ #### 1. Space Not Building
226
+ - Check that all required files are uploaded
227
+ - Verify `app.py` is the main file
228
+ - Check the Space logs for errors
229
+
230
+ #### 2. Connection Errors
231
+ - Verify your Space URL is correct
232
+ - Check that the Space is running (not paused)
233
+ - Ensure your training script can reach the Space URL
234
+
235
+ #### 3. Missing Metrics
236
+ - Check that `enable_tracking=True` in your config
237
+ - Verify the Trackio URL is correct
238
+ - Check training logs for monitoring errors
239
+
240
+ #### 4. Authentication Issues
241
+ - If using tokens, verify they're correct
242
+ - Check Hugging Face account permissions
243
+ - Ensure Space is public or you have access
244
+
245
+ ### Debug Mode
246
+
247
+ Enable debug logging in your training script:
248
+
249
+ ```python
250
+ import logging
251
+ logging.basicConfig(level=logging.DEBUG)
252
+ ```
253
+
254
+ ### Manual Testing
255
+
256
+ Test the Trackio interface manually:
257
+
258
+ 1. Create an experiment
259
+ 2. Log some test metrics
260
+ 3. View the experiment details
261
+ 4. Update the status
262
+
263
+ ## Security Considerations
264
+
265
+ ### Public vs Private Spaces
266
+
267
+ - **Public Spaces**: Anyone can view and use the interface
268
+ - **Private Spaces**: Only you and collaborators can access
269
+
270
+ ### Token Management
271
+
272
+ - Store tokens securely (environment variables)
273
+ - Don't commit tokens to version control
274
+ - Use Hugging Face's token management
275
+
276
+ ### Data Privacy
277
+
278
+ - Trackio stores experiment data in the Space
279
+ - Consider data retention policies
280
+ - Be mindful of sensitive information in experiment names
281
+
282
+ ## Performance Optimization
283
+
284
+ ### Space Configuration
285
+
286
+ - Use CPU (Basic) for the interface (sufficient for tracking)
287
+ - Consider GPU only for actual training
288
+ - Monitor Space usage and limits
289
+
290
+ ### Efficient Logging
291
+
292
+ - Log metrics at reasonable intervals (every 10-100 steps)
293
+ - Avoid logging too frequently to prevent rate limiting
294
+ - Use batch logging when possible
295
+
296
+ ## Monitoring Best Practices
297
+
298
+ ### Experiment Naming
299
+
300
+ Use descriptive names:
301
+ - `smollm3_baseline_v1`
302
+ - `smollm3_high_lr_experiment`
303
+ - `smollm3_dpo_training`
304
+
305
+ ### Metric Logging
306
+
307
+ Log relevant metrics:
308
+ - Training loss
309
+ - Validation loss
310
+ - Learning rate
311
+ - GPU memory usage
312
+ - Training time
313
+
314
+ ### Status Management
315
+
316
+ - Mark experiments as "running" when starting
317
+ - Update to "completed" when finished
318
+ - Mark as "failed" if errors occur
319
+ - Use "paused" for temporary stops
320
+
321
+ ## Integration Examples
322
+
323
+ ### Basic Integration
324
+
325
+ ```python
326
+ from monitoring import SmolLM3Monitor
327
+
328
+ # Initialize monitor
329
+ monitor = SmolLM3Monitor(
330
+ experiment_name="my_experiment",
331
+ trackio_url="https://your-space.hf.space",
332
+ enable_tracking=True
333
+ )
334
+
335
+ # Log configuration
336
+ monitor.log_config(config_dict)
337
+
338
+ # Log metrics during training
339
+ monitor.log_metrics({"loss": 0.5}, step=100)
340
+
341
+ # Log final results
342
+ monitor.log_training_summary(final_results)
343
+ ```
344
+
345
+ ### Advanced Integration
346
+
347
+ ```python
348
+ # Custom monitoring setup
349
+ monitor = SmolLM3Monitor(
350
+ experiment_name="smollm3_advanced",
351
+ trackio_url="https://your-space.hf.space",
352
+ enable_tracking=True,
353
+ log_artifacts=True,
354
+ log_metrics=True,
355
+ log_config=True
356
+ )
357
+
358
+ # Log system metrics
359
+ monitor.log_system_metrics(step=current_step)
360
+
361
+ # Log model checkpoint
362
+ monitor.log_model_checkpoint("checkpoint-1000", step=1000)
363
+
364
+ # Log evaluation results
365
+ monitor.log_evaluation_results(eval_results, step=1000)
366
+ ```
367
+
368
+ ## Support and Resources
369
+
370
+ ### Documentation
371
+
372
+ - [Hugging Face Spaces Documentation](https://huggingface.co/docs/hub/spaces)
373
+ - [Gradio Documentation](https://gradio.app/docs/)
374
+ - [Trackio GitHub Repository](https://github.com/Josephrp/trackio)
375
+
376
+ ### Community
377
+
378
+ - [Hugging Face Forums](https://discuss.huggingface.co/)
379
+ - [Gradio Discord](https://discord.gg/feTf9z3Z)
380
+
381
+ ### Issues and Feedback
382
+
383
+ - Report issues on the project repository
384
+ - Provide feedback on the Trackio interface
385
+ - Suggest improvements for the monitoring system
386
+
387
+ ## Conclusion
388
+
389
+ You now have a complete Trackio monitoring system deployed on Hugging Face Spaces! This setup provides:
390
+
391
+ - ✅ Easy experiment tracking and monitoring
392
+ - ✅ Real-time metric logging
393
+ - ✅ Web-based interface for experiment management
394
+ - ✅ Integration with your SmolLM3 fine-tuning pipeline
395
+ - ✅ Scalable and accessible monitoring solution
396
+
397
+ Start tracking your experiments and gain insights into your model training process!
PUSH_GUIDE.md ADDED
@@ -0,0 +1,406 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Push to Hugging Face Hub Guide
2
+
3
+ This guide explains how to use the `push_to_huggingface.py` script to upload your trained SmolLM3 models and results to Hugging Face Hub.
4
+
5
+ ## Features
6
+
7
+ - ✅ **Automatic Repository Creation** - Creates HF repositories automatically
8
+ - ✅ **Model Validation** - Validates required model files before upload
9
+ - ✅ **Comprehensive Model Cards** - Generates detailed model documentation
10
+ - ✅ **Training Results Upload** - Uploads logs, configs, and results
11
+ - ✅ **Trackio Integration** - Logs push actions to your monitoring system
12
+ - ✅ **Private/Public Repositories** - Support for both private and public models
13
+
14
+ ## Prerequisites
15
+
16
+ ### 1. Install Dependencies
17
+
18
+ ```bash
19
+ pip install huggingface_hub
20
+ ```
21
+
22
+ ### 2. Set Up Hugging Face Token
23
+
24
+ ```bash
25
+ # Option 1: Environment variable
26
+ export HF_TOKEN="your_huggingface_token_here"
27
+
28
+ # Option 2: Use --token argument
29
+ python push_to_huggingface.py model_path repo_name --token "your_token"
30
+ ```
31
+
32
+ ### 3. Get Your Hugging Face Token
33
+
34
+ 1. Go to https://huggingface.co/settings/tokens
35
+ 2. Click "New token"
36
+ 3. Give it a name (e.g., "model-upload")
37
+ 4. Select "Write" permissions
38
+ 5. Copy the token
39
+
40
+ ## Basic Usage
41
+
42
+ ### Simple Model Push
43
+
44
+ ```bash
45
+ python push_to_huggingface.py /path/to/model username/model-name
46
+ ```
47
+
48
+ ### Push with Custom Token
49
+
50
+ ```bash
51
+ python push_to_huggingface.py /path/to/model username/model-name \
52
+ --token "hf_your_token_here"
53
+ ```
54
+
55
+ ### Push Private Model
56
+
57
+ ```bash
58
+ python push_to_huggingface.py /path/to/model username/model-name \
59
+ --private
60
+ ```
61
+
62
+ ### Push with Trackio Integration
63
+
64
+ ```bash
65
+ python push_to_huggingface.py /path/to/model username/model-name \
66
+ --trackio-url "https://your-space.hf.space" \
67
+ --experiment-name "my_experiment"
68
+ ```
69
+
70
+ ## Complete Workflow Example
71
+
72
+ ### 1. Train Your Model
73
+
74
+ ```bash
75
+ python train.py config/train_smollm3.py \
76
+ --dataset_dir my_dataset \
77
+ --enable_tracking \
78
+ --trackio_url "https://your-space.hf.space" \
79
+ --experiment_name "smollm3_finetune_v1"
80
+ ```
81
+
82
+ ### 2. Push to Hugging Face Hub
83
+
84
+ ```bash
85
+ python push_to_huggingface.py /output-checkpoint username/smollm3-finetuned \
86
+ --trackio-url "https://your-space.hf.space" \
87
+ --experiment-name "smollm3_finetune_v1"
88
+ ```
89
+
90
+ ### 3. Use Your Model
91
+
92
+ ```python
93
+ from transformers import AutoModelForCausalLM, AutoTokenizer
94
+
95
+ # Load your uploaded model
96
+ model = AutoModelForCausalLM.from_pretrained("username/smollm3-finetuned")
97
+ tokenizer = AutoTokenizer.from_pretrained("username/smollm3-finetuned")
98
+
99
+ # Generate text
100
+ inputs = tokenizer("Hello, how are you?", return_tensors="pt")
101
+ outputs = model.generate(**inputs, max_new_tokens=100)
102
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
103
+ ```
104
+
105
+ ## Repository Structure
106
+
107
+ After pushing, your repository will contain:
108
+
109
+ ```
110
+ username/model-name/
111
+ ├── README.md # Auto-generated model card
112
+ ├── config.json # Model configuration
113
+ ├── pytorch_model.bin # Model weights
114
+ ├── tokenizer.json # Tokenizer configuration
115
+ ├── tokenizer_config.json # Tokenizer settings
116
+ ├── special_tokens_map.json # Special tokens
117
+ ├── training_results/ # Training artifacts
118
+ │ ├── train_results.json
119
+ │ ├── eval_results.json
120
+ │ ├── training_config.json
121
+ │ └── training.log
122
+ └── .gitattributes # Git attributes
123
+ ```
124
+
125
+ ## Model Card Features
126
+
127
+ The script automatically generates comprehensive model cards including:
128
+
129
+ - **Model Details**: Base model, fine-tuning method, size
130
+ - **Training Configuration**: All training parameters
131
+ - **Training Results**: Loss, accuracy, steps, time
132
+ - **Usage Examples**: Code snippets for loading and using
133
+ - **Performance Metrics**: Training and validation metrics
134
+ - **Hardware Information**: GPU/CPU used for training
135
+
136
+ ## Advanced Usage
137
+
138
+ ### Custom Repository Names
139
+
140
+ ```bash
141
+ # Public repository
142
+ python push_to_huggingface.py /model myusername/smollm3-chatbot
143
+
144
+ # Private repository
145
+ python push_to_huggingface.py /model myusername/smollm3-private --private
146
+ ```
147
+
148
+ ### Integration with Training Pipeline
149
+
150
+ ```bash
151
+ #!/bin/bash
152
+ # Complete training and push workflow
153
+
154
+ # 1. Train the model
155
+ python train.py config/train_smollm3.py \
156
+ --dataset_dir my_dataset \
157
+ --enable_tracking \
158
+ --trackio_url "https://your-space.hf.space" \
159
+ --experiment_name "smollm3_v1"
160
+
161
+ # 2. Push to Hugging Face Hub
162
+ python push_to_huggingface.py /output-checkpoint myusername/smollm3-v1 \
163
+ --trackio-url "https://your-space.hf.space" \
164
+ --experiment-name "smollm3_v1"
165
+
166
+ # 3. Test the model
167
+ python -c "
168
+ from transformers import AutoModelForCausalLM, AutoTokenizer
169
+ model = AutoModelForCausalLM.from_pretrained('myusername/smollm3-v1')
170
+ tokenizer = AutoTokenizer.from_pretrained('myusername/smollm3-v1')
171
+ print('Model loaded successfully!')
172
+ "
173
+ ```
174
+
175
+ ### Batch Processing Multiple Models
176
+
177
+ ```bash
178
+ #!/bin/bash
179
+ # Push multiple models
180
+
181
+ models=(
182
+ "smollm3-baseline"
183
+ "smollm3-high-lr"
184
+ "smollm3-dpo"
185
+ )
186
+
187
+ for model in "${models[@]}"; do
188
+ echo "Pushing $model..."
189
+ python push_to_huggingface.py "/models/$model" "username/$model"
190
+ done
191
+ ```
192
+
193
+ ## Error Handling
194
+
195
+ ### Common Issues and Solutions
196
+
197
+ #### 1. Missing Model Files
198
+
199
+ **Error**: `❌ Missing required files: ['config.json', 'pytorch_model.bin']`
200
+
201
+ **Solution**: Ensure your model directory contains all required files:
202
+ - `config.json`
203
+ - `pytorch_model.bin`
204
+ - `tokenizer.json`
205
+ - `tokenizer_config.json`
206
+
207
+ #### 2. Authentication Issues
208
+
209
+ **Error**: `❌ Failed to create repository: 401 Client Error`
210
+
211
+ **Solution**:
212
+ - Check your HF token is valid
213
+ - Ensure token has write permissions
214
+ - Verify username in repository name matches your account
215
+
216
+ #### 3. Repository Already Exists
217
+
218
+ **Error**: `Repository already exists`
219
+
220
+ **Solution**: The script handles this automatically with `exist_ok=True`, but you can:
221
+ - Use a different repository name
222
+ - Delete the existing repository first
223
+ - Use version numbers: `username/model-v2`
224
+
225
+ #### 4. Large File Upload Issues
226
+
227
+ **Error**: `Upload failed for large files`
228
+
229
+ **Solution**:
230
+ - Check your internet connection
231
+ - Use Git LFS for large files
232
+ - Consider splitting large models
233
+
234
+ ## Trackio Integration
235
+
236
+ ### Logging Push Actions
237
+
238
+ When using Trackio integration, the script logs:
239
+
240
+ - **Push Action**: Repository creation and file uploads
241
+ - **Model Metadata**: Size, configuration, results
242
+ - **Repository Info**: Name, privacy settings, URL
243
+ - **Training Results**: Loss, accuracy, steps
244
+
245
+ ### Viewing Push Logs
246
+
247
+ 1. Go to your Trackio Space
248
+ 2. Navigate to the "View Experiments" tab
249
+ 3. Find your experiment
250
+ 4. Check the metrics for push-related actions
251
+
252
+ ## Security Best Practices
253
+
254
+ ### Token Management
255
+
256
+ ```bash
257
+ # Use environment variables (recommended)
258
+ export HF_TOKEN="your_token_here"
259
+ python push_to_huggingface.py model repo
260
+
261
+ # Don't hardcode tokens in scripts
262
+ # ❌ Bad: python push_to_huggingface.py model repo --token "hf_xxx"
263
+ ```
264
+
265
+ ### Private Models
266
+
267
+ ```bash
268
+ # For sensitive models, use private repositories
269
+ python push_to_huggingface.py model username/private-model --private
270
+ ```
271
+
272
+ ### Repository Naming
273
+
274
+ ```bash
275
+ # Use descriptive names
276
+ python push_to_huggingface.py model username/smollm3-chatbot-v1
277
+
278
+ # Include version numbers
279
+ python push_to_huggingface.py model username/smollm3-v2.0
280
+ ```
281
+
282
+ ## Performance Optimization
283
+
284
+ ### Large Models
285
+
286
+ For models > 5GB:
287
+
288
+ ```bash
289
+ # Use Git LFS for large files
290
+ git lfs install
291
+ git lfs track "*.bin"
292
+
293
+ # Consider splitting models
294
+ python push_to_huggingface.py model username/model-large --private
295
+ ```
296
+
297
+ ### Upload Speed
298
+
299
+ ```bash
300
+ # Use stable internet connection
301
+ # Consider uploading during off-peak hours
302
+ # Use private repositories for faster uploads
303
+ ```
304
+
305
+ ## Troubleshooting
306
+
307
+ ### Debug Mode
308
+
309
+ ```bash
310
+ # Enable debug logging
311
+ export LOG_LEVEL=DEBUG
312
+ python push_to_huggingface.py model repo
313
+ ```
314
+
315
+ ### Validate Model Files
316
+
317
+ ```bash
318
+ # Check model structure before pushing
319
+ ls -la /path/to/model/
320
+ # Should contain: config.json, pytorch_model.bin, tokenizer.json, etc.
321
+ ```
322
+
323
+ ### Test Repository Access
324
+
325
+ ```bash
326
+ # Test your HF token
327
+ python -c "
328
+ from huggingface_hub import HfApi
329
+ api = HfApi(token='your_token')
330
+ print('Token is valid!')
331
+ "
332
+ ```
333
+
334
+ ## Integration Examples
335
+
336
+ ### With CI/CD Pipeline
337
+
338
+ ```yaml
339
+ # .github/workflows/train-and-push.yml
340
+ name: Train and Push Model
341
+
342
+ on:
343
+ push:
344
+ branches: [main]
345
+
346
+ jobs:
347
+ train-and-push:
348
+ runs-on: ubuntu-latest
349
+ steps:
350
+ - uses: actions/checkout@v2
351
+
352
+ - name: Train Model
353
+ run: |
354
+ python train.py config/train_smollm3.py
355
+
356
+ - name: Push to HF Hub
357
+ run: |
358
+ python push_to_huggingface.py /output username/model-${{ github.run_number }}
359
+ env:
360
+ HF_TOKEN: ${{ secrets.HF_TOKEN }}
361
+ ```
362
+
363
+ ### With Docker
364
+
365
+ ```dockerfile
366
+ # Dockerfile
367
+ FROM python:3.9
368
+
369
+ WORKDIR /app
370
+ COPY requirements.txt .
371
+ RUN pip install -r requirements.txt
372
+
373
+ COPY . .
374
+
375
+ CMD ["python", "push_to_huggingface.py", "/model", "username/model"]
376
+ ```
377
+
378
+ ## Support and Resources
379
+
380
+ ### Documentation
381
+
382
+ - [Hugging Face Hub Documentation](https://huggingface.co/docs/hub/index)
383
+ - [Transformers Documentation](https://huggingface.co/docs/transformers/index)
384
+ - [Model Cards Guide](https://huggingface.co/docs/hub/model-cards)
385
+
386
+ ### Community
387
+
388
+ - [Hugging Face Forums](https://discuss.huggingface.co/)
389
+ - [GitHub Issues](https://github.com/huggingface/huggingface_hub/issues)
390
+
391
+ ### Examples
392
+
393
+ - [Model Repository Examples](https://huggingface.co/models?search=smollm3)
394
+ - [Fine-tuned Models](https://huggingface.co/models?pipeline_tag=text-generation&sort=downloads)
395
+
396
+ ## Conclusion
397
+
398
+ The `push_to_huggingface.py` script provides a complete solution for:
399
+
400
+ - ✅ **Easy Model Deployment** - One command to push models
401
+ - ✅ **Professional Documentation** - Auto-generated model cards
402
+ - ✅ **Training Artifacts** - Complete experiment tracking
403
+ - ✅ **Integration Ready** - Works with CI/CD and monitoring
404
+ - ✅ **Security Focused** - Proper token and privacy management
405
+
406
+ Start sharing your fine-tuned SmolLM3 models with the community!
README.md CHANGED
@@ -288,4 +288,17 @@ python -m llama_cpp.convert_model ./output-checkpoint --outfile model.gguf
288
 
289
  ## License
290
 
291
- This project follows the same license as the SmolLM3 model. Please refer to the Hugging Face model page for licensing information.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
288
 
289
  ## License
290
 
291
+ This project follows the same license as the SmolLM3 model. Please refer to the Hugging Face model page for licensing information.
292
+
293
+
294
+ {
295
+ "id": "exp_20250718_195852",
296
+ "name": "petit-elle-l-aime-3",
297
+ "description": "SmolLM3 fine-tuning experiment",
298
+ "created_at": "2025-07-18T19:58:52.689087",
299
+ "status": "running",
300
+ "metrics": [],
301
+ "parameters": {},
302
+ "artifacts": [],
303
+ "logs": []
304
+ }
TRACKIO_INTEGRATION.md ADDED
@@ -0,0 +1,252 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Trackio Integration for SmolLM3 Fine-tuning
2
+
3
+ This document provides comprehensive information about the Trackio experiment tracking and monitoring integration for your SmolLM3 fine-tuning pipeline.
4
+
5
+ ## Features
6
+
7
+ - **SmolLM3 Fine-tuning**: Support for supervised fine-tuning and DPO training
8
+ - **Trackio Integration**: Complete experiment tracking and monitoring
9
+ - **Hugging Face Spaces Deployment**: Easy deployment of Trackio monitoring interface
10
+ - **Comprehensive Logging**: Metrics, parameters, artifacts, and system monitoring
11
+ - **Flexible Configuration**: Support for various training configurations
12
+
13
+ ## Quick Start
14
+
15
+ ### 1. Install Dependencies
16
+
17
+ ```bash
18
+ pip install -r requirements.txt
19
+ ```
20
+
21
+ ### 2. Basic Training with Trackio
22
+
23
+ ```bash
24
+ python train.py config/train_smollm3.py \
25
+ --dataset_dir my_dataset \
26
+ --enable_tracking \
27
+ --trackio_url "https://your-trackio-instance.com" \
28
+ --experiment_name "smollm3_finetune_v1"
29
+ ```
30
+
31
+ ### 3. Training with Custom Parameters
32
+
33
+ ```bash
34
+ python train.py config/train_smollm3.py \
35
+ --dataset_dir my_dataset \
36
+ --batch_size 8 \
37
+ --learning_rate 1e-5 \
38
+ --max_iters 2000 \
39
+ --enable_tracking \
40
+ --trackio_url "https://your-trackio-instance.com" \
41
+ --experiment_name "smollm3_high_lr_experiment"
42
+ ```
43
+
44
+ ## Trackio Integration
45
+
46
+ ### Configuration
47
+
48
+ Add Trackio settings to your configuration:
49
+
50
+ ```python
51
+ # In your config file
52
+ config = SmolLM3Config(
53
+ # ... other settings ...
54
+
55
+ # Trackio monitoring configuration
56
+ enable_tracking=True,
57
+ trackio_url="https://your-trackio-instance.com",
58
+ trackio_token="your_token_here", # Optional
59
+ log_artifacts=True,
60
+ log_metrics=True,
61
+ log_config=True,
62
+ experiment_name="my_experiment"
63
+ )
64
+ ```
65
+
66
+ ### Environment Variables
67
+
68
+ You can also set Trackio configuration via environment variables:
69
+
70
+ ```bash
71
+ export TRACKIO_URL="https://your-trackio-instance.com"
72
+ export TRACKIO_TOKEN="your_token_here"
73
+ ```
74
+
75
+ ### What Gets Tracked
76
+
77
+ - **Configuration**: All training parameters and model settings
78
+ - **Metrics**: Loss, accuracy, learning rate, and custom metrics
79
+ - **System Metrics**: GPU memory, CPU usage, training time
80
+ - **Artifacts**: Model checkpoints, evaluation results
81
+ - **Training Summary**: Final results and experiment duration
82
+
83
+ ## Hugging Face Spaces Deployment
84
+
85
+ ### Deploy Trackio Monitoring Interface
86
+
87
+ 1. **Create a new Space** on Hugging Face:
88
+ - Go to https://huggingface.co/spaces
89
+ - Click "Create new Space"
90
+ - Choose "Gradio" as the SDK
91
+ - Set visibility (Public or Private)
92
+
93
+ 2. **Upload the deployment files**:
94
+ - `app.py` - The Gradio interface
95
+ - `requirements_space.txt` - Dependencies
96
+ - `README.md` - Documentation
97
+
98
+ 3. **Configure the Space**:
99
+ - The Space will automatically install dependencies
100
+ - The Gradio interface will be available at your Space URL
101
+
102
+ ### Using the Trackio Space
103
+
104
+ 1. **Create Experiments**: Use the "Create Experiment" tab to start new experiments
105
+ 2. **Log Metrics**: Use the "Log Metrics" tab to track training progress
106
+ 3. **View Results**: Use the "View Experiments" tab to see experiment details
107
+ 4. **Update Status**: Use the "Update Status" tab to mark experiments as completed
108
+
109
+ ### Integration with Your Training
110
+
111
+ To connect your training script to the Trackio Space:
112
+
113
+ ```python
114
+ # In your training script
115
+ from monitoring import SmolLM3Monitor
116
+
117
+ # Initialize monitor
118
+ monitor = SmolLM3Monitor(
119
+ experiment_name="my_experiment",
120
+ trackio_url="https://your-space.hf.space", # Your Space URL
121
+ enable_tracking=True
122
+ )
123
+
124
+ # Log configuration
125
+ monitor.log_config(config_dict)
126
+
127
+ # Log metrics during training
128
+ monitor.log_metrics({"loss": 0.5, "accuracy": 0.85}, step=100)
129
+
130
+ # Log final results
131
+ monitor.log_training_summary(final_results)
132
+ ```
133
+
134
+ ## Configuration Files
135
+
136
+ ### Main Configuration (`config/train_smollm3.py`)
137
+
138
+ ```python
139
+ @dataclass
140
+ class SmolLM3Config:
141
+ # Model configuration
142
+ model_name: str = "HuggingFaceTB/SmolLM3-3B"
143
+ max_seq_length: int = 4096
144
+
145
+ # Training configuration
146
+ batch_size: int = 4
147
+ learning_rate: float = 2e-5
148
+ max_iters: int = 1000
149
+
150
+ # Trackio monitoring
151
+ enable_tracking: bool = True
152
+ trackio_url: Optional[str] = None
153
+ trackio_token: Optional[str] = None
154
+ experiment_name: Optional[str] = None
155
+ ```
156
+
157
+ ### DPO Configuration (`config/train_smollm3_dpo.py`)
158
+
159
+ ```python
160
+ @dataclass
161
+ class SmolLM3DPOConfig(SmolLM3Config):
162
+ # DPO-specific settings
163
+ beta: float = 0.1
164
+ max_prompt_length: int = 2048
165
+
166
+ # Trackio monitoring (inherited)
167
+ enable_tracking: bool = True
168
+ trackio_url: Optional[str] = None
169
+ ```
170
+
171
+ ## Monitoring Features
172
+
173
+ ### Real-time Metrics
174
+
175
+ - Training loss and evaluation metrics
176
+ - Learning rate scheduling
177
+ - GPU memory and utilization
178
+ - Training time and progress
179
+
180
+ ### Artifact Tracking
181
+
182
+ - Model checkpoints at regular intervals
183
+ - Evaluation results and plots
184
+ - Configuration snapshots
185
+ - Training logs and summaries
186
+
187
+ ### Experiment Management
188
+
189
+ - Experiment naming and organization
190
+ - Status tracking (running, completed, failed)
191
+ - Parameter comparison across experiments
192
+ - Result visualization
193
+
194
+ ## Advanced Usage
195
+
196
+ ### Custom Metrics
197
+
198
+ ```python
199
+ # Log custom metrics
200
+ monitor.log_metrics({
201
+ "custom_metric": value,
202
+ "perplexity": perplexity_score,
203
+ "bleu_score": bleu_score
204
+ }, step=current_step)
205
+ ```
206
+
207
+ ### System Monitoring
208
+
209
+ ```python
210
+ # Log system metrics
211
+ monitor.log_system_metrics(step=current_step)
212
+ ```
213
+
214
+ ### Artifact Logging
215
+
216
+ ```python
217
+ # Log model checkpoint
218
+ monitor.log_model_checkpoint("checkpoint-1000", step=1000)
219
+
220
+ # Log evaluation results
221
+ monitor.log_evaluation_results(eval_results, step=1000)
222
+ ```
223
+
224
+ ## Troubleshooting
225
+
226
+ ### Common Issues
227
+
228
+ 1. **Trackio not available**: Install with `pip install trackio`
229
+ 2. **Connection errors**: Check your Trackio URL and token
230
+ 3. **Missing metrics**: Ensure monitoring is enabled in configuration
231
+ 4. **Space deployment issues**: Check Gradio version compatibility
232
+
233
+ ### Debug Mode
234
+
235
+ Enable debug logging:
236
+
237
+ ```python
238
+ import logging
239
+ logging.basicConfig(level=logging.DEBUG)
240
+ ```
241
+
242
+ ## Contributing
243
+
244
+ 1. Fork the repository
245
+ 2. Create a feature branch
246
+ 3. Make your changes
247
+ 4. Add tests if applicable
248
+ 5. Submit a pull request
249
+
250
+ ## License
251
+
252
+ This project is licensed under the MIT License - see the LICENSE file for details.
app.py ADDED
@@ -0,0 +1,318 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Trackio Deployment on Hugging Face Spaces
3
+ A Gradio interface for experiment tracking and monitoring
4
+ """
5
+
6
+ import gradio as gr
7
+ import os
8
+ import json
9
+ import logging
10
+ from datetime import datetime
11
+ from typing import Dict, Any, Optional
12
+ import requests
13
+
14
+ # Setup logging
15
+ logging.basicConfig(level=logging.INFO)
16
+ logger = logging.getLogger(__name__)
17
+
18
+ class TrackioSpace:
19
+ """Trackio deployment for Hugging Face Spaces"""
20
+
21
+ def __init__(self):
22
+ self.experiments = {}
23
+ self.current_experiment = None
24
+
25
+ def create_experiment(self, name: str, description: str = "") -> Dict[str, Any]:
26
+ """Create a new experiment"""
27
+ experiment_id = f"exp_{datetime.now().strftime('%Y%m%d_%H%M%S')}"
28
+
29
+ experiment = {
30
+ 'id': experiment_id,
31
+ 'name': name,
32
+ 'description': description,
33
+ 'created_at': datetime.now().isoformat(),
34
+ 'status': 'running',
35
+ 'metrics': [],
36
+ 'parameters': {},
37
+ 'artifacts': [],
38
+ 'logs': []
39
+ }
40
+
41
+ self.experiments[experiment_id] = experiment
42
+ self.current_experiment = experiment_id
43
+
44
+ logger.info(f"Created experiment: {experiment_id} - {name}")
45
+ return experiment
46
+
47
+ def log_metrics(self, experiment_id: str, metrics: Dict[str, Any], step: Optional[int] = None):
48
+ """Log metrics for an experiment"""
49
+ if experiment_id not in self.experiments:
50
+ raise ValueError(f"Experiment {experiment_id} not found")
51
+
52
+ metric_entry = {
53
+ 'timestamp': datetime.now().isoformat(),
54
+ 'step': step,
55
+ 'metrics': metrics
56
+ }
57
+
58
+ self.experiments[experiment_id]['metrics'].append(metric_entry)
59
+ logger.info(f"Logged metrics for experiment {experiment_id}: {metrics}")
60
+
61
+ def log_parameters(self, experiment_id: str, parameters: Dict[str, Any]):
62
+ """Log parameters for an experiment"""
63
+ if experiment_id not in self.experiments:
64
+ raise ValueError(f"Experiment {experiment_id} not found")
65
+
66
+ self.experiments[experiment_id]['parameters'].update(parameters)
67
+ logger.info(f"Logged parameters for experiment {experiment_id}: {parameters}")
68
+
69
+ def log_artifact(self, experiment_id: str, artifact_name: str, artifact_data: str):
70
+ """Log an artifact for an experiment"""
71
+ if experiment_id not in self.experiments:
72
+ raise ValueError(f"Experiment {experiment_id} not found")
73
+
74
+ artifact_entry = {
75
+ 'name': artifact_name,
76
+ 'timestamp': datetime.now().isoformat(),
77
+ 'data': artifact_data
78
+ }
79
+
80
+ self.experiments[experiment_id]['artifacts'].append(artifact_entry)
81
+ logger.info(f"Logged artifact for experiment {experiment_id}: {artifact_name}")
82
+
83
+ def get_experiment(self, experiment_id: str) -> Optional[Dict[str, Any]]:
84
+ """Get experiment details"""
85
+ return self.experiments.get(experiment_id)
86
+
87
+ def list_experiments(self) -> Dict[str, Any]:
88
+ """List all experiments"""
89
+ return {
90
+ 'experiments': list(self.experiments.keys()),
91
+ 'current_experiment': self.current_experiment,
92
+ 'total_experiments': len(self.experiments)
93
+ }
94
+
95
+ def update_experiment_status(self, experiment_id: str, status: str):
96
+ """Update experiment status"""
97
+ if experiment_id in self.experiments:
98
+ self.experiments[experiment_id]['status'] = status
99
+ logger.info(f"Updated experiment {experiment_id} status to {status}")
100
+
101
+ # Initialize Trackio space
102
+ trackio_space = TrackioSpace()
103
+
104
+ def create_experiment_interface(name: str, description: str) -> str:
105
+ """Create a new experiment"""
106
+ try:
107
+ experiment = trackio_space.create_experiment(name, description)
108
+ return f"✅ Experiment created successfully!\nID: {experiment['id']}\nName: {experiment['name']}"
109
+ except Exception as e:
110
+ return f"❌ Error creating experiment: {str(e)}"
111
+
112
+ def log_metrics_interface(experiment_id: str, metrics_json: str, step: str) -> str:
113
+ """Log metrics for an experiment"""
114
+ try:
115
+ metrics = json.loads(metrics_json)
116
+ step_int = int(step) if step else None
117
+ trackio_space.log_metrics(experiment_id, metrics, step_int)
118
+ return f"✅ Metrics logged successfully for experiment {experiment_id}"
119
+ except Exception as e:
120
+ return f"❌ Error logging metrics: {str(e)}"
121
+
122
+ def log_parameters_interface(experiment_id: str, parameters_json: str) -> str:
123
+ """Log parameters for an experiment"""
124
+ try:
125
+ parameters = json.loads(parameters_json)
126
+ trackio_space.log_parameters(experiment_id, parameters)
127
+ return f"✅ Parameters logged successfully for experiment {experiment_id}"
128
+ except Exception as e:
129
+ return f"❌ Error logging parameters: {str(e)}"
130
+
131
+ def get_experiment_details(experiment_id: str) -> str:
132
+ """Get experiment details"""
133
+ try:
134
+ experiment = trackio_space.get_experiment(experiment_id)
135
+ if experiment:
136
+ return json.dumps(experiment, indent=2)
137
+ else:
138
+ return f"❌ Experiment {experiment_id} not found"
139
+ except Exception as e:
140
+ return f"❌ Error getting experiment details: {str(e)}"
141
+
142
+ def list_experiments_interface() -> str:
143
+ """List all experiments"""
144
+ try:
145
+ experiments_info = trackio_space.list_experiments()
146
+ return json.dumps(experiments_info, indent=2)
147
+ except Exception as e:
148
+ return f"❌ Error listing experiments: {str(e)}"
149
+
150
+ def update_experiment_status_interface(experiment_id: str, status: str) -> str:
151
+ """Update experiment status"""
152
+ try:
153
+ trackio_space.update_experiment_status(experiment_id, status)
154
+ return f"✅ Experiment {experiment_id} status updated to {status}"
155
+ except Exception as e:
156
+ return f"❌ Error updating experiment status: {str(e)}"
157
+
158
+ # Create Gradio interface
159
+ with gr.Blocks(title="Trackio - Experiment Tracking", theme=gr.themes.Soft()) as demo:
160
+ gr.Markdown("# 🚀 Trackio Experiment Tracking")
161
+ gr.Markdown("Monitor and track your ML experiments with ease!")
162
+
163
+ with gr.Tabs():
164
+ # Create Experiment Tab
165
+ with gr.Tab("Create Experiment"):
166
+ gr.Markdown("### Create a New Experiment")
167
+ with gr.Row():
168
+ with gr.Column():
169
+ experiment_name = gr.Textbox(
170
+ label="Experiment Name",
171
+ placeholder="my_smollm3_finetune",
172
+ value="smollm3_finetune"
173
+ )
174
+ experiment_description = gr.Textbox(
175
+ label="Description",
176
+ placeholder="Fine-tuning SmolLM3 model on custom dataset",
177
+ value="SmolLM3 fine-tuning experiment"
178
+ )
179
+ create_btn = gr.Button("Create Experiment", variant="primary")
180
+
181
+ with gr.Column():
182
+ create_output = gr.Textbox(
183
+ label="Result",
184
+ lines=5,
185
+ interactive=False
186
+ )
187
+
188
+ create_btn.click(
189
+ create_experiment_interface,
190
+ inputs=[experiment_name, experiment_description],
191
+ outputs=create_output
192
+ )
193
+
194
+ # Log Metrics Tab
195
+ with gr.Tab("Log Metrics"):
196
+ gr.Markdown("### Log Training Metrics")
197
+ with gr.Row():
198
+ with gr.Column():
199
+ metrics_exp_id = gr.Textbox(
200
+ label="Experiment ID",
201
+ placeholder="exp_20231201_143022"
202
+ )
203
+ metrics_json = gr.Textbox(
204
+ label="Metrics (JSON)",
205
+ placeholder='{"loss": 0.5, "accuracy": 0.85}',
206
+ value='{"loss": 0.5, "accuracy": 0.85}'
207
+ )
208
+ metrics_step = gr.Textbox(
209
+ label="Step (optional)",
210
+ placeholder="100"
211
+ )
212
+ log_metrics_btn = gr.Button("Log Metrics", variant="primary")
213
+
214
+ with gr.Column():
215
+ metrics_output = gr.Textbox(
216
+ label="Result",
217
+ lines=3,
218
+ interactive=False
219
+ )
220
+
221
+ log_metrics_btn.click(
222
+ log_metrics_interface,
223
+ inputs=[metrics_exp_id, metrics_json, metrics_step],
224
+ outputs=metrics_output
225
+ )
226
+
227
+ # Log Parameters Tab
228
+ with gr.Tab("Log Parameters"):
229
+ gr.Markdown("### Log Experiment Parameters")
230
+ with gr.Row():
231
+ with gr.Column():
232
+ params_exp_id = gr.Textbox(
233
+ label="Experiment ID",
234
+ placeholder="exp_20231201_143022"
235
+ )
236
+ parameters_json = gr.Textbox(
237
+ label="Parameters (JSON)",
238
+ placeholder='{"learning_rate": 2e-5, "batch_size": 4}',
239
+ value='{"learning_rate": 2e-5, "batch_size": 4, "model_name": "HuggingFaceTB/SmolLM3-3B"}'
240
+ )
241
+ log_params_btn = gr.Button("Log Parameters", variant="primary")
242
+
243
+ with gr.Column():
244
+ params_output = gr.Textbox(
245
+ label="Result",
246
+ lines=3,
247
+ interactive=False
248
+ )
249
+
250
+ log_params_btn.click(
251
+ log_parameters_interface,
252
+ inputs=[params_exp_id, parameters_json],
253
+ outputs=params_output
254
+ )
255
+
256
+ # View Experiments Tab
257
+ with gr.Tab("View Experiments"):
258
+ gr.Markdown("### View Experiment Details")
259
+ with gr.Row():
260
+ with gr.Column():
261
+ view_exp_id = gr.Textbox(
262
+ label="Experiment ID",
263
+ placeholder="exp_20231201_143022"
264
+ )
265
+ view_btn = gr.Button("View Experiment", variant="primary")
266
+ list_btn = gr.Button("List All Experiments", variant="secondary")
267
+
268
+ with gr.Column():
269
+ view_output = gr.Textbox(
270
+ label="Experiment Details",
271
+ lines=15,
272
+ interactive=False
273
+ )
274
+
275
+ view_btn.click(
276
+ get_experiment_details,
277
+ inputs=[view_exp_id],
278
+ outputs=view_output
279
+ )
280
+
281
+ list_btn.click(
282
+ list_experiments_interface,
283
+ inputs=[],
284
+ outputs=view_output
285
+ )
286
+
287
+ # Update Status Tab
288
+ with gr.Tab("Update Status"):
289
+ gr.Markdown("### Update Experiment Status")
290
+ with gr.Row():
291
+ with gr.Column():
292
+ status_exp_id = gr.Textbox(
293
+ label="Experiment ID",
294
+ placeholder="exp_20231201_143022"
295
+ )
296
+ status_dropdown = gr.Dropdown(
297
+ label="Status",
298
+ choices=["running", "completed", "failed", "paused"],
299
+ value="running"
300
+ )
301
+ update_status_btn = gr.Button("Update Status", variant="primary")
302
+
303
+ with gr.Column():
304
+ status_output = gr.Textbox(
305
+ label="Result",
306
+ lines=3,
307
+ interactive=False
308
+ )
309
+
310
+ update_status_btn.click(
311
+ update_experiment_status_interface,
312
+ inputs=[status_exp_id, status_dropdown],
313
+ outputs=status_output
314
+ )
315
+
316
+ # Launch the app
317
+ if __name__ == "__main__":
318
+ demo.launch()
cloud_deployment.sh ADDED
@@ -0,0 +1,279 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/bash
2
+ # Cloud Deployment Script for SmolLM3 DPO Training
3
+ # This script sets up a cloud instance for training and uploading to Hugging Face
4
+
5
+ set -e # Exit on any error
6
+
7
+ echo "🚀 Starting SmolLM3 DPO Cloud Deployment"
8
+ echo "=========================================="
9
+
10
+ # Configuration
11
+ MODEL_NAME="HuggingFaceTB/SmolLM3-3B"
12
+ DATASET_NAME="HuggingFaceTB/smoltalk"
13
+ EXPERIMENT_NAME="smollm3_dpo_6epochs"
14
+ REPO_NAME="your-username/smollm3-dpo-6epochs" # Change this to your username
15
+ TRACKIO_URL="https://your-trackio-space.hf.space" # Change this to your Trackio Space URL
16
+ HF_TOKEN="your_hf_token_here" # Change this to your HF token
17
+
18
+ # Training Configuration
19
+ BATCH_SIZE=2
20
+ GRADIENT_ACCUMULATION_STEPS=8
21
+ LEARNING_RATE=5e-6
22
+ MAX_EPOCHS=6
23
+ MAX_SEQ_LENGTH=4096
24
+ SAVE_STEPS=500
25
+ EVAL_STEPS=100
26
+ LOGGING_STEPS=10
27
+
28
+ echo "📋 Configuration:"
29
+ echo " Model: $MODEL_NAME"
30
+ echo " Dataset: $DATASET_NAME"
31
+ echo " Experiment: $EXPERIMENT_NAME"
32
+ echo " Repository: $REPO_NAME"
33
+ echo " Epochs: $MAX_EPOCHS"
34
+ echo " Batch Size: $BATCH_SIZE"
35
+ echo " Learning Rate: $LEARNING_RATE"
36
+
37
+ # Step 1: Update system and install dependencies
38
+ echo ""
39
+ echo "🔧 Step 1: Installing system dependencies..."
40
+ sudo apt-get update
41
+ sudo apt-get install -y git curl wget unzip
42
+
43
+ # Step 2: Install Python and pip
44
+ echo ""
45
+ echo "🐍 Step 2: Installing Python dependencies..."
46
+ sudo apt-get install -y python3 python3-pip python3-venv
47
+
48
+ # Step 3: Create virtual environment
49
+ echo ""
50
+ echo "📦 Step 3: Setting up Python virtual environment..."
51
+ python3 -m venv smollm3_env
52
+ source smollm3_env/bin/activate
53
+
54
+ # Step 4: Install PyTorch and CUDA
55
+ echo ""
56
+ echo "🔥 Step 4: Installing PyTorch with CUDA support..."
57
+ pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
58
+
59
+ # Step 5: Install project dependencies
60
+ echo ""
61
+ echo "📚 Step 5: Installing project dependencies..."
62
+ pip install -r requirements.txt
63
+
64
+ # Step 6: Install additional dependencies for DPO
65
+ echo ""
66
+ echo "🎯 Step 6: Installing DPO-specific dependencies..."
67
+ pip install trl>=0.7.0
68
+ pip install peft>=0.4.0
69
+ pip install accelerate>=0.20.0
70
+
71
+ # Step 7: Set up Hugging Face token
72
+ echo ""
73
+ echo "🔑 Step 7: Setting up Hugging Face authentication..."
74
+ export HF_TOKEN="$HF_TOKEN"
75
+ huggingface-cli login --token $HF_TOKEN
76
+
77
+ # Step 8: Create DPO configuration
78
+ echo ""
79
+ echo "⚙️ Step 8: Creating DPO configuration..."
80
+ cat > config/train_smollm3_dpo_6epochs.py << EOF
81
+ """
82
+ SmolLM3 DPO Training Configuration - 6 Epochs
83
+ Optimized for cloud deployment
84
+ """
85
+
86
+ from config.train_smollm3_dpo import SmolLM3DPOConfig
87
+
88
+ config = SmolLM3DPOConfig(
89
+ # Model configuration
90
+ model_name="$MODEL_NAME",
91
+ max_seq_length=$MAX_SEQ_LENGTH,
92
+ use_flash_attention=True,
93
+ use_gradient_checkpointing=True,
94
+
95
+ # Training configuration
96
+ batch_size=$BATCH_SIZE,
97
+ gradient_accumulation_steps=$GRADIENT_ACCUMULATION_STEPS,
98
+ learning_rate=$LEARNING_RATE,
99
+ weight_decay=0.01,
100
+ warmup_steps=100,
101
+ max_iters=None, # Will be calculated based on epochs
102
+ eval_interval=100,
103
+ log_interval=10,
104
+ save_interval=500,
105
+
106
+ # DPO configuration
107
+ beta=0.1,
108
+ max_prompt_length=$((MAX_SEQ_LENGTH // 2)),
109
+
110
+ # Optimizer configuration
111
+ optimizer="adamw",
112
+ beta1=0.9,
113
+ beta2=0.95,
114
+ eps=1e-8,
115
+
116
+ # Scheduler configuration
117
+ scheduler="cosine",
118
+ min_lr=1e-6,
119
+
120
+ # Mixed precision
121
+ fp16=True,
122
+ bf16=False,
123
+
124
+ # Logging and saving
125
+ save_steps=$SAVE_STEPS,
126
+ eval_steps=$EVAL_STEPS,
127
+ logging_steps=$LOGGING_STEPS,
128
+ save_total_limit=3,
129
+
130
+ # Evaluation
131
+ eval_strategy="steps",
132
+ metric_for_best_model="eval_loss",
133
+ greater_is_better=False,
134
+ load_best_model_at_end=True,
135
+
136
+ # Data configuration
137
+ data_dir="smoltalk_dataset",
138
+ train_file="train.json",
139
+ validation_file="validation.json",
140
+
141
+ # Chat template configuration
142
+ use_chat_template=True,
143
+ chat_template_kwargs={
144
+ "enable_thinking": False,
145
+ "add_generation_prompt": True
146
+ },
147
+
148
+ # Trackio monitoring configuration
149
+ enable_tracking=True,
150
+ trackio_url="$TRACKIO_URL",
151
+ trackio_token=None,
152
+ log_artifacts=True,
153
+ log_metrics=True,
154
+ log_config=True,
155
+ experiment_name="$EXPERIMENT_NAME"
156
+ )
157
+ EOF
158
+
159
+ # Step 9: Download and prepare dataset
160
+ echo ""
161
+ echo "📊 Step 9: Downloading and preparing dataset..."
162
+ python -c "
163
+ from datasets import load_dataset
164
+ import json
165
+ import os
166
+
167
+ # Load SmolTalk dataset
168
+ print('Loading SmolTalk dataset...')
169
+ dataset = load_dataset('$DATASET_NAME')
170
+
171
+ # Create dataset directory
172
+ os.makedirs('smoltalk_dataset', exist_ok=True)
173
+
174
+ # Convert to DPO format (preference pairs)
175
+ def convert_to_dpo_format(example):
176
+ # For SmolTalk, we'll create preference pairs based on response quality
177
+ # This is a simplified example - you may need to adjust based on your needs
178
+ return {
179
+ 'prompt': example.get('prompt', ''),
180
+ 'chosen': example.get('chosen', ''),
181
+ 'rejected': example.get('rejected', '')
182
+ }
183
+
184
+ # Process train split
185
+ train_data = []
186
+ for example in dataset['train']:
187
+ dpo_example = convert_to_dpo_format(example)
188
+ if dpo_example['prompt'] and dpo_example['chosen'] and dpo_example['rejected']:
189
+ train_data.append(dpo_example)
190
+
191
+ # Process validation split
192
+ val_data = []
193
+ for example in dataset['validation']:
194
+ dpo_example = convert_to_dpo_format(example)
195
+ if dpo_example['prompt'] and dpo_example['chosen'] and dpo_example['rejected']:
196
+ val_data.append(dpo_example)
197
+
198
+ # Save to files
199
+ with open('smoltalk_dataset/train.json', 'w') as f:
200
+ json.dump(train_data, f, indent=2)
201
+
202
+ with open('smoltalk_dataset/validation.json', 'w') as f:
203
+ json.dump(val_data, f, indent=2)
204
+
205
+ print(f'Dataset prepared: {len(train_data)} train samples, {len(val_data)} validation samples')
206
+ "
207
+
208
+ # Step 10: Calculate training steps based on epochs
209
+ echo ""
210
+ echo "📈 Step 10: Calculating training parameters..."
211
+ TOTAL_SAMPLES=$(python -c "import json; data=json.load(open('smoltalk_dataset/train.json')); print(len(data))")
212
+ EFFECTIVE_BATCH_SIZE=$((BATCH_SIZE * GRADIENT_ACCUMULATION_STEPS))
213
+ STEPS_PER_EPOCH=$((TOTAL_SAMPLES / EFFECTIVE_BATCH_SIZE))
214
+ MAX_STEPS=$((STEPS_PER_EPOCH * MAX_EPOCHS))
215
+
216
+ echo " Total samples: $TOTAL_SAMPLES"
217
+ echo " Effective batch size: $EFFECTIVE_BATCH_SIZE"
218
+ echo " Steps per epoch: $STEPS_PER_EPOCH"
219
+ echo " Total training steps: $MAX_STEPS"
220
+
221
+ # Step 11: Start DPO training
222
+ echo ""
223
+ echo "🎯 Step 11: Starting DPO training..."
224
+ python train.py config/train_smollm3_dpo_6epochs.py \
225
+ --dataset_dir smoltalk_dataset \
226
+ --out_dir /output-checkpoint \
227
+ --init_from scratch \
228
+ --max_iters $MAX_STEPS \
229
+ --batch_size $BATCH_SIZE \
230
+ --learning_rate $LEARNING_RATE \
231
+ --gradient_accumulation_steps $GRADIENT_ACCUMULATION_STEPS \
232
+ --max_seq_length $MAX_SEQ_LENGTH \
233
+ --save_steps $SAVE_STEPS \
234
+ --eval_steps $EVAL_STEPS \
235
+ --logging_steps $LOGGING_STEPS \
236
+ --enable_tracking \
237
+ --trackio_url "$TRACKIO_URL" \
238
+ --experiment_name "$EXPERIMENT_NAME"
239
+
240
+ # Step 12: Push model to Hugging Face Hub
241
+ echo ""
242
+ echo "📤 Step 12: Pushing model to Hugging Face Hub..."
243
+ python push_to_huggingface.py /output-checkpoint "$REPO_NAME" \
244
+ --token "$HF_TOKEN" \
245
+ --trackio-url "$TRACKIO_URL" \
246
+ --experiment-name "$EXPERIMENT_NAME"
247
+
248
+ # Step 13: Test the uploaded model
249
+ echo ""
250
+ echo "🧪 Step 13: Testing uploaded model..."
251
+ python -c "
252
+ from transformers import AutoModelForCausalLM, AutoTokenizer
253
+ import torch
254
+
255
+ print('Loading uploaded model...')
256
+ model = AutoModelForCausalLM.from_pretrained('$REPO_NAME', torch_dtype=torch.float16, device_map='auto')
257
+ tokenizer = AutoTokenizer.from_pretrained('$REPO_NAME')
258
+
259
+ print('Testing model generation...')
260
+ prompt = 'Hello, how are you?'
261
+ inputs = tokenizer(prompt, return_tensors='pt').to(model.device)
262
+ outputs = model.generate(**inputs, max_new_tokens=50, do_sample=True, temperature=0.7)
263
+ response = tokenizer.decode(outputs[0], skip_special_tokens=True)
264
+ print(f'Prompt: {prompt}')
265
+ print(f'Response: {response}')
266
+ print('✅ Model test completed successfully!')
267
+ "
268
+
269
+ echo ""
270
+ echo "🎉 Deployment completed successfully!"
271
+ echo "====================================="
272
+ echo "📊 Model: https://huggingface.co/$REPO_NAME"
273
+ echo "📈 Trackio: $TRACKIO_URL"
274
+ echo "📋 Experiment: $EXPERIMENT_NAME"
275
+ echo ""
276
+ echo "Next steps:"
277
+ echo "1. Monitor training progress in your Trackio Space"
278
+ echo "2. Check the model repository on Hugging Face Hub"
279
+ echo "3. Use the model in your applications"
config/__init__.py ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Configuration package for SmolLM3 training
3
+ """
4
+
5
+ from .train_smollm3 import SmolLM3Config, get_config as get_base_config
6
+ from .train_smollm3_openhermes_fr import SmolLM3ConfigOpenHermesFR, get_config as get_openhermes_fr_config
7
+ from .train_smollm3_openhermes_fr_a100_large import SmolLM3ConfigOpenHermesFRA100Large, get_config as get_a100_large_config
8
+ from .train_smollm3_openhermes_fr_a100_multiple_passes import SmolLM3ConfigOpenHermesFRMultiplePasses, get_config as get_multiple_passes_config
9
+
10
+ __all__ = [
11
+ 'SmolLM3Config',
12
+ 'SmolLM3ConfigOpenHermesFR',
13
+ 'SmolLM3ConfigOpenHermesFRA100Large',
14
+ 'SmolLM3ConfigOpenHermesFRMultiplePasses',
15
+ 'get_base_config',
16
+ 'get_openhermes_fr_config',
17
+ 'get_a100_large_config',
18
+ 'get_multiple_passes_config',
19
+ ]
config/runpod_config.py ADDED
@@ -0,0 +1,47 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ RunPod Optimized Configuration for SmolLM3 Fine-tuning
3
+ Optimized for cloud GPU training on RunPod
4
+ """
5
+
6
+ from config.train_smollm3 import SmolLM3Config
7
+
8
+ config = SmolLM3Config(
9
+ # Model configuration
10
+ model_name="HuggingFaceTB/SmolLM3-3B",
11
+ max_seq_length=4096,
12
+ use_flash_attention=True,
13
+ use_gradient_checkpointing=True,
14
+
15
+ # Training configuration - optimized for cloud GPUs
16
+ batch_size=2, # Conservative for cloud stability
17
+ gradient_accumulation_steps=8, # Effective batch size = 16
18
+ learning_rate=2e-5,
19
+ weight_decay=0.01,
20
+ warmup_steps=100,
21
+ max_iters=1500,
22
+
23
+ # Mixed precision for efficiency
24
+ fp16=True,
25
+ bf16=False,
26
+
27
+ # Logging and saving - more frequent for cloud
28
+ save_steps=200,
29
+ eval_steps=100,
30
+ logging_steps=10,
31
+ save_total_limit=5, # Keep more checkpoints
32
+
33
+ # Cloud-specific optimizations
34
+ ddp_backend="nccl",
35
+ ddp_find_unused_parameters=False,
36
+
37
+ # Data loading optimizations
38
+ dataloader_num_workers=4,
39
+ dataloader_pin_memory=True,
40
+
41
+ # Chat template configuration
42
+ use_chat_template=True,
43
+ chat_template_kwargs={
44
+ "enable_thinking": False,
45
+ "add_generation_prompt": True
46
+ }
47
+ )
config/train_smollm3.py CHANGED
@@ -68,6 +68,15 @@ class SmolLM3Config:
68
  use_chat_template: bool = True
69
  chat_template_kwargs: dict = None
70
 
 
 
 
 
 
 
 
 
 
71
  def __post_init__(self):
72
  if self.chat_template_kwargs is None:
73
  self.chat_template_kwargs = {
 
68
  use_chat_template: bool = True
69
  chat_template_kwargs: dict = None
70
 
71
+ # Trackio monitoring configuration
72
+ enable_tracking: bool = True
73
+ trackio_url: Optional[str] = None
74
+ trackio_token: Optional[str] = None
75
+ log_artifacts: bool = True
76
+ log_metrics: bool = True
77
+ log_config: bool = True
78
+ experiment_name: Optional[str] = None
79
+
80
  def __post_init__(self):
81
  if self.chat_template_kwargs is None:
82
  self.chat_template_kwargs = {
config/train_smollm3_dpo.py CHANGED
@@ -1,38 +1,95 @@
1
  """
2
  SmolLM3 DPO Training Configuration
3
- Optimized for Direct Preference Optimization
4
  """
5
 
 
 
 
6
  from config.train_smollm3 import SmolLM3Config
7
 
8
- config = SmolLM3Config(
9
- # Model configuration
10
- model_name="HuggingFaceTB/SmolLM3-3B-Instruct", # Start from instruction-tuned model
11
- max_seq_length=4096,
12
- use_flash_attention=True,
13
- use_gradient_checkpointing=True,
14
 
15
- # Training configuration
16
- batch_size=2, # Smaller batch size for DPO
17
- gradient_accumulation_steps=4,
18
- learning_rate=5e-6, # Very low learning rate for DPO
19
- weight_decay=0.01,
20
- warmup_steps=100,
21
- max_iters=1000,
22
 
23
- # Mixed precision
24
- fp16=True,
25
- bf16=False,
 
26
 
27
- # Logging and saving
28
- save_steps=200,
29
- eval_steps=100,
30
- logging_steps=20,
31
 
32
- # Chat template configuration
33
- use_chat_template=True,
34
- chat_template_kwargs={
35
- "enable_thinking": False, # Disable reasoning for preference learning
36
- "add_generation_prompt": True
37
- }
38
- )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  """
2
  SmolLM3 DPO Training Configuration
3
+ Based on nanoGPT structure but adapted for SmolLM3 DPO training
4
  """
5
 
6
+ import os
7
+ from dataclasses import dataclass
8
+ from typing import Optional
9
  from config.train_smollm3 import SmolLM3Config
10
 
11
+ @dataclass
12
+ class SmolLM3DPOConfig(SmolLM3Config):
13
+ """Configuration for SmolLM3 DPO fine-tuning"""
 
 
 
14
 
15
+ # DPO-specific configuration
16
+ beta: float = 0.1
17
+ max_prompt_length: int = 2048
18
+ max_length: int = 4096
 
 
 
19
 
20
+ # DPO training configuration
21
+ dpo_beta: float = 0.1
22
+ dpo_loss_type: str = "sigmoid" # "sigmoid" or "hinge"
23
+ dpo_alpha: float = 0.5
24
 
25
+ # Reference model configuration
26
+ ref_model_name: Optional[str] = None # If None, will use the same as model_name
27
+ ref_model_peft_config: Optional[dict] = None
 
28
 
29
+ # Preference dataset configuration
30
+ preference_dataset_format: str = "dpo" # "dpo", "rlhf", "custom"
31
+ preference_dataset_text_field: str = "text"
32
+ preference_dataset_prompt_field: str = "prompt"
33
+ preference_dataset_chosen_field: str = "chosen"
34
+ preference_dataset_rejected_field: str = "rejected"
35
+
36
+ # DPO training arguments
37
+ dpo_gradient_checkpointing: bool = True
38
+ dpo_gradient_checkpointing_kwargs: dict = None
39
+ dpo_precompute_ref_log_probs: bool = False
40
+ dpo_peft_config: Optional[dict] = None
41
+
42
+ def __post_init__(self):
43
+ super().__post_init__()
44
+
45
+ # Set default values for DPO-specific settings
46
+ if self.ref_model_name is None:
47
+ self.ref_model_name = self.model_name
48
+
49
+ if self.dpo_gradient_checkpointing_kwargs is None:
50
+ self.dpo_gradient_checkpointing_kwargs = {
51
+ "use_reentrant": False
52
+ }
53
+
54
+ if self.dpo_peft_config is None:
55
+ self.dpo_peft_config = {
56
+ "r": 16,
57
+ "lora_alpha": 32,
58
+ "lora_dropout": 0.1,
59
+ "bias": "none",
60
+ "task_type": "CAUSAL_LM"
61
+ }
62
+
63
+ # Validate DPO configuration
64
+ if self.beta <= 0:
65
+ raise ValueError("beta must be positive")
66
+
67
+ if self.max_prompt_length > self.max_seq_length:
68
+ raise ValueError("max_prompt_length cannot exceed max_seq_length")
69
+
70
+ if self.max_length > self.max_seq_length:
71
+ raise ValueError("max_length cannot exceed max_seq_length")
72
+
73
+ def get_dpo_config(config_path: str) -> SmolLM3DPOConfig:
74
+ """Load DPO configuration from file or return default"""
75
+ if os.path.exists(config_path):
76
+ # Load from file if it exists
77
+ import importlib.util
78
+ spec = importlib.util.spec_from_file_location("config_module", config_path)
79
+ config_module = importlib.util.module_from_spec(spec)
80
+ spec.loader.exec_module(config_module)
81
+
82
+ if hasattr(config_module, 'config'):
83
+ return config_module.config
84
+ else:
85
+ # Try to find a config class
86
+ for attr_name in dir(config_module):
87
+ attr = getattr(config_module, attr_name)
88
+ if isinstance(attr, SmolLM3DPOConfig):
89
+ return attr
90
+
91
+ # Return default configuration
92
+ return SmolLM3DPOConfig()
93
+
94
+ # Default DPO configuration instance
95
+ config = SmolLM3DPOConfig()
config/train_smollm3_openhermes_fr.py ADDED
@@ -0,0 +1,129 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ SmolLM3 Training Configuration for OpenHermes-FR Dataset
3
+ Optimized for French instruction tuning using legmlai/openhermes-fr
4
+ """
5
+
6
+ import os
7
+ from dataclasses import dataclass
8
+ from typing import Optional
9
+ from config.train_smollm3 import SmolLM3Config
10
+
11
+ @dataclass
12
+ class SmolLM3ConfigOpenHermesFR(SmolLM3Config):
13
+ """Configuration for SmolLM3 fine-tuning on OpenHermes-FR dataset"""
14
+
15
+ # Model configuration
16
+ model_name: str = "HuggingFaceTB/SmolLM3-3B"
17
+ max_seq_length: int = 4096
18
+ use_flash_attention: bool = True
19
+ use_gradient_checkpointing: bool = True
20
+
21
+ # Training configuration - optimized for French instruction tuning
22
+ batch_size: int = 2 # Reduced for French text (longer sequences)
23
+ gradient_accumulation_steps: int = 8 # Increased to maintain effective batch size
24
+ learning_rate: float = 1e-5 # Slightly lower for instruction tuning
25
+ weight_decay: float = 0.01
26
+ warmup_steps: int = 500 # More warmup for instruction tuning
27
+ max_iters: int = 2000 # More iterations for large dataset
28
+ eval_interval: int = 200
29
+ log_interval: int = 10
30
+ save_interval: int = 500
31
+
32
+ # Optimizer configuration
33
+ optimizer: str = "adamw"
34
+ beta1: float = 0.9
35
+ beta2: float = 0.95
36
+ eps: float = 1e-8
37
+
38
+ # Scheduler configuration
39
+ scheduler: str = "cosine"
40
+ min_lr: float = 1e-6
41
+
42
+ # Mixed precision
43
+ fp16: bool = True
44
+ bf16: bool = False
45
+
46
+ # DDP configuration
47
+ ddp_backend: str = "nccl"
48
+ ddp_find_unused_parameters: bool = False
49
+
50
+ # Logging and saving
51
+ save_steps: int = 500
52
+ eval_steps: int = 200
53
+ logging_steps: int = 10
54
+ save_total_limit: Optional[int] = 3
55
+
56
+ # Evaluation
57
+ eval_strategy: str = "steps"
58
+ metric_for_best_model: str = "eval_loss"
59
+ greater_is_better: bool = False
60
+ load_best_model_at_end: bool = True
61
+
62
+ # OpenHermes-FR Dataset configuration
63
+ dataset_name: str = "legmlai/openhermes-fr"
64
+ dataset_split: str = "train"
65
+ input_field: str = "prompt"
66
+ target_field: str = "accepted_completion"
67
+ filter_bad_entries: bool = True
68
+ bad_entry_field: str = "bad_entry"
69
+
70
+ # Data configuration (not used for HF datasets but kept for compatibility)
71
+ data_dir: str = None
72
+ train_file: str = None
73
+ validation_file: Optional[str] = None
74
+ test_file: Optional[str] = None
75
+
76
+ # Chat template configuration
77
+ use_chat_template: bool = True
78
+ chat_template_kwargs: dict = None
79
+
80
+ # Trackio monitoring configuration
81
+ enable_tracking: bool = True
82
+ trackio_url: Optional[str] = None
83
+ trackio_token: Optional[str] = None
84
+ log_artifacts: bool = True
85
+ log_metrics: bool = True
86
+ log_config: bool = True
87
+ experiment_name: Optional[str] = None
88
+
89
+ def __post_init__(self):
90
+ if self.chat_template_kwargs is None:
91
+ self.chat_template_kwargs = {
92
+ "enable_thinking": False,
93
+ "add_generation_prompt": True
94
+ }
95
+
96
+ # Validate configuration
97
+ if self.fp16 and self.bf16:
98
+ raise ValueError("Cannot use both fp16 and bf16")
99
+
100
+ if self.max_seq_length > 131072: # 128k limit
101
+ raise ValueError("max_seq_length cannot exceed 131072")
102
+
103
+ # Set default experiment name if not provided
104
+ if self.experiment_name is None:
105
+ self.experiment_name = "smollm3_openhermes_fr"
106
+
107
+ def get_config(config_path: str) -> SmolLM3ConfigOpenHermesFR:
108
+ """Load configuration from file or return default"""
109
+ if os.path.exists(config_path):
110
+ # Load from file if it exists
111
+ import importlib.util
112
+ spec = importlib.util.spec_from_file_location("config_module", config_path)
113
+ config_module = importlib.util.module_from_spec(spec)
114
+ spec.loader.exec_module(config_module)
115
+
116
+ if hasattr(config_module, 'config'):
117
+ return config_module.config
118
+ else:
119
+ # Try to find a config class
120
+ for attr_name in dir(config_module):
121
+ attr = getattr(config_module, attr_name)
122
+ if isinstance(attr, SmolLM3ConfigOpenHermesFR):
123
+ return attr
124
+
125
+ # Return default configuration
126
+ return SmolLM3ConfigOpenHermesFR()
127
+
128
+ # Default configuration instance
129
+ config = SmolLM3ConfigOpenHermesFR()
config/train_smollm3_openhermes_fr_a100_large.py ADDED
@@ -0,0 +1,161 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ SmolLM3 Training Configuration for OpenHermes-FR Dataset - A100 Large Scale
3
+ Optimized for A100 GPUs with large batch sizes and multiple passes on 800k+ datapoints
4
+ """
5
+
6
+ import os
7
+ from dataclasses import dataclass
8
+ from typing import Optional
9
+ from config.train_smollm3 import SmolLM3Config
10
+
11
+ @dataclass
12
+ class SmolLM3ConfigOpenHermesFRA100Large(SmolLM3Config):
13
+ """Configuration for SmolLM3 fine-tuning on OpenHermes-FR dataset - A100 Large Scale"""
14
+
15
+ # Model configuration - optimized for A100
16
+ model_name: str = "HuggingFaceTB/SmolLM3-3B"
17
+ max_seq_length: int = 8192 # Increased for better context understanding
18
+ use_flash_attention: bool = True
19
+ use_gradient_checkpointing: bool = False # Disabled for A100 efficiency
20
+
21
+ # Training configuration - A100 optimized with large batch sizes
22
+ batch_size: int = 8 # Large batch size for A100 (80GB VRAM)
23
+ gradient_accumulation_steps: int = 16 # Effective batch size = 8 * 16 = 128
24
+ learning_rate: float = 5e-6 # Lower LR for large effective batch size
25
+ weight_decay: float = 0.01
26
+ warmup_steps: int = 1000 # More warmup for large dataset
27
+ max_iters: int = 8000 # Multiple passes on 800k dataset
28
+ eval_interval: int = 500 # Less frequent evaluation
29
+ log_interval: int = 25 # Less frequent logging
30
+ save_interval: int = 1000 # Less frequent saving
31
+
32
+ # Optimizer configuration - optimized for large batches
33
+ optimizer: str = "adamw"
34
+ beta1: float = 0.9
35
+ beta2: float = 0.999 # Higher beta2 for stability with large batches
36
+ eps: float = 1e-8
37
+
38
+ # Scheduler configuration - longer training
39
+ scheduler: str = "cosine"
40
+ min_lr: float = 5e-7 # Lower min LR
41
+
42
+ # Mixed precision - A100 optimized
43
+ fp16: bool = False # Use bf16 for A100
44
+ bf16: bool = True # Better for A100
45
+
46
+ # DDP configuration
47
+ ddp_backend: str = "nccl"
48
+ ddp_find_unused_parameters: bool = False
49
+
50
+ # Logging and saving - optimized for long training
51
+ save_steps: int = 1000
52
+ eval_steps: int = 500
53
+ logging_steps: int = 25
54
+ save_total_limit: Optional[int] = 5 # Keep more checkpoints
55
+
56
+ # Evaluation
57
+ eval_strategy: str = "steps"
58
+ metric_for_best_model: str = "eval_loss"
59
+ greater_is_better: bool = False
60
+ load_best_model_at_end: bool = True
61
+
62
+ # OpenHermes-FR Dataset configuration
63
+ dataset_name: str = "legmlai/openhermes-fr"
64
+ dataset_split: str = "train"
65
+ input_field: str = "prompt"
66
+ target_field: str = "accepted_completion"
67
+ filter_bad_entries: bool = True
68
+ bad_entry_field: str = "bad_entry"
69
+
70
+ # Data configuration (not used for HF datasets but kept for compatibility)
71
+ data_dir: str = None
72
+ train_file: str = None
73
+ validation_file: Optional[str] = None
74
+ test_file: Optional[str] = None
75
+
76
+ # Chat template configuration
77
+ use_chat_template: bool = True
78
+ chat_template_kwargs: dict = None
79
+
80
+ # Trackio monitoring configuration
81
+ enable_tracking: bool = True
82
+ trackio_url: Optional[str] = None
83
+ trackio_token: Optional[str] = None
84
+ log_artifacts: bool = True
85
+ log_metrics: bool = True
86
+ log_config: bool = True
87
+ experiment_name: Optional[str] = None
88
+
89
+ # Additional A100 optimizations
90
+ dataloader_num_workers: int = 8 # More workers for faster data loading
91
+ dataloader_pin_memory: bool = True
92
+ dataloader_prefetch_factor: int = 2
93
+
94
+ # Memory optimizations
95
+ max_grad_norm: float = 1.0 # Gradient clipping
96
+ group_by_length: bool = True # Group similar length sequences
97
+
98
+ # Training duration calculations
99
+ # With 800k datapoints and effective batch size of 128:
100
+ # Steps per epoch = 800,000 / 128 = 6,250 steps
101
+ # For 3 passes: 6,250 * 3 = 18,750 steps
102
+ # For 5 passes: 6,250 * 5 = 31,250 steps
103
+ # Current max_iters = 8,000 (about 1.3 passes)
104
+
105
+ def __post_init__(self):
106
+ if self.chat_template_kwargs is None:
107
+ self.chat_template_kwargs = {
108
+ "enable_thinking": False,
109
+ "add_generation_prompt": True
110
+ }
111
+
112
+ # Validate configuration
113
+ if self.fp16 and self.bf16:
114
+ raise ValueError("Cannot use both fp16 and bf16")
115
+
116
+ if self.max_seq_length > 131072: # 128k limit
117
+ raise ValueError("max_seq_length cannot exceed 131072")
118
+
119
+ # Calculate training statistics
120
+ effective_batch_size = self.batch_size * self.gradient_accumulation_steps
121
+ steps_per_epoch = 800000 // effective_batch_size # Approximate for 800k dataset
122
+ epochs_for_max_iters = self.max_iters / steps_per_epoch
123
+
124
+ print(f"=== A100 Large Scale Training Configuration ===")
125
+ print(f"Effective batch size: {effective_batch_size}")
126
+ print(f"Steps per epoch: ~{steps_per_epoch}")
127
+ print(f"Training for ~{epochs_for_max_iters:.1f} epochs")
128
+ print(f"Total training steps: {self.max_iters}")
129
+ print(f"Learning rate: {self.learning_rate}")
130
+ print(f"Mixed precision: {'bf16' if self.bf16 else 'fp16'}")
131
+ print(f"Max sequence length: {self.max_seq_length}")
132
+ print(f"Gradient checkpointing: {self.use_gradient_checkpointing}")
133
+ print("=" * 50)
134
+
135
+ # Set default experiment name if not provided
136
+ if self.experiment_name is None:
137
+ self.experiment_name = "smollm3_openhermes_fr_a100_large"
138
+
139
+ def get_config(config_path: str) -> SmolLM3ConfigOpenHermesFRA100Large:
140
+ """Load configuration from file or return default"""
141
+ if os.path.exists(config_path):
142
+ # Load from file if it exists
143
+ import importlib.util
144
+ spec = importlib.util.spec_from_file_location("config_module", config_path)
145
+ config_module = importlib.util.module_from_spec(spec)
146
+ spec.loader.exec_module(config_module)
147
+
148
+ if hasattr(config_module, 'config'):
149
+ return config_module.config
150
+ else:
151
+ # Try to find a config class
152
+ for attr_name in dir(config_module):
153
+ attr = getattr(config_module, attr_name)
154
+ if isinstance(attr, SmolLM3ConfigOpenHermesFRA100Large):
155
+ return attr
156
+
157
+ # Return default configuration
158
+ return SmolLM3ConfigOpenHermesFRA100Large()
159
+
160
+ # Default configuration instance
161
+ config = SmolLM3ConfigOpenHermesFRA100Large()
config/train_smollm3_openhermes_fr_a100_multiple_passes.py ADDED
@@ -0,0 +1,164 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ SmolLM3 Training Configuration for OpenHermes-FR Dataset - Multiple Passes
3
+ Optimized for A100 GPUs with multiple passes (3-5 epochs) on 800k+ datapoints
4
+ """
5
+
6
+ import os
7
+ from dataclasses import dataclass
8
+ from typing import Optional
9
+ from config.train_smollm3 import SmolLM3Config
10
+
11
+ @dataclass
12
+ class SmolLM3ConfigOpenHermesFRMultiplePasses(SmolLM3Config):
13
+ """Configuration for SmolLM3 fine-tuning with multiple passes on OpenHermes-FR dataset"""
14
+
15
+ # Model configuration - optimized for A100
16
+ model_name: str = "HuggingFaceTB/SmolLM3-3B"
17
+ max_seq_length: int = 8192 # Increased for better context understanding
18
+ use_flash_attention: bool = True
19
+ use_gradient_checkpointing: bool = False # Disabled for A100 efficiency
20
+
21
+ # Training configuration - Multiple passes optimized
22
+ batch_size: int = 6 # Slightly smaller for stability during long training
23
+ gradient_accumulation_steps: int = 20 # Effective batch size = 6 * 20 = 120
24
+ learning_rate: float = 3e-6 # Conservative LR for multiple passes
25
+ weight_decay: float = 0.01
26
+ warmup_steps: int = 2000 # Longer warmup for multiple passes
27
+ max_iters: int = 25000 # 4 passes on 800k dataset (25k steps)
28
+ eval_interval: int = 1000 # Less frequent evaluation
29
+ log_interval: int = 50 # Less frequent logging
30
+ save_interval: int = 2000 # Less frequent saving
31
+
32
+ # Optimizer configuration - stability focused
33
+ optimizer: str = "adamw"
34
+ beta1: float = 0.9
35
+ beta2: float = 0.999 # Higher beta2 for stability
36
+ eps: float = 1e-8
37
+
38
+ # Scheduler configuration - longer training with multiple passes
39
+ scheduler: str = "cosine"
40
+ min_lr: float = 3e-7 # Lower min LR
41
+
42
+ # Mixed precision - A100 optimized
43
+ fp16: bool = False # Use bf16 for A100
44
+ bf16: bool = True # Better for A100
45
+
46
+ # DDP configuration
47
+ ddp_backend: str = "nccl"
48
+ ddp_find_unused_parameters: bool = False
49
+
50
+ # Logging and saving - optimized for long training
51
+ save_steps: int = 2000
52
+ eval_steps: int = 1000
53
+ logging_steps: int = 50
54
+ save_total_limit: Optional[int] = 8 # Keep more checkpoints for long training
55
+
56
+ # Evaluation
57
+ eval_strategy: str = "steps"
58
+ metric_for_best_model: str = "eval_loss"
59
+ greater_is_better: bool = False
60
+ load_best_model_at_end: bool = True
61
+
62
+ # OpenHermes-FR Dataset configuration
63
+ dataset_name: str = "legmlai/openhermes-fr"
64
+ dataset_split: str = "train"
65
+ input_field: str = "prompt"
66
+ target_field: str = "accepted_completion"
67
+ filter_bad_entries: bool = True
68
+ bad_entry_field: str = "bad_entry"
69
+
70
+ # Data configuration (not used for HF datasets but kept for compatibility)
71
+ data_dir: str = None
72
+ train_file: str = None
73
+ validation_file: Optional[str] = None
74
+ test_file: Optional[str] = None
75
+
76
+ # Chat template configuration
77
+ use_chat_template: bool = True
78
+ chat_template_kwargs: dict = None
79
+
80
+ # Trackio monitoring configuration
81
+ enable_tracking: bool = True
82
+ trackio_url: Optional[str] = None
83
+ trackio_token: Optional[str] = None
84
+ log_artifacts: bool = True
85
+ log_metrics: bool = True
86
+ log_config: bool = True
87
+ experiment_name: Optional[str] = None
88
+
89
+ # Additional A100 optimizations
90
+ dataloader_num_workers: int = 8 # More workers for faster data loading
91
+ dataloader_pin_memory: bool = True
92
+ dataloader_prefetch_factor: int = 2
93
+
94
+ # Memory optimizations
95
+ max_grad_norm: float = 1.0 # Gradient clipping
96
+ group_by_length: bool = True # Group similar length sequences
97
+
98
+ # Training duration calculations
99
+ # With 800k datapoints and effective batch size of 120:
100
+ # Steps per epoch = 800,000 / 120 = 6,667 steps
101
+ # For 3 passes: 6,667 * 3 = 20,000 steps
102
+ # For 4 passes: 6,667 * 4 = 26,667 steps
103
+ # For 5 passes: 6,667 * 5 = 33,333 steps
104
+ # Current max_iters = 25,000 (about 3.75 passes)
105
+
106
+ def __post_init__(self):
107
+ if self.chat_template_kwargs is None:
108
+ self.chat_template_kwargs = {
109
+ "enable_thinking": False,
110
+ "add_generation_prompt": True
111
+ }
112
+
113
+ # Validate configuration
114
+ if self.fp16 and self.bf16:
115
+ raise ValueError("Cannot use both fp16 and bf16")
116
+
117
+ if self.max_seq_length > 131072: # 128k limit
118
+ raise ValueError("max_seq_length cannot exceed 131072")
119
+
120
+ # Calculate training statistics
121
+ effective_batch_size = self.batch_size * self.gradient_accumulation_steps
122
+ steps_per_epoch = 800000 // effective_batch_size # Approximate for 800k dataset
123
+ epochs_for_max_iters = self.max_iters / steps_per_epoch
124
+
125
+ print(f"=== Multiple Passes Training Configuration ===")
126
+ print(f"Effective batch size: {effective_batch_size}")
127
+ print(f"Steps per epoch: ~{steps_per_epoch}")
128
+ print(f"Training for ~{epochs_for_max_iters:.1f} epochs")
129
+ print(f"Total training steps: {self.max_iters}")
130
+ print(f"Learning rate: {self.learning_rate}")
131
+ print(f"Mixed precision: {'bf16' if self.bf16 else 'fp16'}")
132
+ print(f"Max sequence length: {self.max_seq_length}")
133
+ print(f"Gradient checkpointing: {self.use_gradient_checkpointing}")
134
+ print(f"Warmup steps: {self.warmup_steps}")
135
+ print(f"Save interval: {self.save_interval}")
136
+ print("=" * 50)
137
+
138
+ # Set default experiment name if not provided
139
+ if self.experiment_name is None:
140
+ self.experiment_name = "smollm3_openhermes_fr_multiple_passes"
141
+
142
+ def get_config(config_path: str) -> SmolLM3ConfigOpenHermesFRMultiplePasses:
143
+ """Load configuration from file or return default"""
144
+ if os.path.exists(config_path):
145
+ # Load from file if it exists
146
+ import importlib.util
147
+ spec = importlib.util.spec_from_file_location("config_module", config_path)
148
+ config_module = importlib.util.module_from_spec(spec)
149
+ spec.loader.exec_module(config_module)
150
+
151
+ if hasattr(config_module, 'config'):
152
+ return config_module.config
153
+ else:
154
+ # Try to find a config class
155
+ for attr_name in dir(config_module):
156
+ attr = getattr(config_module, attr_name)
157
+ if isinstance(attr, SmolLM3ConfigOpenHermesFRMultiplePasses):
158
+ return attr
159
+
160
+ # Return default configuration
161
+ return SmolLM3ConfigOpenHermesFRMultiplePasses()
162
+
163
+ # Default configuration instance
164
+ config = SmolLM3ConfigOpenHermesFRMultiplePasses()
data.py CHANGED
@@ -22,13 +22,17 @@ class SmolLM3Dataset:
22
  tokenizer: PreTrainedTokenizer,
23
  max_seq_length: int = 4096,
24
  use_chat_template: bool = True,
25
- chat_template_kwargs: Optional[Dict] = None
 
 
26
  ):
27
  self.data_path = data_path
28
  self.tokenizer = tokenizer
29
  self.max_seq_length = max_seq_length
30
  self.use_chat_template = use_chat_template
31
  self.chat_template_kwargs = chat_template_kwargs or {}
 
 
32
 
33
  # Load and process dataset
34
  self.dataset = self._load_dataset()
@@ -74,6 +78,17 @@ class SmolLM3Dataset:
74
  try:
75
  dataset = load_dataset(self.data_path)
76
  logger.info(f"Loaded Hugging Face dataset: {self.data_path}")
 
 
 
 
 
 
 
 
 
 
 
77
  # If only 'train' split exists, create validation and test splits
78
  if ("train" in dataset) and ("validation" not in dataset or "test" not in dataset):
79
  logger.info("Automatically splitting train into train/validation/test (98/1/1)")
@@ -123,6 +138,11 @@ class SmolLM3Dataset:
123
  {"role": "user", "content": example["prompt"]},
124
  {"role": "assistant", "content": example["accepted_completion"]}
125
  ]
 
 
 
 
 
126
  else:
127
  # Fallback: treat as plain text
128
  return {"text": str(example)}
 
22
  tokenizer: PreTrainedTokenizer,
23
  max_seq_length: int = 4096,
24
  use_chat_template: bool = True,
25
+ chat_template_kwargs: Optional[Dict] = None,
26
+ filter_bad_entries: bool = False,
27
+ bad_entry_field: str = "bad_entry"
28
  ):
29
  self.data_path = data_path
30
  self.tokenizer = tokenizer
31
  self.max_seq_length = max_seq_length
32
  self.use_chat_template = use_chat_template
33
  self.chat_template_kwargs = chat_template_kwargs or {}
34
+ self.filter_bad_entries = filter_bad_entries
35
+ self.bad_entry_field = bad_entry_field
36
 
37
  # Load and process dataset
38
  self.dataset = self._load_dataset()
 
78
  try:
79
  dataset = load_dataset(self.data_path)
80
  logger.info(f"Loaded Hugging Face dataset: {self.data_path}")
81
+
82
+ # Filter bad entries if requested
83
+ if self.filter_bad_entries and self.bad_entry_field in dataset["train"].column_names:
84
+ logger.info(f"Filtering out bad entries using field: {self.bad_entry_field}")
85
+ for split in dataset:
86
+ if self.bad_entry_field in dataset[split].column_names:
87
+ original_size = len(dataset[split])
88
+ dataset[split] = dataset[split].filter(lambda x: not x[self.bad_entry_field])
89
+ filtered_size = len(dataset[split])
90
+ logger.info(f"Filtered {split}: {original_size} -> {filtered_size} samples")
91
+
92
  # If only 'train' split exists, create validation and test splits
93
  if ("train" in dataset) and ("validation" not in dataset or "test" not in dataset):
94
  logger.info("Automatically splitting train into train/validation/test (98/1/1)")
 
138
  {"role": "user", "content": example["prompt"]},
139
  {"role": "assistant", "content": example["accepted_completion"]}
140
  ]
141
+ elif "prompt" in example and "completion" in example:
142
+ messages = [
143
+ {"role": "user", "content": example["prompt"]},
144
+ {"role": "assistant", "content": example["completion"]}
145
+ ]
146
  else:
147
  # Fallback: treat as plain text
148
  return {"text": str(example)}
deploy_trackio_space.py ADDED
@@ -0,0 +1,235 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Deployment script for Trackio on Hugging Face Spaces
4
+ Automates the process of creating and configuring a Trackio Space
5
+ """
6
+
7
+ import os
8
+ import json
9
+ import requests
10
+ import subprocess
11
+ import sys
12
+ from pathlib import Path
13
+ from typing import Dict, Any, Optional
14
+
15
+ class TrackioSpaceDeployer:
16
+ """Deployer for Trackio on Hugging Face Spaces"""
17
+
18
+ def __init__(self, space_name: str, username: str, token: str):
19
+ self.space_name = space_name
20
+ self.username = username
21
+ self.token = token
22
+ self.space_url = f"https://huggingface.co/spaces/{username}/{space_name}"
23
+
24
+ def create_space(self) -> bool:
25
+ """Create a new Hugging Face Space"""
26
+ try:
27
+ print(f"Creating Space: {self.space_name}")
28
+
29
+ # Create space using Hugging Face CLI
30
+ cmd = [
31
+ "huggingface-cli", "repo", "create",
32
+ f"{self.username}/{self.space_name}",
33
+ "--type", "space",
34
+ "--space-sdk", "gradio",
35
+ "--space-hardware", "cpu-basic"
36
+ ]
37
+
38
+ result = subprocess.run(cmd, capture_output=True, text=True)
39
+
40
+ if result.returncode == 0:
41
+ print(f"✅ Space created successfully: {self.space_url}")
42
+ return True
43
+ else:
44
+ print(f"❌ Failed to create space: {result.stderr}")
45
+ return False
46
+
47
+ except Exception as e:
48
+ print(f"❌ Error creating space: {e}")
49
+ return False
50
+
51
+ def upload_files(self) -> bool:
52
+ """Upload necessary files to the Space"""
53
+ try:
54
+ print("Uploading files to Space...")
55
+
56
+ # Files to upload
57
+ files_to_upload = [
58
+ "app.py",
59
+ "requirements_space.txt",
60
+ "README.md"
61
+ ]
62
+
63
+ for file_path in files_to_upload:
64
+ if os.path.exists(file_path):
65
+ # Use git to add and push files
66
+ subprocess.run(["git", "add", file_path], check=True)
67
+ subprocess.run(["git", "commit", "-m", f"Add {file_path}"], check=True)
68
+ subprocess.run(["git", "push"], check=True)
69
+ print(f"✅ Uploaded {file_path}")
70
+ else:
71
+ print(f"⚠️ File not found: {file_path}")
72
+
73
+ return True
74
+
75
+ except Exception as e:
76
+ print(f"❌ Error uploading files: {e}")
77
+ return False
78
+
79
+ def configure_space(self) -> bool:
80
+ """Configure the Space settings"""
81
+ try:
82
+ print("Configuring Space settings...")
83
+
84
+ # Create space configuration
85
+ space_config = {
86
+ "title": "Trackio - Experiment Tracking",
87
+ "emoji": "🚀",
88
+ "colorFrom": "blue",
89
+ "colorTo": "purple",
90
+ "sdk": "gradio",
91
+ "sdk_version": "4.0.0",
92
+ "app_file": "app.py",
93
+ "pinned": False
94
+ }
95
+
96
+ # Write README.md for the space
97
+ space_readme = f"""---
98
+ title: Trackio for Petite Elle L'Aime
99
+ emoji: 🐠
100
+ colorFrom: indigo
101
+ colorTo: yellow
102
+ sdk: gradio
103
+ sdk_version: 5.38.0
104
+ app_file: app.py
105
+ pinned: true
106
+ license: mit
107
+ short_description: trackio for training monitoring
108
+ ---
109
+
110
+ # Trackio Experiment Tracking
111
+
112
+ A Gradio interface for experiment tracking and monitoring.
113
+
114
+ ## Features
115
+
116
+ - Create and manage experiments
117
+ - Log training metrics and parameters
118
+ - View experiment details and results
119
+ - Update experiment status
120
+
121
+ ## Usage
122
+
123
+ 1. Create a new experiment using the "Create Experiment" tab
124
+ 2. Log metrics during training using the "Log Metrics" tab
125
+ 3. View experiment details using the "View Experiments" tab
126
+ 4. Update experiment status using the "Update Status" tab
127
+
128
+ ## Integration
129
+
130
+ To connect your training script to this Trackio Space:
131
+
132
+ ```python
133
+ from monitoring import SmolLM3Monitor
134
+
135
+ monitor = SmolLM3Monitor(
136
+ experiment_name="my_experiment",
137
+ trackio_url="{self.space_url}",
138
+ enable_tracking=True
139
+ )
140
+ ```
141
+
142
+ Visit: {self.space_url}
143
+ """
144
+
145
+ with open("README.md", "w") as f:
146
+ f.write(space_readme)
147
+
148
+ return True
149
+
150
+ except Exception as e:
151
+ print(f"❌ Error configuring space: {e}")
152
+ return False
153
+
154
+ def test_space(self) -> bool:
155
+ """Test if the Space is working correctly"""
156
+ try:
157
+ print("Testing Space...")
158
+
159
+ # Wait a bit for the space to build
160
+ import time
161
+ time.sleep(30)
162
+
163
+ # Try to access the space
164
+ response = requests.get(self.space_url, timeout=10)
165
+
166
+ if response.status_code == 200:
167
+ print(f"✅ Space is accessible: {self.space_url}")
168
+ return True
169
+ else:
170
+ print(f"⚠️ Space returned status code: {response.status_code}")
171
+ return False
172
+
173
+ except Exception as e:
174
+ print(f"❌ Error testing space: {e}")
175
+ return False
176
+
177
+ def deploy(self) -> bool:
178
+ """Complete deployment process"""
179
+ print("🚀 Starting Trackio Space deployment...")
180
+
181
+ # Step 1: Create space
182
+ if not self.create_space():
183
+ return False
184
+
185
+ # Step 2: Configure space
186
+ if not self.configure_space():
187
+ return False
188
+
189
+ # Step 3: Upload files
190
+ if not self.upload_files():
191
+ return False
192
+
193
+ # Step 4: Test space
194
+ if not self.test_space():
195
+ print("⚠️ Space created but may need time to build")
196
+
197
+ print(f"🎉 Deployment completed!")
198
+ print(f"📊 Trackio Space URL: {self.space_url}")
199
+ print(f"🔧 Space configuration: {self.space_url}/settings")
200
+
201
+ return True
202
+
203
+ def main():
204
+ """Main deployment function"""
205
+ print("Trackio Space Deployment Script")
206
+ print("=" * 40)
207
+
208
+ # Get user input
209
+ username = input("Enter your Hugging Face username: ").strip()
210
+ space_name = input("Enter Space name (e.g., trackio-monitoring): ").strip()
211
+ token = input("Enter your Hugging Face token (optional): ").strip()
212
+
213
+ if not username or not space_name:
214
+ print("❌ Username and Space name are required")
215
+ sys.exit(1)
216
+
217
+ # Create deployer
218
+ deployer = TrackioSpaceDeployer(space_name, username, token)
219
+
220
+ # Run deployment
221
+ success = deployer.deploy()
222
+
223
+ if success:
224
+ print("\n✅ Deployment successful!")
225
+ print(f"🌐 Your Trackio Space: {deployer.space_url}")
226
+ print("\nNext steps:")
227
+ print("1. Wait for the Space to build (usually 2-5 minutes)")
228
+ print("2. Test the interface by visiting the Space URL")
229
+ print("3. Use the Space URL in your training scripts")
230
+ else:
231
+ print("\n❌ Deployment failed!")
232
+ print("Check the error messages above and try again.")
233
+
234
+ if __name__ == "__main__":
235
+ main()
monitoring.py ADDED
@@ -0,0 +1,298 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Trackio Monitoring Integration for SmolLM3 Fine-tuning
3
+ Provides comprehensive experiment tracking and monitoring capabilities
4
+ """
5
+
6
+ import os
7
+ import json
8
+ import logging
9
+ from typing import Dict, Any, Optional, List
10
+ from datetime import datetime
11
+ import torch
12
+ from pathlib import Path
13
+
14
+ try:
15
+ import trackio
16
+ from trackio import TrackioClient
17
+ TRACKIO_AVAILABLE = True
18
+ except ImportError:
19
+ TRACKIO_AVAILABLE = False
20
+ print("Warning: Trackio not available. Install with: pip install trackio")
21
+
22
+ logger = logging.getLogger(__name__)
23
+
24
+ class SmolLM3Monitor:
25
+ """Monitoring and tracking for SmolLM3 fine-tuning experiments"""
26
+
27
+ def __init__(
28
+ self,
29
+ experiment_name: str,
30
+ trackio_url: Optional[str] = None,
31
+ trackio_token: Optional[str] = None,
32
+ enable_tracking: bool = True,
33
+ log_artifacts: bool = True,
34
+ log_metrics: bool = True,
35
+ log_config: bool = True
36
+ ):
37
+ self.experiment_name = experiment_name
38
+ self.enable_tracking = enable_tracking and TRACKIO_AVAILABLE
39
+ self.log_artifacts = log_artifacts
40
+ self.log_metrics = log_metrics
41
+ self.log_config = log_config
42
+
43
+ # Initialize Trackio client
44
+ self.trackio_client = None
45
+ if self.enable_tracking:
46
+ self._setup_trackio(trackio_url, trackio_token)
47
+
48
+ # Experiment metadata
49
+ self.experiment_id = None
50
+ self.start_time = datetime.now()
51
+ self.metrics_history = []
52
+ self.artifacts = []
53
+
54
+ logger.info(f"Initialized monitoring for experiment: {experiment_name}")
55
+
56
+ def _setup_trackio(self, trackio_url: Optional[str], trackio_token: Optional[str]):
57
+ """Setup Trackio client"""
58
+ try:
59
+ # Get Trackio configuration from environment or parameters
60
+ url = trackio_url or os.getenv('TRACKIO_URL')
61
+ token = trackio_token or os.getenv('TRACKIO_TOKEN')
62
+
63
+ if not url:
64
+ logger.warning("Trackio URL not provided. Set TRACKIO_URL environment variable.")
65
+ self.enable_tracking = False
66
+ return
67
+
68
+ self.trackio_client = TrackioClient(
69
+ url=url,
70
+ token=token
71
+ )
72
+
73
+ # Create or get experiment
74
+ self.experiment_id = self.trackio_client.create_experiment(
75
+ name=self.experiment_name,
76
+ description=f"SmolLM3 fine-tuning experiment started at {self.start_time}"
77
+ )
78
+
79
+ logger.info(f"Trackio client initialized. Experiment ID: {self.experiment_id}")
80
+
81
+ except Exception as e:
82
+ logger.error(f"Failed to initialize Trackio: {e}")
83
+ self.enable_tracking = False
84
+
85
+ def log_config(self, config: Dict[str, Any]):
86
+ """Log experiment configuration"""
87
+ if not self.enable_tracking or not self.log_config:
88
+ return
89
+
90
+ try:
91
+ # Log configuration as parameters
92
+ self.trackio_client.log_parameters(
93
+ experiment_id=self.experiment_id,
94
+ parameters=config
95
+ )
96
+
97
+ # Also save config locally
98
+ config_path = f"config_{self.experiment_name}_{self.start_time.strftime('%Y%m%d_%H%M%S')}.json"
99
+ with open(config_path, 'w') as f:
100
+ json.dump(config, f, indent=2, default=str)
101
+
102
+ self.artifacts.append(config_path)
103
+ logger.info(f"Configuration logged to Trackio and saved to {config_path}")
104
+
105
+ except Exception as e:
106
+ logger.error(f"Failed to log configuration: {e}")
107
+
108
+ def log_metrics(self, metrics: Dict[str, Any], step: Optional[int] = None):
109
+ """Log training metrics"""
110
+ if not self.enable_tracking or not self.log_metrics:
111
+ return
112
+
113
+ try:
114
+ # Add timestamp
115
+ metrics['timestamp'] = datetime.now().isoformat()
116
+ if step is not None:
117
+ metrics['step'] = step
118
+
119
+ # Log to Trackio
120
+ self.trackio_client.log_metrics(
121
+ experiment_id=self.experiment_id,
122
+ metrics=metrics,
123
+ step=step
124
+ )
125
+
126
+ # Store locally
127
+ self.metrics_history.append(metrics)
128
+
129
+ logger.debug(f"Metrics logged: {metrics}")
130
+
131
+ except Exception as e:
132
+ logger.error(f"Failed to log metrics: {e}")
133
+
134
+ def log_model_checkpoint(self, checkpoint_path: str, step: Optional[int] = None):
135
+ """Log model checkpoint"""
136
+ if not self.enable_tracking or not self.log_artifacts:
137
+ return
138
+
139
+ try:
140
+ # Log checkpoint as artifact
141
+ self.trackio_client.log_artifact(
142
+ experiment_id=self.experiment_id,
143
+ file_path=checkpoint_path,
144
+ artifact_name=f"checkpoint_step_{step}" if step else "checkpoint"
145
+ )
146
+
147
+ self.artifacts.append(checkpoint_path)
148
+ logger.info(f"Checkpoint logged: {checkpoint_path}")
149
+
150
+ except Exception as e:
151
+ logger.error(f"Failed to log checkpoint: {e}")
152
+
153
+ def log_evaluation_results(self, results: Dict[str, Any], step: Optional[int] = None):
154
+ """Log evaluation results"""
155
+ if not self.enable_tracking:
156
+ return
157
+
158
+ try:
159
+ # Add evaluation prefix to metrics
160
+ eval_metrics = {f"eval_{k}": v for k, v in results.items()}
161
+
162
+ self.log_metrics(eval_metrics, step)
163
+
164
+ # Save evaluation results locally
165
+ eval_path = f"eval_results_step_{step}_{self.start_time.strftime('%Y%m%d_%H%M%S')}.json"
166
+ with open(eval_path, 'w') as f:
167
+ json.dump(results, f, indent=2, default=str)
168
+
169
+ self.artifacts.append(eval_path)
170
+ logger.info(f"Evaluation results logged and saved to {eval_path}")
171
+
172
+ except Exception as e:
173
+ logger.error(f"Failed to log evaluation results: {e}")
174
+
175
+ def log_system_metrics(self, step: Optional[int] = None):
176
+ """Log system metrics (GPU, memory, etc.)"""
177
+ if not self.enable_tracking:
178
+ return
179
+
180
+ try:
181
+ system_metrics = {}
182
+
183
+ # GPU metrics
184
+ if torch.cuda.is_available():
185
+ for i in range(torch.cuda.device_count()):
186
+ system_metrics[f'gpu_{i}_memory_allocated'] = torch.cuda.memory_allocated(i) / 1024**3 # GB
187
+ system_metrics[f'gpu_{i}_memory_reserved'] = torch.cuda.memory_reserved(i) / 1024**3 # GB
188
+ system_metrics[f'gpu_{i}_utilization'] = torch.cuda.utilization(i) if hasattr(torch.cuda, 'utilization') else 0
189
+
190
+ # CPU and memory metrics (basic)
191
+ import psutil
192
+ system_metrics['cpu_percent'] = psutil.cpu_percent()
193
+ system_metrics['memory_percent'] = psutil.virtual_memory().percent
194
+
195
+ self.log_metrics(system_metrics, step)
196
+
197
+ except Exception as e:
198
+ logger.error(f"Failed to log system metrics: {e}")
199
+
200
+ def log_training_summary(self, summary: Dict[str, Any]):
201
+ """Log training summary at the end"""
202
+ if not self.enable_tracking:
203
+ return
204
+
205
+ try:
206
+ # Add experiment duration
207
+ end_time = datetime.now()
208
+ duration = (end_time - self.start_time).total_seconds()
209
+ summary['experiment_duration_seconds'] = duration
210
+ summary['experiment_duration_hours'] = duration / 3600
211
+
212
+ # Log final summary
213
+ self.trackio_client.log_parameters(
214
+ experiment_id=self.experiment_id,
215
+ parameters=summary
216
+ )
217
+
218
+ # Save summary locally
219
+ summary_path = f"training_summary_{self.experiment_name}_{self.start_time.strftime('%Y%m%d_%H%M%S')}.json"
220
+ with open(summary_path, 'w') as f:
221
+ json.dump(summary, f, indent=2, default=str)
222
+
223
+ self.artifacts.append(summary_path)
224
+ logger.info(f"Training summary logged and saved to {summary_path}")
225
+
226
+ except Exception as e:
227
+ logger.error(f"Failed to log training summary: {e}")
228
+
229
+ def create_monitoring_callback(self):
230
+ """Create a callback for integration with Hugging Face Trainer"""
231
+ if not self.enable_tracking:
232
+ return None
233
+
234
+ class TrackioCallback:
235
+ def __init__(self, monitor):
236
+ self.monitor = monitor
237
+
238
+ def on_log(self, args, state, control, logs=None, **kwargs):
239
+ """Called when logs are created"""
240
+ if logs:
241
+ self.monitor.log_metrics(logs, state.global_step)
242
+ self.monitor.log_system_metrics(state.global_step)
243
+
244
+ def on_save(self, args, state, control, **kwargs):
245
+ """Called when a checkpoint is saved"""
246
+ checkpoint_path = os.path.join(args.output_dir, f"checkpoint-{state.global_step}")
247
+ if os.path.exists(checkpoint_path):
248
+ self.monitor.log_model_checkpoint(checkpoint_path, state.global_step)
249
+
250
+ def on_evaluate(self, args, state, control, metrics=None, **kwargs):
251
+ """Called when evaluation is performed"""
252
+ if metrics:
253
+ self.monitor.log_evaluation_results(metrics, state.global_step)
254
+
255
+ return TrackioCallback(self)
256
+
257
+ def get_experiment_url(self) -> Optional[str]:
258
+ """Get the URL to view the experiment in Trackio"""
259
+ if self.trackio_client and self.experiment_id:
260
+ return f"{self.trackio_client.url}/experiments/{self.experiment_id}"
261
+ return None
262
+
263
+ def close(self):
264
+ """Close the monitoring session"""
265
+ if self.enable_tracking and self.trackio_client:
266
+ try:
267
+ # Mark experiment as completed
268
+ self.trackio_client.update_experiment_status(
269
+ experiment_id=self.experiment_id,
270
+ status="completed"
271
+ )
272
+ logger.info("Monitoring session closed")
273
+ except Exception as e:
274
+ logger.error(f"Failed to close monitoring session: {e}")
275
+
276
+ # Utility function to create monitor from config
277
+ def create_monitor_from_config(config, experiment_name: Optional[str] = None) -> SmolLM3Monitor:
278
+ """Create a monitor instance from configuration"""
279
+ if experiment_name is None:
280
+ experiment_name = f"smollm3_finetune_{datetime.now().strftime('%Y%m%d_%H%M%S')}"
281
+
282
+ # Extract monitoring configuration
283
+ trackio_url = getattr(config, 'trackio_url', None)
284
+ trackio_token = getattr(config, 'trackio_token', None)
285
+ enable_tracking = getattr(config, 'enable_tracking', True)
286
+ log_artifacts = getattr(config, 'log_artifacts', True)
287
+ log_metrics = getattr(config, 'log_metrics', True)
288
+ log_config = getattr(config, 'log_config', True)
289
+
290
+ return SmolLM3Monitor(
291
+ experiment_name=experiment_name,
292
+ trackio_url=trackio_url,
293
+ trackio_token=trackio_token,
294
+ enable_tracking=enable_tracking,
295
+ log_artifacts=log_artifacts,
296
+ log_metrics=log_metrics,
297
+ log_config=log_config
298
+ )
push_to_huggingface.py ADDED
@@ -0,0 +1,486 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Push Trained Model and Results to Hugging Face Hub
4
+ Integrates with Trackio monitoring and provides complete model deployment
5
+ """
6
+
7
+ import os
8
+ import json
9
+ import argparse
10
+ import logging
11
+ from pathlib import Path
12
+ from typing import Dict, Any, Optional, List
13
+ from datetime import datetime
14
+ import subprocess
15
+ import shutil
16
+
17
+ try:
18
+ from huggingface_hub import HfApi, create_repo, upload_file
19
+ from huggingface_hub import snapshot_download, hf_hub_download
20
+ HF_AVAILABLE = True
21
+ except ImportError:
22
+ HF_AVAILABLE = False
23
+ print("Warning: huggingface_hub not available. Install with: pip install huggingface_hub")
24
+
25
+ try:
26
+ from monitoring import SmolLM3Monitor
27
+ MONITORING_AVAILABLE = True
28
+ except ImportError:
29
+ MONITORING_AVAILABLE = False
30
+ print("Warning: monitoring module not available")
31
+
32
+ logger = logging.getLogger(__name__)
33
+
34
+ class HuggingFacePusher:
35
+ """Push trained models and results to Hugging Face Hub"""
36
+
37
+ def __init__(
38
+ self,
39
+ model_path: str,
40
+ repo_name: str,
41
+ token: Optional[str] = None,
42
+ private: bool = False,
43
+ trackio_url: Optional[str] = None,
44
+ experiment_name: Optional[str] = None
45
+ ):
46
+ self.model_path = Path(model_path)
47
+ self.repo_name = repo_name
48
+ self.token = token or os.getenv('HF_TOKEN')
49
+ self.private = private
50
+ self.trackio_url = trackio_url
51
+ self.experiment_name = experiment_name
52
+
53
+ # Initialize HF API
54
+ if HF_AVAILABLE:
55
+ self.api = HfApi(token=self.token)
56
+ else:
57
+ raise ImportError("huggingface_hub is required. Install with: pip install huggingface_hub")
58
+
59
+ # Initialize monitoring if available
60
+ self.monitor = None
61
+ if MONITORING_AVAILABLE and trackio_url:
62
+ self.monitor = SmolLM3Monitor(
63
+ experiment_name=experiment_name or "model_push",
64
+ trackio_url=trackio_url,
65
+ enable_tracking=True
66
+ )
67
+
68
+ logger.info(f"Initialized HuggingFacePusher for {repo_name}")
69
+
70
+ def create_repository(self) -> bool:
71
+ """Create the Hugging Face repository"""
72
+ try:
73
+ logger.info(f"Creating repository: {self.repo_name}")
74
+
75
+ # Create repository
76
+ create_repo(
77
+ repo_id=self.repo_name,
78
+ token=self.token,
79
+ private=self.private,
80
+ exist_ok=True
81
+ )
82
+
83
+ logger.info(f"✅ Repository created: https://huggingface.co/{self.repo_name}")
84
+ return True
85
+
86
+ except Exception as e:
87
+ logger.error(f"❌ Failed to create repository: {e}")
88
+ return False
89
+
90
+ def validate_model_path(self) -> bool:
91
+ """Validate that the model path contains required files"""
92
+ required_files = [
93
+ "config.json",
94
+ "pytorch_model.bin",
95
+ "tokenizer.json",
96
+ "tokenizer_config.json"
97
+ ]
98
+
99
+ missing_files = []
100
+ for file in required_files:
101
+ if not (self.model_path / file).exists():
102
+ missing_files.append(file)
103
+
104
+ if missing_files:
105
+ logger.error(f"❌ Missing required files: {missing_files}")
106
+ return False
107
+
108
+ logger.info("✅ Model files validated")
109
+ return True
110
+
111
+ def create_model_card(self, training_config: Dict[str, Any], results: Dict[str, Any]) -> str:
112
+ """Create a comprehensive model card"""
113
+ model_card = f"""---
114
+ language:
115
+ - en
116
+ license: mit
117
+ tags:
118
+ - smollm3
119
+ - fine-tuned
120
+ - text-generation
121
+ - transformers
122
+ ---
123
+
124
+ # {self.repo_name.split('/')[-1]}
125
+
126
+ This is a fine-tuned SmolLM3 model based on the HuggingFaceTB/SmolLM3-3B architecture.
127
+
128
+ ## Model Details
129
+
130
+ - **Base Model**: HuggingFaceTB/SmolLM3-3B
131
+ - **Fine-tuning Method**: Supervised Fine-tuning
132
+ - **Training Date**: {datetime.now().strftime('%Y-%m-%d')}
133
+ - **Model Size**: {self._get_model_size():.1f} GB
134
+
135
+ ## Training Configuration
136
+
137
+ ```json
138
+ {json.dumps(training_config, indent=2)}
139
+ ```
140
+
141
+ ## Training Results
142
+
143
+ ```json
144
+ {json.dumps(results, indent=2)}
145
+ ```
146
+
147
+ ## Usage
148
+
149
+ ```python
150
+ from transformers import AutoModelForCausalLM, AutoTokenizer
151
+
152
+ # Load model and tokenizer
153
+ model = AutoModelForCausalLM.from_pretrained("{self.repo_name}")
154
+ tokenizer = AutoTokenizer.from_pretrained("{self.repo_name}")
155
+
156
+ # Generate text
157
+ inputs = tokenizer("Hello, how are you?", return_tensors="pt")
158
+ outputs = model.generate(**inputs, max_new_tokens=100)
159
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
160
+ ```
161
+
162
+ ## Training Information
163
+
164
+ - **Framework**: Transformers
165
+ - **Hardware**: {self._get_hardware_info()}
166
+ - **Training Time**: {results.get('training_time_hours', 'Unknown')} hours
167
+ - **Final Loss**: {results.get('final_loss', 'Unknown')}
168
+ - **Final Accuracy**: {results.get('final_accuracy', 'Unknown')}
169
+
170
+ ## Model Performance
171
+
172
+ - **Training Loss**: {results.get('train_loss', 'Unknown')}
173
+ - **Validation Loss**: {results.get('eval_loss', 'Unknown')}
174
+ - **Training Steps**: {results.get('total_steps', 'Unknown')}
175
+
176
+ ## Limitations and Biases
177
+
178
+ This model is fine-tuned for specific tasks and may not generalize well to all use cases. Please evaluate the model's performance on your specific task before deployment.
179
+
180
+ ## License
181
+
182
+ This model is licensed under the MIT License.
183
+ """
184
+ return model_card
185
+
186
+ def _get_model_size(self) -> float:
187
+ """Get model size in GB"""
188
+ try:
189
+ total_size = 0
190
+ for file in self.model_path.rglob("*"):
191
+ if file.is_file():
192
+ total_size += file.stat().st_size
193
+ return total_size / (1024**3) # Convert to GB
194
+ except:
195
+ return 0.0
196
+
197
+ def _get_hardware_info(self) -> str:
198
+ """Get hardware information"""
199
+ try:
200
+ import torch
201
+ if torch.cuda.is_available():
202
+ gpu_name = torch.cuda.get_device_name(0)
203
+ return f"GPU: {gpu_name}"
204
+ else:
205
+ return "CPU"
206
+ except:
207
+ return "Unknown"
208
+
209
+ def upload_model_files(self) -> bool:
210
+ """Upload model files to Hugging Face Hub"""
211
+ try:
212
+ logger.info("Uploading model files...")
213
+
214
+ # Upload all files in the model directory
215
+ for file_path in self.model_path.rglob("*"):
216
+ if file_path.is_file():
217
+ relative_path = file_path.relative_to(self.model_path)
218
+ remote_path = str(relative_path)
219
+
220
+ logger.info(f"Uploading {relative_path}")
221
+ upload_file(
222
+ path_or_fileobj=str(file_path),
223
+ path_in_repo=remote_path,
224
+ repo_id=self.repo_name,
225
+ token=self.token
226
+ )
227
+
228
+ logger.info("✅ Model files uploaded successfully")
229
+ return True
230
+
231
+ except Exception as e:
232
+ logger.error(f"❌ Failed to upload model files: {e}")
233
+ return False
234
+
235
+ def upload_training_results(self, results_path: str) -> bool:
236
+ """Upload training results and logs"""
237
+ try:
238
+ logger.info("Uploading training results...")
239
+
240
+ results_files = [
241
+ "train_results.json",
242
+ "eval_results.json",
243
+ "training_config.json",
244
+ "training.log"
245
+ ]
246
+
247
+ for file_name in results_files:
248
+ file_path = Path(results_path) / file_name
249
+ if file_path.exists():
250
+ logger.info(f"Uploading {file_name}")
251
+ upload_file(
252
+ path_or_fileobj=str(file_path),
253
+ path_in_repo=f"training_results/{file_name}",
254
+ repo_id=self.repo_name,
255
+ token=self.token
256
+ )
257
+
258
+ logger.info("✅ Training results uploaded successfully")
259
+ return True
260
+
261
+ except Exception as e:
262
+ logger.error(f"❌ Failed to upload training results: {e}")
263
+ return False
264
+
265
+ def create_readme(self, training_config: Dict[str, Any], results: Dict[str, Any]) -> bool:
266
+ """Create and upload README.md"""
267
+ try:
268
+ logger.info("Creating README.md...")
269
+
270
+ readme_content = f"""# {self.repo_name.split('/')[-1]}
271
+
272
+ A fine-tuned SmolLM3 model for text generation tasks.
273
+
274
+ ## Quick Start
275
+
276
+ ```python
277
+ from transformers import AutoModelForCausalLM, AutoTokenizer
278
+
279
+ model = AutoModelForCausalLM.from_pretrained("{self.repo_name}")
280
+ tokenizer = AutoTokenizer.from_pretrained("{self.repo_name}")
281
+
282
+ # Generate text
283
+ text = "Hello, how are you?"
284
+ inputs = tokenizer(text, return_tensors="pt")
285
+ outputs = model.generate(**inputs, max_new_tokens=100)
286
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
287
+ ```
288
+
289
+ ## Model Information
290
+
291
+ - **Base Model**: HuggingFaceTB/SmolLM3-3B
292
+ - **Fine-tuning Date**: {datetime.now().strftime('%Y-%m-%d')}
293
+ - **Model Size**: {self._get_model_size():.1f} GB
294
+ - **Training Steps**: {results.get('total_steps', 'Unknown')}
295
+ - **Final Loss**: {results.get('final_loss', 'Unknown')}
296
+
297
+ ## Training Configuration
298
+
299
+ ```json
300
+ {json.dumps(training_config, indent=2)}
301
+ ```
302
+
303
+ ## Performance Metrics
304
+
305
+ ```json
306
+ {json.dumps(results, indent=2)}
307
+ ```
308
+
309
+ ## Files
310
+
311
+ - `pytorch_model.bin`: Model weights
312
+ - `config.json`: Model configuration
313
+ - `tokenizer.json`: Tokenizer configuration
314
+ - `training_results/`: Training logs and results
315
+
316
+ ## License
317
+
318
+ MIT License
319
+ """
320
+
321
+ # Write README to temporary file
322
+ readme_path = Path("temp_readme.md")
323
+ with open(readme_path, "w") as f:
324
+ f.write(readme_content)
325
+
326
+ # Upload README
327
+ upload_file(
328
+ path_or_fileobj=str(readme_path),
329
+ path_in_repo="README.md",
330
+ repo_id=self.repo_name,
331
+ token=self.token
332
+ )
333
+
334
+ # Clean up
335
+ readme_path.unlink()
336
+
337
+ logger.info("✅ README.md uploaded successfully")
338
+ return True
339
+
340
+ except Exception as e:
341
+ logger.error(f"❌ Failed to create README: {e}")
342
+ return False
343
+
344
+ def log_to_trackio(self, action: str, details: Dict[str, Any]):
345
+ """Log push action to Trackio"""
346
+ if self.monitor:
347
+ try:
348
+ self.monitor.log_metrics({
349
+ "push_action": action,
350
+ "repo_name": self.repo_name,
351
+ "model_size_gb": self._get_model_size(),
352
+ **details
353
+ })
354
+ logger.info(f"✅ Logged {action} to Trackio")
355
+ except Exception as e:
356
+ logger.error(f"❌ Failed to log to Trackio: {e}")
357
+
358
+ def push_model(self, training_config: Optional[Dict[str, Any]] = None,
359
+ results: Optional[Dict[str, Any]] = None) -> bool:
360
+ """Complete model push process"""
361
+ logger.info(f"🚀 Starting model push to {self.repo_name}")
362
+
363
+ # Validate model path
364
+ if not self.validate_model_path():
365
+ return False
366
+
367
+ # Create repository
368
+ if not self.create_repository():
369
+ return False
370
+
371
+ # Load training config and results if not provided
372
+ if training_config is None:
373
+ training_config = self._load_training_config()
374
+
375
+ if results is None:
376
+ results = self._load_training_results()
377
+
378
+ # Create and upload model card
379
+ model_card = self.create_model_card(training_config, results)
380
+ model_card_path = Path("temp_model_card.md")
381
+ with open(model_card_path, "w") as f:
382
+ f.write(model_card)
383
+
384
+ try:
385
+ upload_file(
386
+ path_or_fileobj=str(model_card_path),
387
+ path_in_repo="README.md",
388
+ repo_id=self.repo_name,
389
+ token=self.token
390
+ )
391
+ finally:
392
+ model_card_path.unlink()
393
+
394
+ # Upload model files
395
+ if not self.upload_model_files():
396
+ return False
397
+
398
+ # Upload training results
399
+ if results:
400
+ self.upload_training_results(str(self.model_path))
401
+
402
+ # Log to Trackio
403
+ self.log_to_trackio("model_push", {
404
+ "model_path": str(self.model_path),
405
+ "repo_name": self.repo_name,
406
+ "private": self.private,
407
+ "training_config": training_config,
408
+ "results": results
409
+ })
410
+
411
+ logger.info(f"🎉 Model successfully pushed to: https://huggingface.co/{self.repo_name}")
412
+ return True
413
+
414
+ def _load_training_config(self) -> Dict[str, Any]:
415
+ """Load training configuration"""
416
+ config_path = self.model_path / "training_config.json"
417
+ if config_path.exists():
418
+ with open(config_path, "r") as f:
419
+ return json.load(f)
420
+ return {"model_name": "HuggingFaceTB/SmolLM3-3B"}
421
+
422
+ def _load_training_results(self) -> Dict[str, Any]:
423
+ """Load training results"""
424
+ results_path = self.model_path / "train_results.json"
425
+ if results_path.exists():
426
+ with open(results_path, "r") as f:
427
+ return json.load(f)
428
+ return {"final_loss": "Unknown", "total_steps": "Unknown"}
429
+
430
+ def parse_args():
431
+ """Parse command line arguments"""
432
+ parser = argparse.ArgumentParser(description='Push trained model to Hugging Face Hub')
433
+
434
+ # Required arguments
435
+ parser.add_argument('model_path', type=str, help='Path to trained model directory')
436
+ parser.add_argument('repo_name', type=str, help='Hugging Face repository name (username/repo-name)')
437
+
438
+ # Optional arguments
439
+ parser.add_argument('--token', type=str, default=None, help='Hugging Face token')
440
+ parser.add_argument('--private', action='store_true', help='Make repository private')
441
+ parser.add_argument('--trackio-url', type=str, default=None, help='Trackio Space URL for logging')
442
+ parser.add_argument('--experiment-name', type=str, default=None, help='Experiment name for Trackio')
443
+
444
+ return parser.parse_args()
445
+
446
+ def main():
447
+ """Main function"""
448
+ args = parse_args()
449
+
450
+ # Setup logging
451
+ logging.basicConfig(
452
+ level=logging.INFO,
453
+ format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
454
+ )
455
+
456
+ logger.info("Starting model push to Hugging Face Hub")
457
+
458
+ # Initialize pusher
459
+ try:
460
+ pusher = HuggingFacePusher(
461
+ model_path=args.model_path,
462
+ repo_name=args.repo_name,
463
+ token=args.token,
464
+ private=args.private,
465
+ trackio_url=args.trackio_url,
466
+ experiment_name=args.experiment_name
467
+ )
468
+
469
+ # Push model
470
+ success = pusher.push_model()
471
+
472
+ if success:
473
+ logger.info("✅ Model push completed successfully!")
474
+ logger.info(f"🌐 View your model at: https://huggingface.co/{args.repo_name}")
475
+ else:
476
+ logger.error("❌ Model push failed!")
477
+ return 1
478
+
479
+ except Exception as e:
480
+ logger.error(f"❌ Error during model push: {e}")
481
+ return 1
482
+
483
+ return 0
484
+
485
+ if __name__ == "__main__":
486
+ exit(main())
requirements.txt CHANGED
@@ -32,4 +32,11 @@ sentencepiece>=0.1.99
32
  # Development
33
  pytest>=7.0.0
34
  black>=23.0.0
35
- isort>=5.12.0
 
 
 
 
 
 
 
 
32
  # Development
33
  pytest>=7.0.0
34
  black>=23.0.0
35
+ isort>=5.12.0
36
+
37
+ # Experiment tracking and monitoring
38
+ trackio>=0.1.0
39
+ psutil>=5.9.0
40
+
41
+ # Hugging Face Hub integration
42
+ huggingface_hub>=0.16.0
requirements_space.txt ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Gradio and web interface
2
+ gradio>=4.0.0
3
+ gradio-client>=0.10.0
4
+
5
+ # Core dependencies for Trackio Space
6
+ requests>=2.31.0
7
+ numpy>=1.24.0
8
+ pandas>=2.0.0
9
+
10
+ # JSON and data handling
11
+ jsonschema>=4.17.0
12
+
13
+ # Optional: for better UI
14
+ plotly>=5.15.0
15
+ matplotlib>=3.7.0
16
+
17
+ # Development and debugging
18
+ python-dotenv>=1.0.0
run_a100_large_experiment.py ADDED
@@ -0,0 +1,134 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Script to run A100 large-scale experiments on OpenHermes-FR dataset
4
+ Supports multiple configurations for different training scenarios
5
+ """
6
+
7
+ import argparse
8
+ import os
9
+ import sys
10
+ from pathlib import Path
11
+
12
+ def main():
13
+ parser = argparse.ArgumentParser(description="Run A100 large-scale experiments")
14
+ parser.add_argument(
15
+ "--config",
16
+ type=str,
17
+ default="config/train_smollm3_openhermes_fr_a100_large.py",
18
+ help="Configuration file to use"
19
+ )
20
+ parser.add_argument(
21
+ "--experiment-name",
22
+ type=str,
23
+ help="Custom experiment name for tracking"
24
+ )
25
+ parser.add_argument(
26
+ "--output-dir",
27
+ type=str,
28
+ default="./outputs",
29
+ help="Output directory for checkpoints and logs"
30
+ )
31
+ parser.add_argument(
32
+ "--resume",
33
+ type=str,
34
+ help="Resume training from checkpoint"
35
+ )
36
+ parser.add_argument(
37
+ "--dry-run",
38
+ action="store_true",
39
+ help="Print configuration without starting training"
40
+ )
41
+
42
+ args = parser.parse_args()
43
+
44
+ # Add the current directory to Python path
45
+ sys.path.insert(0, str(Path(__file__).parent))
46
+
47
+ # Import the configuration
48
+ try:
49
+ from config.train_smollm3_openhermes_fr_a100_large import get_config as get_large_config
50
+ from config.train_smollm3_openhermes_fr_a100_multiple_passes import get_config as get_multiple_passes_config
51
+
52
+ # Map config files to their respective functions
53
+ config_map = {
54
+ "config/train_smollm3_openhermes_fr_a100_large.py": get_large_config,
55
+ "config/train_smollm3_openhermes_fr_a100_multiple_passes.py": get_multiple_passes_config,
56
+ }
57
+
58
+ if args.config in config_map:
59
+ config = config_map[args.config](args.config)
60
+ else:
61
+ # Try to load from the specified config file
62
+ config = get_large_config(args.config)
63
+
64
+ except ImportError as e:
65
+ print(f"Error importing configuration: {e}")
66
+ print("Available configurations:")
67
+ print(" - config/train_smollm3_openhermes_fr_a100_large.py (Large batch, 1.3 passes)")
68
+ print(" - config/train_smollm3_openhermes_fr_a100_multiple_passes.py (Multiple passes, 4 epochs)")
69
+ return 1
70
+
71
+ # Override experiment name if provided
72
+ if args.experiment_name:
73
+ config.experiment_name = args.experiment_name
74
+
75
+ # Create output directory
76
+ os.makedirs(args.output_dir, exist_ok=True)
77
+
78
+ # Print configuration summary
79
+ print(f"\n{'='*60}")
80
+ print(f"EXPERIMENT CONFIGURATION")
81
+ print(f"{'='*60}")
82
+ print(f"Config file: {args.config}")
83
+ print(f"Experiment name: {config.experiment_name}")
84
+ print(f"Output directory: {args.output_dir}")
85
+ print(f"Model: {config.model_name}")
86
+ print(f"Batch size: {config.batch_size}")
87
+ print(f"Gradient accumulation: {config.gradient_accumulation_steps}")
88
+ print(f"Effective batch size: {config.batch_size * config.gradient_accumulation_steps}")
89
+ print(f"Learning rate: {config.learning_rate}")
90
+ print(f"Max iterations: {config.max_iters}")
91
+ print(f"Max sequence length: {config.max_seq_length}")
92
+ print(f"Mixed precision: {'bf16' if config.bf16 else 'fp16'}")
93
+ print(f"Dataset: {config.dataset_name}")
94
+ print(f"{'='*60}\n")
95
+
96
+ if args.dry_run:
97
+ print("DRY RUN - Configuration printed above. Use without --dry-run to start training.")
98
+ return 0
99
+
100
+ # Import and run training
101
+ try:
102
+ from train import main as train_main
103
+
104
+ # Set up training arguments
105
+ train_args = [
106
+ "--config", args.config,
107
+ "--output-dir", args.output_dir,
108
+ ]
109
+
110
+ if args.resume:
111
+ train_args.extend(["--resume", args.resume])
112
+
113
+ # Override sys.argv for the training script
114
+ original_argv = sys.argv
115
+ sys.argv = ["train.py"] + train_args
116
+
117
+ # Run training
118
+ train_main()
119
+
120
+ # Restore original argv
121
+ sys.argv = original_argv
122
+
123
+ except ImportError as e:
124
+ print(f"Error importing training module: {e}")
125
+ print("Make sure train.py is available in the current directory.")
126
+ return 1
127
+ except Exception as e:
128
+ print(f"Error during training: {e}")
129
+ return 1
130
+
131
+ return 0
132
+
133
+ if __name__ == "__main__":
134
+ exit(main())
test_monitoring.py ADDED
@@ -0,0 +1,181 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Quick Start Script for Trackio Integration
4
+ Tests the monitoring functionality without full training
5
+ """
6
+
7
+ import os
8
+ import json
9
+ import logging
10
+ from datetime import datetime
11
+ from monitoring import SmolLM3Monitor
12
+
13
+ def setup_logging():
14
+ """Setup logging configuration"""
15
+ logging.basicConfig(
16
+ level=logging.INFO,
17
+ format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
18
+ )
19
+ return logging.getLogger(__name__)
20
+
21
+ def test_trackio_integration():
22
+ """Test Trackio integration with sample data"""
23
+ logger = setup_logging()
24
+
25
+ print("🚀 Testing Trackio Integration")
26
+ print("=" * 40)
27
+
28
+ # Get Trackio URL from user or environment
29
+ trackio_url = os.getenv('TRACKIO_URL')
30
+ if not trackio_url:
31
+ trackio_url = input("Enter your Trackio Space URL (or press Enter to skip): ").strip()
32
+ if not trackio_url:
33
+ print("⚠️ No Trackio URL provided. Running in local mode only.")
34
+ trackio_url = None
35
+
36
+ # Initialize monitor
37
+ experiment_name = f"test_experiment_{datetime.now().strftime('%Y%m%d_%H%M%S')}"
38
+
39
+ monitor = SmolLM3Monitor(
40
+ experiment_name=experiment_name,
41
+ trackio_url=trackio_url,
42
+ enable_tracking=trackio_url is not None,
43
+ log_artifacts=True,
44
+ log_metrics=True,
45
+ log_config=True
46
+ )
47
+
48
+ print(f"✅ Monitor initialized for experiment: {experiment_name}")
49
+
50
+ # Test configuration logging
51
+ sample_config = {
52
+ "model_name": "HuggingFaceTB/SmolLM3-3B",
53
+ "batch_size": 4,
54
+ "learning_rate": 2e-5,
55
+ "max_iters": 1000,
56
+ "max_seq_length": 4096,
57
+ "test_mode": True
58
+ }
59
+
60
+ print("📝 Logging configuration...")
61
+ monitor.log_config(sample_config)
62
+
63
+ # Test metrics logging
64
+ print("📊 Logging sample metrics...")
65
+ for step in range(0, 100, 10):
66
+ metrics = {
67
+ "loss": 2.0 - (step * 0.015), # Simulate decreasing loss
68
+ "accuracy": 0.5 + (step * 0.004), # Simulate increasing accuracy
69
+ "learning_rate": 2e-5,
70
+ "step": step
71
+ }
72
+ monitor.log_metrics(metrics, step=step)
73
+ print(f" Step {step}: loss={metrics['loss']:.3f}, accuracy={metrics['accuracy']:.3f}")
74
+
75
+ # Test system metrics
76
+ print("💻 Logging system metrics...")
77
+ monitor.log_system_metrics(step=50)
78
+
79
+ # Test evaluation results
80
+ print("📈 Logging evaluation results...")
81
+ eval_results = {
82
+ "eval_loss": 1.2,
83
+ "eval_accuracy": 0.85,
84
+ "perplexity": 3.3,
85
+ "bleu_score": 0.72
86
+ }
87
+ monitor.log_evaluation_results(eval_results, step=100)
88
+
89
+ # Test training summary
90
+ print("📋 Logging training summary...")
91
+ summary = {
92
+ "final_loss": 0.5,
93
+ "final_accuracy": 0.89,
94
+ "total_steps": 100,
95
+ "training_time_hours": 2.5,
96
+ "model_size_gb": 6.2,
97
+ "test_mode": True
98
+ }
99
+ monitor.log_training_summary(summary)
100
+
101
+ # Close monitoring
102
+ monitor.close()
103
+
104
+ print("✅ Trackio integration test completed!")
105
+
106
+ if trackio_url:
107
+ experiment_url = monitor.get_experiment_url()
108
+ if experiment_url:
109
+ print(f"🌐 View your experiment at: {experiment_url}")
110
+
111
+ return True
112
+
113
+ def test_local_monitoring():
114
+ """Test local monitoring without Trackio"""
115
+ logger = setup_logging()
116
+
117
+ print("🔧 Testing Local Monitoring")
118
+ print("=" * 30)
119
+
120
+ # Initialize monitor without Trackio
121
+ experiment_name = f"local_test_{datetime.now().strftime('%Y%m%d_%H%M%S')}"
122
+
123
+ monitor = SmolLM3Monitor(
124
+ experiment_name=experiment_name,
125
+ enable_tracking=False, # Disable Trackio
126
+ log_artifacts=True,
127
+ log_metrics=True,
128
+ log_config=True
129
+ )
130
+
131
+ print(f"✅ Local monitor initialized for experiment: {experiment_name}")
132
+
133
+ # Test local logging
134
+ sample_config = {
135
+ "model_name": "HuggingFaceTB/SmolLM3-3B",
136
+ "batch_size": 4,
137
+ "learning_rate": 2e-5,
138
+ "local_test": True
139
+ }
140
+
141
+ print("📝 Logging configuration locally...")
142
+ monitor.log_config(sample_config)
143
+
144
+ # Test local metrics
145
+ print("📊 Logging sample metrics locally...")
146
+ for step in range(0, 50, 10):
147
+ metrics = {
148
+ "loss": 1.8 - (step * 0.02),
149
+ "accuracy": 0.6 + (step * 0.005),
150
+ "step": step
151
+ }
152
+ monitor.log_metrics(metrics, step=step)
153
+ print(f" Step {step}: loss={metrics['loss']:.3f}, accuracy={metrics['accuracy']:.3f}")
154
+
155
+ print("✅ Local monitoring test completed!")
156
+ return True
157
+
158
+ def main():
159
+ """Main function"""
160
+ print("Trackio Integration Quick Start")
161
+ print("=" * 40)
162
+
163
+ # Test local monitoring first
164
+ test_local_monitoring()
165
+ print()
166
+
167
+ # Test Trackio integration if available
168
+ try:
169
+ test_trackio_integration()
170
+ except Exception as e:
171
+ print(f"❌ Trackio integration test failed: {e}")
172
+ print("💡 Make sure you have a valid Trackio Space URL")
173
+
174
+ print("\n🎉 Quick start completed!")
175
+ print("\nNext steps:")
176
+ print("1. Deploy Trackio to Hugging Face Spaces (see DEPLOYMENT_GUIDE.md)")
177
+ print("2. Update your training script with Trackio integration")
178
+ print("3. Run your first monitored training session")
179
+
180
+ if __name__ == "__main__":
181
+ main()
train.py CHANGED
@@ -76,6 +76,16 @@ def parse_args():
76
  parser.add_argument('--logging_steps', type=int, default=10,
77
  help='Log every N steps')
78
 
 
 
 
 
 
 
 
 
 
 
79
  return parser.parse_args()
80
 
81
  def main():
@@ -99,14 +109,22 @@ def main():
99
  if args.gradient_accumulation_steps is not None:
100
  config.gradient_accumulation_steps = args.gradient_accumulation_steps
101
 
 
 
 
 
 
 
 
 
 
 
102
  # Setup paths
103
- dataset_path = os.path.join('/input', args.dataset_dir)
104
  output_path = args.out_dir
105
 
106
  # Ensure output directory exists
107
  os.makedirs(output_path, exist_ok=True)
108
 
109
- logger.info(f"Dataset path: {dataset_path}")
110
  logger.info(f"Output path: {output_path}")
111
 
112
  # Initialize model
@@ -116,11 +134,23 @@ def main():
116
  config=config
117
  )
118
 
119
- # Load dataset
 
 
 
 
 
 
 
 
 
 
120
  dataset = SmolLM3Dataset(
121
  data_path=dataset_path,
122
  tokenizer=model.tokenizer,
123
- max_seq_length=args.max_seq_length
 
 
124
  )
125
 
126
  # Initialize trainer
 
76
  parser.add_argument('--logging_steps', type=int, default=10,
77
  help='Log every N steps')
78
 
79
+ # Trackio monitoring arguments
80
+ parser.add_argument('--enable_tracking', action='store_true', default=True,
81
+ help='Enable Trackio experiment tracking')
82
+ parser.add_argument('--trackio_url', type=str, default=None,
83
+ help='Trackio server URL')
84
+ parser.add_argument('--trackio_token', type=str, default=None,
85
+ help='Trackio authentication token')
86
+ parser.add_argument('--experiment_name', type=str, default=None,
87
+ help='Custom experiment name for tracking')
88
+
89
  return parser.parse_args()
90
 
91
  def main():
 
109
  if args.gradient_accumulation_steps is not None:
110
  config.gradient_accumulation_steps = args.gradient_accumulation_steps
111
 
112
+ # Override Trackio configuration
113
+ if args.enable_tracking is not None:
114
+ config.enable_tracking = args.enable_tracking
115
+ if args.trackio_url is not None:
116
+ config.trackio_url = args.trackio_url
117
+ if args.trackio_token is not None:
118
+ config.trackio_token = args.trackio_token
119
+ if args.experiment_name is not None:
120
+ config.experiment_name = args.experiment_name
121
+
122
  # Setup paths
 
123
  output_path = args.out_dir
124
 
125
  # Ensure output directory exists
126
  os.makedirs(output_path, exist_ok=True)
127
 
 
128
  logger.info(f"Output path: {output_path}")
129
 
130
  # Initialize model
 
134
  config=config
135
  )
136
 
137
+ # Determine dataset path
138
+ if hasattr(config, 'dataset_name') and config.dataset_name:
139
+ # Use Hugging Face dataset
140
+ dataset_path = config.dataset_name
141
+ logger.info(f"Using Hugging Face dataset: {dataset_path}")
142
+ else:
143
+ # Use local dataset
144
+ dataset_path = os.path.join('/input', args.dataset_dir)
145
+ logger.info(f"Using local dataset: {dataset_path}")
146
+
147
+ # Load dataset with filtering options
148
  dataset = SmolLM3Dataset(
149
  data_path=dataset_path,
150
  tokenizer=model.tokenizer,
151
+ max_seq_length=args.max_seq_length,
152
+ filter_bad_entries=getattr(config, 'filter_bad_entries', False),
153
+ bad_entry_field=getattr(config, 'bad_entry_field', 'bad_entry')
154
  )
155
 
156
  # Initialize trainer
trainer.py CHANGED
@@ -11,6 +11,9 @@ from transformers import Trainer, TrainingArguments
11
  from trl import SFTTrainer
12
  import json
13
 
 
 
 
14
  logger = logging.getLogger(__name__)
15
 
16
  class SmolLM3Trainer:
@@ -32,6 +35,9 @@ class SmolLM3Trainer:
32
  self.init_from = init_from
33
  self.use_sft_trainer = use_sft_trainer
34
 
 
 
 
35
  # Setup trainer
36
  self.trainer = self._setup_trainer()
37
 
@@ -55,6 +61,13 @@ class SmolLM3Trainer:
55
  # Get data collator
56
  data_collator = self.dataset.get_data_collator()
57
 
 
 
 
 
 
 
 
58
  if self.use_sft_trainer:
59
  # Use SFTTrainer for supervised fine-tuning
60
  trainer = SFTTrainer(
@@ -67,6 +80,7 @@ class SmolLM3Trainer:
67
  dataset_text_field="text",
68
  max_seq_length=self.config.max_seq_length,
69
  packing=False, # Disable packing for better control
 
70
  )
71
  else:
72
  # Use standard Trainer
@@ -77,6 +91,7 @@ class SmolLM3Trainer:
77
  train_dataset=train_dataset,
78
  eval_dataset=eval_dataset,
79
  data_collator=data_collator,
 
80
  )
81
 
82
  return trainer
@@ -103,6 +118,17 @@ class SmolLM3Trainer:
103
  """Start training"""
104
  logger.info("Starting training")
105
 
 
 
 
 
 
 
 
 
 
 
 
106
  # Load checkpoint if resuming
107
  if self.init_from == "resume":
108
  checkpoint_path = "/input-checkpoint"
@@ -122,11 +148,26 @@ class SmolLM3Trainer:
122
  with open(os.path.join(self.output_dir, "train_results.json"), "w") as f:
123
  json.dump(train_result.metrics, f, indent=2)
124
 
 
 
 
 
 
 
 
 
 
 
 
 
125
  logger.info("Training completed successfully!")
126
  logger.info(f"Training metrics: {train_result.metrics}")
127
 
128
  except Exception as e:
129
  logger.error(f"Training failed: {e}")
 
 
 
130
  raise
131
 
132
  def evaluate(self):
 
11
  from trl import SFTTrainer
12
  import json
13
 
14
+ # Import monitoring
15
+ from monitoring import create_monitor_from_config
16
+
17
  logger = logging.getLogger(__name__)
18
 
19
  class SmolLM3Trainer:
 
35
  self.init_from = init_from
36
  self.use_sft_trainer = use_sft_trainer
37
 
38
+ # Initialize monitoring
39
+ self.monitor = create_monitor_from_config(config)
40
+
41
  # Setup trainer
42
  self.trainer = self._setup_trainer()
43
 
 
61
  # Get data collator
62
  data_collator = self.dataset.get_data_collator()
63
 
64
+ # Add monitoring callback
65
+ callbacks = []
66
+ if self.monitor and self.monitor.enable_tracking:
67
+ trackio_callback = self.monitor.create_monitoring_callback()
68
+ if trackio_callback:
69
+ callbacks.append(trackio_callback)
70
+
71
  if self.use_sft_trainer:
72
  # Use SFTTrainer for supervised fine-tuning
73
  trainer = SFTTrainer(
 
80
  dataset_text_field="text",
81
  max_seq_length=self.config.max_seq_length,
82
  packing=False, # Disable packing for better control
83
+ callbacks=callbacks,
84
  )
85
  else:
86
  # Use standard Trainer
 
91
  train_dataset=train_dataset,
92
  eval_dataset=eval_dataset,
93
  data_collator=data_collator,
94
+ callbacks=callbacks,
95
  )
96
 
97
  return trainer
 
118
  """Start training"""
119
  logger.info("Starting training")
120
 
121
+ # Log configuration to Trackio
122
+ if self.monitor and self.monitor.enable_tracking:
123
+ config_dict = {k: v for k, v in self.config.__dict__.items()
124
+ if not k.startswith('_')}
125
+ self.monitor.log_config(config_dict)
126
+
127
+ # Log experiment URL
128
+ experiment_url = self.monitor.get_experiment_url()
129
+ if experiment_url:
130
+ logger.info(f"Trackio experiment URL: {experiment_url}")
131
+
132
  # Load checkpoint if resuming
133
  if self.init_from == "resume":
134
  checkpoint_path = "/input-checkpoint"
 
148
  with open(os.path.join(self.output_dir, "train_results.json"), "w") as f:
149
  json.dump(train_result.metrics, f, indent=2)
150
 
151
+ # Log training summary to Trackio
152
+ if self.monitor and self.monitor.enable_tracking:
153
+ summary = {
154
+ 'final_loss': train_result.metrics.get('train_loss', 0),
155
+ 'total_steps': train_result.metrics.get('train_runtime', 0),
156
+ 'training_time': train_result.metrics.get('train_runtime', 0),
157
+ 'output_dir': self.output_dir,
158
+ 'model_name': getattr(self.config, 'model_name', 'unknown'),
159
+ }
160
+ self.monitor.log_training_summary(summary)
161
+ self.monitor.close()
162
+
163
  logger.info("Training completed successfully!")
164
  logger.info(f"Training metrics: {train_result.metrics}")
165
 
166
  except Exception as e:
167
  logger.error(f"Training failed: {e}")
168
+ # Close monitoring on error
169
+ if self.monitor and self.monitor.enable_tracking:
170
+ self.monitor.close()
171
  raise
172
 
173
  def evaluate(self):