Tonic commited on
Commit
3c37508
·
1 Parent(s): ad3b15d

adds readme, removes quantization, adds readtoken logic, updates trackio , spaces

Browse files
This view is limited to 50 files because it contains too many changes.   See raw diff
Files changed (50) hide show
  1. README.md +275 -293
  2. docs/A100_LARGE_SCALE_GUIDE.md +0 -195
  3. docs/APP_CONFIGURATION_GUIDE.md +0 -234
  4. docs/CLOUD_DEPLOYMENT_GUIDE.md +0 -462
  5. docs/CLOUD_TRAINING_GUIDE.md +0 -440
  6. docs/Configuration_Management.md +29 -0
  7. docs/DATASET_AUTOMATION_FIX.md +0 -218
  8. docs/DATASET_COMPONENTS_VERIFICATION.md +0 -235
  9. docs/DEPLOYMENT_COMPONENTS_VERIFICATION.md +0 -393
  10. docs/DEPLOYMENT_GUIDE.md +0 -397
  11. docs/Data_Pipeline.md +95 -0
  12. docs/ENHANCED_MODEL_CARD_METADATA.md +0 -300
  13. docs/ENVIRONMENT_SETUP_FIX.md +0 -239
  14. docs/ENVIRONMENT_VARIABLES.md +0 -113
  15. docs/Entry_Point.md +120 -0
  16. docs/FINAL_DEPLOYMENT_VERIFICATION.md +0 -378
  17. docs/FORMATTING_FIX_SUMMARY.md +0 -153
  18. docs/GIT_CONFIGURATION_FIX.md +0 -257
  19. docs/GIT_CONFIGURATION_GUIDE.md +0 -258
  20. docs/H100_LIGHTWEIGHT_GUIDE.md +0 -276
  21. docs/HF_DATASETS_GUIDE.md +0 -269
  22. docs/HF_HUB_V0_34_UPDATE.md +0 -170
  23. docs/HF_SPACES_GUIDE.md +0 -163
  24. docs/INTERACTIVE_PIPELINE_IMPROVEMENTS.md +0 -330
  25. docs/LATEST_DEPLOYMENT_APPROACH.md +0 -267
  26. docs/LAUNCH_SCRIPT_UPDATES.md +0 -174
  27. docs/LAUNCH_SCRIPT_USERNAME_FIX.md +0 -154
  28. docs/MODEL_CARD_USER_INPUT_ANALYSIS.md +0 -233
  29. docs/MODEL_RECOVERY_GUIDE.md +0 -228
  30. docs/MONITORING_IMPROVEMENTS_SUMMARY.md +0 -191
  31. docs/MONITORING_INTEGRATION_GUIDE.md +0 -245
  32. docs/MONITORING_VERIFICATION_REPORT.md +0 -163
  33. docs/Model_Abstraction.md +36 -0
  34. docs/NO_THINK_TAG_GUIDE.md +0 -146
  35. docs/PIPELINE_SUMMARY.md +0 -330
  36. docs/PUSH_GUIDE.md +0 -406
  37. docs/PUSH_SCRIPT_GUIDE.md +0 -267
  38. docs/QUANTIZATION_FIX_SUMMARY.md +0 -165
  39. docs/QUANTIZATION_GUIDE.md +0 -313
  40. docs/QUANTIZATION_IMPLEMENTATION_SUMMARY.md +0 -248
  41. docs/README_END_TO_END.md +0 -303
  42. docs/SFT_TRAINER_CONFIG_USAGE.md +0 -233
  43. docs/TOKEN_FIX_SUMMARY.md +0 -249
  44. docs/TOKEN_VALIDATION_FIX.md +0 -183
  45. docs/TRACKIO_API_FIX_SUMMARY.md +0 -276
  46. docs/TRACKIO_DEPLOYMENT_FIXES.md +0 -266
  47. docs/TRACKIO_DICT_ACCESS_FIX.md +0 -144
  48. docs/TRACKIO_INTEGRATION.md +0 -252
  49. docs/TRACKIO_INTEGRATION_VERIFICATION.md +0 -177
  50. docs/TRACKIO_INTERFACE_GUIDE.md +0 -222
README.md CHANGED
@@ -1,399 +1,381 @@
1
- # SmolLM3 Fine-tuning
2
 
3
- This repository provides a complete setup for fine-tuning SmolLM3 models using the FlexAI console, following the nanoGPT structure but adapted for modern transformer models.
4
 
5
- ## Overview
6
 
7
- SmolLM3 is a 3B-parameter transformer decoder model optimized for efficiency, long-context reasoning, and multilingual support. This setup allows you to fine-tune SmolLM3 for various tasks including:
 
 
8
 
9
- - **Supervised Fine-tuning (SFT)**: Adapt the model for instruction following
10
- - **Direct Preference Optimization (DPO)**: Improve model alignment
11
- - **Long-context fine-tuning**: Support for up to 128k tokens
12
- - **Tool calling**: Fine-tune for function calling capabilities
13
- - **Model Quantization**: Create int8 (GPU) and int4 (CPU) quantized versions
14
 
15
- ## Quick Start
 
 
 
 
 
 
 
 
 
16
 
17
- ### 1. Repository Setup
18
 
19
- The repository follows the FlexAI console structure with the following key files:
20
 
21
- - `train.py`: Main entry point script
22
- - `config/train_smollm3.py`: Default configuration
23
- - `model.py`: Model wrapper and loading
24
- - `data.py`: Dataset handling and preprocessing
25
- - `trainer.py`: Training loop and trainer setup
26
- - `requirements.txt`: Dependencies
27
 
28
- ### 2. FlexAI Console Configuration
29
-
30
- When setting up a Fine Tuning Job in the FlexAI console, use these settings:
31
-
32
- #### Basic Configuration
33
- - **Name**: `smollm3-finetune`
34
- - **Cluster**: Your organization's designated cluster
35
- - **Checkpoint**: (Optional) Previous training job checkpoint
36
- - **Node Count**: 1
37
- - **Accelerator Count**: 1-8 (depending on your needs)
38
-
39
- #### Repository Settings
40
- - **Repository URL**: `https://github.com/your-username/flexai-finetune`
41
- - **Repository Revision**: `main`
42
-
43
- #### Dataset Configuration
44
- - **Datasets**: Your dataset (mounted under `/input`)
45
- - **Mount Directory**: `my_dataset`
46
-
47
- #### Entry Point
48
- ```
49
- train.py config/train_smollm3.py --dataset_dir=my_dataset --init_from=resume --out_dir=/input-checkpoint --max_iters=1500
50
  ```
51
 
52
- ### 3. Dataset Format
53
-
54
- The script supports multiple dataset formats:
 
 
 
 
 
55
 
56
- #### Chat Format (Recommended)
57
- ```json
58
- [
59
- {
60
- "messages": [
61
- {"role": "user", "content": "What is machine learning?"},
62
- {"role": "assistant", "content": "Machine learning is a subset of AI..."}
63
- ]
64
- }
65
- ]
66
- ```
67
 
68
- #### Instruction Format
69
- ```json
70
- [
71
- {
72
- "instruction": "What is machine learning?",
73
- "output": "Machine learning is a subset of AI..."
74
- }
75
- ]
76
- ```
77
 
78
- #### User-Assistant Format
79
- ```json
80
- [
81
- {
82
- "user": "What is machine learning?",
83
- "assistant": "Machine learning is a subset of AI..."
84
- }
85
- ]
 
 
 
 
 
 
 
86
  ```
87
 
88
- ### 4. Configuration Options
89
 
90
- The default configuration in `config/train_smollm3.py` includes:
91
-
92
- ```python
93
- @dataclass
94
- class SmolLM3Config:
95
- # Model configuration
96
- model_name: str = "HuggingFaceTB/SmolLM3-3B"
97
- max_seq_length: int = 4096
98
- use_flash_attention: bool = True
99
-
100
- # Training configuration
101
- batch_size: int = 4
102
- gradient_accumulation_steps: int = 4
103
- learning_rate: float = 2e-5
104
- max_iters: int = 1000
105
-
106
- # Mixed precision
107
- fp16: bool = True
108
- bf16: bool = False
 
 
 
 
109
  ```
110
 
111
- ### 5. Command Line Arguments
112
 
113
- The `train.py` script accepts various arguments:
114
 
115
- ```bash
116
- # Basic usage
117
- python train.py config/train_smollm3.py
118
-
119
- # With custom parameters
120
- python train.py config/train_smollm3.py \
121
- --dataset_dir=my_dataset \
122
- --out_dir=/output-checkpoint \
123
- --init_from=resume \
124
- --max_iters=1500 \
125
- --batch_size=8 \
126
- --learning_rate=1e-5 \
127
- --max_seq_length=8192
128
- ```
129
 
130
- ## Advanced Usage
131
-
132
- ### 1. Custom Configuration
133
-
134
- Create a custom configuration file:
135
 
136
  ```python
137
  # config/my_config.py
138
  from config.train_smollm3 import SmolLM3Config
139
 
140
  config = SmolLM3Config(
141
- model_name="HuggingFaceTB/SmolLM3-3B-Instruct",
142
  max_seq_length=8192,
143
- batch_size=2,
144
- learning_rate=1e-5,
145
- max_iters=2000,
146
- use_flash_attention=True,
147
- fp16=True
148
  )
149
  ```
150
 
151
- ### 2. Long-Context Fine-tuning
152
 
153
- For long-context tasks (up to 128k tokens):
154
 
155
  ```python
156
- config = SmolLM3Config(
157
- max_seq_length=131072, # 128k tokens
158
- model_name="HuggingFaceTB/SmolLM3-3B",
159
- use_flash_attention=True,
160
- gradient_checkpointing=True
 
 
 
 
 
 
 
 
 
161
  )
162
  ```
163
 
164
- ### 3. DPO Training
165
 
166
- For preference optimization, use the DPO trainer:
167
 
168
  ```python
169
- from trainer import SmolLM3DPOTrainer
 
 
170
 
171
- dpo_trainer = SmolLM3DPOTrainer(
 
172
  model=model,
173
  dataset=dataset,
174
  config=config,
175
- output_dir="./dpo-output"
176
  )
177
 
178
- dpo_trainer.train()
179
- ```
180
-
181
- ### 4. Tool Calling Fine-tuning
182
-
183
- Include tool calling examples in your dataset:
184
-
185
- ```json
186
- [
187
- {
188
- "messages": [
189
- {"role": "user", "content": "What's the weather in New York?"},
190
- {"role": "assistant", "content": "<tool_call>\n<invoke name=\"get_weather\">\n<parameter name=\"location\">New York</parameter>\n</invoke>\n</tool_call>"},
191
- {"role": "tool", "content": "The weather in New York is 72°F and sunny."},
192
- {"role": "assistant", "content": "The weather in New York is currently 72°F and sunny."}
193
- ]
194
- }
195
- ]
196
  ```
197
 
198
- ## Model Variants
199
 
200
- SmolLM3 comes in several variants:
201
 
202
- - **SmolLM3-3B-Base**: Base model for general fine-tuning
203
- - **SmolLM3-3B**: Instruction-tuned model
204
- - **SmolLM3-3B-Instruct**: Enhanced instruction model
205
- - **Quantized versions**: Available for deployment
206
 
207
- ## Hardware Requirements
208
-
209
- ### Minimum Requirements
210
- - **GPU**: 16GB+ VRAM (for 3B model)
211
- - **RAM**: 32GB+ system memory
212
- - **Storage**: 50GB+ free space
213
-
214
- ### Recommended
215
- - **GPU**: A100/H100 or similar
216
- - **RAM**: 64GB+ system memory
217
- - **Storage**: 100GB+ SSD
218
 
219
- ## Troubleshooting
220
 
221
- ### Common Issues
222
 
223
- 1. **Out of Memory (OOM)**
224
- - Reduce `batch_size`
225
- - Increase `gradient_accumulation_steps`
226
- - Enable `gradient_checkpointing`
227
- - Use `fp16` or `bf16`
228
-
229
- 2. **Slow Training**
230
- - Enable `flash_attention`
231
- - Use mixed precision (`fp16`/`bf16`)
232
- - Increase `dataloader_num_workers`
233
 
234
- 3. **Dataset Loading Issues**
235
- - Check dataset format
236
- - Ensure proper JSON structure
237
- - Verify file permissions
238
 
239
- ### Debug Mode
240
 
241
- Enable debug logging:
242
 
243
  ```python
244
- import logging
245
- logging.basicConfig(level=logging.DEBUG)
 
 
 
 
 
 
246
  ```
247
 
248
- ## Evaluation
249
 
250
- After training, evaluate your model:
251
 
252
  ```python
253
- from transformers import pipeline
254
-
255
- pipe = pipeline(
256
- task="text-generation",
257
- model="./output-checkpoint",
258
- device=0,
259
- max_new_tokens=256,
260
- do_sample=True,
261
- temperature=0.7
262
- )
263
-
264
- # Test the model
265
- messages = [{"role": "user", "content": "Explain gravity in simple terms."}]
266
- outputs = pipe(messages)
267
- print(outputs[0]["generated_text"][-1]["content"])
268
- ```
269
-
270
- ## Model Quantization
271
-
272
- The pipeline includes built-in quantization support using torchao for creating optimized model versions with a unified repository structure:
273
-
274
- ### Repository Structure
275
-
276
- All models (main and quantized) are stored in a single repository:
277
-
278
- ```
279
- your-username/model-name/
280
- ├── README.md (unified model card)
281
- ├── config.json
282
- ├── pytorch_model.bin
283
- ├── tokenizer.json
284
- ├── int8/ (quantized model for GPU)
285
- └── int4/ (quantized model for CPU)
286
  ```
287
 
288
- ### Quantization Types
289
 
290
- - **int8_weight_only**: GPU optimized, ~50% memory reduction
291
- - **int4_weight_only**: CPU optimized, ~75% memory reduction
292
-
293
- ### Automatic Quantization
294
-
295
- When using the interactive pipeline (`launch.sh`), you'll be prompted to create quantized versions after training:
296
 
297
  ```bash
298
- ./launch.sh
299
- # ... training completes ...
300
- # Choose quantization options when prompted
 
 
301
  ```
302
 
303
- ### Standalone Quantization
304
 
305
- Quantize existing models independently:
306
 
307
  ```bash
308
- # Quantize and push to HF Hub (same repository)
309
- python scripts/model_tonic/quantize_standalone.py /path/to/model your-username/model-name \
 
310
  --quant-type int8_weight_only \
311
  --token YOUR_HF_TOKEN
312
 
313
- # Quantize and save locally
314
- python scripts/model_tonic/quantize_standalone.py /path/to/model your-username/model-name \
 
315
  --quant-type int4_weight_only \
316
  --device cpu \
317
  --save-only
318
  ```
319
 
320
- ### Loading Quantized Models
 
 
 
 
321
 
322
  ```python
323
- import torch
324
- from transformers import AutoModelForCausalLM, AutoTokenizer
325
-
326
- # Load main model
327
- model = AutoModelForCausalLM.from_pretrained(
328
- "your-username/model-name",
329
- device_map="auto",
330
- torch_dtype=torch.bfloat16
331
- )
332
- tokenizer = AutoTokenizer.from_pretrained("your-username/model-name")
333
 
334
- # Load int8 quantized model (GPU)
335
- model = AutoModelForCausalLM.from_pretrained(
336
- "your-username/model-name/int8",
337
- device_map="auto",
338
- torch_dtype=torch.bfloat16
 
 
339
  )
340
- tokenizer = AutoTokenizer.from_pretrained("your-username/model-name/int8")
341
 
342
- # Load int4 quantized model (CPU)
343
- model = AutoModelForCausalLM.from_pretrained(
344
- "your-username/model-name/int4",
345
- device_map="cpu",
346
- torch_dtype=torch.bfloat16
347
- )
348
- tokenizer = AutoTokenizer.from_pretrained("your-username/model-name/int4")
349
  ```
350
 
351
- For detailed quantization documentation, see [QUANTIZATION_GUIDE.md](docs/QUANTIZATION_GUIDE.md).
352
 
353
- ### Unified Model Cards
354
 
355
- The system generates comprehensive model cards that include information about all model variants:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
356
 
357
- - **Single README**: One comprehensive model card for the entire repository
358
- - **Conditional Sections**: Quantized model information appears when available
359
- - **Usage Examples**: Complete examples for all model variants
360
- - **Performance Information**: Memory and speed benefits for each quantization type
361
 
362
- For detailed information about the unified model card system, see [UNIFIED_MODEL_CARD_GUIDE.md](docs/UNIFIED_MODEL_CARD_GUIDE.md).
363
 
364
- ## Deployment
 
 
 
 
 
 
 
 
 
 
 
 
 
 
365
 
366
- ### Using vLLM
367
  ```bash
368
- vllm serve ./output-checkpoint --enable-auto-tool-choice
 
 
 
 
 
 
 
369
  ```
370
 
371
- ### Using llama.cpp
372
- ```bash
373
- # Convert to GGUF format
374
- python -m llama_cpp.convert_model ./output-checkpoint --outfile model.gguf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
375
  ```
376
 
377
- ## Resources
378
 
379
- - [SmolLM3 Blog Post](https://huggingface.co/blog/smollm3)
380
- - [Model Repository](https://huggingface.co/HuggingFaceTB/SmolLM3-3B)
381
- - [GitHub Repository](https://github.com/huggingface/smollm)
382
- - [SmolTalk Dataset](https://huggingface.co/datasets/HuggingFaceTB/smoltalk)
 
 
383
 
384
- ## License
385
 
386
- This project follows the same license as the SmolLM3 model. Please refer to the Hugging Face model page for licensing information.
387
 
 
388
 
389
- {
390
- "id": "exp_20250718_195852",
391
- "name": "petit-elle-l-aime-3",
392
- "description": "SmolLM3 fine-tuning experiment",
393
- "created_at": "2025-07-18T19:58:52.689087",
394
- "status": "running",
395
- "metrics": [],
396
- "parameters": {},
397
- "artifacts": [],
398
- "logs": []
399
- }
 
1
+ # 🤏🏻🏭SmolFactory
2
 
3
+ A comprehensive end-to-end fine-tuning pipeline for SmolLM3 models with custom monitoring, Hugging Face integration, and interactive configuration management.
4
 
5
+ ## 🤖 Automatically Push Model, Spaces, Datasets & Monitoring
6
 
7
+ - **Trackio Monitoring Space**: Real-time training metrics, loss curves, and resource utilization
8
+ - **Demo Spaces**: Instant web interfaces for model testing and demonstration
9
+ - **Automatic Deployment**: Spaces created and configured automatically during the pipeline
10
 
11
+ ### 📈 **Custom Trackio Monitoring**
 
 
 
 
12
 
13
+ - **Real-time Metrics**: Live training loss, learning rate, gradient norms, and GPU utilization
14
+ - **Custom Dashboards**: Tailored visualizations for SmolLM3 fine-tuning
15
+ - **Artifact Logging**: Model checkpoints, configuration files, and training logs
16
+ - **Experiment Comparison**: Side-by-side analysis of different training runs
17
+ - **Alert System**: Notifications for training issues or completion
18
+ - **Integration**: Seamless connection with HF Spaces for public monitoring
19
+ - **Experiment Tracking**: All training data, metrics, and artifacts stored in HF Datasets
20
+ - **Reproducibility**: Complete experiment history with configuration snapshots
21
+ - **Collaboration**: Easy sharing of training results and model comparisons
22
+ - **Version Control**: Track dataset changes and model performance over time
23
 
24
+ ## 🚀 Quick Start
25
 
26
+ ### Interactive Pipeline (Recommended)
27
 
28
+ The easiest way to get started is using the interactive pipeline:
 
 
 
 
 
29
 
30
+ ```bash
31
+ ./launch.sh
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
32
  ```
33
 
34
+ This script will:
35
+ 1. **Authenticate** with Hugging Face (write + read tokens)
36
+ 2. **Configure** training parameters interactively
37
+ 3. **Deploy** Trackio Space for monitoring
38
+ 4. **Setup** HF Dataset for experiment tracking
39
+ 5. **Execute** training with your chosen configuration
40
+ 6. **Push** model to HF Hub with comprehensive documentation
41
+ 7. **Deploy** demo space for testing (optional)
42
 
43
+ ### Manual Setup
 
 
 
 
 
 
 
 
 
 
44
 
45
+ For advanced users who want to customize the pipeline:
 
 
 
 
 
 
 
 
46
 
47
+ ```bash
48
+ # 1. Install dependencies
49
+ pip install -r requirements/requirements_core.txt
50
+
51
+ # 2. Configure your training
52
+ python scripts/training/train.py \
53
+ --config config/train_smollm3_h100_lightweight.py \
54
+ --experiment-name "my-experiment" \
55
+ --output-dir ./outputs \
56
+ --trackio-url "https://huggingface.co/spaces/username/trackio-monitoring"
57
+
58
+ # 3. Push model to HF Hub
59
+ python scripts/model_tonic/push_to_huggingface.py \
60
+ ./outputs username/model-name \
61
+ --token YOUR_HF_TOKEN
62
  ```
63
 
 
64
 
65
+ ## 🏗️ Repository Architecture
66
+
67
+ ```mermaid
68
+ graph LR
69
+ Entry_Point["Entry Point"]
70
+ Configuration_Management["Configuration Management"]
71
+ Data_Pipeline["Data Pipeline"]
72
+ Model_Abstraction["Model Abstraction"]
73
+ Training_Orchestrator["Training Orchestrator"]
74
+ Entry_Point -- "Initializes and Uses" --> Configuration_Management
75
+ Entry_Point -- "Initializes" --> Data_Pipeline
76
+ Entry_Point -- "Initializes" --> Model_Abstraction
77
+ Entry_Point -- "Initializes and Invokes" --> Training_Orchestrator
78
+ Configuration_Management -- "Provides Configuration To" --> Model_Abstraction
79
+ Configuration_Management -- "Provides Configuration To" --> Data_Pipeline
80
+ Configuration_Management -- "Provides Configuration To" --> Training_Orchestrator
81
+ Data_Pipeline -- "Provides Data To" --> Training_Orchestrator
82
+ Model_Abstraction -- "Provides Model To" --> Training_Orchestrator
83
+ click Entry_Point href "https://github.com/Josephrp/SmolFactory/blob/main/docs/Entry_Point.md" "Details"
84
+ click Configuration_Management href "https://github.com/Josephrp/SmolFactory/blob/main/docs/Configuration_Management.md" "Details"
85
+ click Data_Pipeline href "https://github.com/CodeBoarding/GeneratedOnBoardings/blob/main/docs/Data_Pipeline.md" "Details"
86
+ click Model_Abstraction href "https://github.com/CodeBoarding/GeneratedOnBoardings/blob/main/docs/Model_Abstraction.md" "Details"
87
+ click Training_Orchestrator href "https://github.com/CodeBoarding/GeneratedOnBoardings/blob/main/docs/Training_Orchestrator.md" "Details"
88
  ```
89
 
 
90
 
91
+ ## 🔧 Core Components
92
 
93
+ ### Configuration System (`config/`)
 
 
 
 
 
 
 
 
 
 
 
 
 
94
 
95
+ All training configurations inherit from `SmolLM3Config`:
 
 
 
 
96
 
97
  ```python
98
  # config/my_config.py
99
  from config.train_smollm3 import SmolLM3Config
100
 
101
  config = SmolLM3Config(
102
+ model_name="HuggingFaceTB/SmolLM3-3B",
103
  max_seq_length=8192,
104
+ batch_size=8,
105
+ learning_rate=5e-6,
106
+ trainer_type="sft", # or "dpo"
107
+ enable_tracking=True,
108
+ trackio_url="https://huggingface.co/spaces/username/trackio-monitoring"
109
  )
110
  ```
111
 
112
+ ### Dataset Processing (`src/data.py`)
113
 
114
+ The `SmolLM3Dataset` class handles multiple dataset formats:
115
 
116
  ```python
117
+ from src.data import SmolLM3Dataset
118
+
119
+ # Supports multiple formats:
120
+ # 1. Chat format (recommended)
121
+ # 2. Instruction format
122
+ # 3. User-Assistant format
123
+ # 4. Hugging Face datasets
124
+
125
+ dataset = SmolLM3Dataset(
126
+ data_path="my_dataset",
127
+ tokenizer=tokenizer,
128
+ max_seq_length=4096,
129
+ use_chat_template=True,
130
+ sample_size=80000 # For lightweight training
131
  )
132
  ```
133
 
134
+ ### Training Orchestration (`src/train.py`)
135
 
136
+ The main training script coordinates all components:
137
 
138
  ```python
139
+ from src.train import main
140
+ from src.model import SmolLM3Model
141
+ from src.trainer import SmolLM3Trainer, SmolLM3DPOTrainer
142
 
143
+ # SFT Training
144
+ trainer = SmolLM3Trainer(
145
  model=model,
146
  dataset=dataset,
147
  config=config,
148
+ output_dir="./outputs"
149
  )
150
 
151
+ # DPO Training
152
+ dpo_trainer = SmolLM3DPOTrainer(
153
+ model=model,
154
+ dataset=dataset,
155
+ config=config,
156
+ output_dir="./dpo-outputs"
157
+ )
 
 
 
 
 
 
 
 
 
 
 
158
  ```
159
 
160
+ ## 🎯 Training Types
161
 
162
+ ### Supervised Fine-tuning (SFT)
163
 
164
+ Standard instruction tuning for improving model capabilities:
 
 
 
165
 
166
+ ```bash
167
+ python scripts/training/train.py \
168
+ --config config/train_smollm3.py \
169
+ --trainer-type sft \
170
+ --experiment-name "sft-experiment"
171
+ ```
 
 
 
 
 
172
 
173
+ ### Direct Preference Optimization (DPO)
174
 
175
+ Preference-based training for alignment:
176
 
177
+ ```bash
178
+ python scripts/training/train.py \
179
+ --config config/train_smollm3_dpo.py \
180
+ --trainer-type dpo \
181
+ --experiment-name "dpo-experiment"
182
+ ```
 
 
 
 
183
 
184
+ ## 📊 Monitoring & Tracking
 
 
 
185
 
186
+ ### Trackio Integration
187
 
188
+ The pipeline includes comprehensive monitoring:
189
 
190
  ```python
191
+ from src.monitoring import create_monitor_from_config
192
+
193
+ monitor = create_monitor_from_config(config)
194
+ monitor.log_metrics({
195
+ "train_loss": loss,
196
+ "learning_rate": lr,
197
+ "gradient_norm": grad_norm
198
+ })
199
  ```
200
 
201
+ ### HF Dataset Integration
202
 
203
+ Experiment data is automatically saved to HF Datasets:
204
 
205
  ```python
206
+ # Automatically configured in launch.sh
207
+ dataset_repo = "username/trackio-experiments"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
208
  ```
209
 
210
+ ## 🔄 Model Management
211
 
212
+ ### Pushing to HF Hub
 
 
 
 
 
213
 
214
  ```bash
215
+ python scripts/model_tonic/push_to_huggingface.py \
216
+ ./outputs username/model-name \
217
+ --token YOUR_HF_TOKEN \
218
+ --trackio-url "https://huggingface.co/spaces/username/trackio-monitoring" \
219
+ --experiment-name "my-experiment"
220
  ```
221
 
222
+ ### Model Quantization
223
 
224
+ Create optimized versions for deployment:
225
 
226
  ```bash
227
+ # Quantize and push to HF Hub
228
+ python scripts/model_tonic/quantize_standalone.py \
229
+ ./outputs username/model-name \
230
  --quant-type int8_weight_only \
231
  --token YOUR_HF_TOKEN
232
 
233
+ # Quantize for CPU deployment
234
+ python scripts/model_tonic/quantize_standalone.py \
235
+ ./outputs username/model-name \
236
  --quant-type int4_weight_only \
237
  --device cpu \
238
  --save-only
239
  ```
240
 
241
+ ## 🛠️ Customization Guide
242
+
243
+ ### Adding New Training Configurations
244
+
245
+ 1. Create a new config file in `config/`:
246
 
247
  ```python
248
+ # config/train_smollm3_custom.py
249
+ from config.train_smollm3 import SmolLM3Config
 
 
 
 
 
 
 
 
250
 
251
+ config = SmolLM3Config(
252
+ model_name="HuggingFaceTB/SmolLM3-3B-Instruct",
253
+ max_seq_length=16384,
254
+ batch_size=4,
255
+ learning_rate=1e-5,
256
+ max_iters=2000,
257
+ trainer_type="sft"
258
  )
259
+ ```
260
 
261
+ 2. Add to the training script mapping in `scripts/training/train.py`:
262
+
263
+ ```python
264
+ config_map = {
265
+ # ... existing configs ...
266
+ "config/train_smollm3_custom.py": get_custom_config,
267
+ }
268
  ```
269
 
270
+ ### Custom Dataset Formats
271
 
272
+ Extend `src/data.py` to support new formats:
273
 
274
+ ```python
275
+ def _load_custom_format(self, data_path: str) -> Dataset:
276
+ """Load custom dataset format"""
277
+ # Your custom loading logic here
278
+ pass
279
+ ```
280
+
281
+ ### Custom Training Loops
282
+
283
+ Extend `src/trainer.py` for specialized training:
284
+
285
+ ```python
286
+ class SmolLM3CustomTrainer(SmolLM3Trainer):
287
+ def training_step(self, batch):
288
+ # Custom training logic
289
+ pass
290
+ ```
291
 
292
+ ## 🔧 Development & Contributing
 
 
 
293
 
294
+ ### Project Structure
295
 
296
+ - **`src/`**: Core training modules
297
+ - **`config/`**: Training configurations
298
+ - **`scripts/`**: Utility scripts and automation
299
+ - **`docs/`**: Comprehensive documentation
300
+ - **`tests/`**: Test files and debugging tools
301
+
302
+ ### Adding New Features
303
+
304
+ 1. **Configuration**: Add to `config/` directory
305
+ 2. **Core Logic**: Extend modules in `src/`
306
+ 3. **Scripts**: Add utility scripts to `scripts/`
307
+ 4. **Documentation**: Update relevant docs in `docs/`
308
+ 5. **Tests**: Add test files to `tests/`
309
+
310
+ ### Testing Your Changes
311
 
 
312
  ```bash
313
+ # Run basic tests
314
+ python tests/test_config.py
315
+ python tests/test_dataset.py
316
+ python tests/test_training.py
317
+
318
+ # Test specific components
319
+ python tests/test_monitoring.py
320
+ python tests/test_model_push.py
321
  ```
322
 
323
+ ### Code Style
324
+
325
+ - Follow PEP 8 for Python code
326
+ - Use type hints for all functions
327
+ - Add comprehensive docstrings
328
+ - Include error handling for external APIs
329
+ - Use structured logging with consistent field names
330
+
331
+ ## 🚨 Troubleshooting
332
+
333
+ ### Common Issues
334
+
335
+ 1. **Out of Memory (OOM)**
336
+ ```bash
337
+ # Reduce batch size in config
338
+ batch_size=2 # instead of 8
339
+ gradient_accumulation_steps=16 # increase to compensate
340
+ ```
341
+
342
+ 2. **Token Validation Errors**
343
+ ```bash
344
+ # Validate your HF token
345
+ python scripts/validate_hf_token.py YOUR_TOKEN
346
+ ```
347
+
348
+ 3. **Dataset Loading Issues**
349
+ ```bash
350
+ # Check dataset format
351
+ python tests/test_dataset_loading.py
352
+ ```
353
+
354
+ ### Debug Mode
355
+
356
+ Enable detailed logging:
357
+
358
+ ```python
359
+ import logging
360
+ logging.basicConfig(level=logging.DEBUG)
361
  ```
362
 
363
+ ## 🤝 Contributing
364
 
365
+ 1. Fork the repository
366
+ 2. Create a feature branch
367
+ 3. Make your changes following the code style
368
+ 4. Add tests for new functionality
369
+ 5. Update documentation
370
+ 6. Submit a pull request
371
 
372
+ ## 📄 License
373
 
374
+ This project follows the same license as the SmolLM3 model. Please refer to the Hugging Face model page for licensing information.
375
 
376
+ ## 🔗 Resources
377
 
378
+ - [SmolLM3 Blog Post](https://huggingface.co/blog/smollm3)
379
+ - [Model Repository](https://huggingface.co/HuggingFaceTB/SmolLM3-3B)
380
+ - [GitHub Repository](https://github.com/huggingface/smollm)
381
+ - [SmolTalk Dataset](https://huggingface.co/datasets/HuggingFaceTB/smoltalk)
 
 
 
 
 
 
 
docs/A100_LARGE_SCALE_GUIDE.md DELETED
@@ -1,195 +0,0 @@
1
- # A100 Large Scale Training Guide
2
-
3
- This guide provides configurations and instructions for running fully-fledged experiments with multiple passes on the full OpenHermes-FR dataset (800k+ datapoints) using A100 GPUs.
4
-
5
- ## Available Configurations
6
-
7
- ### 1. A100 Large Batch Configuration
8
- **File**: `config/train_smollm3_openhermes_fr_a100_large.py`
9
-
10
- **Key Features**:
11
- - **Effective Batch Size**: 128 (8 × 16 gradient accumulation)
12
- - **Training Duration**: ~1.3 passes (8,000 steps)
13
- - **Learning Rate**: 5e-6 (optimized for large batches)
14
- - **Mixed Precision**: bf16 (A100 optimized)
15
- - **Sequence Length**: 8192 tokens
16
- - **Memory Optimizations**: No gradient checkpointing for A100 efficiency
17
-
18
- **Estimated Training Time**: ~6-8 hours on A100
19
-
20
- ### 2. Multiple Passes Configuration
21
- **File**: `config/train_smollm3_openhermes_fr_a100_multiple_passes.py`
22
-
23
- **Key Features**:
24
- - **Effective Batch Size**: 120 (6 × 20 gradient accumulation)
25
- - **Training Duration**: ~4 passes (25,000 steps)
26
- - **Learning Rate**: 3e-6 (conservative for long training)
27
- - **Warmup Steps**: 2000 (longer warmup for stability)
28
- - **Checkpoint Strategy**: More frequent saves (every 2000 steps)
29
-
30
- **Estimated Training Time**: ~20-24 hours on A100
31
-
32
- ## Training Commands
33
-
34
- ### Quick Start - Large Batch Experiment
35
- ```bash
36
- python run_a100_large_experiment.py \
37
- --config config/train_smollm3_openhermes_fr_a100_large.py \
38
- --experiment-name "smollm3_openhermes_fr_large_batch" \
39
- --output-dir ./outputs/large_batch
40
- ```
41
-
42
- ### Multiple Passes Experiment
43
- ```bash
44
- python run_a100_large_experiment.py \
45
- --config config/train_smollm3_openhermes_fr_a100_multiple_passes.py \
46
- --experiment-name "smollm3_openhermes_fr_multiple_passes" \
47
- --output-dir ./outputs/multiple_passes
48
- ```
49
-
50
- ### Dry Run (Check Configuration)
51
- ```bash
52
- python run_a100_large_experiment.py \
53
- --config config/train_smollm3_openhermes_fr_a100_large.py \
54
- --dry-run
55
- ```
56
-
57
- ### Resume Training
58
- ```bash
59
- python run_a100_large_experiment.py \
60
- --config config/train_smollm3_openhermes_fr_a100_multiple_passes.py \
61
- --resume ./outputs/multiple_passes/checkpoint-10000 \
62
- --output-dir ./outputs/multiple_passes
63
- ```
64
-
65
- ## Configuration Details
66
-
67
- ### Memory Usage Optimization
68
- - **Gradient Checkpointing**: Disabled for A100 efficiency
69
- - **Flash Attention**: Enabled for memory efficiency
70
- - **bf16 Mixed Precision**: Better for A100 than fp16
71
- - **Gradient Clipping**: 1.0 for stability
72
- - **Group by Length**: Enabled for better batching
73
-
74
- ### Data Loading Optimization
75
- - **Num Workers**: 8 for faster data loading
76
- - **Pin Memory**: Enabled for GPU transfer efficiency
77
- - **Prefetch Factor**: 2 for pipeline optimization
78
-
79
- ### Training Stability
80
- - **Conservative Learning Rate**: Lower LR for large effective batch sizes
81
- - **Longer Warmup**: More warmup steps for stability
82
- - **Higher Beta2**: 0.999 for AdamW stability
83
- - **Gradient Clipping**: Prevents gradient explosion
84
-
85
- ## Expected Results
86
-
87
- ### Large Batch Configuration (1.3 passes)
88
- - **Training Steps**: 8,000
89
- - **Effective Batch Size**: 128
90
- - **Steps per Epoch**: ~6,250
91
- - **Epochs**: ~1.3
92
- - **Expected Loss**: Should converge to ~1.5-2.0
93
-
94
- ### Multiple Passes Configuration (4 passes)
95
- - **Training Steps**: 25,000
96
- - **Effective Batch Size**: 120
97
- - **Steps per Epoch**: ~6,667
98
- - **Epochs**: ~3.75
99
- - **Expected Loss**: Should converge to ~1.2-1.5
100
-
101
- ## Monitoring and Logging
102
-
103
- ### Trackio Integration
104
- Both configurations include Trackio monitoring:
105
- - **Metrics Logging**: Every 25-50 steps
106
- - **Artifact Logging**: Model checkpoints
107
- - **Config Logging**: Training configuration
108
-
109
- ### Checkpoint Strategy
110
- - **Large Batch**: Save every 1000 steps (8 checkpoints)
111
- - **Multiple Passes**: Save every 2000 steps (12 checkpoints)
112
- - **Best Model**: Automatically load best model at end
113
-
114
- ## Hardware Requirements
115
-
116
- ### Minimum Requirements
117
- - **GPU**: A100 80GB (or multiple A100s)
118
- - **RAM**: 64GB+ system RAM
119
- - **Storage**: 100GB+ for checkpoints and logs
120
- - **Network**: Fast internet for dataset download
121
-
122
- ### Recommended Setup
123
- - **GPU**: 2-4x A100 80GB
124
- - **RAM**: 128GB+ system RAM
125
- - **Storage**: 500GB+ NVMe SSD
126
- - **Network**: 10Gbps+ connection
127
-
128
- ## Troubleshooting
129
-
130
- ### Out of Memory (OOM)
131
- If you encounter OOM errors:
132
- 1. Reduce `batch_size` from 8 to 6 or 4
133
- 2. Increase `gradient_accumulation_steps` to maintain effective batch size
134
- 3. Reduce `max_seq_length` from 8192 to 4096
135
-
136
- ### Slow Training
137
- If training is too slow:
138
- 1. Increase `dataloader_num_workers` to 12-16
139
- 2. Ensure you're using bf16 mixed precision
140
- 3. Check that gradient checkpointing is disabled
141
- 4. Verify flash attention is enabled
142
-
143
- ### Convergence Issues
144
- If loss doesn't converge:
145
- 1. Reduce learning rate by 2x
146
- 2. Increase warmup steps
147
- 3. Check gradient norms in logs
148
- 4. Verify dataset quality
149
-
150
- ## Customization
151
-
152
- ### For Different Dataset Sizes
153
- Adjust `max_iters` based on your dataset size:
154
- ```python
155
- # For 1M datapoints with effective batch size 120
156
- steps_per_epoch = 1000000 // 120 # ~8,333 steps
157
- max_iters = steps_per_epoch * desired_epochs
158
- ```
159
-
160
- ### For Different GPU Memory
161
- Adjust batch size and gradient accumulation:
162
- ```python
163
- # For 40GB A100
164
- batch_size = 4
165
- gradient_accumulation_steps = 32 # Effective batch size = 128
166
-
167
- # For 24GB GPU
168
- batch_size = 2
169
- gradient_accumulation_steps = 64 # Effective batch size = 128
170
- ```
171
-
172
- ## Performance Tips
173
-
174
- 1. **Use bf16**: Better than fp16 for A100
175
- 2. **Disable Gradient Checkpointing**: A100 has enough memory
176
- 3. **Use Flash Attention**: Memory efficient attention
177
- 4. **Group by Length**: Better batching efficiency
178
- 5. **Pin Memory**: Faster GPU transfers
179
- 6. **Multiple Workers**: Faster data loading
180
-
181
- ## Expected Timeline
182
-
183
- - **Large Batch**: 6-8 hours for 1.3 passes
184
- - **Multiple Passes**: 20-24 hours for 4 passes
185
- - **Full Dataset (5+ passes)**: 30+ hours
186
-
187
- ## Next Steps
188
-
189
- After training completes:
190
- 1. Evaluate on validation set
191
- 2. Test generation quality
192
- 3. Push to Hugging Face Hub
193
- 4. Deploy for inference
194
-
195
- For deployment instructions, see `DEPLOYMENT_GUIDE.md`.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/APP_CONFIGURATION_GUIDE.md DELETED
@@ -1,234 +0,0 @@
1
- # ⚙️ App Configuration Guide
2
-
3
- ## Overview
4
-
5
- The Trackio app now includes a **Configuration tab** that allows you to set your Hugging Face token and dataset repository directly through the interface, providing an alternative to environment variables.
6
-
7
- ## 🚀 New Features
8
-
9
- ### **Configuration Tab**
10
- - ✅ **HF Token Input**: Secure password field for your Hugging Face token
11
- - ✅ **Dataset Repository Input**: Text field for your dataset repository
12
- - ✅ **Update Configuration**: Apply new settings and reload experiments
13
- - ✅ **Test Connection**: Verify access to the dataset repository
14
- - ✅ **Create Dataset**: Create a new dataset repository if it doesn't exist
15
-
16
- ### **Flexible Configuration**
17
- - ✅ **Environment Variables**: Still supported as fallback
18
- - ✅ **Interface Input**: New direct input method
19
- - ✅ **Dynamic Updates**: Change configuration without restarting
20
- - ✅ **Validation**: Input validation and error handling
21
-
22
- ## 📋 Configuration Tab Usage
23
-
24
- ### **1. Access the Configuration Tab**
25
- - Open the Trackio app
26
- - Click on the "⚙️ Configuration" tab
27
- - You'll see input fields for HF Token and Dataset Repository
28
-
29
- ### **2. Set Your HF Token**
30
- ```
31
- Hugging Face Token: hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
32
- ```
33
- - **Type**: Password field (hidden for security)
34
- - **Required**: Yes (for dataset access)
35
- - **Format**: Your HF token starting with `hf_`
36
- - **Help**: Click the help text for instructions on getting your token
37
-
38
- ### **3. Set Your Dataset Repository**
39
- ```
40
- Dataset Repository: your-username/your-dataset-name
41
- ```
42
- - **Type**: Text field
43
- - **Required**: No (defaults to `tonic/trackio-experiments`)
44
- - **Format**: `username/dataset-name`
45
- - **Examples**:
46
- - `tonic/trackio-experiments`
47
- - `your-username/my-experiments`
48
- - `your-org/team-experiments`
49
-
50
- ### **4. Use the Action Buttons**
51
-
52
- #### **Update Configuration**
53
- - Applies new settings immediately
54
- - Reloads experiments with new configuration
55
- - Shows current status and experiment count
56
-
57
- #### **Test Connection**
58
- - Verifies access to the dataset repository
59
- - Tests HF token permissions
60
- - Shows dataset information and experiment count
61
-
62
- #### **Create Dataset**
63
- - Creates a new dataset repository if it doesn't exist
64
- - Sets up the correct schema for experiments
65
- - Makes the dataset private by default
66
-
67
- ## 🔧 Configuration Methods
68
-
69
- ### **Method 1: Interface Input (New)**
70
- 1. Go to "⚙️ Configuration" tab
71
- 2. Enter your HF token and dataset repository
72
- 3. Click "Update Configuration"
73
- 4. Verify with "Test Connection"
74
-
75
- ### **Method 2: Environment Variables (Existing)**
76
- ```bash
77
- # Set environment variables
78
- export HF_TOKEN=your_hf_token_here
79
- export TRACKIO_DATASET_REPO=your-username/your-dataset-name
80
-
81
- # Or for HF Spaces, add to Space settings
82
- HF_TOKEN=your_hf_token_here
83
- TRACKIO_DATASET_REPO=your-username/your-dataset-name
84
- ```
85
-
86
- ### **Method 3: Hybrid Approach**
87
- - Set environment variables as defaults
88
- - Override specific values through the interface
89
- - Interface values take precedence over environment variables
90
-
91
- ## 📊 Configuration Priority
92
-
93
- The app uses this priority order for configuration:
94
-
95
- 1. **Interface Input** (highest priority)
96
- 2. **Environment Variables** (fallback)
97
- 3. **Default Values** (lowest priority)
98
-
99
- ## 🛠️ Getting Your HF Token
100
-
101
- ### **Step-by-Step Instructions**
102
- 1. Go to [Hugging Face Settings](https://huggingface.co/settings/tokens)
103
- 2. Click "New token"
104
- 3. Give it a name (e.g., "Trackio Access")
105
- 4. Select "Write" permissions
106
- 5. Click "Generate token"
107
- 6. Copy the token (starts with `hf_`)
108
- 7. Paste it in the app's HF Token field
109
-
110
- ### **Token Permissions**
111
- - **Read**: Required for loading experiments
112
- - **Write**: Required for saving experiments
113
- - **Scope**: Should have access to your dataset repositories
114
-
115
- ## 📁 Dataset Repository Format
116
-
117
- ### **Correct Format**
118
- ```
119
- username/dataset-name
120
- ```
121
-
122
- ### **Examples**
123
- - `tonic/trackio-experiments` (default)
124
- - `your-username/my-experiments`
125
- - `your-org/team-experiments`
126
- - `your-username/smollm3-experiments`
127
-
128
- ### **Validation**
129
- - Must contain exactly one `/`
130
- - Username must be valid HF username
131
- - Dataset name must be valid (alphanumeric + hyphens)
132
-
133
- ## 🔍 Testing Your Configuration
134
-
135
- ### **1. Test Connection**
136
- - Enter your HF token and dataset repository
137
- - Click "Test Connection"
138
- - Should show: "✅ Connection successful!"
139
-
140
- ### **2. Create Dataset (if needed)**
141
- - If dataset doesn't exist, click "Create Dataset"
142
- - Should show: "✅ Dataset created successfully!"
143
-
144
- ### **3. Update Configuration**
145
- - Click "Update Configuration"
146
- - Should show: "✅ Configuration updated successfully!"
147
-
148
- ## 🚨 Troubleshooting
149
-
150
- ### **Issue: "Please provide a Hugging Face token"**
151
- **Solution**:
152
- - Enter your HF token in the interface
153
- - Or set the `HF_TOKEN` environment variable
154
-
155
- ### **Issue: "Connection failed: 401 Unauthorized"**
156
- **Solutions**:
157
- 1. Check your HF token is correct
158
- 2. Verify the token has read access to the dataset
159
- 3. Ensure the dataset repository exists
160
-
161
- ### **Issue: "Failed to create dataset"**
162
- **Solutions**:
163
- 1. Check your HF token has write permissions
164
- 2. Verify the username in the repository name
165
- 3. Ensure the dataset name is valid
166
-
167
- ### **Issue: "Dataset repository must be in format: username/dataset-name"**
168
- **Solution**:
169
- - Use the correct format: `username/dataset-name`
170
- - Example: `your-username/my-experiments`
171
-
172
- ## 📈 Benefits
173
-
174
- ### **For Users**
175
- - ✅ **Easy Setup**: No need to set environment variables
176
- - ✅ **Visual Interface**: Clear input fields and validation
177
- - ✅ **Immediate Feedback**: Test connection and see results
178
- - ✅ **Flexible**: Can change configuration anytime
179
-
180
- ### **For Development**
181
- - ✅ **Backward Compatible**: Environment variables still work
182
- - ✅ **Fallback Support**: Graceful degradation
183
- - ✅ **Error Handling**: Clear error messages
184
- - ✅ **Validation**: Input validation and testing
185
-
186
- ### **For Deployment**
187
- - ✅ **HF Spaces Ready**: Works on Hugging Face Spaces
188
- - ✅ **No Restart Required**: Dynamic configuration updates
189
- - ✅ **Secure**: Password field for token input
190
- - ✅ **User-Friendly**: Clear instructions and help text
191
-
192
- ## 🎯 Usage Examples
193
-
194
- ### **Basic Setup**
195
- 1. Open the app
196
- 2. Go to "⚙️ Configuration" tab
197
- 3. Enter your HF token
198
- 4. Enter your dataset repository
199
- 5. Click "Update Configuration"
200
- 6. Click "Test Connection" to verify
201
-
202
- ### **Advanced Setup**
203
- 1. Set environment variables as defaults
204
- 2. Use interface to override specific values
205
- 3. Test connection to verify access
206
- 4. Create dataset if it doesn't exist
207
- 5. Start using the app with persistent storage
208
-
209
- ### **Team Setup**
210
- 1. Create a shared dataset repository
211
- 2. Share the repository name with team
212
- 3. Each team member sets their own HF token
213
- 4. All experiments are stored in the shared dataset
214
-
215
- ## 📋 Configuration Status
216
-
217
- The app shows current configuration status:
218
- ```
219
- 📊 Dataset: your-username/your-dataset
220
- 🔑 HF Token: Set
221
- 📈 Experiments: 5
222
- ```
223
-
224
- ## 🔄 Updating Configuration
225
-
226
- You can update configuration at any time:
227
- 1. Go to "⚙️ Configuration" tab
228
- 2. Change HF token or dataset repository
229
- 3. Click "Update Configuration"
230
- 4. Experiments will reload with new settings
231
-
232
- ---
233
-
234
- **🎉 Your Trackio app is now more flexible and user-friendly with direct configuration input!**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/CLOUD_DEPLOYMENT_GUIDE.md DELETED
@@ -1,462 +0,0 @@
1
- # Cloud Deployment Guide for SmolLM3 DPO Training
2
-
3
- This guide provides the exact sequence of commands to deploy and run SmolLM3 DPO training on a cloud computing instance with 6 epochs.
4
-
5
- ## Prerequisites
6
-
7
- ### Cloud Instance Requirements
8
-
9
- - **GPU**: NVIDIA A100, H100, or similar (16GB+ VRAM)
10
- - **RAM**: 64GB+ system memory
11
- - **Storage**: 100GB+ SSD storage
12
- - **OS**: Ubuntu 20.04 or 22.04
13
-
14
- ### Required Information
15
-
16
- Before starting, gather these details:
17
- - Your Hugging Face username
18
- - Your Hugging Face token (with write permissions)
19
- - Your Trackio Space URL (if using monitoring)
20
-
21
- ## Step-by-Step Deployment
22
-
23
- ### Step 1: Launch Cloud Instance
24
-
25
- Choose your cloud provider and launch an instance:
26
-
27
- #### AWS (g5.2xlarge or g5.4xlarge)
28
- ```bash
29
- # Launch instance with Ubuntu 22.04 and appropriate GPU
30
- aws ec2 run-instances \
31
- --image-id ami-0c7217cdde317cfec \
32
- --instance-type g5.2xlarge \
33
- --key-name your-key-pair \
34
- --security-group-ids sg-xxxxxxxxx
35
- ```
36
-
37
- #### Google Cloud (n1-standard-8 with T4/V100)
38
- ```bash
39
- gcloud compute instances create smollm3-dpo \
40
- --zone=us-central1-a \
41
- --machine-type=n1-standard-8 \
42
- --accelerator="type=nvidia-tesla-t4,count=1" \
43
- --image-family=ubuntu-2204-lts \
44
- --image-project=ubuntu-os-cloud
45
- ```
46
-
47
- #### Azure (Standard_NC6s_v3)
48
- ```bash
49
- az vm create \
50
- --resource-group your-rg \
51
- --name smollm3-dpo \
52
- --image Canonical:0001-com-ubuntu-server-jammy:22_04-lts:latest \
53
- --size Standard_NC6s_v3 \
54
- --admin-username azureuser
55
- ```
56
-
57
- ### Step 2: Connect to Instance
58
-
59
- ```bash
60
- # SSH to your instance
61
- ssh -i your-key.pem ubuntu@your-instance-ip
62
-
63
- # Or for Azure
64
- ssh azureuser@your-instance-ip
65
- ```
66
-
67
- ### Step 3: Update System and Install Dependencies
68
-
69
- ```bash
70
- # Update system
71
- sudo apt-get update
72
- sudo apt-get upgrade -y
73
-
74
- # Install system dependencies
75
- sudo apt-get install -y git curl wget unzip python3 python3-pip python3-venv
76
-
77
- # Install NVIDIA drivers (if not pre-installed)
78
- curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
79
- curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
80
- sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
81
- sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
82
-
83
- sudo apt-get update
84
- sudo apt-get install -y nvidia-container-toolkit
85
- ```
86
-
87
- ### Step 4: Clone Repository and Setup Environment
88
-
89
- ```bash
90
- # Clone your repository
91
- git clone https://github.com/your-username/flexai-finetune.git
92
- cd flexai-finetune
93
-
94
- # Create virtual environment
95
- python3 -m venv smollm3_env
96
- source smollm3_env/bin/activate
97
-
98
- # Install PyTorch with CUDA
99
- pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
100
-
101
- # Install project dependencies
102
- pip install -r requirements.txt
103
-
104
- # Install additional DPO dependencies
105
- pip install trl>=0.7.0
106
- pip install peft>=0.4.0
107
- pip install accelerate>=0.20.0
108
- ```
109
-
110
- ### Step 5: Configure Authentication
111
-
112
- ```bash
113
- # Set your Hugging Face token
114
- export HF_TOKEN="your_huggingface_token_here"
115
-
116
- # Login to Hugging Face
117
- hf login --token $HF_TOKEN
118
- ```
119
-
120
- ### Step 6: Create Configuration Files
121
-
122
- Create the DPO configuration file:
123
-
124
- ```bash
125
- cat > config/train_smollm3_dpo_6epochs.py << 'EOF'
126
- """
127
- SmolLM3 DPO Training Configuration - 6 Epochs
128
- Optimized for cloud deployment
129
- """
130
-
131
- from config.train_smollm3_dpo import SmolLM3DPOConfig
132
-
133
- config = SmolLM3DPOConfig(
134
- # Model configuration
135
- model_name="HuggingFaceTB/SmolLM3-3B",
136
- max_seq_length=4096,
137
- use_flash_attention=True,
138
- use_gradient_checkpointing=True,
139
-
140
- # Training configuration
141
- batch_size=2,
142
- gradient_accumulation_steps=8,
143
- learning_rate=5e-6,
144
- weight_decay=0.01,
145
- warmup_steps=100,
146
- max_iters=None, # Will be calculated based on epochs
147
- eval_interval=100,
148
- log_interval=10,
149
- save_interval=500,
150
-
151
- # DPO configuration
152
- beta=0.1,
153
- max_prompt_length=2048,
154
-
155
- # Optimizer configuration
156
- optimizer="adamw",
157
- beta1=0.9,
158
- beta2=0.95,
159
- eps=1e-8,
160
-
161
- # Scheduler configuration
162
- scheduler="cosine",
163
- min_lr=1e-6,
164
-
165
- # Mixed precision
166
- fp16=True,
167
- bf16=False,
168
-
169
- # Logging and saving
170
- save_steps=500,
171
- eval_steps=100,
172
- logging_steps=10,
173
- save_total_limit=3,
174
-
175
- # Evaluation
176
- eval_strategy="steps",
177
- metric_for_best_model="eval_loss",
178
- greater_is_better=False,
179
- load_best_model_at_end=True,
180
-
181
- # Data configuration
182
- data_dir="smoltalk_dataset",
183
- train_file="train.json",
184
- validation_file="validation.json",
185
-
186
- # Chat template configuration
187
- use_chat_template=True,
188
- chat_template_kwargs={
189
- "enable_thinking": False,
190
- "add_generation_prompt": True
191
- },
192
-
193
- # Trackio monitoring configuration
194
- enable_tracking=True,
195
- trackio_url="https://your-trackio-space.hf.space", # Change this
196
- trackio_token=None,
197
- log_artifacts=True,
198
- log_metrics=True,
199
- log_config=True,
200
- experiment_name="smollm3_dpo_6epochs"
201
- )
202
- EOF
203
- ```
204
-
205
- ### Step 7: Download and Prepare Dataset
206
-
207
- ```bash
208
- # Create dataset preparation script
209
- cat > prepare_dataset.py << 'EOF'
210
- from datasets import load_dataset
211
- import json
212
- import os
213
-
214
- # Load SmolTalk dataset
215
- print('Loading SmolTalk dataset...')
216
- dataset = load_dataset('HuggingFaceTB/smoltalk')
217
-
218
- # Create dataset directory
219
- os.makedirs('smoltalk_dataset', exist_ok=True)
220
-
221
- # Convert to DPO format (preference pairs)
222
- def convert_to_dpo_format(example):
223
- # For SmolTalk, we'll create preference pairs based on response quality
224
- # This is a simplified example - you may need to adjust based on your needs
225
- return {
226
- 'prompt': example.get('prompt', ''),
227
- 'chosen': example.get('chosen', ''),
228
- 'rejected': example.get('rejected', '')
229
- }
230
-
231
- # Process train split
232
- train_data = []
233
- for example in dataset['train']:
234
- dpo_example = convert_to_dpo_format(example)
235
- if dpo_example['prompt'] and dpo_example['chosen'] and dpo_example['rejected']:
236
- train_data.append(dpo_example)
237
-
238
- # Process validation split
239
- val_data = []
240
- for example in dataset['validation']:
241
- dpo_example = convert_to_dpo_format(example)
242
- if dpo_example['prompt'] and dpo_example['chosen'] and dpo_example['rejected']:
243
- val_data.append(dpo_example)
244
-
245
- # Save to files
246
- with open('smoltalk_dataset/train.json', 'w') as f:
247
- json.dump(train_data, f, indent=2)
248
-
249
- with open('smoltalk_dataset/validation.json', 'w') as f:
250
- json.dump(val_data, f, indent=2)
251
-
252
- print(f'Dataset prepared: {len(train_data)} train samples, {len(val_data)} validation samples')
253
- EOF
254
-
255
- # Run dataset preparation
256
- python prepare_dataset.py
257
- ```
258
-
259
- ### Step 8: Calculate Training Parameters
260
-
261
- ```bash
262
- # Calculate training steps based on epochs
263
- TOTAL_SAMPLES=$(python -c "import json; data=json.load(open('smoltalk_dataset/train.json')); print(len(data))")
264
- BATCH_SIZE=2
265
- GRADIENT_ACCUMULATION_STEPS=8
266
- MAX_EPOCHS=6
267
- EFFECTIVE_BATCH_SIZE=$((BATCH_SIZE * GRADIENT_ACCUMULATION_STEPS))
268
- STEPS_PER_EPOCH=$((TOTAL_SAMPLES / EFFECTIVE_BATCH_SIZE))
269
- MAX_STEPS=$((STEPS_PER_EPOCH * MAX_EPOCHS))
270
-
271
- echo "Training Configuration:"
272
- echo " Total samples: $TOTAL_SAMPLES"
273
- echo " Effective batch size: $EFFECTIVE_BATCH_SIZE"
274
- echo " Steps per epoch: $STEPS_PER_EPOCH"
275
- echo " Total training steps: $MAX_STEPS"
276
- echo " Training epochs: $MAX_EPOCHS"
277
- ```
278
-
279
- ### Step 9: Start DPO Training
280
-
281
- ```bash
282
- # Start training with all parameters
283
- python train.py config/train_smollm3_dpo_6epochs.py \
284
- --dataset_dir smoltalk_dataset \
285
- --out_dir /output-checkpoint \
286
- --init_from scratch \
287
- --max_iters $MAX_STEPS \
288
- --batch_size $BATCH_SIZE \
289
- --learning_rate 5e-6 \
290
- --gradient_accumulation_steps $GRADIENT_ACCUMULATION_STEPS \
291
- --max_seq_length 4096 \
292
- --save_steps 500 \
293
- --eval_steps 100 \
294
- --logging_steps 10 \
295
- --enable_tracking \
296
- --trackio_url "https://your-trackio-space.hf.space" \
297
- --experiment_name "smollm3_dpo_6epochs"
298
- ```
299
-
300
- ### Step 10: Push Model to Hugging Face Hub
301
-
302
- ```bash
303
- # Push the trained model
304
- python push_to_huggingface.py /output-checkpoint "your-username/smollm3-dpo-6epochs" \
305
- --token "$HF_TOKEN" \
306
- --trackio-url "https://your-trackio-space.hf.space" \
307
- --experiment-name "smollm3_dpo_6epochs"
308
- ```
309
-
310
- ### Step 11: Test the Uploaded Model
311
-
312
- ```bash
313
- # Test the model
314
- python -c "
315
- from transformers import AutoModelForCausalLM, AutoTokenizer
316
- import torch
317
-
318
- print('Loading uploaded model...')
319
- model = AutoModelForCausalLM.from_pretrained('your-username/smollm3-dpo-6epochs', torch_dtype=torch.float16, device_map='auto')
320
- tokenizer = AutoTokenizer.from_pretrained('your-username/smollm3-dpo-6epochs')
321
-
322
- print('Testing model generation...')
323
- prompt = 'Hello, how are you?'
324
- inputs = tokenizer(prompt, return_tensors='pt').to(model.device)
325
- outputs = model.generate(**inputs, max_new_tokens=50, do_sample=True, temperature=0.7)
326
- response = tokenizer.decode(outputs[0], skip_special_tokens=True)
327
- print(f'Prompt: {prompt}')
328
- print(f'Response: {response}')
329
- print('✅ Model test completed successfully!')
330
- "
331
- ```
332
-
333
- ## Complete One-Line Deployment
334
-
335
- If you want to run everything automatically, use the deployment script:
336
-
337
- ```bash
338
- # Make script executable
339
- chmod +x cloud_deployment.sh
340
-
341
- # Edit configuration in the script first
342
- nano cloud_deployment.sh
343
- # Change these variables:
344
- # - REPO_NAME="your-username/smollm3-dpo-6epochs"
345
- # - TRACKIO_URL="https://your-trackio-space.hf.space"
346
- # - HF_TOKEN="your_hf_token_here"
347
-
348
- # Run the complete deployment
349
- ./cloud_deployment.sh
350
- ```
351
-
352
- ## Monitoring and Debugging
353
-
354
- ### Check GPU Usage
355
-
356
- ```bash
357
- # Monitor GPU usage during training
358
- watch -n 1 nvidia-smi
359
- ```
360
-
361
- ### Check Training Logs
362
-
363
- ```bash
364
- # Monitor training progress
365
- tail -f training.log
366
-
367
- # Check system resources
368
- htop
369
- ```
370
-
371
- ### Monitor Trackio
372
-
373
- ```bash
374
- # Check if Trackio is logging properly
375
- curl -s "https://your-trackio-space.hf.space" | grep -i "experiment"
376
- ```
377
-
378
- ## Expected Timeline
379
-
380
- - **Setup**: 15-30 minutes
381
- - **Dataset preparation**: 5-10 minutes
382
- - **Training (6 epochs)**: 4-8 hours (depending on GPU)
383
- - **Model upload**: 10-30 minutes
384
- - **Testing**: 5-10 minutes
385
-
386
- ## Troubleshooting
387
-
388
- ### Common Issues
389
-
390
- #### 1. Out of Memory (OOM)
391
- ```bash
392
- # Reduce batch size
393
- BATCH_SIZE=1
394
- GRADIENT_ACCUMULATION_STEPS=16
395
-
396
- # Or use gradient checkpointing
397
- # Already enabled in config
398
- ```
399
-
400
- #### 2. Slow Training
401
- ```bash
402
- # Check GPU utilization
403
- nvidia-smi
404
-
405
- # Check if mixed precision is working
406
- # Look for "fp16" in training logs
407
- ```
408
-
409
- #### 3. Dataset Issues
410
- ```bash
411
- # Check dataset format
412
- head -n 5 smoltalk_dataset/train.json
413
-
414
- # Verify dataset size
415
- wc -l smoltalk_dataset/train.json
416
- ```
417
-
418
- #### 4. Authentication Issues
419
- ```bash
420
- # Test HF token
421
- python -c "
422
- from huggingface_hub import HfApi
423
- api = HfApi(token='$HF_TOKEN')
424
- print('Token is valid!')
425
- "
426
- ```
427
-
428
- ## Cost Estimation
429
-
430
- ### AWS (g5.2xlarge)
431
- - **Instance**: $0.526/hour
432
- - **Training time**: 6 hours
433
- - **Total cost**: ~$3.16
434
-
435
- ### Google Cloud (n1-standard-8 + T4)
436
- - **Instance**: $0.38/hour
437
- - **Training time**: 6 hours
438
- - **Total cost**: ~$2.28
439
-
440
- ### Azure (Standard_NC6s_v3)
441
- - **Instance**: $0.90/hour
442
- - **Training time**: 6 hours
443
- - **Total cost**: ~$5.40
444
-
445
- ## Next Steps
446
-
447
- After successful deployment:
448
-
449
- 1. **Monitor training** in your Trackio Space
450
- 2. **Check model repository** on Hugging Face Hub
451
- 3. **Test the model** with different prompts
452
- 4. **Share your model** with the community
453
- 5. **Iterate and improve** based on results
454
-
455
- ## Support
456
-
457
- - **Training issues**: Check logs and GPU utilization
458
- - **Upload issues**: Verify HF token and repository permissions
459
- - **Monitoring issues**: Check Trackio Space configuration
460
- - **Performance issues**: Adjust batch size and learning rate
461
-
462
- Your SmolLM3 DPO model will be ready for use after training completes!
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/CLOUD_TRAINING_GUIDE.md DELETED
@@ -1,440 +0,0 @@
1
- # Cloud Training Guide for OpenHermes-FR Dataset
2
-
3
- This guide provides step-by-step instructions for training SmolLM3 models on cloud instances using the [legmlai/openhermes-fr](https://huggingface.co/datasets/legmlai/openhermes-fr) dataset.
4
-
5
- ## Overview
6
-
7
- The OpenHermes-FR dataset contains 799,875 French instruction-response pairs, perfect for fine-tuning SmolLM3 models for French language tasks. This guide covers:
8
-
9
- - ✅ **Cloud Instance Setup** - Complete environment configuration
10
- - ✅ **Dataset Integration** - Automatic loading and filtering
11
- - ✅ **Training Configuration** - Optimized for French instruction tuning
12
- - ✅ **Monitoring Integration** - Trackio experiment tracking
13
- - ✅ **Model Deployment** - Push to Hugging Face Hub
14
-
15
- ## Dataset Information
16
-
17
- ### Schema
18
- ```json
19
- {
20
- "prompt": "Explique la différence entre la photosynthèse C3 et C4.",
21
- "accepted_completion": "La photosynthèse C3 utilise… (réponse détaillée)",
22
- "bad_prompt_detected": false,
23
- "bad_response_detected": false,
24
- "bad_entry": false
25
- }
26
- ```
27
-
28
- ### Key Features
29
- - **Size**: 799,875 examples (~1.4GB)
30
- - **Language**: 100% French
31
- - **Quality**: GPT-4o generated responses with automatic filtering
32
- - **License**: ODC-BY 1.0
33
-
34
- ## Cloud Instance Setup
35
-
36
- ### 1. Choose Your Cloud Provider
37
-
38
- #### **AWS EC2 (Recommended)**
39
- ```bash
40
- # Launch instance with GPU
41
- # Recommended: g4dn.xlarge or g5.xlarge
42
- # AMI: Deep Learning AMI (Ubuntu 20.04)
43
- ```
44
-
45
- #### **Google Cloud Platform**
46
- ```bash
47
- # Launch instance with GPU
48
- # Recommended: n1-standard-4 with Tesla T4 or V100
49
- ```
50
-
51
- #### **Azure**
52
- ```bash
53
- # Launch instance with GPU
54
- # Recommended: Standard_NC6s_v3 or Standard_NC12s_v3
55
- ```
56
-
57
- ### 2. Instance Specifications
58
-
59
- #### **Minimum Requirements**
60
- - **GPU**: 16GB+ VRAM (Tesla T4, V100, or A100)
61
- - **RAM**: 32GB+ system memory
62
- - **Storage**: 100GB+ SSD
63
- - **CPU**: 8+ cores
64
-
65
- #### **Recommended Specifications**
66
- - **GPU**: A100 (40GB) or H100 (80GB)
67
- - **RAM**: 64GB+ system memory
68
- - **Storage**: 200GB+ NVMe SSD
69
- - **CPU**: 16+ cores
70
-
71
- ### 3. Environment Setup
72
-
73
- ```bash
74
- # Update system
75
- sudo apt update && sudo apt upgrade -y
76
-
77
- # Install CUDA (if not pre-installed)
78
- # Follow NVIDIA CUDA installation guide for your GPU
79
-
80
- # Install Python dependencies
81
- sudo apt install python3-pip python3-venv git -y
82
-
83
- # Create virtual environment
84
- python3 -m venv smollm3_env
85
- source smollm3_env/bin/activate
86
-
87
- # Clone repository
88
- git clone <your-repo-url>
89
- cd <your-repo-directory>
90
-
91
- # Install dependencies
92
- pip install -r requirements.txt
93
-
94
- # Install additional dependencies for cloud training
95
- pip install accelerate transformers datasets huggingface_hub
96
- ```
97
-
98
- ## Training Configuration
99
-
100
- ### 1. Use the OpenHermes-FR Config
101
-
102
- The repository includes a specialized configuration for the OpenHermes-FR dataset:
103
-
104
- ```bash
105
- python train.py config/train_smollm3_openhermes_fr.py \
106
- --enable_tracking \
107
- --trackio_url "https://your-space.hf.space" \
108
- --experiment_name "smollm3_fr_openhermes_v1"
109
- ```
110
-
111
- ### 2. Configuration Details
112
-
113
- The `config/train_smollm3_openhermes_fr.py` includes:
114
-
115
- #### **Dataset Configuration**
116
- ```python
117
- dataset_name: str = "legmlai/openhermes-fr"
118
- dataset_split: str = "train"
119
- input_field: str = "prompt"
120
- target_field: str = "accepted_completion"
121
- filter_bad_entries: bool = True
122
- bad_entry_field: str = "bad_entry"
123
- ```
124
-
125
- #### **Training Optimization**
126
- ```python
127
- batch_size: int = 2 # Reduced for French text (longer sequences)
128
- gradient_accumulation_steps: int = 8 # Maintains effective batch size
129
- learning_rate: float = 1e-5 # Lower for instruction tuning
130
- max_iters: int = 2000 # More iterations for large dataset
131
- ```
132
-
133
- #### **Monitoring Integration**
134
- ```python
135
- enable_tracking: bool = True
136
- experiment_name: str = "smollm3_openhermes_fr"
137
- ```
138
-
139
- ## Training Commands
140
-
141
- ### Basic Training
142
- ```bash
143
- python train.py config/train_smollm3_openhermes_fr.py
144
- ```
145
-
146
- ### Training with Monitoring
147
- ```bash
148
- python train.py config/train_smollm3_openhermes_fr.py \
149
- --enable_tracking \
150
- --trackio_url "https://your-trackio-space.hf.space" \
151
- --experiment_name "smollm3_fr_openhermes_v1"
152
- ```
153
-
154
- ### Training with Custom Parameters
155
- ```bash
156
- python train.py config/train_smollm3_openhermes_fr.py \
157
- --batch_size 4 \
158
- --learning_rate 2e-5 \
159
- --max_iters 3000 \
160
- --enable_tracking \
161
- --trackio_url "https://your-trackio-space.hf.space" \
162
- --experiment_name "smollm3_fr_high_lr"
163
- ```
164
-
165
- ### Training with Checkpoint Resume
166
- ```bash
167
- python train.py config/train_smollm3_openhermes_fr.py \
168
- --init_from resume \
169
- --enable_tracking \
170
- --trackio_url "https://your-trackio-space.hf.space" \
171
- --experiment_name "smollm3_fr_resume"
172
- ```
173
-
174
- ## Dataset Processing
175
-
176
- ### Automatic Filtering
177
-
178
- The training script automatically:
179
- - ✅ **Loads** the OpenHermes-FR dataset from Hugging Face
180
- - ✅ **Filters** out bad entries (`bad_entry = true`)
181
- - ✅ **Splits** data into train/validation/test (98/1/1)
182
- - ✅ **Formats** prompts and completions for instruction tuning
183
-
184
- ### Manual Dataset Inspection
185
-
186
- ```python
187
- from datasets import load_dataset
188
-
189
- # Load dataset
190
- dataset = load_dataset("legmlai/openhermes-fr")
191
-
192
- # Check dataset info
193
- print(f"Dataset size: {len(dataset['train'])}")
194
- print(f"Sample columns: {dataset['train'].column_names}")
195
-
196
- # Check filtering
197
- bad_entries = dataset['train'].filter(lambda x: x['bad_entry'])
198
- print(f"Bad entries: {len(bad_entries)}")
199
-
200
- # Sample data
201
- sample = dataset['train'][0]
202
- print(f"Prompt: {sample['prompt']}")
203
- print(f"Completion: {sample['accepted_completion']}")
204
- ```
205
-
206
- ## Monitoring and Tracking
207
-
208
- ### Trackio Integration
209
-
210
- The training automatically logs:
211
- - **Training metrics**: Loss, accuracy, learning rate
212
- - **System metrics**: GPU memory, CPU usage
213
- - **Dataset info**: Size, filtering statistics
214
- - **Model checkpoints**: Regular saves with metadata
215
-
216
- ### View Training Progress
217
-
218
- 1. **Trackio Space**: Visit your Trackio Space URL
219
- 2. **Experiment Details**: Check the "View Experiments" tab
220
- 3. **Metrics**: Monitor loss curves and system usage
221
- 4. **Logs**: Download training logs for analysis
222
-
223
- ## Model Deployment
224
-
225
- ### Push to Hugging Face Hub
226
-
227
- After training, deploy your model:
228
-
229
- ```bash
230
- python push_to_huggingface.py /output-checkpoint username/smollm3-fr-openhermes \
231
- --trackio-url "https://your-trackio-space.hf.space" \
232
- --experiment-name "smollm3_fr_openhermes_v1"
233
- ```
234
-
235
- ### Use Your Model
236
-
237
- ```python
238
- from transformers import AutoModelForCausalLM, AutoTokenizer
239
-
240
- # Load your fine-tuned model
241
- model = AutoModelForCausalLM.from_pretrained("username/smollm3-fr-openhermes")
242
- tokenizer = AutoTokenizer.from_pretrained("username/smollm3-fr-openhermes")
243
-
244
- # Generate French text
245
- prompt = "Expliquez le concept de l'intelligence artificielle."
246
- inputs = tokenizer(prompt, return_tensors="pt")
247
- outputs = model.generate(**inputs, max_new_tokens=200)
248
- print(tokenizer.decode(outputs[0], skip_special_tokens=True))
249
- ```
250
-
251
- ## Performance Optimization
252
-
253
- ### GPU Memory Management
254
-
255
- ```bash
256
- # Monitor GPU usage
257
- nvidia-smi -l 1
258
-
259
- # Optimize for your GPU
260
- # For 16GB VRAM: batch_size=2, gradient_accumulation_steps=8
261
- # For 24GB VRAM: batch_size=4, gradient_accumulation_steps=4
262
- # For 40GB+ VRAM: batch_size=8, gradient_accumulation_steps=2
263
- ```
264
-
265
- ### Training Speed
266
-
267
- ```bash
268
- # Use mixed precision (enabled by default)
269
- fp16: bool = True
270
-
271
- # Enable gradient checkpointing (enabled by default)
272
- use_gradient_checkpointing: bool = True
273
-
274
- # Use flash attention (enabled by default)
275
- use_flash_attention: bool = True
276
- ```
277
-
278
- ## Troubleshooting
279
-
280
- ### Common Issues
281
-
282
- #### 1. **Out of Memory (OOM)**
283
- ```bash
284
- # Reduce batch size
285
- python train.py config/train_smollm3_openhermes_fr.py --batch_size 1
286
-
287
- # Increase gradient accumulation
288
- # Edit config: gradient_accumulation_steps = 16
289
- ```
290
-
291
- #### 2. **Slow Training**
292
- ```bash
293
- # Check GPU utilization
294
- nvidia-smi
295
-
296
- # Verify data loading
297
- # Check if dataset is cached locally
298
- ```
299
-
300
- #### 3. **Dataset Loading Issues**
301
- ```bash
302
- # Clear cache
303
- rm -rf ~/.cache/huggingface/
304
-
305
- # Check internet connection
306
- # Verify dataset name: "legmlai/openhermes-fr"
307
- ```
308
-
309
- #### 4. **Monitoring Connection Issues**
310
- ```bash
311
- # Test Trackio connection
312
- curl -I https://your-trackio-space.hf.space
313
-
314
- # Check token permissions
315
- # Verify experiment name format
316
- ```
317
-
318
- ### Debug Mode
319
-
320
- ```bash
321
- # Enable debug logging
322
- export LOG_LEVEL=DEBUG
323
- python train.py config/train_smollm3_openhermes_fr.py
324
- ```
325
-
326
- ## Cost Optimization
327
-
328
- ### Cloud Provider Tips
329
-
330
- #### **AWS EC2**
331
- - Use Spot Instances for cost savings
332
- - Monitor usage with CloudWatch
333
- - Use appropriate instance types
334
-
335
- #### **Google Cloud Platform**
336
- - Use Preemptible VMs for non-critical training
337
- - Monitor with Cloud Monitoring
338
- - Use committed use discounts
339
-
340
- #### **Azure**
341
- - Use Spot VMs for cost optimization
342
- - Monitor with Azure Monitor
343
- - Use reserved instances for long training
344
-
345
- ### Training Time Estimates
346
-
347
- | GPU Type | Batch Size | Estimated Time |
348
- |----------|------------|----------------|
349
- | Tesla T4 (16GB) | 2 | 8-12 hours |
350
- | V100 (32GB) | 4 | 4-6 hours |
351
- | A100 (40GB) | 8 | 2-3 hours |
352
- | H100 (80GB) | 16 | 1-2 hours |
353
-
354
- ## Security Best Practices
355
-
356
- ### Token Management
357
- ```bash
358
- # Use environment variables
359
- export HF_TOKEN="your_token_here"
360
- export TRACKIO_TOKEN="your_trackio_token"
361
-
362
- # Don't hardcode in scripts
363
- # Use IAM roles when possible
364
- ```
365
-
366
- ### Data Privacy
367
- ```bash
368
- # Use private repositories for sensitive models
369
- python push_to_huggingface.py model username/private-model --private
370
-
371
- # Secure your cloud instance
372
- # Use VPC and security groups
373
- ```
374
-
375
- ## Complete Workflow Example
376
-
377
- ### 1. Setup Cloud Instance
378
- ```bash
379
- # Launch GPU instance
380
- # Install dependencies
381
- git clone <your-repo>
382
- cd <your-repo>
383
- pip install -r requirements.txt
384
- ```
385
-
386
- ### 2. Train Model
387
- ```bash
388
- python train.py config/train_smollm3_openhermes_fr.py \
389
- --enable_tracking \
390
- --trackio_url "https://your-space.hf.space" \
391
- --experiment_name "smollm3_fr_v1"
392
- ```
393
-
394
- ### 3. Deploy Model
395
- ```bash
396
- python push_to_huggingface.py /output-checkpoint username/smollm3-fr-v1 \
397
- --trackio-url "https://your-space.hf.space" \
398
- --experiment-name "smollm3_fr_v1"
399
- ```
400
-
401
- ### 4. Test Model
402
- ```python
403
- from transformers import AutoModelForCausalLM, AutoTokenizer
404
-
405
- model = AutoModelForCausalLM.from_pretrained("username/smollm3-fr-v1")
406
- tokenizer = AutoTokenizer.from_pretrained("username/smollm3-fr-v1")
407
-
408
- # Test French generation
409
- prompt = "Qu'est-ce que l'apprentissage automatique?"
410
- inputs = tokenizer(prompt, return_tensors="pt")
411
- outputs = model.generate(**inputs, max_new_tokens=100)
412
- print(tokenizer.decode(outputs[0], skip_special_tokens=True))
413
- ```
414
-
415
- ## Support and Resources
416
-
417
- ### Documentation
418
- - [OpenHermes-FR Dataset](https://huggingface.co/datasets/legmlai/openhermes-fr)
419
- - [SmolLM3 Model](https://huggingface.co/HuggingFaceTB/SmolLM3-3B)
420
- - [Trackio Monitoring](https://github.com/Josephrp/trackio)
421
-
422
- ### Community
423
- - [Hugging Face Forums](https://discuss.huggingface.co/)
424
- - [Transformers Documentation](https://huggingface.co/docs/transformers/)
425
-
426
- ### Examples
427
- - [French Language Models](https://huggingface.co/models?search=french)
428
- - [Instruction Tuned Models](https://huggingface.co/models?pipeline_tag=text-generation&sort=downloads)
429
-
430
- ## Conclusion
431
-
432
- This guide provides everything needed to train SmolLM3 models on the OpenHermes-FR dataset in the cloud:
433
-
434
- - ✅ **Complete Setup** - From cloud instance to model deployment
435
- - ✅ **Optimized Configuration** - Tailored for French instruction tuning
436
- - ✅ **Monitoring Integration** - Trackio experiment tracking
437
- - ✅ **Cost Optimization** - Tips for efficient cloud usage
438
- - ✅ **Troubleshooting** - Solutions for common issues
439
-
440
- Start training your French language model today!
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/Configuration_Management.md ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ```mermaid
2
+ graph LR
3
+ Configuration_Management["Configuration Management"]
4
+ Training_Orchestration["Training Orchestration"]
5
+ Training_Orchestration -- "retrieves configuration from" --> Configuration_Management
6
+ click Configuration_Management href "https://github.com//Josephrp/SmolFactory/blob/main/SmolFactory/docs/blob/Configuration_Management.md" "Details"
7
+ ```
8
+
9
+ [![CodeBoarding](https://img.shields.io/badge/Generated%20by-CodeBoarding-9cf?style=flat-square)](https://github.com/CodeBoarding/GeneratedOnBoardings)[![Demo](https://img.shields.io/badge/Try%20our-Demo-blue?style=flat-square)](https://www.codeboarding.org/demo)[![Contact](https://img.shields.io/badge/Contact%20us%20-%[email protected]?style=flat-square)](mailto:[email protected])
10
+
11
+ ## Details
12
+
13
+ One paragraph explaining the functionality which is represented by this graph. What the main flow is and what is its purpose.
14
+
15
+ ### Configuration Management [[Expand]](./Configuration_Management.md)
16
+ This component, primarily embodied by the `SmolLM3Config` dataclass and the `get_config` function in `config/train_smollm3.py`, is responsible for the centralized definition, loading, validation, and provision of access to all training parameters, model specifications, data paths, and hyperparameters. It supports loading both base and custom configurations, ensuring that all necessary settings are available and correctly formatted for the training and fine-tuning processes.
17
+
18
+
19
+ **Related Classes/Methods**: _None_
20
+
21
+ ### Training Orchestration
22
+ This component represents the main scripts or modules responsible for initiating and coordinating the training and fine-tuning processes. It acts as the primary entry point for different training runs, retrieving necessary configurations and orchestrating the overall training pipeline.
23
+
24
+
25
+ **Related Classes/Methods**: _None_
26
+
27
+
28
+
29
+ ### [FAQ](https://github.com/CodeBoarding/GeneratedOnBoardings/tree/main?tab=readme-ov-file#faq)
docs/DATASET_AUTOMATION_FIX.md DELETED
@@ -1,218 +0,0 @@
1
- # Dataset Configuration Automation Fix
2
-
3
- ## Problem Description
4
-
5
- The original launch script required users to manually specify their username in the dataset repository name, which was:
6
- 1. **Error-prone**: Users had to remember their username
7
- 2. **Inconsistent**: Different users might use different naming conventions
8
- 3. **Manual**: Required extra steps in the setup process
9
-
10
- ## Solution Implementation
11
-
12
- ### Automatic Dataset Repository Creation
13
-
14
- We've implemented a Python-based solution that automatically:
15
-
16
- 1. **Extracts username from token**: Uses the HF API to get the username from the validated token
17
- 2. **Creates dataset repository**: Automatically creates `username/trackio-experiments` or custom name
18
- 3. **Sets environment variables**: Automatically configures `TRACKIO_DATASET_REPO`
19
- 4. **Provides customization**: Allows users to customize the dataset name if desired
20
-
21
- ### Key Components
22
-
23
- #### 1. **`scripts/dataset_tonic/setup_hf_dataset.py`** - Main Dataset Setup Script
24
- - Automatically detects username from HF token
25
- - Creates dataset repository with proper permissions
26
- - Supports custom dataset names
27
- - Sets environment variables for other scripts
28
-
29
- #### 2. **Updated `launch.sh`** - Enhanced User Experience
30
- - Automatically creates dataset repository
31
- - Provides options for default or custom dataset names
32
- - Fallback to manual input if automatic creation fails
33
- - Clear user feedback and progress indicators
34
-
35
- #### 3. **Python API Integration** - Consistent Authentication
36
- - Uses `HfApi(token=token)` for direct token authentication
37
- - Avoids environment variable conflicts
38
- - Consistent error handling across all scripts
39
-
40
- ## Usage Examples
41
-
42
- ### Automatic Dataset Creation (Default)
43
-
44
- ```bash
45
- # The launch script now automatically:
46
- python scripts/dataset_tonic/setup_hf_dataset.py hf_your_token_here
47
-
48
- # Creates: username/trackio-experiments
49
- # Sets: TRACKIO_DATASET_REPO=username/trackio-experiments
50
- ```
51
-
52
- ### Custom Dataset Name
53
-
54
- ```bash
55
- # Create with custom name
56
- python scripts/dataset_tonic/setup_hf_dataset.py hf_your_token_here my-custom-experiments
57
-
58
- # Creates: username/my-custom-experiments
59
- # Sets: TRACKIO_DATASET_REPO=username/my-custom-experiments
60
- ```
61
-
62
- ### Launch Script Integration
63
-
64
- The launch script now provides a seamless experience:
65
-
66
- ```bash
67
- ./launch.sh
68
-
69
- # Step 3: Experiment Details
70
- # - Automatically creates dataset repository
71
- # - Option to use default or custom name
72
- # - No manual username input required
73
- ```
74
-
75
- ## Features
76
-
77
- ### ✅ **Automatic Username Detection**
78
- - Extracts username from HF token using Python API
79
- - No manual username input required
80
- - Consistent across all scripts
81
-
82
- ### ✅ **Flexible Dataset Naming**
83
- - Default: `username/trackio-experiments`
84
- - Custom: `username/custom-name`
85
- - User choice during setup
86
-
87
- ### ✅ **Robust Error Handling**
88
- - Graceful fallback to manual input
89
- - Clear error messages
90
- - Token validation before creation
91
-
92
- ### ✅ **Environment Integration**
93
- - Automatically sets `TRACKIO_DATASET_REPO`
94
- - Compatible with existing scripts
95
- - No manual configuration required
96
-
97
- ### ✅ **Cross-Platform Compatibility**
98
- - Works on Windows, Linux, macOS
99
- - Uses Python API instead of CLI
100
- - Consistent behavior across platforms
101
-
102
- ## Technical Implementation
103
-
104
- ### Token Authentication Flow
105
-
106
- ```python
107
- # 1. Direct token authentication
108
- api = HfApi(token=token)
109
-
110
- # 2. Extract username
111
- user_info = api.whoami()
112
- username = user_info.get("name", user_info.get("username"))
113
-
114
- # 3. Create repository
115
- create_repo(
116
- repo_id=f"{username}/{dataset_name}",
117
- repo_type="dataset",
118
- token=token,
119
- exist_ok=True,
120
- private=False
121
- )
122
- ```
123
-
124
- ### Launch Script Integration
125
-
126
- ```bash
127
- # Automatic dataset creation
128
- if python3 scripts/dataset_tonic/setup_hf_dataset.py 2>/dev/null; then
129
- TRACKIO_DATASET_REPO="$TRACKIO_DATASET_REPO"
130
- print_status "Dataset repository created successfully"
131
- else
132
- # Fallback to manual input
133
- get_input "Trackio dataset repository" "$HF_USERNAME/trackio-experiments" TRACKIO_DATASET_REPO
134
- fi
135
- ```
136
-
137
- ## User Experience Improvements
138
-
139
- ### Before (Manual Process)
140
- 1. User enters HF token
141
- 2. User manually types username
142
- 3. User manually types dataset repository name
143
- 4. User manually configures environment variables
144
- 5. Risk of typos and inconsistencies
145
-
146
- ### After (Automated Process)
147
- 1. User enters HF token
148
- 2. System automatically detects username
149
- 3. System automatically creates dataset repository
150
- 4. System automatically sets environment variables
151
- 5. Option to customize dataset name if desired
152
-
153
- ## Error Handling
154
-
155
- ### Common Scenarios
156
-
157
- | Scenario | Action | User Experience |
158
- |----------|--------|-----------------|
159
- | Valid token | ✅ Automatic creation | Seamless setup |
160
- | Invalid token | ❌ Clear error message | Helpful feedback |
161
- | Network issues | ⚠️ Retry with fallback | Graceful degradation |
162
- | Repository exists | ℹ️ Use existing | No conflicts |
163
-
164
- ### Fallback Mechanisms
165
-
166
- 1. **Token validation fails**: Clear error message with troubleshooting steps
167
- 2. **Dataset creation fails**: Fallback to manual input
168
- 3. **Network issues**: Retry with exponential backoff
169
- 4. **Permission issues**: Clear guidance on token permissions
170
-
171
- ## Benefits
172
-
173
- ### For Users
174
- - **Simplified Setup**: No manual username input required
175
- - **Reduced Errors**: Automatic username detection eliminates typos
176
- - **Consistent Naming**: Standardized repository naming conventions
177
- - **Better UX**: Clear progress indicators and feedback
178
-
179
- ### For Developers
180
- - **Maintainable Code**: Python API instead of CLI dependencies
181
- - **Cross-Platform**: Works consistently across operating systems
182
- - **Extensible**: Easy to add new features and customizations
183
- - **Testable**: Comprehensive test coverage
184
-
185
- ### For System
186
- - **Reliable**: Robust error handling and fallback mechanisms
187
- - **Secure**: Direct token authentication without environment conflicts
188
- - **Scalable**: Easy to extend for additional repository types
189
- - **Integrated**: Seamless integration with existing pipeline
190
-
191
- ## Migration Guide
192
-
193
- ### For Existing Users
194
-
195
- No migration required! The system automatically:
196
- - Detects existing repositories
197
- - Uses existing repositories if they exist
198
- - Creates new repositories only when needed
199
-
200
- ### For New Users
201
-
202
- The setup is now completely automated:
203
- 1. Run `./launch.sh`
204
- 2. Enter your HF token
205
- 3. Choose dataset naming preference
206
- 4. System handles everything else automatically
207
-
208
- ## Future Enhancements
209
-
210
- - [ ] Support for organization repositories
211
- - [ ] Multiple dataset repositories per user
212
- - [ ] Dataset repository templates
213
- - [ ] Advanced repository configuration options
214
- - [ ] Repository sharing and collaboration features
215
-
216
- ---
217
-
218
- **Note**: This automation ensures that users can focus on their fine-tuning experiments rather than repository setup details, while maintaining full flexibility for customization when needed.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/DATASET_COMPONENTS_VERIFICATION.md DELETED
@@ -1,235 +0,0 @@
1
- # Dataset Components Verification
2
-
3
- ## Overview
4
-
5
- This document verifies that all important dataset components have been properly implemented and are working correctly.
6
-
7
- ## ✅ **Verified Components**
8
-
9
- ### 1. **Initial Experiment Data** ✅ IMPLEMENTED
10
-
11
- **Location**: `scripts/dataset_tonic/setup_hf_dataset.py` - `add_initial_experiment_data()` function
12
-
13
- **What it does**:
14
- - Creates comprehensive sample experiment data
15
- - Includes realistic training metrics (loss, accuracy, GPU usage, etc.)
16
- - Contains proper experiment parameters (model name, batch size, learning rate, etc.)
17
- - Includes experiment logs and artifacts structure
18
- - Uploads data to HF Dataset using `datasets` library
19
-
20
- **Sample Data Structure**:
21
- ```json
22
- {
23
- "experiment_id": "exp_20250120_143022",
24
- "name": "smollm3-finetune-demo",
25
- "description": "SmolLM3 fine-tuning experiment demo with comprehensive metrics tracking",
26
- "created_at": "2025-01-20T14:30:22.123456",
27
- "status": "completed",
28
- "metrics": "[{\"timestamp\": \"2025-01-20T14:30:22.123456\", \"step\": 100, \"metrics\": {\"loss\": 1.15, \"grad_norm\": 10.5, \"learning_rate\": 5e-6, \"num_tokens\": 1000000.0, \"mean_token_accuracy\": 0.76, \"epoch\": 0.1, \"total_tokens\": 1000000.0, \"throughput\": 2000000.0, \"step_time\": 0.5, \"batch_size\": 2, \"seq_len\": 4096, \"token_acc\": 0.76, \"gpu_memory_allocated\": 15.2, \"gpu_memory_reserved\": 70.1, \"gpu_utilization\": 85.2, \"cpu_percent\": 2.7, \"memory_percent\": 10.1}}]",
29
- "parameters": "{\"model_name\": \"HuggingFaceTB/SmolLM3-3B\", \"max_seq_length\": 4096, \"batch_size\": 2, \"learning_rate\": 5e-6, \"epochs\": 3, \"dataset\": \"OpenHermes-FR\", \"trainer_type\": \"SFTTrainer\", \"hardware\": \"GPU (H100/A100)\", \"mixed_precision\": true, \"gradient_checkpointing\": true, \"flash_attention\": true}",
30
- "artifacts": "[]",
31
- "logs": "[{\"timestamp\": \"2025-01-20T14:30:22.123456\", \"level\": \"INFO\", \"message\": \"Training started successfully\"}, {\"timestamp\": \"2025-01-20T14:30:22.123456\", \"level\": \"INFO\", \"message\": \"Model loaded and configured\"}, {\"timestamp\": \"2025-01-20T14:30:22.123456\", \"level\": \"INFO\", \"message\": \"Dataset loaded and preprocessed\"}]",
32
- "last_updated": "2025-01-20T14:30:22.123456"
33
- }
34
- ```
35
-
36
- **Test Result**: ✅ Successfully uploaded to `Tonic/test-dataset-complete`
37
-
38
- ### 2. **README Templates** ✅ IMPLEMENTED
39
-
40
- **Location**:
41
- - Template: `templates/datasets/readme.md`
42
- - Implementation: `scripts/dataset_tonic/setup_hf_dataset.py` - `add_dataset_readme()` function
43
-
44
- **What it does**:
45
- - Uses comprehensive README template from `templates/datasets/readme.md`
46
- - Falls back to basic README if template doesn't exist
47
- - Includes dataset schema documentation
48
- - Provides usage examples and integration information
49
- - Uploads README to dataset repository using `huggingface_hub`
50
-
51
- **Template Features**:
52
- - Dataset schema documentation
53
- - Metrics structure examples
54
- - Integration instructions
55
- - Privacy and license information
56
- - Sample experiment entries
57
-
58
- **Test Result**: ✅ Successfully added README to `Tonic/test-dataset-complete`
59
-
60
- ### 3. **Dataset Repository Creation** ✅ IMPLEMENTED
61
-
62
- **Location**: `scripts/dataset_tonic/setup_hf_dataset.py` - `create_dataset_repository()` function
63
-
64
- **What it does**:
65
- - Creates HF Dataset repository with proper permissions
66
- - Handles existing repositories gracefully
67
- - Sets up public dataset for easier sharing
68
- - Uses Python API (`huggingface_hub.create_repo`)
69
-
70
- **Test Result**: ✅ Successfully created dataset repositories
71
-
72
- ### 4. **Automatic Username Detection** ✅ IMPLEMENTED
73
-
74
- **Location**: `scripts/dataset_tonic/setup_hf_dataset.py` - `get_username_from_token()` function
75
-
76
- **What it does**:
77
- - Extracts username from HF token using Python API
78
- - Uses `HfApi(token=token).whoami()`
79
- - Handles both `name` and `username` fields
80
- - Provides clear error messages
81
-
82
- **Test Result**: ✅ Successfully detected username "Tonic"
83
-
84
- ### 5. **Environment Variable Integration** ✅ IMPLEMENTED
85
-
86
- **Location**: `scripts/dataset_tonic/setup_hf_dataset.py` - `setup_trackio_dataset()` function
87
-
88
- **What it does**:
89
- - Sets `TRACKIO_DATASET_REPO` environment variable
90
- - Supports both environment and command-line token sources
91
- - Provides clear feedback on environment setup
92
-
93
- **Test Result**: ✅ Successfully set `TRACKIO_DATASET_REPO=Tonic/test-dataset-complete`
94
-
95
- ### 6. **Launch Script Integration** ✅ IMPLEMENTED
96
-
97
- **Location**: `launch.sh` - Dataset creation section
98
-
99
- **What it does**:
100
- - Automatically calls dataset setup script
101
- - Provides user options for default or custom dataset names
102
- - Falls back to manual input if automatic creation fails
103
- - Integrates seamlessly with the training pipeline
104
-
105
- **Features**:
106
- - Automatic dataset creation
107
- - Custom dataset name support
108
- - Graceful error handling
109
- - Clear user feedback
110
-
111
- ## 🔧 **Technical Implementation Details**
112
-
113
- ### Token Authentication Flow
114
-
115
- ```python
116
- # 1. Direct token authentication
117
- api = HfApi(token=token)
118
-
119
- # 2. Extract username
120
- user_info = api.whoami()
121
- username = user_info.get("name", user_info.get("username"))
122
-
123
- # 3. Create repository
124
- create_repo(
125
- repo_id=f"{username}/{dataset_name}",
126
- repo_type="dataset",
127
- token=token,
128
- exist_ok=True,
129
- private=False
130
- )
131
-
132
- # 4. Upload data
133
- dataset = Dataset.from_list(initial_experiments)
134
- dataset.push_to_hub(repo_id, token=token, private=False)
135
-
136
- # 5. Upload README
137
- upload_file(
138
- path_or_fileobj=readme_content,
139
- path_in_repo="README.md",
140
- repo_id=repo_id,
141
- repo_type="dataset",
142
- token=token
143
- )
144
- ```
145
-
146
- ### Error Handling
147
-
148
- - **Token validation**: Clear error messages for invalid tokens
149
- - **Repository creation**: Handles existing repositories gracefully
150
- - **Data upload**: Fallback mechanisms for upload failures
151
- - **README upload**: Graceful handling of template issues
152
-
153
- ### Cross-Platform Compatibility
154
-
155
- - **Windows**: Tested and working on Windows PowerShell
156
- - **Linux**: Compatible with bash scripts
157
- - **macOS**: Compatible with zsh/bash
158
-
159
- ## 📊 **Test Results**
160
-
161
- ### Successful Test Run
162
-
163
- ```bash
164
- $ python scripts/dataset_tonic/setup_hf_dataset.py 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 test-dataset-complete
165
-
166
- 🚀 Setting up Trackio Dataset Repository
167
- ==================================================
168
- 🔍 Getting username from token...
169
- ✅ Authenticated as: Tonic
170
- 🔧 Creating dataset repository: Tonic/test-dataset-complete
171
- ✅ Successfully created dataset repository: Tonic/test-dataset-complete
172
- ✅ Set TRACKIO_DATASET_REPO=Tonic/test-dataset-complete
173
- 📊 Adding initial experiment data...
174
- Creating parquet from Arrow format: 100%|████████████████████████████████████| 1/1 [00:00<00:00, 93.77ba/s]
175
- Uploading the dataset shards: 100%|█████████████████████████████████████| 1/1 [00:01<00:00, 1.39s/ shards]
176
- ✅ Successfully uploaded initial experiment data to Tonic/test-dataset-complete
177
- ✅ Successfully added README to Tonic/test-dataset-complete
178
- ✅ Successfully added initial experiment data
179
-
180
- 🎉 Dataset setup complete!
181
- 📊 Dataset URL: https://huggingface.co/datasets/Tonic/test-dataset-complete
182
- 🔧 Repository ID: Tonic/test-dataset-complete
183
- ```
184
-
185
- ### Verified Dataset Repository
186
-
187
- **URL**: https://huggingface.co/datasets/Tonic/test-dataset-complete
188
-
189
- **Contents**:
190
- - ✅ README.md with comprehensive documentation
191
- - ✅ Initial experiment data with realistic metrics
192
- - ✅ Proper dataset schema
193
- - ✅ Public repository for easy access
194
-
195
- ## 🎯 **Integration Points**
196
-
197
- ### 1. **Trackio Space Integration**
198
- - Dataset repository automatically configured
199
- - Environment variables set for Space deployment
200
- - Compatible with Trackio monitoring interface
201
-
202
- ### 2. **Training Pipeline Integration**
203
- - `TRACKIO_DATASET_REPO` environment variable set
204
- - Compatible with monitoring scripts
205
- - Ready for experiment logging
206
-
207
- ### 3. **Launch Script Integration**
208
- - Seamless integration with `launch.sh`
209
- - Automatic dataset creation during setup
210
- - User-friendly configuration options
211
-
212
- ## ✅ **Verification Summary**
213
-
214
- | Component | Status | Location | Test Result |
215
- |-----------|--------|----------|-------------|
216
- | Initial Experiment Data | ✅ Implemented | `setup_hf_dataset.py` | ✅ Uploaded successfully |
217
- | README Templates | ✅ Implemented | `templates/datasets/readme.md` | ✅ Added to repository |
218
- | Dataset Repository Creation | ✅ Implemented | `setup_hf_dataset.py` | ✅ Created successfully |
219
- | Username Detection | ✅ Implemented | `setup_hf_dataset.py` | ✅ Detected "Tonic" |
220
- | Environment Variables | ✅ Implemented | `setup_hf_dataset.py` | ✅ Set correctly |
221
- | Launch Script Integration | ✅ Implemented | `launch.sh` | ✅ Integrated |
222
- | Error Handling | ✅ Implemented | All functions | ✅ Graceful fallbacks |
223
- | Cross-Platform Support | ✅ Implemented | Python API | ✅ Windows/Linux/macOS |
224
-
225
- ## 🚀 **Next Steps**
226
-
227
- The dataset components are now **fully implemented and verified**. Users can:
228
-
229
- 1. **Run the launch script**: `./launch.sh`
230
- 2. **Get automatic dataset creation**: No manual username input required
231
- 3. **Receive comprehensive documentation**: README templates included
232
- 4. **Start with sample data**: Initial experiment data provided
233
- 5. **Monitor experiments**: Trackio integration ready
234
-
235
- **All important components are properly implemented and working correctly!** 🎉
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/DEPLOYMENT_COMPONENTS_VERIFICATION.md DELETED
@@ -1,393 +0,0 @@
1
- # Deployment Components Verification
2
-
3
- ## Overview
4
-
5
- This document verifies that all important components for Trackio Spaces deployment and model repository deployment have been properly implemented and are working correctly.
6
-
7
- ## ✅ **Trackio Spaces Deployment - Verified Components**
8
-
9
- ### 1. **Space Creation** ✅ IMPLEMENTED
10
-
11
- **Location**: `scripts/trackio_tonic/deploy_trackio_space.py` - `create_space()` function
12
-
13
- **What it does**:
14
- - Creates HF Space using latest Python API (`create_repo`)
15
- - Falls back to CLI method if API fails
16
- - Handles authentication and username extraction
17
- - Sets proper Space configuration (Gradio SDK, CPU hardware)
18
-
19
- **Key Features**:
20
- - ✅ **API-based creation**: Uses `huggingface_hub.create_repo`
21
- - ✅ **Fallback mechanism**: CLI method if API fails
22
- - ✅ **Username extraction**: Automatic from token using `whoami()`
23
- - ✅ **Proper configuration**: Gradio SDK, CPU hardware, public access
24
-
25
- **Test Result**: ✅ Successfully creates Spaces
26
-
27
- ### 2. **File Upload System** ✅ IMPLEMENTED
28
-
29
- **Location**: `scripts/trackio_tonic/deploy_trackio_space.py` - `upload_files_to_space()` function
30
-
31
- **What it does**:
32
- - Prepares all required files in temporary directory
33
- - Uploads files using HF Hub API (`upload_file`)
34
- - Handles proper file structure for HF Spaces
35
- - Sets up git repository and pushes to main branch
36
-
37
- **Key Features**:
38
- - ✅ **API-based upload**: Uses `huggingface_hub.upload_file`
39
- - ✅ **Proper file structure**: Follows HF Spaces requirements
40
- - ✅ **Git integration**: Proper git workflow in temp directory
41
- - ✅ **Error handling**: Graceful fallback mechanisms
42
-
43
- **Files Uploaded**:
44
- - ✅ `app.py` - Main Gradio interface
45
- - ✅ `requirements.txt` - Dependencies
46
- - ✅ `README.md` - Space documentation
47
- - ✅ `.gitignore` - Git ignore file
48
-
49
- ### 3. **Space Configuration** ✅ IMPLEMENTED
50
-
51
- **Location**: `scripts/trackio_tonic/deploy_trackio_space.py` - `set_space_secrets()` function
52
-
53
- **What it does**:
54
- - Sets environment variables via HF Hub API
55
- - Configures `HF_TOKEN` for dataset access
56
- - Sets `TRACKIO_DATASET_REPO` for experiment storage
57
- - Provides manual setup instructions if API fails
58
-
59
- **Key Features**:
60
- - ✅ **API-based secrets**: Uses `add_space_secret()` method
61
- - ✅ **Automatic configuration**: Sets required environment variables
62
- - ✅ **Manual fallback**: Clear instructions if API fails
63
- - ✅ **Error handling**: Graceful degradation
64
-
65
- ### 4. **Space Testing** ✅ IMPLEMENTED
66
-
67
- **Location**: `scripts/trackio_tonic/deploy_trackio_space.py` - `test_space()` function
68
-
69
- **What it does**:
70
- - Tests Space availability after deployment
71
- - Checks if Space is building correctly
72
- - Provides status feedback to user
73
- - Handles build time delays
74
-
75
- **Key Features**:
76
- - ✅ **Availability testing**: Checks Space URL accessibility
77
- - ✅ **Build status**: Monitors Space build progress
78
- - ✅ **User feedback**: Clear status messages
79
- - ✅ **Timeout handling**: Proper wait times for builds
80
-
81
- ### 5. **Gradio Interface** ✅ IMPLEMENTED
82
-
83
- **Location**: `templates/spaces/app.py` - Complete Gradio application
84
-
85
- **What it does**:
86
- - Provides comprehensive experiment tracking interface
87
- - Integrates with HF Datasets for persistent storage
88
- - Offers real-time metrics visualization
89
- - Supports API access for training scripts
90
-
91
- **Key Features**:
92
- - ✅ **Experiment management**: Create, view, update experiments
93
- - ✅ **Metrics logging**: Real-time training metrics
94
- - ✅ **Visualization**: Interactive plots and charts
95
- - ✅ **HF Datasets integration**: Persistent storage
96
- - ✅ **API endpoints**: Programmatic access
97
- - ✅ **Fallback data**: Backup when dataset unavailable
98
-
99
- **Interface Components**:
100
- - ✅ **Create Experiment**: Start new experiments
101
- - ✅ **Log Metrics**: Track training progress
102
- - ✅ **View Experiments**: See experiment details
103
- - ✅ **Update Status**: Mark experiments complete
104
- - ✅ **Visualizations**: Interactive plots
105
- - ✅ **Configuration**: Environment setup
106
-
107
- ### 6. **Requirements and Dependencies** ✅ IMPLEMENTED
108
-
109
- **Location**: `templates/spaces/requirements.txt`
110
-
111
- **What it includes**:
112
- - ✅ **Core Gradio**: `gradio>=4.0.0`
113
- - ✅ **Data processing**: `pandas>=2.0.0`, `numpy>=1.24.0`
114
- - ✅ **Visualization**: `plotly>=5.15.0`
115
- - ✅ **HF integration**: `datasets>=2.14.0`, `huggingface-hub>=0.16.0`
116
- - ✅ **HTTP requests**: `requests>=2.31.0`
117
- - ✅ **Environment**: `python-dotenv>=1.0.0`
118
-
119
- ### 7. **README Template** ✅ IMPLEMENTED
120
-
121
- **Location**: `templates/spaces/README.md`
122
-
123
- **What it includes**:
124
- - ✅ **HF Spaces metadata**: Proper YAML frontmatter
125
- - ✅ **Feature documentation**: Complete interface description
126
- - ✅ **API documentation**: Usage examples
127
- - ✅ **Configuration guide**: Environment variables
128
- - ✅ **Troubleshooting**: Common issues and solutions
129
-
130
- ## ✅ **Model Repository Deployment - Verified Components**
131
-
132
- ### 1. **Repository Creation** ✅ IMPLEMENTED
133
-
134
- **Location**: `scripts/model_tonic/push_to_huggingface.py` - `create_repository()` function
135
-
136
- **What it does**:
137
- - Creates HF model repository using Python API
138
- - Handles private/public repository settings
139
- - Supports existing repository updates
140
- - Provides proper error handling
141
-
142
- **Key Features**:
143
- - ✅ **API-based creation**: Uses `huggingface_hub.create_repo`
144
- - ✅ **Privacy settings**: Configurable private/public
145
- - ✅ **Existing handling**: `exist_ok=True` for updates
146
- - ✅ **Error handling**: Clear error messages
147
-
148
- ### 2. **Model File Upload** ✅ IMPLEMENTED
149
-
150
- **Location**: `scripts/model_tonic/push_to_huggingface.py` - `upload_model_files()` function
151
-
152
- **What it does**:
153
- - Validates model files exist and are complete
154
- - Uploads all model files to repository
155
- - Handles large file uploads efficiently
156
- - Provides progress feedback
157
-
158
- **Key Features**:
159
- - ✅ **File validation**: Checks for required model files
160
- - ✅ **Complete upload**: All model components uploaded
161
- - ✅ **Progress tracking**: Upload progress feedback
162
- - ✅ **Error handling**: Graceful failure handling
163
-
164
- **Files Uploaded**:
165
- - ✅ `config.json` - Model configuration
166
- - ✅ `pytorch_model.bin` - Model weights
167
- - ✅ `tokenizer.json` - Tokenizer configuration
168
- - ✅ `tokenizer_config.json` - Tokenizer settings
169
- - ✅ `special_tokens_map.json` - Special tokens
170
- - ✅ `generation_config.json` - Generation settings
171
-
172
- ### 3. **Model Card Generation** ✅ IMPLEMENTED
173
-
174
- **Location**: `scripts/model_tonic/push_to_huggingface.py` - `create_model_card()` function
175
-
176
- **What it does**:
177
- - Generates comprehensive model cards
178
- - Includes training configuration and results
179
- - Provides usage examples and documentation
180
- - Supports quantized model variants
181
-
182
- **Key Features**:
183
- - ✅ **Template-based**: Uses `templates/model_card.md`
184
- - ✅ **Dynamic content**: Training config and results
185
- - ✅ **Usage examples**: Code snippets and instructions
186
- - ✅ **Quantized support**: Multiple model variants
187
- - ✅ **Metadata**: Proper HF Hub metadata
188
-
189
- ### 4. **Training Results Documentation** ✅ IMPLEMENTED
190
-
191
- **Location**: `scripts/model_tonic/push_to_huggingface.py` - `upload_training_results()` function
192
-
193
- **What it does**:
194
- - Uploads training configuration and results
195
- - Documents experiment parameters
196
- - Includes performance metrics
197
- - Provides experiment tracking links
198
-
199
- **Key Features**:
200
- - ✅ **Configuration upload**: Training parameters
201
- - ✅ **Results documentation**: Performance metrics
202
- - ✅ **Experiment links**: Trackio integration
203
- - ✅ **Metadata**: Proper documentation structure
204
-
205
- ### 5. **Quantized Model Support** ✅ IMPLEMENTED
206
-
207
- **Location**: `scripts/model_tonic/quantize_model.py`
208
-
209
- **What it does**:
210
- - Creates int8 and int4 quantized models
211
- - Uploads to subdirectories in same repository
212
- - Generates quantized model cards
213
- - Provides usage instructions for each variant
214
-
215
- **Key Features**:
216
- - ✅ **Multiple quantization**: int8 and int4 support
217
- - ✅ **Unified repository**: All variants in one repo
218
- - ✅ **Separate documentation**: Individual model cards
219
- - ✅ **Usage instructions**: Clear guidance for each variant
220
-
221
- ### 6. **Trackio Integration** ✅ IMPLEMENTED
222
-
223
- **Location**: `scripts/model_tonic/push_to_huggingface.py` - `log_to_trackio()` function
224
-
225
- **What it does**:
226
- - Logs model push events to Trackio
227
- - Records training results and metrics
228
- - Provides experiment tracking links
229
- - Integrates with HF Datasets
230
-
231
- **Key Features**:
232
- - ✅ **Event logging**: Model push events
233
- - ✅ **Results tracking**: Training metrics
234
- - ✅ **Experiment links**: Trackio Space integration
235
- - ✅ **Dataset integration**: HF Datasets support
236
-
237
- ### 7. **Model Validation** ✅ IMPLEMENTED
238
-
239
- **Location**: `scripts/model_tonic/push_to_huggingface.py` - `validate_model_path()` function
240
-
241
- **What it does**:
242
- - Validates model files are complete
243
- - Checks for required model components
244
- - Verifies file integrity
245
- - Provides detailed error messages
246
-
247
- **Key Features**:
248
- - ✅ **File validation**: Checks all required files
249
- - ✅ **Size verification**: Model file sizes
250
- - ✅ **Configuration check**: Valid config files
251
- - ✅ **Error reporting**: Detailed error messages
252
-
253
- ## 🔧 **Technical Implementation Details**
254
-
255
- ### Trackio Space Deployment Flow
256
-
257
- ```python
258
- # 1. Create Space
259
- create_repo(
260
- repo_id=f"{username}/{space_name}",
261
- token=token,
262
- repo_type="space",
263
- exist_ok=True,
264
- private=False,
265
- space_sdk="gradio",
266
- space_hardware="cpu-basic"
267
- )
268
-
269
- # 2. Upload Files
270
- upload_file(
271
- path_or_fileobj=file_content,
272
- path_in_repo=file_path,
273
- repo_id=repo_id,
274
- repo_type="space",
275
- token=token
276
- )
277
-
278
- # 3. Set Secrets
279
- add_space_secret(
280
- repo_id=repo_id,
281
- repo_type="space",
282
- key="HF_TOKEN",
283
- value=token
284
- )
285
- ```
286
-
287
- ### Model Repository Deployment Flow
288
-
289
- ```python
290
- # 1. Create Repository
291
- create_repo(
292
- repo_id=repo_name,
293
- token=token,
294
- private=private,
295
- exist_ok=True
296
- )
297
-
298
- # 2. Upload Model Files
299
- upload_file(
300
- path_or_fileobj=model_file,
301
- path_in_repo=file_path,
302
- repo_id=repo_name,
303
- token=token
304
- )
305
-
306
- # 3. Generate Model Card
307
- model_card = create_model_card(training_config, results)
308
- upload_file(
309
- path_or_fileobj=model_card,
310
- path_in_repo="README.md",
311
- repo_id=repo_name,
312
- token=token
313
- )
314
- ```
315
-
316
- ## 📊 **Test Results**
317
-
318
- ### Trackio Space Deployment Test
319
-
320
- ```bash
321
- $ python scripts/trackio_tonic/deploy_trackio_space.py
322
-
323
- 🚀 Starting Trackio Space deployment...
324
- ✅ Authenticated as: Tonic
325
- ✅ Space created successfully: https://huggingface.co/spaces/Tonic/trackio-monitoring
326
- ✅ Files uploaded successfully
327
- ✅ Secrets configured via API
328
- ✅ Space is building and will be available shortly
329
- 🎉 Deployment completed!
330
- 📊 Trackio Space URL: https://huggingface.co/spaces/Tonic/trackio-monitoring
331
- ```
332
-
333
- ### Model Repository Deployment Test
334
-
335
- ```bash
336
- $ python scripts/model_tonic/push_to_huggingface.py --model_path outputs/model --repo_name Tonic/smollm3-finetuned
337
-
338
- ✅ Repository created: https://huggingface.co/Tonic/smollm3-finetuned
339
- ✅ Model files uploaded successfully
340
- ✅ Model card generated and uploaded
341
- ✅ Training results documented
342
- ✅ Quantized models created and uploaded
343
- 🎉 Model deployment completed!
344
- ```
345
-
346
- ## 🎯 **Integration Points**
347
-
348
- ### 1. **End-to-End Pipeline Integration**
349
- - ✅ **Launch script**: Automatic deployment calls
350
- - ✅ **Environment setup**: Proper token configuration
351
- - ✅ **Error handling**: Graceful fallbacks
352
- - ✅ **User feedback**: Clear progress indicators
353
-
354
- ### 2. **Monitoring Integration**
355
- - ✅ **Trackio Space**: Real-time experiment tracking
356
- - ✅ **HF Datasets**: Persistent experiment storage
357
- - ✅ **Model cards**: Complete documentation
358
- - ✅ **Training results**: Comprehensive logging
359
-
360
- ### 3. **Cross-Component Integration**
361
- - ✅ **Dataset deployment**: Automatic dataset creation
362
- - ✅ **Space deployment**: Automatic Space creation
363
- - ✅ **Model deployment**: Automatic model upload
364
- - ✅ **Documentation**: Complete system documentation
365
-
366
- ## ✅ **Verification Summary**
367
-
368
- | Component | Status | Location | Test Result |
369
- |-----------|--------|----------|-------------|
370
- | **Trackio Space Creation** | ✅ Implemented | `deploy_trackio_space.py` | ✅ Created successfully |
371
- | **File Upload System** | ✅ Implemented | `deploy_trackio_space.py` | ✅ Uploaded successfully |
372
- | **Space Configuration** | ✅ Implemented | `deploy_trackio_space.py` | ✅ Configured via API |
373
- | **Gradio Interface** | ✅ Implemented | `templates/spaces/app.py` | ✅ Full functionality |
374
- | **Requirements** | ✅ Implemented | `templates/spaces/requirements.txt` | ✅ All dependencies |
375
- | **README Template** | ✅ Implemented | `templates/spaces/README.md` | ✅ Complete documentation |
376
- | **Model Repository Creation** | ✅ Implemented | `push_to_huggingface.py` | ✅ Created successfully |
377
- | **Model File Upload** | ✅ Implemented | `push_to_huggingface.py` | ✅ Uploaded successfully |
378
- | **Model Card Generation** | ✅ Implemented | `push_to_huggingface.py` | ✅ Generated and uploaded |
379
- | **Quantized Models** | ✅ Implemented | `quantize_model.py` | ✅ Created and uploaded |
380
- | **Trackio Integration** | ✅ Implemented | `push_to_huggingface.py` | ✅ Integrated successfully |
381
- | **Model Validation** | ✅ Implemented | `push_to_huggingface.py` | ✅ Validated successfully |
382
-
383
- ## 🚀 **Next Steps**
384
-
385
- The deployment components are now **fully implemented and verified**. Users can:
386
-
387
- 1. **Deploy Trackio Space**: Automatic Space creation and configuration
388
- 2. **Upload Models**: Complete model deployment with documentation
389
- 3. **Monitor Experiments**: Real-time tracking and visualization
390
- 4. **Share Results**: Comprehensive documentation and examples
391
- 5. **Scale Operations**: Support for multiple experiments and models
392
-
393
- **All important deployment components are properly implemented and working correctly!** 🎉
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/DEPLOYMENT_GUIDE.md DELETED
@@ -1,397 +0,0 @@
1
- # Trackio Deployment Guide for Hugging Face Spaces
2
-
3
- This guide provides step-by-step instructions for deploying Trackio experiment tracking to Hugging Face Spaces and integrating it with your SmolLM3 fine-tuning pipeline.
4
-
5
- ## Prerequisites
6
-
7
- - Hugging Face account
8
- - Hugging Face CLI installed (`pip install huggingface_hub`)
9
- - Git configured with your Hugging Face credentials
10
-
11
- ## Method 1: Automated Deployment (Recommended)
12
-
13
- ### Step 1: Run the Deployment Script
14
-
15
- ```bash
16
- python deploy_trackio_space.py
17
- ```
18
-
19
- The script will prompt you for:
20
- - Your Hugging Face username
21
- - Space name (e.g., `trackio-monitoring`)
22
- - Hugging Face token (needs a write token obviously)
23
-
24
- ### Step 2: Wait for Build
25
-
26
- After deployment, wait 2-5 minutes for the Space to build and become available.
27
-
28
- ### Step 3: Test the Interface
29
-
30
- Visit your Space URL to test the interface:
31
- ```
32
- https://huggingface.co/spaces/YOUR_USERNAME/YOUR_SPACE_NAME
33
- ```
34
-
35
- ## Method 2: Manual Deployment
36
-
37
- ### Step 1: Create a New Space
38
-
39
- 1. Go to https://huggingface.co/spaces
40
- 2. Click "Create new Space"
41
- 3. Configure the Space:
42
- - **Owner**: Your username
43
- - **Space name**: `trackio-monitoring` (or your preferred name)
44
- - **SDK**: Gradio
45
- - **Hardware**: CPU (Basic)
46
- - **License**: MIT
47
-
48
- ### Step 2: Upload Files
49
-
50
- Upload these files to your Space:
51
-
52
- #### `app.py`
53
- The main Gradio interface (already created in this repository)
54
-
55
- #### `requirements_space.txt`
56
- ```
57
- gradio>=4.0.0
58
- gradio-client>=0.10.0
59
- requests>=2.31.0
60
- numpy>=1.24.0
61
- pandas>=2.0.0
62
- jsonschema>=4.17.0
63
- plotly>=5.15.0
64
- matplotlib>=3.7.0
65
- python-dotenv>=1.0.0
66
- ```
67
-
68
- #### `README.md`
69
- ```markdown
70
- # Trackio Experiment Tracking
71
-
72
- A Gradio interface for experiment tracking and monitoring.
73
-
74
- ## Features
75
-
76
- - Create and manage experiments
77
- - Log training metrics and parameters
78
- - View experiment details and results
79
- - Update experiment status
80
-
81
- ## Usage
82
-
83
- 1. Create a new experiment using the "Create Experiment" tab
84
- 2. Log metrics during training using the "Log Metrics" tab
85
- 3. View experiment details using the "View Experiments" tab
86
- 4. Update experiment status using the "Update Status" tab
87
-
88
- ## Integration
89
-
90
- To connect your training script to this Trackio Space:
91
-
92
- ```python
93
- from monitoring import SmolLM3Monitor
94
-
95
- monitor = SmolLM3Monitor(
96
- experiment_name="my_experiment",
97
- trackio_url="https://your-space.hf.space",
98
- enable_tracking=True
99
- )
100
- ```
101
-
102
- ### Step 3: Configure Space Settings
103
-
104
- In your Space settings, ensure:
105
- - **App file**: `app.py`
106
- - **Python version**: 3.9 or higher
107
- - **Hardware**: CPU (Basic) is sufficient
108
-
109
- ## Integration with Your Training Script
110
-
111
- ### Step 1: Update Your Configuration
112
-
113
- Add Trackio settings to your training configuration:
114
-
115
- ```python
116
- # config/train_smollm3.py
117
- @dataclass
118
- class SmolLM3Config:
119
- # ... existing settings ...
120
-
121
- # Trackio monitoring configuration
122
- enable_tracking: bool = True
123
- trackio_url: Optional[str] = None # Your Space URL
124
- trackio_token: Optional[str] = None
125
- log_artifacts: bool = True
126
- log_metrics: bool = True
127
- log_config: bool = True
128
- experiment_name: Optional[str] = None
129
- ```
130
-
131
- ### Step 2: Run Training with Trackio
132
-
133
- ```bash
134
- python train.py config/train_smollm3.py \
135
- --dataset_dir my_dataset \
136
- --enable_tracking \
137
- --trackio_url "https://your-username-trackio-monitoring.hf.space" \
138
- --experiment_name "smollm3_finetune_v1"
139
- ```
140
-
141
- ### Step 3: Monitor Your Experiments
142
-
143
- 1. **Create Experiment**: Use the "Create Experiment" tab in your Space
144
- 2. **Log Metrics**: Your training script will automatically log metrics
145
- 3. **View Results**: Use the "View Experiments" tab to see progress
146
- 4. **Update Status**: Mark experiments as completed when done
147
-
148
- ## Advanced Configuration
149
-
150
- ### Environment Variables
151
-
152
- You can set Trackio configuration via environment variables:
153
-
154
- ```bash
155
- export TRACKIO_URL="https://your-space.hf.space"
156
- export TRACKIO_TOKEN="your_token_here"
157
- ```
158
-
159
- ### Custom Experiment Names
160
-
161
- ```bash
162
- python train.py config/train_smollm3.py \
163
- --experiment_name "smollm3_high_lr_experiment" \
164
- --trackio_url "https://your-space.hf.space"
165
- ```
166
-
167
- ### Multiple Experiments
168
-
169
- You can run multiple experiments and track them separately:
170
-
171
- ```bash
172
- # Experiment 1
173
- python train.py config/train_smollm3.py \
174
- --experiment_name "smollm3_baseline" \
175
- --learning_rate 2e-5
176
-
177
- # Experiment 2
178
- python train.py config/train_smollm3.py \
179
- --experiment_name "smollm3_high_lr" \
180
- --learning_rate 5e-5
181
- ```
182
-
183
- ## Using the Trackio Interface
184
-
185
- ### Creating Experiments
186
-
187
- 1. Go to the "Create Experiment" tab
188
- 2. Enter experiment name (e.g., "smollm3_finetune_v1")
189
- 3. Add description (optional)
190
- 4. Click "Create Experiment"
191
- 5. Note the experiment ID for logging metrics
192
-
193
- ### Logging Metrics
194
-
195
- 1. Go to the "Log Metrics" tab
196
- 2. Enter your experiment ID
197
- 3. Add metrics in JSON format:
198
- ```json
199
- {
200
- "loss": 0.5,
201
- "accuracy": 0.85,
202
- "learning_rate": 2e-5
203
- }
204
- ```
205
- 4. Add step number (optional)
206
- 5. Click "Log Metrics"
207
-
208
- ### Viewing Experiments
209
-
210
- 1. Go to the "View Experiments" tab
211
- 2. Enter experiment ID to view specific experiment
212
- 3. Or click "List All Experiments" to see all experiments
213
-
214
- ### Updating Status
215
-
216
- 1. Go to the "Update Status" tab
217
- 2. Enter experiment ID
218
- 3. Select new status (running, completed, failed, paused)
219
- 4. Click "Update Status"
220
-
221
- ## Troubleshooting
222
-
223
- ### Common Issues
224
-
225
- #### 1. Space Not Building
226
- - Check that all required files are uploaded
227
- - Verify `app.py` is the main file
228
- - Check the Space logs for errors
229
-
230
- #### 2. Connection Errors
231
- - Verify your Space URL is correct
232
- - Check that the Space is running (not paused)
233
- - Ensure your training script can reach the Space URL
234
-
235
- #### 3. Missing Metrics
236
- - Check that `enable_tracking=True` in your config
237
- - Verify the Trackio URL is correct
238
- - Check training logs for monitoring errors
239
-
240
- #### 4. Authentication Issues
241
- - If using tokens, verify they're correct
242
- - Check Hugging Face account permissions
243
- - Ensure Space is public or you have access
244
-
245
- ### Debug Mode
246
-
247
- Enable debug logging in your training script:
248
-
249
- ```python
250
- import logging
251
- logging.basicConfig(level=logging.DEBUG)
252
- ```
253
-
254
- ### Manual Testing
255
-
256
- Test the Trackio interface manually:
257
-
258
- 1. Create an experiment
259
- 2. Log some test metrics
260
- 3. View the experiment details
261
- 4. Update the status
262
-
263
- ## Security Considerations
264
-
265
- ### Public vs Private Spaces
266
-
267
- - **Public Spaces**: Anyone can view and use the interface
268
- - **Private Spaces**: Only you and collaborators can access
269
-
270
- ### Token Management
271
-
272
- - Store tokens securely (environment variables)
273
- - Don't commit tokens to version control
274
- - Use Hugging Face's token management
275
-
276
- ### Data Privacy
277
-
278
- - Trackio stores experiment data in the Space
279
- - Consider data retention policies
280
- - Be mindful of sensitive information in experiment names
281
-
282
- ## Performance Optimization
283
-
284
- ### Space Configuration
285
-
286
- - Use CPU (Basic) for the interface (sufficient for tracking)
287
- - Consider GPU only for actual training
288
- - Monitor Space usage and limits
289
-
290
- ### Efficient Logging
291
-
292
- - Log metrics at reasonable intervals (every 10-100 steps)
293
- - Avoid logging too frequently to prevent rate limiting
294
- - Use batch logging when possible
295
-
296
- ## Monitoring Best Practices
297
-
298
- ### Experiment Naming
299
-
300
- Use descriptive names:
301
- - `smollm3_baseline_v1`
302
- - `smollm3_high_lr_experiment`
303
- - `smollm3_dpo_training`
304
-
305
- ### Metric Logging
306
-
307
- Log relevant metrics:
308
- - Training loss
309
- - Validation loss
310
- - Learning rate
311
- - GPU memory usage
312
- - Training time
313
-
314
- ### Status Management
315
-
316
- - Mark experiments as "running" when starting
317
- - Update to "completed" when finished
318
- - Mark as "failed" if errors occur
319
- - Use "paused" for temporary stops
320
-
321
- ## Integration Examples
322
-
323
- ### Basic Integration
324
-
325
- ```python
326
- from monitoring import SmolLM3Monitor
327
-
328
- # Initialize monitor
329
- monitor = SmolLM3Monitor(
330
- experiment_name="my_experiment",
331
- trackio_url="https://your-space.hf.space",
332
- enable_tracking=True
333
- )
334
-
335
- # Log configuration
336
- monitor.log_config(config_dict)
337
-
338
- # Log metrics during training
339
- monitor.log_metrics({"loss": 0.5}, step=100)
340
-
341
- # Log final results
342
- monitor.log_training_summary(final_results)
343
- ```
344
-
345
- ### Advanced Integration
346
-
347
- ```python
348
- # Custom monitoring setup
349
- monitor = SmolLM3Monitor(
350
- experiment_name="smollm3_advanced",
351
- trackio_url="https://your-space.hf.space",
352
- enable_tracking=True,
353
- log_artifacts=True,
354
- log_metrics=True,
355
- log_config=True
356
- )
357
-
358
- # Log system metrics
359
- monitor.log_system_metrics(step=current_step)
360
-
361
- # Log model checkpoint
362
- monitor.log_model_checkpoint("checkpoint-1000", step=1000)
363
-
364
- # Log evaluation results
365
- monitor.log_evaluation_results(eval_results, step=1000)
366
- ```
367
-
368
- ## Support and Resources
369
-
370
- ### Documentation
371
-
372
- - [Hugging Face Spaces Documentation](https://huggingface.co/docs/hub/spaces)
373
- - [Gradio Documentation](https://gradio.app/docs/)
374
- - [Trackio GitHub Repository](https://github.com/Josephrp/trackio)
375
-
376
- ### Community
377
-
378
- - [Hugging Face Forums](https://discuss.huggingface.co/)
379
- - [Gradio Discord](https://discord.gg/feTf9z3Z)
380
-
381
- ### Issues and Feedback
382
-
383
- - Report issues on the project repository
384
- - Provide feedback on the Trackio interface
385
- - Suggest improvements for the monitoring system
386
-
387
- ## Conclusion
388
-
389
- You now have a complete Trackio monitoring system deployed on Hugging Face Spaces! This setup provides:
390
-
391
- - ✅ Easy experiment tracking and monitoring
392
- - ✅ Real-time metric logging
393
- - ✅ Web-based interface for experiment management
394
- - ✅ Integration with your SmolLM3 fine-tuning pipeline
395
- - ✅ Scalable and accessible monitoring solution
396
-
397
- Start tracking your experiments and gain insights into your model training process!
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/Data_Pipeline.md ADDED
@@ -0,0 +1,95 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ```mermaid
2
+ graph LR
3
+ EntryPoint["EntryPoint"]
4
+ Configuration["Configuration"]
5
+ Model_Abstraction["Model Abstraction"]
6
+ Data_Pipeline["Data Pipeline"]
7
+ Training_Logic["Training Logic"]
8
+ Utilities["Utilities"]
9
+ EntryPoint -- "instructs" --> Data_Pipeline
10
+ EntryPoint -- "loads settings from" --> Configuration
11
+ EntryPoint -- "initializes models via" --> Model_Abstraction
12
+ EntryPoint -- "invokes" --> Training_Logic
13
+ Configuration -- "provides settings to" --> EntryPoint
14
+ Configuration -- "informs" --> Model_Abstraction
15
+ Configuration -- "guides" --> Data_Pipeline
16
+ Model_Abstraction -- "provides models to" --> EntryPoint
17
+ Model_Abstraction -- "receives settings from" --> Configuration
18
+ Model_Abstraction -- "interacts with" --> Training_Logic
19
+ Data_Pipeline -- "provides processed data to" --> EntryPoint
20
+ Data_Pipeline -- "receives parameters from" --> Configuration
21
+ Data_Pipeline -- "supplies batches to" --> Training_Logic
22
+ Training_Logic -- "receives control from" --> EntryPoint
23
+ Training_Logic -- "consumes data from" --> Data_Pipeline
24
+ Training_Logic -- "operates on models from" --> Model_Abstraction
25
+ Training_Logic -- "uses" --> Utilities
26
+ Utilities -- "used by" --> EntryPoint
27
+ Utilities -- "provides functionalities to" --> Training_Logic
28
+ Utilities -- "assists" --> Data_Pipeline
29
+ click Model_Abstraction href "https://github.com/Josephrp/SmolFactory/blob/main/docs/Model_Abstraction.md" "Details"
30
+ click Data_Pipeline href "https://github.com/Josephrp/SmolFactory/blob/main/docs/Data_Pipeline.md" "Details"
31
+ ```
32
+
33
+ [![CodeBoarding](https://img.shields.io/badge/Generated%20by-CodeBoarding-9cf?style=flat-square)](https://github.com/CodeBoarding/GeneratedOnBoardings)[![Demo](https://img.shields.io/badge/Try%20our-Demo-blue?style=flat-square)](https://www.codeboarding.org/demo)[![Contact](https://img.shields.io/badge/Contact%20us%20-%[email protected]?style=flat-square)](mailto:[email protected])
34
+
35
+ ## Details
36
+
37
+ Final component overview for the `smollm3_finetune` project, based on the provided analysis and adhering to Machine Learning Training and Fine-tuning Framework patterns.
38
+
39
+ ### EntryPoint
40
+ The main entry point of the application, responsible for orchestrating the entire training and fine-tuning workflow. It initializes other core components, loads configurations, and manages the overall execution flow.
41
+
42
+
43
+ **Related Classes/Methods**:
44
+
45
+ - `smollm3_finetune.train` (1:1)
46
+
47
+
48
+ ### Configuration
49
+ Centralizes and defines all parameters and settings required for the training and fine-tuning process, including model hyperparameters, dataset paths, and training arguments. It promotes a configuration-driven architecture, allowing easy modification and versioning of experimental setups.
50
+
51
+
52
+ **Related Classes/Methods**:
53
+
54
+ - <a href="https://github.com/Josephrp/SmolFactory/docs/blob/main/src/config.py#L1-L1" target="_blank" rel="noopener noreferrer">`config` (1:1)</a>
55
+
56
+
57
+ ### Model Abstraction [[Expand]](./Model_Abstraction.md)
58
+ Encapsulates the logic for loading, initializing, and managing different machine learning models and their variants (e.g., different architectures, quantization settings). It provides a consistent interface for interacting with various model architectures.
59
+
60
+
61
+ **Related Classes/Methods**:
62
+
63
+ - <a href="https://github.com/Josephrp/SmolFactory/docs/main/src/model.py#L1-L1" target="_blank" rel="noopener noreferrer">`model` (1:1)</a>
64
+
65
+
66
+ ### Data Pipeline [[Expand]](./Data_Pipeline.md)
67
+ Handles the entire data lifecycle, including dataset loading, preprocessing (e.g., tokenization, formatting), and creating efficient data loaders for both training and evaluation phases. It ensures data is prepared correctly and efficiently for the model.
68
+
69
+
70
+ **Related Classes/Methods**:
71
+
72
+ - `smollm3_finetune.data.load_and_preprocess_data` (1:1)
73
+
74
+
75
+ ### Training Logic
76
+ Contains the core algorithms and routines for training and fine-tuning machine learning models. This includes the training loop, optimization steps, loss calculation, gradient accumulation, and potentially specialized fine-tuning methods (e.g., LoRA, QLoRA).
77
+
78
+
79
+ **Related Classes/Methods**:
80
+
81
+ - <a href="https://github.com/Josephrp/SmolFactory/docs/blob/main/src/trainer.py#L1-L1" target="_blank" rel="noopener noreferrer">`trainer` (1:1)</a>
82
+
83
+
84
+ ### Utilities
85
+ A collection of common helper functions, reusable modules, and general-purpose tools that support various parts of the training framework but do not belong to a specific core component. This includes functions for logging, metrics calculation, device management, etc.
86
+
87
+
88
+ **Related Classes/Methods**:
89
+
90
+ - `utils` (1:1)
91
+
92
+
93
+
94
+
95
+ ### [FAQ](https://github.com/CodeBoarding/GeneratedOnBoardings/tree/main?tab=readme-ov-file#faq)
docs/ENHANCED_MODEL_CARD_METADATA.md DELETED
@@ -1,300 +0,0 @@
1
- # Enhanced Model Card Metadata System
2
-
3
- ## Overview
4
-
5
- The enhanced model card system now includes comprehensive YAML metadata that follows the [Hugging Face Model Cards specification](https://huggingface.co/docs/hub/en/model-cards). This ensures maximum compatibility with the Hugging Face Hub and provides rich metadata for model discovery and usage.
6
-
7
- ## Metadata Structure
8
-
9
- ### Core Metadata Fields
10
-
11
- The model card template now includes the following metadata fields:
12
-
13
- ```yaml
14
- ---
15
- language:
16
- - en
17
- - fr
18
- license: apache-2.0
19
- library_name: transformers
20
- tags:
21
- - smollm3
22
- - fine-tuned
23
- - causal-lm
24
- - text-generation
25
- - quantized
26
- - dataset:OpenHermes-FR
27
- - config:H100 Lightweight
28
- pipeline_tag: text-generation
29
- base_model: HuggingFaceTB/SmolLM3-3B
30
- datasets:
31
- - OpenHermes-FR
32
- ---
33
- ```
34
-
35
- ### Conditional Metadata
36
-
37
- The system supports conditional metadata based on model configuration:
38
-
39
- #### Quantized Models
40
- When quantized models are available, additional metadata is included:
41
-
42
- ```yaml
43
- quantization_types:
44
- - int8_weight_only
45
- - int4_weight_only
46
- ```
47
-
48
- #### Model Index (Evaluation Results)
49
- The system automatically generates structured evaluation results:
50
-
51
- ```yaml
52
- model-index:
53
- - name: Model Name
54
- results:
55
- - task:
56
- type: text-generation
57
- dataset:
58
- name: OpenHermes-FR
59
- type: OpenHermes-FR
60
- metrics:
61
- - name: Training Loss
62
- type: loss
63
- value: "2.1"
64
- - name: Validation Loss
65
- type: loss
66
- value: "2.3"
67
- - name: Perplexity
68
- type: perplexity
69
- value: "9.8"
70
- ```
71
-
72
- For quantized models, additional entries are included:
73
-
74
- ```yaml
75
- - name: Model Name (int8 quantized)
76
- results:
77
- - task:
78
- type: text-generation
79
- dataset:
80
- name: OpenHermes-FR
81
- type: OpenHermes-FR
82
- metrics:
83
- - name: Memory Reduction
84
- type: memory_efficiency
85
- value: "~50%"
86
- - name: Inference Speed
87
- type: speed
88
- value: "Faster"
89
- ```
90
-
91
- ## Metadata Fields Explained
92
-
93
- ### Required Fields
94
-
95
- | Field | Description | Example |
96
- |-------|-------------|---------|
97
- | `language` | Supported languages | `["en", "fr"]` |
98
- | `license` | Model license | `"apache-2.0"` |
99
- | `library_name` | Primary library | `"transformers"` |
100
- | `tags` | Model tags for discovery | `["smollm3", "fine-tuned"]` |
101
- | `pipeline_tag` | Task type | `"text-generation"` |
102
- | `base_model` | Original model | `"HuggingFaceTB/SmolLM3-3B"` |
103
-
104
- ### Optional Fields
105
-
106
- | Field | Description | Example |
107
- |-------|-------------|---------|
108
- | `datasets` | Training datasets | `["OpenHermes-FR"]` |
109
- | `author` | Model author | `"Your Name"` |
110
- | `experiment_name` | Experiment tracking | `"smollm3-experiment"` |
111
- | `trackio_url` | Monitoring URL | `"https://trackio.space/exp"` |
112
- | `hardware` | Training hardware | `"GPU (H100/A100)"` |
113
- | `training_config` | Configuration type | `"H100 Lightweight"` |
114
- | `trainer_type` | Trainer used | `"SFTTrainer"` |
115
- | `batch_size` | Training batch size | `"8"` |
116
- | `learning_rate` | Learning rate | `"5e-6"` |
117
- | `max_epochs` | Number of epochs | `"3"` |
118
- | `max_seq_length` | Sequence length | `"2048"` |
119
- | `gradient_accumulation_steps` | Gradient accumulation | `"16"` |
120
-
121
- ### Training Results
122
-
123
- | Field | Description | Example |
124
- |-------|-------------|---------|
125
- | `training_loss` | Final training loss | `"2.1"` |
126
- | `validation_loss` | Final validation loss | `"2.3"` |
127
- | `perplexity` | Model perplexity | `"9.8"` |
128
-
129
- ## Benefits of Enhanced Metadata
130
-
131
- ### 1. Improved Discovery
132
- - **Filtering**: Users can filter models by dataset, configuration, or hardware
133
- - **Search**: Enhanced search capabilities on the Hugging Face Hub
134
- - **Tags**: Automatic tag generation for better categorization
135
-
136
- ### 2. Better Model Cards
137
- - **Structured Data**: Evaluation results are displayed in widgets
138
- - **Consistent Format**: Follows Hugging Face standards
139
- - **Rich Information**: Comprehensive model information
140
-
141
- ### 3. Integration Benefits
142
- - **Papers with Code**: Model index data can be indexed in leaderboards
143
- - **API Compatibility**: Better integration with Hugging Face APIs
144
- - **Automated Tools**: Support for automated model analysis
145
-
146
- ## Usage Examples
147
-
148
- ### Basic Model Card Generation
149
-
150
- ```bash
151
- python scripts/model_tonic/generate_model_card.py \
152
- --repo-name "username/model-name" \
153
- --model-name "My Fine-tuned Model" \
154
- --dataset-name "OpenHermes-FR" \
155
- --training-config "H100 Lightweight" \
156
- --batch-size "8" \
157
- --learning-rate "5e-6" \
158
- --max-epochs "3" \
159
- --training-loss "2.1" \
160
- --validation-loss "2.3" \
161
- --perplexity "9.8" \
162
- --output "README.md"
163
- ```
164
-
165
- ### With Quantized Models
166
-
167
- ```bash
168
- python scripts/model_tonic/generate_model_card.py \
169
- --repo-name "username/model-name" \
170
- --model-name "My Fine-tuned Model" \
171
- --dataset-name "OpenHermes-FR" \
172
- --training-config "H100 Lightweight" \
173
- --batch-size "8" \
174
- --learning-rate "5e-6" \
175
- --max-epochs "3" \
176
- --training-loss "2.1" \
177
- --validation-loss "2.3" \
178
- --perplexity "9.8" \
179
- --quantized-models \
180
- --output "README.md"
181
- ```
182
-
183
- ## Template Variables
184
-
185
- The enhanced template supports all the original variables plus new metadata fields:
186
-
187
- ### New Variables
188
-
189
- | Variable | Description | Default |
190
- |----------|-------------|---------|
191
- | `training_loss` | Training loss value | `"N/A"` |
192
- | `validation_loss` | Validation loss value | `"N/A"` |
193
- | `perplexity` | Model perplexity | `"N/A"` |
194
-
195
- ### Conditional Metadata
196
-
197
- The template automatically includes:
198
-
199
- - **Dataset Information**: When `dataset_name` is provided
200
- - **Quantization Types**: When `quantized_models` is `true`
201
- - **Evaluation Results**: When training metrics are available
202
- - **Hardware Information**: When `hardware_info` is provided
203
-
204
- ## Integration with Training Pipeline
205
-
206
- ### Automatic Metadata Generation
207
-
208
- The push script automatically extracts metadata from:
209
-
210
- 1. **Training Configuration**: Batch size, learning rate, epochs, etc.
211
- 2. **Training Results**: Loss values, perplexity, etc.
212
- 3. **Model Information**: Base model, hardware, etc.
213
- 4. **Experiment Tracking**: Trackio URLs, experiment names
214
-
215
- ### Example Integration
216
-
217
- ```python
218
- # In push_to_huggingface.py
219
- variables = {
220
- "model_name": f"{self.repo_name.split('/')[-1]} - Fine-tuned SmolLM3",
221
- "repo_name": self.repo_name,
222
- "base_model": "HuggingFaceTB/SmolLM3-3B",
223
- "dataset_name": training_config.get('dataset_name', 'OpenHermes-FR'),
224
- "training_config_type": training_config.get('training_config_type', 'Custom Configuration'),
225
- "trainer_type": training_config.get('trainer_type', 'SFTTrainer'),
226
- "batch_size": str(training_config.get('per_device_train_batch_size', 8)),
227
- "learning_rate": str(training_config.get('learning_rate', '5e-6')),
228
- "max_epochs": str(training_config.get('num_train_epochs', 3)),
229
- "hardware_info": self._get_hardware_info(),
230
- "training_loss": results.get('train_loss', 'N/A'),
231
- "validation_loss": results.get('eval_loss', 'N/A'),
232
- "perplexity": results.get('perplexity', 'N/A'),
233
- "quantized_models": False # Updated if quantized models are added
234
- }
235
- ```
236
-
237
- ## Validation and Testing
238
-
239
- ### Metadata Validation
240
-
241
- The system includes validation for:
242
-
243
- - **Required Fields**: Ensures all required metadata is present
244
- - **Format Validation**: Validates YAML syntax and structure
245
- - **Value Ranges**: Checks for reasonable values in numeric fields
246
- - **Conditional Logic**: Verifies conditional metadata is properly included
247
-
248
- ### Test Coverage
249
-
250
- The test suite verifies:
251
-
252
- - **Basic Metadata**: All required fields are present
253
- - **Conditional Metadata**: Quantized model metadata is included when appropriate
254
- - **Evaluation Results**: Model index data is properly structured
255
- - **Template Processing**: Variable substitution works correctly
256
-
257
- ## Best Practices
258
-
259
- ### 1. Metadata Completeness
260
- - Include all available training information
261
- - Provide accurate evaluation metrics
262
- - Use consistent naming conventions
263
-
264
- ### 2. Conditional Logic
265
- - Only include relevant metadata
266
- - Use conditional sections appropriately
267
- - Provide fallback values for missing data
268
-
269
- ### 3. Validation
270
- - Test metadata generation with various configurations
271
- - Verify YAML syntax is correct
272
- - Check that all variables are properly substituted
273
-
274
- ### 4. Documentation
275
- - Document all available metadata fields
276
- - Provide examples for each field type
277
- - Include troubleshooting information
278
-
279
- ## Future Enhancements
280
-
281
- ### Planned Features
282
-
283
- 1. **Additional Metrics**: Support for more evaluation metrics
284
- 2. **Custom Metadata**: User-defined metadata fields
285
- 3. **Validation Rules**: Configurable validation rules
286
- 4. **Auto-Detection**: Automatic detection of model features
287
- 5. **Integration APIs**: Better integration with external tools
288
-
289
- ### Extensibility
290
-
291
- The system is designed to be easily extensible:
292
-
293
- - **New Fields**: Easy to add new metadata fields
294
- - **Custom Validators**: Support for custom validation logic
295
- - **Template Extensions**: Support for template inheritance
296
- - **API Integration**: Easy integration with external APIs
297
-
298
- ## Conclusion
299
-
300
- The enhanced model card metadata system provides comprehensive, standards-compliant metadata that maximizes compatibility with the Hugging Face Hub while providing rich information for model discovery and usage. The system automatically generates appropriate metadata based on model configuration and training results, ensuring consistency and completeness across all model repositories.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/ENVIRONMENT_SETUP_FIX.md DELETED
@@ -1,239 +0,0 @@
1
- # Environment Setup Fix
2
-
3
- ## Issue Identified
4
-
5
- The user requested to ensure that the provided token is properly available in the new virtual environment created during the launch script execution to avoid errors.
6
-
7
- ## Root Cause
8
-
9
- The `launch.sh` script was setting environment variables after creating the virtual environment, which could cause the token to not be available within the virtual environment context.
10
-
11
- ## Fixes Applied
12
-
13
- ### 1. **Environment Variables Set Before Virtual Environment** ✅ **FIXED**
14
-
15
- **File**: `launch.sh`
16
-
17
- **Changes**:
18
- - Set environment variables before creating the virtual environment
19
- - Re-export environment variables after activating the virtual environment
20
- - Added verification step to ensure token is available
21
-
22
- **Before**:
23
- ```bash
24
- print_info "Creating Python virtual environment..."
25
- python3 -m venv smollm3_env
26
- source smollm3_env/bin/activate
27
-
28
- # ... install dependencies ...
29
-
30
- # Step 8: Authentication setup
31
- export HF_TOKEN="$HF_TOKEN"
32
- export HUGGING_FACE_HUB_TOKEN="$HF_TOKEN"
33
- ```
34
-
35
- **After**:
36
- ```bash
37
- # Set environment variables before creating virtual environment
38
- print_info "Setting up environment variables..."
39
- export HF_TOKEN="$HF_TOKEN"
40
- export TRACKIO_DATASET_REPO="$TRACKIO_DATASET_REPO"
41
- export HUGGING_FACE_HUB_TOKEN="$HF_TOKEN"
42
- export HF_USERNAME="$HF_USERNAME"
43
-
44
- print_info "Creating Python virtual environment..."
45
- python3 -m venv smollm3_env
46
- source smollm3_env/bin/activate
47
-
48
- # Re-export environment variables in the virtual environment
49
- print_info "Configuring environment variables in virtual environment..."
50
- export HF_TOKEN="$HF_TOKEN"
51
- export TRACKIO_DATASET_REPO="$TRACKIO_DATASET_REPO"
52
- export HUGGING_FACE_HUB_TOKEN="$HF_TOKEN"
53
- export HF_USERNAME="$HF_USERNAME"
54
- ```
55
-
56
- ### 2. **Token Verification Step** ✅ **ADDED**
57
-
58
- **File**: `launch.sh`
59
-
60
- **Added verification to ensure token is properly configured**:
61
- ```bash
62
- # Verify token is available in the virtual environment
63
- print_info "Verifying token availability in virtual environment..."
64
- if [ -n "$HF_TOKEN" ] && [ -n "$HUGGING_FACE_HUB_TOKEN" ]; then
65
- print_status "✅ Token properly configured in virtual environment"
66
- print_info " HF_TOKEN: ${HF_TOKEN:0:10}...${HF_TOKEN: -4}"
67
- print_info " HUGGING_FACE_HUB_TOKEN: ${HUGGING_FACE_HUB_TOKEN:0:10}...${HUGGING_FACE_HUB_TOKEN: -4}"
68
- else
69
- print_error "❌ Token not properly configured in virtual environment"
70
- print_error "Please check your token and try again"
71
- exit 1
72
- fi
73
- ```
74
-
75
- ### 3. **Environment Variables Before Each Script Call** ✅ **ADDED**
76
-
77
- **File**: `launch.sh`
78
-
79
- **Added environment variable exports before each Python script call**:
80
-
81
- **Trackio Space Deployment**:
82
- ```bash
83
- # Ensure environment variables are available for the script
84
- export HF_TOKEN="$HF_TOKEN"
85
- export HUGGING_FACE_HUB_TOKEN="$HF_TOKEN"
86
- export HF_USERNAME="$HF_USERNAME"
87
-
88
- python deploy_trackio_space.py "$TRACKIO_SPACE_NAME" "$HF_TOKEN" "$GIT_EMAIL"
89
- ```
90
-
91
- **Dataset Setup**:
92
- ```bash
93
- # Ensure environment variables are available for the script
94
- export HF_TOKEN="$HF_TOKEN"
95
- export HUGGING_FACE_HUB_TOKEN="$HF_TOKEN"
96
- export HF_USERNAME="$HF_USERNAME"
97
-
98
- python setup_hf_dataset.py "$HF_TOKEN"
99
- ```
100
-
101
- **Trackio Configuration**:
102
- ```bash
103
- # Ensure environment variables are available for the script
104
- export HF_TOKEN="$HF_TOKEN"
105
- export HUGGING_FACE_HUB_TOKEN="$HF_TOKEN"
106
- export HF_USERNAME="$HF_USERNAME"
107
-
108
- python configure_trackio.py
109
- ```
110
-
111
- **Training Script**:
112
- ```bash
113
- # Ensure environment variables are available for training
114
- export HF_TOKEN="$HF_TOKEN"
115
- export HUGGING_FACE_HUB_TOKEN="$HF_TOKEN"
116
- export HF_USERNAME="$HF_USERNAME"
117
- export TRACKIO_DATASET_REPO="$TRACKIO_DATASET_REPO"
118
-
119
- python scripts/training/train.py \
120
- --config "$CONFIG_FILE" \
121
- --experiment-name "$EXPERIMENT_NAME" \
122
- --output-dir /output-checkpoint \
123
- --trackio-url "$TRACKIO_URL" \
124
- --trainer-type "$TRAINER_TYPE"
125
- ```
126
-
127
- **Model Push**:
128
- ```bash
129
- # Ensure environment variables are available for model push
130
- export HF_TOKEN="$HF_TOKEN"
131
- export HUGGING_FACE_HUB_TOKEN="$HF_TOKEN"
132
- export HF_USERNAME="$HF_USERNAME"
133
- export TRACKIO_DATASET_REPO="$TRACKIO_DATASET_REPO"
134
-
135
- python scripts/model_tonic/push_to_huggingface.py /output-checkpoint "$REPO_NAME" \
136
- --token "$HF_TOKEN" \
137
- --trackio-url "$TRACKIO_URL" \
138
- --experiment-name "$EXPERIMENT_NAME" \
139
- --dataset-repo "$TRACKIO_DATASET_REPO"
140
- ```
141
-
142
- **Quantization Scripts**:
143
- ```bash
144
- # Ensure environment variables are available for quantization
145
- export HF_TOKEN="$HF_TOKEN"
146
- export HUGGING_FACE_HUB_TOKEN="$HF_TOKEN"
147
- export HF_USERNAME="$HF_USERNAME"
148
- export TRACKIO_DATASET_REPO="$TRACKIO_DATASET_REPO"
149
-
150
- python scripts/model_tonic/quantize_model.py /output-checkpoint "$REPO_NAME" \
151
- --quant-type "$QUANT_TYPE" \
152
- --device "$DEVICE" \
153
- --token "$HF_TOKEN" \
154
- --trackio-url "$TRACKIO_URL" \
155
- --experiment-name "${EXPERIMENT_NAME}-${QUANT_TYPE}" \
156
- --dataset-repo "$TRACKIO_DATASET_REPO"
157
- ```
158
-
159
- ## Key Improvements
160
-
161
- ### 1. **Proper Environment Variable Timing**
162
- - ✅ **Set before virtual environment**: Variables set before creating venv
163
- - ✅ **Re-export after activation**: Variables re-exported after activating venv
164
- - ✅ **Before each script**: Variables exported before each Python script call
165
- - ✅ **Verification step**: Token availability verified before proceeding
166
-
167
- ### 2. **Comprehensive Coverage**
168
- - ✅ **All scripts covered**: Every Python script has environment variables
169
- - ✅ **Multiple variables**: HF_TOKEN, HUGGING_FACE_HUB_TOKEN, HF_USERNAME, TRACKIO_DATASET_REPO
170
- - ✅ **Consistent naming**: All scripts use the same environment variable names
171
- - ✅ **Error handling**: Verification step catches missing tokens
172
-
173
- ### 3. **Cross-Platform Compatibility**
174
- - ✅ **Bash compatible**: Uses standard bash export syntax
175
- - ✅ **Virtual environment aware**: Properly handles venv activation
176
- - ✅ **Token validation**: Verifies token availability before use
177
- - ✅ **Clear error messages**: Descriptive error messages for debugging
178
-
179
- ## Environment Variables Set
180
-
181
- The following environment variables are now properly set and available in the virtual environment:
182
-
183
- 1. **`HF_TOKEN`** - The Hugging Face token for authentication
184
- 2. **`HUGGING_FACE_HUB_TOKEN`** - Alternative token variable for Python API
185
- 3. **`HF_USERNAME`** - Username extracted from token
186
- 4. **`TRACKIO_DATASET_REPO`** - Dataset repository for Trackio
187
-
188
- ## Test Results
189
-
190
- ### **Environment Setup Test**
191
- ```bash
192
- $ python tests/test_environment_setup.py
193
-
194
- 🚀 Environment Setup Verification
195
- ==================================================
196
- 🔍 Testing Environment Variables
197
- [OK] HF_TOKEN: hf_FWrfleE...zuoF
198
- [OK] HUGGING_FACE_HUB_TOKEN: hf_FWrfleE...zuoF
199
- [OK] HF_USERNAME: Tonic...onic
200
- [OK] TRACKIO_DATASET_REPO: Tonic/trac...ents
201
-
202
- 🔍 Testing Launch Script Environment Setup
203
- [OK] Found: export HF_TOKEN=
204
- [OK] Found: export HUGGING_FACE_HUB_TOKEN=
205
- [OK] Found: export HF_USERNAME=
206
- [OK] Found: export TRACKIO_DATASET_REPO=
207
- [OK] Found virtual environment activation
208
- [OK] Found environment variable re-export after activation
209
-
210
- [SUCCESS] ALL ENVIRONMENT TESTS PASSED!
211
- [OK] Environment variables: Properly set
212
- [OK] Virtual environment: Can access variables
213
- [OK] Launch script: Properly configured
214
-
215
- The environment setup is working correctly!
216
- ```
217
-
218
- ## User Token Status
219
-
220
- **Token**: `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`
221
- **Status**: ✅ **Working correctly in virtual environment**
222
- **Username**: `Tonic` (auto-detected)
223
-
224
- ## Next Steps
225
-
226
- The user can now run the launch script with confidence that the token will be properly available in the virtual environment:
227
-
228
- ```bash
229
- ./launch.sh
230
- ```
231
-
232
- The script will:
233
- 1. ✅ **Set environment variables** before creating virtual environment
234
- 2. ✅ **Re-export variables** after activating virtual environment
235
- 3. ✅ **Verify token availability** before proceeding
236
- 4. ✅ **Export variables** before each Python script call
237
- 5. ✅ **Ensure all scripts** have access to the token
238
-
239
- **No more token-related errors in the virtual environment!** 🎉
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/ENVIRONMENT_VARIABLES.md DELETED
@@ -1,113 +0,0 @@
1
- # 🔧 Trackio Environment Variables Reference
2
-
3
- ## Quick Setup
4
-
5
- Set these environment variables in your Hugging Face Space:
6
-
7
- ```bash
8
- # Required: Your HF token for dataset access
9
- HF_TOKEN=your_hf_token_here
10
-
11
- # Optional: Dataset repository to use (defaults to tonic/trackio-experiments)
12
- TRACKIO_DATASET_REPO=your-username/your-dataset-name
13
- ```
14
-
15
- ## Environment Variables
16
-
17
- | Variable | Required | Default | Description |
18
- |----------|----------|---------|-------------|
19
- | `HF_TOKEN` | ✅ Yes | None | Your Hugging Face token for dataset access |
20
- | `TRACKIO_DATASET_REPO` | ❌ No | `tonic/trackio-experiments` | Dataset repository to load experiments from |
21
- | `SPACE_ID` | 🔄 Auto | None | HF Space ID (automatically detected) |
22
-
23
- ## Configuration Examples
24
-
25
- ### 1. Default Setup
26
- ```bash
27
- HF_TOKEN=your_token_here
28
- # Uses: tonic/trackio-experiments
29
- ```
30
-
31
- ### 2. Personal Dataset
32
- ```bash
33
- HF_TOKEN=your_token_here
34
- TRACKIO_DATASET_REPO=your-username/trackio-experiments
35
- ```
36
-
37
- ### 3. Team Dataset
38
- ```bash
39
- HF_TOKEN=your_token_here
40
- TRACKIO_DATASET_REPO=your-org/team-experiments
41
- ```
42
-
43
- ### 4. Project-Specific Dataset
44
- ```bash
45
- HF_TOKEN=your_token_here
46
- TRACKIO_DATASET_REPO=your-username/smollm3-experiments
47
- ```
48
-
49
- ## How to Set in HF Spaces
50
-
51
- 1. Go to your Hugging Face Space settings
52
- 2. Navigate to "Settings" → "Environment variables"
53
- 3. Add the variables:
54
- - `HF_TOKEN`: Your HF token
55
- - `TRACKIO_DATASET_REPO`: Your dataset repository (optional)
56
-
57
- ## Testing Configuration
58
-
59
- Run the configuration script to check your setup:
60
-
61
- ```bash
62
- python configure_trackio.py
63
- ```
64
-
65
- This will:
66
- - ✅ Show current environment variables
67
- - 🧪 Test dataset access
68
- - 📊 Display experiment count
69
- - 💾 Generate configuration file
70
-
71
- ## Getting Your HF Token
72
-
73
- 1. Go to [Hugging Face Settings](https://huggingface.co/settings/tokens)
74
- 2. Click "New token"
75
- 3. Give it a name (e.g., "Trackio Access")
76
- 4. Select "Write" permissions
77
- 5. Copy the token and set it as `HF_TOKEN`
78
-
79
- ## Dataset Repository Format
80
-
81
- The `TRACKIO_DATASET_REPO` should follow this format:
82
- ```
83
- username/dataset-name
84
- ```
85
-
86
- Examples:
87
- - `tonic/trackio-experiments`
88
- - `your-username/my-experiments`
89
- - `your-org/team-experiments`
90
-
91
- ## Troubleshooting
92
-
93
- ### Issue: "HF_TOKEN not found"
94
- **Solution**: Set your HF token in the Space environment variables
95
-
96
- ### Issue: "Failed to load dataset"
97
- **Solutions**:
98
- 1. Check your token has read access to the dataset
99
- 2. Verify the dataset repository exists
100
- 3. Try the backup fallback (automatic)
101
-
102
- ### Issue: "Failed to save experiments"
103
- **Solutions**:
104
- 1. Check your token has write permissions
105
- 2. Verify the dataset repository exists
106
- 3. Check network connectivity
107
-
108
- ## Security Notes
109
-
110
- - 🔒 Dataset is private by default
111
- - 🔑 Only accessible with your HF_TOKEN
112
- - 🛡️ No sensitive data exposed publicly
113
- - 🔐 Secure storage on HF infrastructure
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/Entry_Point.md ADDED
@@ -0,0 +1,120 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ```mermaid
2
+ graph LR
3
+ Entry_Point["Entry Point"]
4
+ Configuration["Configuration"]
5
+ Model_Abstraction["Model Abstraction"]
6
+ Data_Pipeline["Data Pipeline"]
7
+ Training_Logic["Training Logic"]
8
+ Utilities["Utilities"]
9
+ Scripts["Scripts"]
10
+ Requirements_Management["Requirements Management"]
11
+ Entry_Point -- "initializes" --> Configuration
12
+ Entry_Point -- "initializes" --> Model_Abstraction
13
+ Entry_Point -- "initializes" --> Data_Pipeline
14
+ Entry_Point -- "invokes" --> Training_Logic
15
+ Configuration -- "provides settings to" --> Model_Abstraction
16
+ Configuration -- "provides settings to" --> Data_Pipeline
17
+ Configuration -- "provides settings to" --> Training_Logic
18
+ Model_Abstraction -- "provides model to" --> Training_Logic
19
+ Data_Pipeline -- "provides data to" --> Training_Logic
20
+ Training_Logic -- "utilizes" --> Model_Abstraction
21
+ Training_Logic -- "utilizes" --> Data_Pipeline
22
+ Training_Logic -- "utilizes" --> Configuration
23
+ Training_Logic -- "utilizes" --> Utilities
24
+ Data_Pipeline -- "uses" --> Utilities
25
+ Model_Abstraction -- "uses" --> Utilities
26
+ Scripts -- "supports" --> Data_Pipeline
27
+ Scripts -- "supports" --> Model_Abstraction
28
+ Requirements_Management -- "defines environment for" --> Entry_Point
29
+ Requirements_Management -- "defines environment for" --> Configuration
30
+ Requirements_Management -- "defines environment for" --> Model_Abstraction
31
+ Requirements_Management -- "defines environment for" --> Data_Pipeline
32
+ Requirements_Management -- "defines environment for" --> Training_Logic
33
+ Requirements_Management -- "defines environment for" --> Utilities
34
+ Requirements_Management -- "defines environment for" --> Scripts
35
+ click Entry_Point href "https://github.com/Josephrp/SmolFactory/blob/main/docs/Entry_Point.md" "Details"
36
+ click Model_Abstraction href "https://github.com/Josephrp/SmolFactory/blob/main/docs/Model_Abstraction.md" "Details"
37
+ click Data_Pipeline href "https://github.com/Josephrp/SmolFactory/blob/main/docs/Data_Pipeline.md" "Details"
38
+ ```
39
+
40
+ [![CodeBoarding](https://img.shields.io/badge/Generated%20by-CodeBoarding-9cf?style=flat-square)](https://github.com/CodeBoarding/GeneratedOnBoardings)[![Demo](https://img.shields.io/badge/Try%20our-Demo-blue?style=flat-square)](https://www.codeboarding.org/demo)[![Contact](https://img.shields.io/badge/Contact%20us%20-%[email protected]?style=flat-square)](mailto:[email protected])
41
+
42
+ ## Details
43
+
44
+ Component overview for the Machine Learning Training and Fine-tuning Framework.
45
+
46
+ ### Entry Point [[Expand]](./Entry_Point.md)
47
+ The primary execution script that orchestrates the entire training process. It initializes all other major components, loads configurations, sets up the training environment, and invokes the core training logic.
48
+
49
+
50
+ **Related Classes/Methods**:
51
+
52
+ - `train.py`
53
+
54
+
55
+ ### Configuration
56
+ Centralized management of all training parameters, model hyperparameters, dataset paths, and other environment settings. It defines the schema for configurations, often using dataclasses, and supports both base and custom configurations.
57
+
58
+
59
+ **Related Classes/Methods**:
60
+
61
+ - `config/` (1:1)
62
+
63
+
64
+ ### Model Abstraction [[Expand]](./Model_Abstraction.md)
65
+ Responsible for abstracting the underlying machine learning model. This includes loading pre-trained models, handling different model architectures or variants, and preparing the model for training (e.g., quantization, device placement).
66
+
67
+
68
+ **Related Classes/Methods**:
69
+
70
+ - <a href="https://github.com/Josephrp/SmolFactory/docs/blob/main/src/model.py#L1-L1" target="_blank" rel="noopener noreferrer">`model.py` (1:1)</a>
71
+
72
+
73
+ ### Data Pipeline [[Expand]](./Data_Pipeline.md)
74
+ Manages the entire data flow, from loading raw datasets to preprocessing, tokenization, and creating efficient data loaders (e.g., PyTorch `DataLoader`) for batching and shuffling data during training and evaluation.
75
+
76
+
77
+ **Related Classes/Methods**:
78
+
79
+ - <a href="https://github.com/Josephrp/SmolFactory/docs/blob/main/src/data.py#L1-L1" target="_blank" rel="noopener noreferrer">`data.py` (1:1)</a>
80
+
81
+
82
+ ### Training Logic
83
+ Encapsulates the core training loop, including forward and backward passes, loss calculation, optimization steps, and integration of callbacks for monitoring and control. It may include specialized trainers for different fine-tuning methods.
84
+
85
+
86
+ **Related Classes/Methods**:
87
+
88
+ - <a href="https://github.com/Josephrp/SmolFactory/docs/blob/main/src/trainer.py#L1-L1" target="_blank" rel="noopener noreferrer">`trainer.py` (1:1)</a>
89
+
90
+
91
+ ### Utilities
92
+ Provides a collection of common helper functions, classes, and modules used across various components. This includes functionalities like logging, metric calculation, checkpointing, and general data manipulation.
93
+
94
+
95
+ **Related Classes/Methods**:
96
+
97
+ - `utils/` (1:1)
98
+
99
+
100
+ ### Scripts
101
+ Contains auxiliary scripts that support the overall project but are separate from the main training pipeline. Examples include data preparation scripts, model conversion tools, or deployment-related utilities.
102
+
103
+
104
+ **Related Classes/Methods**:
105
+
106
+ - `scripts/` (1:1)
107
+
108
+
109
+ ### Requirements Management
110
+ Defines and manages all project dependencies, ensuring a consistent and reproducible development and deployment environment. This typically involves `requirements.txt` files or similar dependency management tools.
111
+
112
+
113
+ **Related Classes/Methods**:
114
+
115
+ - `requirements/` (1:1)
116
+
117
+
118
+
119
+
120
+ ### [FAQ](https://github.com/CodeBoarding/GeneratedOnBoardings/tree/main?tab=readme-ov-file#faq)
docs/FINAL_DEPLOYMENT_VERIFICATION.md DELETED
@@ -1,378 +0,0 @@
1
- # Final Deployment Verification Summary
2
-
3
- ## Overview
4
-
5
- This document provides the final verification that all important components for Trackio Spaces deployment and model repository deployment have been properly implemented and are working correctly.
6
-
7
- ## ✅ **VERIFICATION COMPLETE: All Components Properly Implemented**
8
-
9
- ### **What We Verified**
10
-
11
- You were absolutely right to ask about the Trackio Spaces deployment and model repository deployment components. I've now **completely verified** that all important components are properly implemented:
12
-
13
- ## **Trackio Spaces Deployment** ✅ **FULLY IMPLEMENTED**
14
-
15
- ### **1. Space Creation System** ✅ **COMPLETE**
16
- - **Location**: `scripts/trackio_tonic/deploy_trackio_space.py`
17
- - **Functionality**: Creates HF Spaces using latest Python API
18
- - **Features**:
19
- - ✅ API-based creation with `huggingface_hub.create_repo`
20
- - ✅ Fallback to CLI method if API fails
21
- - ✅ Automatic username extraction from token
22
- - ✅ Proper Space configuration (Gradio SDK, CPU hardware)
23
-
24
- ### **2. File Upload System** ✅ **COMPLETE**
25
- - **Location**: `scripts/trackio_tonic/deploy_trackio_space.py`
26
- - **Functionality**: Uploads all required files to Space
27
- - **Features**:
28
- - ✅ API-based upload using `huggingface_hub.upload_file`
29
- - ✅ Proper HF Spaces file structure
30
- - ✅ Git integration in temporary directory
31
- - ✅ Error handling and fallback mechanisms
32
-
33
- **Files Uploaded**:
34
- - ✅ `app.py` - Complete Gradio interface (1,241 lines)
35
- - ✅ `requirements.txt` - All dependencies included
36
- - ✅ `README.md` - Comprehensive documentation
37
- - ✅ `.gitignore` - Proper git configuration
38
-
39
- ### **3. Space Configuration** ✅ **COMPLETE**
40
- - **Location**: `scripts/trackio_tonic/deploy_trackio_space.py`
41
- - **Functionality**: Sets environment variables via HF Hub API
42
- - **Features**:
43
- - ✅ API-based secrets using `add_space_secret()`
44
- - ✅ Automatic `HF_TOKEN` configuration
45
- - ✅ Automatic `TRACKIO_DATASET_REPO` setup
46
- - ✅ Manual fallback instructions if API fails
47
-
48
- ### **4. Gradio Interface** ✅ **COMPLETE**
49
- - **Location**: `templates/spaces/app.py` (1,241 lines)
50
- - **Functionality**: Comprehensive experiment tracking interface
51
- - **Features**:
52
- - ✅ **Experiment Management**: Create, view, update experiments
53
- - ✅ **Metrics Logging**: Real-time training metrics
54
- - ✅ **Visualization**: Interactive plots and charts
55
- - ✅ **HF Datasets Integration**: Persistent storage
56
- - ✅ **API Endpoints**: Programmatic access
57
- - ✅ **Fallback Data**: Backup when dataset unavailable
58
-
59
- **Interface Components**:
60
- - ✅ **Create Experiment**: Start new experiments
61
- - ✅ **Log Metrics**: Track training progress
62
- - ✅ **View Experiments**: See experiment details
63
- - ✅ **Update Status**: Mark experiments complete
64
- - ✅ **Visualizations**: Interactive plots
65
- - ✅ **Configuration**: Environment setup
66
-
67
- ### **5. Requirements and Dependencies** ✅ **COMPLETE**
68
- - **Location**: `templates/spaces/requirements.txt`
69
- - **Dependencies**: All required packages included
70
- - ✅ **Core Gradio**: `gradio>=4.0.0`
71
- - ✅ **Data Processing**: `pandas>=2.0.0`, `numpy>=1.24.0`
72
- - ✅ **Visualization**: `plotly>=5.15.0`
73
- - ✅ **HF Integration**: `datasets>=2.14.0`, `huggingface-hub>=0.16.0`
74
- - ✅ **HTTP Requests**: `requests>=2.31.0`
75
- - ✅ **Environment**: `python-dotenv>=1.0.0`
76
-
77
- ### **6. README Template** ✅ **COMPLETE**
78
- - **Location**: `templates/spaces/README.md`
79
- - **Features**:
80
- - ✅ **HF Spaces Metadata**: Proper YAML frontmatter
81
- - ✅ **Feature Documentation**: Complete interface description
82
- - ✅ **API Documentation**: Usage examples
83
- - ✅ **Configuration Guide**: Environment variables
84
- - ✅ **Troubleshooting**: Common issues and solutions
85
-
86
- ## **Model Repository Deployment** ✅ **FULLY IMPLEMENTED**
87
-
88
- ### **1. Repository Creation** ✅ **COMPLETE**
89
- - **Location**: `scripts/model_tonic/push_to_huggingface.py`
90
- - **Functionality**: Creates HF model repositories using Python API
91
- - **Features**:
92
- - ✅ API-based creation with `huggingface_hub.create_repo`
93
- - ✅ Configurable private/public settings
94
- - ✅ Existing repository handling (`exist_ok=True`)
95
- - ✅ Proper error handling and messages
96
-
97
- ### **2. Model File Upload** ✅ **COMPLETE**
98
- - **Location**: `scripts/model_tonic/push_to_huggingface.py`
99
- - **Functionality**: Uploads all model files to repository
100
- - **Features**:
101
- - ✅ File validation and integrity checks
102
- - ✅ Complete model component upload
103
- - ✅ Progress tracking and feedback
104
- - ✅ Graceful error handling
105
-
106
- **Files Uploaded**:
107
- - ✅ `config.json` - Model configuration
108
- - ✅ `pytorch_model.bin` - Model weights
109
- - ✅ `tokenizer.json` - Tokenizer configuration
110
- - ✅ `tokenizer_config.json` - Tokenizer settings
111
- - ✅ `special_tokens_map.json` - Special tokens
112
- - ✅ `generation_config.json` - Generation settings
113
-
114
- ### **3. Model Card Generation** ✅ **COMPLETE**
115
- - **Location**: `scripts/model_tonic/push_to_huggingface.py`
116
- - **Functionality**: Generates comprehensive model cards
117
- - **Features**:
118
- - ✅ Template-based generation using `templates/model_card.md`
119
- - ✅ Dynamic content from training configuration
120
- - ✅ Usage examples and documentation
121
- - ✅ Support for quantized model variants
122
- - ✅ Proper HF Hub metadata
123
-
124
- ### **4. Training Results Documentation** ✅ **COMPLETE**
125
- - **Location**: `scripts/model_tonic/push_to_huggingface.py`
126
- - **Functionality**: Uploads training configuration and results
127
- - **Features**:
128
- - ✅ Training parameters documentation
129
- - ✅ Performance metrics inclusion
130
- - ✅ Experiment tracking links
131
- - ✅ Proper documentation structure
132
-
133
- ### **5. Quantized Model Support** ✅ **COMPLETE**
134
- - **Location**: `scripts/model_tonic/quantize_model.py`
135
- - **Functionality**: Creates and uploads quantized models
136
- - **Features**:
137
- - ✅ Multiple quantization levels (int8, int4)
138
- - ✅ Unified repository structure
139
- - ✅ Separate documentation for each variant
140
- - ✅ Clear usage instructions
141
-
142
- ### **6. Trackio Integration** ✅ **COMPLETE**
143
- - **Location**: `scripts/model_tonic/push_to_huggingface.py`
144
- - **Functionality**: Logs model push events to Trackio
145
- - **Features**:
146
- - ✅ Event logging for model pushes
147
- - ✅ Training results tracking
148
- - ✅ Experiment tracking links
149
- - ✅ HF Datasets integration
150
-
151
- ### **7. Model Validation** ✅ **COMPLETE**
152
- - **Location**: `scripts/model_tonic/push_to_huggingface.py`
153
- - **Functionality**: Validates model files before upload
154
- - **Features**:
155
- - ✅ Complete file validation
156
- - ✅ Size and integrity checks
157
- - ✅ Configuration validation
158
- - ✅ Detailed error reporting
159
-
160
- ## **Integration Components** ✅ **FULLY IMPLEMENTED**
161
-
162
- ### **1. Launch Script Integration** ✅ **COMPLETE**
163
- - **Location**: `launch.sh`
164
- - **Features**:
165
- - ✅ Automatic Trackio Space deployment calls
166
- - ✅ Automatic model push integration
167
- - ✅ Environment setup and configuration
168
- - ✅ Error handling and user feedback
169
-
170
- ### **2. Monitoring Integration** ✅ **COMPLETE**
171
- - **Location**: `src/monitoring.py`
172
- - **Features**:
173
- - ✅ `SmolLM3Monitor` class implementation
174
- - ✅ Real-time experiment tracking
175
- - ✅ Trackio Space integration
176
- - ✅ HF Datasets integration
177
-
178
- ### **3. Dataset Integration** ✅ **COMPLETE**
179
- - **Location**: `scripts/dataset_tonic/setup_hf_dataset.py`
180
- - **Features**:
181
- - ✅ Automatic dataset repository creation
182
- - ✅ Initial experiment data upload
183
- - ✅ README template integration
184
- - ✅ Environment variable setup
185
-
186
- ## **Token Validation** ✅ **FULLY IMPLEMENTED**
187
-
188
- ### **1. Token Validation System** ✅ **COMPLETE**
189
- - **Location**: `scripts/validate_hf_token.py`
190
- - **Features**:
191
- - ✅ API-based token validation
192
- - ✅ Username extraction from token
193
- - ✅ JSON output for shell parsing
194
- - ✅ Comprehensive error handling
195
-
196
- ## **Test Results** ✅ **ALL PASSED**
197
-
198
- ### **Comprehensive Component Test**
199
- ```bash
200
- $ python tests/test_deployment_components.py
201
-
202
- 🚀 Deployment Components Verification
203
- ==================================================
204
- 🔍 Testing Trackio Space Deployment Components
205
- ✅ Trackio Space deployment script exists
206
- ✅ Gradio app template exists
207
- ✅ TrackioSpace class implemented
208
- ✅ Experiment creation functionality
209
- ✅ Metrics logging functionality
210
- ✅ Experiment retrieval functionality
211
- ✅ Space requirements file exists
212
- ✅ Required dependency: gradio
213
- ✅ Required dependency: pandas
214
- ✅ Required dependency: plotly
215
- ✅ Required dependency: datasets
216
- ✅ Required dependency: huggingface-hub
217
- ✅ Space README template exists
218
- ✅ HF Spaces metadata present
219
- ✅ All Trackio Space components verified!
220
-
221
- 🔍 Testing Model Repository Deployment Components
222
- ✅ Model push script exists
223
- ✅ Model quantization script exists
224
- ✅ Model card template exists
225
- ✅ Required section: base_model:
226
- ✅ Required section: pipeline_tag:
227
- ✅ Required section: tags:
228
- ✅ Model card generator exists
229
- ✅ Required function: def create_repository
230
- ✅ Required function: def upload_model_files
231
- ✅ Required function: def create_model_card
232
- ✅ Required function: def validate_model_path
233
- ✅ All Model Repository components verified!
234
-
235
- 🔍 Testing Integration Components
236
- ✅ Launch script exists
237
- ✅ Trackio Space deployment integrated
238
- ✅ Model push integrated
239
- ✅ Monitoring script exists
240
- ✅ SmolLM3Monitor class implemented
241
- ✅ Dataset setup script exists
242
- ✅ Dataset setup function implemented
243
- ✅ All integration components verified!
244
-
245
- 🔍 Testing Token Validation
246
- ✅ Token validation script exists
247
- ✅ Token validation function implemented
248
- ✅ Token validation components verified!
249
-
250
- ==================================================
251
- 🎉 ALL COMPONENTS VERIFIED SUCCESSFULLY!
252
- ✅ Trackio Space deployment components: Complete
253
- ✅ Model repository deployment components: Complete
254
- ✅ Integration components: Complete
255
- ✅ Token validation components: Complete
256
-
257
- All important deployment components are properly implemented!
258
- ```
259
-
260
- ## **Technical Implementation Details**
261
-
262
- ### **Trackio Space Deployment Flow**
263
- ```python
264
- # 1. Create Space
265
- create_repo(
266
- repo_id=f"{username}/{space_name}",
267
- token=token,
268
- repo_type="space",
269
- exist_ok=True,
270
- private=False,
271
- space_sdk="gradio",
272
- space_hardware="cpu-basic"
273
- )
274
-
275
- # 2. Upload Files
276
- upload_file(
277
- path_or_fileobj=file_content,
278
- path_in_repo=file_path,
279
- repo_id=repo_id,
280
- repo_type="space",
281
- token=token
282
- )
283
-
284
- # 3. Set Secrets
285
- add_space_secret(
286
- repo_id=repo_id,
287
- repo_type="space",
288
- key="HF_TOKEN",
289
- value=token
290
- )
291
- ```
292
-
293
- ### **Model Repository Deployment Flow**
294
- ```python
295
- # 1. Create Repository
296
- create_repo(
297
- repo_id=repo_name,
298
- token=token,
299
- private=private,
300
- exist_ok=True
301
- )
302
-
303
- # 2. Upload Model Files
304
- upload_file(
305
- path_or_fileobj=model_file,
306
- path_in_repo=file_path,
307
- repo_id=repo_name,
308
- token=token
309
- )
310
-
311
- # 3. Generate Model Card
312
- model_card = create_model_card(training_config, results)
313
- upload_file(
314
- path_or_fileobj=model_card,
315
- path_in_repo="README.md",
316
- repo_id=repo_name,
317
- token=token
318
- )
319
- ```
320
-
321
- ## **Verification Summary**
322
-
323
- | Component Category | Status | Components Verified | Test Result |
324
- |-------------------|--------|-------------------|-------------|
325
- | **Trackio Space Deployment** | ✅ Complete | 6 components | ✅ All passed |
326
- | **Model Repository Deployment** | ✅ Complete | 7 components | ✅ All passed |
327
- | **Integration Components** | ✅ Complete | 3 components | ✅ All passed |
328
- | **Token Validation** | ✅ Complete | 1 component | ✅ All passed |
329
-
330
- ## **Key Achievements**
331
-
332
- ### **1. Complete Automation**
333
- - ✅ **No manual username input**: Automatic extraction from token
334
- - ✅ **No manual Space creation**: Automatic via Python API
335
- - ✅ **No manual model upload**: Complete automation
336
- - ✅ **No manual configuration**: Automatic environment setup
337
-
338
- ### **2. Robust Error Handling**
339
- - ✅ **API fallbacks**: CLI methods when API fails
340
- - ✅ **Graceful degradation**: Clear error messages
341
- - ✅ **User feedback**: Progress indicators and status
342
- - ✅ **Recovery mechanisms**: Multiple retry strategies
343
-
344
- ### **3. Comprehensive Documentation**
345
- - ✅ **Model cards**: Complete with usage examples
346
- - ✅ **Space documentation**: Full interface description
347
- - ✅ **API documentation**: Usage examples and integration
348
- - ✅ **Troubleshooting guides**: Common issues and solutions
349
-
350
- ### **4. Cross-Platform Support**
351
- - ✅ **Windows**: Tested and working on PowerShell
352
- - ✅ **Linux**: Compatible with bash scripts
353
- - ✅ **macOS**: Compatible with zsh/bash
354
- - ✅ **Python API**: Platform-independent
355
-
356
- ## **Next Steps**
357
-
358
- The deployment components are now **fully implemented and verified**. Users can:
359
-
360
- 1. **Deploy Trackio Space**: Automatic Space creation and configuration
361
- 2. **Upload Models**: Complete model deployment with documentation
362
- 3. **Monitor Experiments**: Real-time tracking and visualization
363
- 4. **Share Results**: Comprehensive documentation and examples
364
- 5. **Scale Operations**: Support for multiple experiments and models
365
-
366
- ## **Conclusion**
367
-
368
- **All important deployment components are properly implemented and working correctly!** 🎉
369
-
370
- The verification confirms that:
371
- - ✅ **Trackio Spaces deployment**: Complete with all required components
372
- - ✅ **Model repository deployment**: Complete with all required components
373
- - ✅ **Integration systems**: Complete with all required components
374
- - ✅ **Token validation**: Complete with all required components
375
- - ✅ **Documentation**: Complete with all required components
376
- - ✅ **Error handling**: Complete with all required components
377
-
378
- The system is now ready for production use with full automation and comprehensive functionality.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/FORMATTING_FIX_SUMMARY.md DELETED
@@ -1,153 +0,0 @@
1
- # String Formatting Fix Summary
2
-
3
- ## 🐛 Problem
4
-
5
- The training script was failing with the error:
6
- ```
7
- ERROR:trainer:Training failed: Unknown format code 'f' for object of type 'str'
8
- ```
9
-
10
- This error occurs when Python's string formatting encounters an f-string format specifier (`%f`) but receives a string object instead of a numeric value.
11
-
12
- ## 🔍 Root Cause
13
-
14
- The issue was caused by inconsistent use of f-string formatting (`f"..."`) and traditional string formatting (`"..." % ...`) in the logging statements throughout the codebase. When logging statements used f-string syntax but were processed by the logging system, it could cause formatting conflicts.
15
-
16
- ## ✅ Solution
17
-
18
- I fixed the issue by standardizing all logging statements to use traditional string formatting with `%` placeholders instead of f-strings. This ensures compatibility with Python's logging system and prevents formatting conflicts.
19
-
20
- ### Files Fixed
21
-
22
- 1. **`src/monitoring.py`** - Fixed all logging statements
23
- 2. **`src/trainer.py`** - Fixed all logging statements
24
- 3. **`src/model.py`** - Fixed all logging statements
25
- 4. **`src/data.py`** - Fixed all logging statements
26
-
27
- ### Changes Made
28
-
29
- #### Before (Problematic):
30
- ```python
31
- logger.info(f"Loading model from {self.model_name}")
32
- logger.error(f"Failed to load model: {e}")
33
- print(f"Step {step}: loss={loss:.4f}, lr={lr}")
34
- ```
35
-
36
- #### After (Fixed):
37
- ```python
38
- logger.info("Loading model from %s", self.model_name)
39
- logger.error("Failed to load model: %s", e)
40
- print("Step {}: loss={:.4f}, lr={}".format(step, loss, lr))
41
- ```
42
-
43
- ## 🧪 Testing
44
-
45
- Created `test_formatting_fix.py` to verify the fix:
46
-
47
- ```bash
48
- python test_formatting_fix.py
49
- ```
50
-
51
- This script tests:
52
- - ✅ Logging functionality
53
- - ✅ Module imports
54
- - ✅ Configuration loading
55
- - ✅ Monitoring creation
56
- - ✅ Error handling
57
-
58
- ## 🚀 Usage
59
-
60
- The fix is now ready to use. You can run your training command again:
61
-
62
- ```bash
63
- python run_a100_large_experiment.py \
64
- --config config/train_smollm3_openhermes_fr_a100_balanced.py \
65
- --trackio_url "https://tonic-test-trackio-test.hf.space" \
66
- --experiment-name "petit-elle-l-aime-3-balanced" \
67
- --output-dir ./outputs/balanced | tee trainfr.log
68
- ```
69
-
70
- ## 📋 Key Changes
71
-
72
- ### 1. Monitoring Module (`src/monitoring.py`)
73
- - Fixed all `logger.info()`, `logger.error()`, `logger.warning()` calls
74
- - Replaced f-strings with `%` formatting
75
- - Fixed string concatenation in file paths
76
- - Fixed HF Datasets integration logging
77
-
78
- ### 2. Trainer Module (`src/trainer.py`)
79
- - Fixed logging in `SmolLM3Trainer` class
80
- - Fixed console output formatting
81
- - Fixed error message formatting
82
- - Fixed callback logging
83
-
84
- ### 3. Model Module (`src/model.py`)
85
- - Fixed model loading logging
86
- - Fixed configuration logging
87
- - Fixed error reporting
88
- - Fixed parameter logging
89
-
90
- ### 4. Data Module (`src/data.py`)
91
- - Fixed dataset loading logging
92
- - Fixed processing progress logging
93
- - Fixed error handling
94
- - Fixed split processing logging
95
-
96
- ## 🔧 Technical Details
97
-
98
- ### Why This Happened
99
- 1. **Mixed Formatting**: Some code used f-strings while others used `%` formatting
100
- 2. **Logging System**: Python's logging system processes format strings differently
101
- 3. **String Processing**: When strings containing `%f` were processed as format strings, it caused conflicts
102
-
103
- ### The Fix
104
- 1. **Standardized Formatting**: All logging now uses `%` placeholders
105
- 2. **Consistent Style**: No more mixing of f-strings and `%` formatting
106
- 3. **Safe Logging**: All logging statements are now safe for the logging system
107
-
108
- ### Benefits
109
- - ✅ **Eliminates Formatting Errors**: No more "Unknown format code 'f'" errors
110
- - ✅ **Consistent Code Style**: All logging uses the same format
111
- - ✅ **Better Performance**: Traditional formatting is slightly faster
112
- - ✅ **Compatibility**: Works with all Python versions and logging configurations
113
-
114
- ## 🎯 Verification
115
-
116
- To verify the fix works:
117
-
118
- 1. **Run the test script**:
119
- ```bash
120
- python test_formatting_fix.py
121
- ```
122
-
123
- 2. **Check that all tests pass**:
124
- - ✅ Logging tests
125
- - ✅ Import tests
126
- - ✅ Configuration tests
127
- - ✅ Monitoring creation tests
128
-
129
- 3. **Run your training command**:
130
- ```bash
131
- python run_a100_large_experiment.py --config config/train_smollm3_openhermes_fr_a100_balanced.py --trackio_url "https://tonic-test-trackio-test.hf.space" --experiment-name "petit-elle-l-aime-3-balanced" --output-dir ./outputs/balanced
132
- ```
133
-
134
- ## 📝 Notes
135
-
136
- - The fix maintains all existing functionality
137
- - No changes to the training logic or configuration
138
- - All error messages and logging remain informative
139
- - The fix is backward compatible
140
- - HF Datasets integration is preserved
141
-
142
- ## 🚨 Prevention
143
-
144
- To prevent similar issues in the future:
145
-
146
- 1. **Use Consistent Formatting**: Stick to `%` formatting for logging
147
- 2. **Avoid f-strings in Logging**: Don't use f-strings in `logger.info()` calls
148
- 3. **Test Logging**: Always test logging statements during development
149
- 4. **Use Type Hints**: Consider using type hints to catch formatting issues early
150
-
151
- ---
152
-
153
- **The formatting fix is now complete and ready for use! 🎉**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/GIT_CONFIGURATION_FIX.md DELETED
@@ -1,257 +0,0 @@
1
- # Git Configuration Fix for Trackio Space Deployment
2
-
3
- ## Issue Identified
4
-
5
- The Trackio Space deployment was failing with the error:
6
- ```
7
- ❌ Error uploading files: Command '['git', 'commit', '-m', 'Initial Trackio Space setup']' returned non-zero exit status 128.
8
- ```
9
-
10
- This error occurs because git requires a user identity (email and name) to be configured before making commits. The deployment script was creating a temporary directory and initializing a git repository, but wasn't configuring the git user identity in that temporary directory.
11
-
12
- ## Root Cause
13
-
14
- ### **Problem**: Git Identity Not Configured in Temporary Directory
15
-
16
- When the deployment script:
17
- 1. Creates a temporary directory
18
- 2. Changes to that directory (`os.chdir(temp_dir)`)
19
- 3. Initializes a git repository (`git init`)
20
- 4. Tries to commit (`git commit`)
21
-
22
- The git repository in the temporary directory doesn't inherit the git configuration from the main directory, so it has no user identity configured.
23
-
24
- ### **Solution**: Configure Git Identity in Temporary Directory
25
-
26
- The fix involves explicitly configuring git user identity in the temporary directory before attempting to commit.
27
-
28
- ## Fixes Applied
29
-
30
- ### 1. **Enhanced TrackioSpaceDeployer Constructor**
31
-
32
- **Before**:
33
- ```python
34
- def __init__(self, space_name: str, username: str, token: str):
35
- self.space_name = space_name
36
- self.username = username
37
- self.token = token
38
- ```
39
-
40
- **After**:
41
- ```python
42
- def __init__(self, space_name: str, username: str, token: str, git_email: str = None, git_name: str = None):
43
- self.space_name = space_name
44
- self.username = username
45
- self.token = token
46
-
47
- # Git configuration
48
- self.git_email = git_email or f"{username}@huggingface.co"
49
- self.git_name = git_name or username
50
- ```
51
-
52
- ### 2. **Git Configuration in upload_files_to_space Method**
53
-
54
- **Added to the method**:
55
- ```python
56
- # Configure git user identity for this repository
57
- try:
58
- # Try to get existing git config
59
- result = subprocess.run(["git", "config", "--global", "user.email"], capture_output=True, text=True)
60
- if result.returncode == 0 and result.stdout.strip():
61
- git_email = result.stdout.strip()
62
- else:
63
- git_email = self.git_email
64
-
65
- result = subprocess.run(["git", "config", "--global", "user.name"], capture_output=True, text=True)
66
- if result.returncode == 0 and result.stdout.strip():
67
- git_name = result.stdout.strip()
68
- else:
69
- git_name = self.git_name
70
-
71
- except Exception:
72
- # Fallback to default values
73
- git_email = self.git_email
74
- git_name = self.git_name
75
-
76
- # Set git config for this repository
77
- subprocess.run(["git", "config", "user.email", git_email], check=True, capture_output=True)
78
- subprocess.run(["git", "config", "user.name", git_name], check=True, capture_output=True)
79
-
80
- print(f"✅ Configured git with email: {git_email}, name: {git_name}")
81
- ```
82
-
83
- ### 3. **Updated Main Function**
84
-
85
- **Enhanced to accept git configuration**:
86
- ```python
87
- def main():
88
- # Get user input
89
- username = input("Enter your Hugging Face username: ").strip()
90
- space_name = input("Enter Space name (e.g., trackio-monitoring): ").strip()
91
- token = input("Enter your Hugging Face token: ").strip()
92
-
93
- # Get git configuration (optional)
94
- git_email = input("Enter your git email (optional, press Enter for default): ").strip()
95
- git_name = input("Enter your git name (optional, press Enter for default): ").strip()
96
-
97
- # Create deployer with git config
98
- deployer = TrackioSpaceDeployer(space_name, username, token, git_email, git_name)
99
- ```
100
-
101
- ### 4. **Updated Launch Script**
102
-
103
- **Enhanced to pass git configuration**:
104
- ```bash
105
- # Create deployment script input
106
- cat > deploy_input.txt << EOF
107
- $HF_USERNAME
108
- $TRACKIO_SPACE_NAME
109
- $HF_TOKEN
110
- $GIT_EMAIL
111
- $HF_USERNAME
112
- EOF
113
- ```
114
-
115
- ## Testing the Fix
116
-
117
- ### **Run Git Configuration Tests**
118
- ```bash
119
- python tests/test_git_config_fix.py
120
- ```
121
-
122
- Expected output:
123
- ```
124
- 🚀 Testing Git Configuration Fix
125
- ========================================
126
- 🔍 Testing git configuration in temporary directory...
127
- ✅ Created temp directory: /tmp/tmp_xxxxx
128
- ✅ Initialized git repository
129
- ✅ Git email configured correctly
130
- ✅ Git name configured correctly
131
- ✅ Git commit successful
132
- ✅ Cleanup successful
133
-
134
- 🔍 Testing deployment script git configuration...
135
- ✅ Git email set correctly
136
- ✅ Git name set correctly
137
-
138
- 🔍 Testing git configuration fallback...
139
- ✅ Default git email set correctly
140
- ✅ Default git name set correctly
141
-
142
- 🔍 Testing git commit with configuration...
143
- ✅ Created temp directory: /tmp/tmp_xxxxx
144
- ✅ Git commit successful with configuration
145
- ✅ Cleanup successful
146
-
147
- 📊 Test Results: 4/4 tests passed
148
- ✅ All git configuration tests passed! The deployment should work correctly.
149
- ```
150
-
151
- ## Files Modified
152
-
153
- ### **Core Deployment Files**
154
- 1. **`scripts/trackio_tonic/deploy_trackio_space.py`**
155
- - Enhanced constructor to accept git configuration
156
- - Added git configuration in upload_files_to_space method
157
- - Updated main function to accept git parameters
158
- - Added fallback mechanisms for git configuration
159
-
160
- ### **Launch Script**
161
- 2. **`launch.sh`**
162
- - Updated to pass git configuration to deployment script
163
- - Enhanced input file creation with git parameters
164
-
165
- ### **Testing**
166
- 3. **`tests/test_git_config_fix.py`**
167
- - Comprehensive testing of git configuration
168
- - Tests for temporary directory git setup
169
- - Tests for deployment script git handling
170
- - Tests for fallback behavior
171
-
172
- ## Benefits of the Fix
173
-
174
- ### **1. Reliable Git Commits**
175
- - Git user identity properly configured in temporary directory
176
- - No more "exit status 128" errors
177
- - Successful commits and pushes to Hugging Face Spaces
178
-
179
- ### **2. Flexible Configuration**
180
- - Accepts custom git email and name
181
- - Falls back to sensible defaults
182
- - Works with existing git configuration
183
-
184
- ### **3. Better Error Handling**
185
- - Graceful fallback to default values
186
- - Clear error messages and logging
187
- - Robust configuration validation
188
-
189
- ### **4. Professional Setup**
190
- - Uses user's actual email address when provided
191
- - Maintains proper git attribution
192
- - Follows git best practices
193
-
194
- ## Usage Instructions
195
-
196
- ### **1. Test the Fix**
197
- ```bash
198
- python tests/test_git_config_fix.py
199
- ```
200
-
201
- ### **2. Deploy with Git Configuration**
202
- ```bash
203
- python scripts/trackio_tonic/deploy_trackio_space.py
204
- ```
205
-
206
- When prompted:
207
- - Enter your HF username
208
- - Enter space name
209
- - Enter your HF token
210
- - Enter your git email (or press Enter for default)
211
- - Enter your git name (or press Enter for default)
212
-
213
- ### **3. Use with Launch Script**
214
- ```bash
215
- ./launch.sh
216
- ```
217
-
218
- The launch script will automatically pass the git configuration to the deployment script.
219
-
220
- ## Troubleshooting
221
-
222
- ### **Common Issues**
223
-
224
- #### **1. Git Configuration Still Fails**
225
- ```bash
226
- # Check if git is properly configured
227
- git config --list
228
-
229
- # Set git config manually if needed
230
- git config --global user.email "[email protected]"
231
- git config --global user.name "Your Name"
232
- ```
233
-
234
- #### **2. Permission Issues**
235
- ```bash
236
- # Check HF token permissions
237
- hf whoami
238
-
239
- # Verify token has write access
240
- hf repo create test-repo --type space
241
- ```
242
-
243
- #### **3. Space Creation Fails**
244
- ```bash
245
- # Check if space name is available
246
- # Try a different space name
247
- # Verify HF token is valid
248
- ```
249
-
250
- ## Next Steps
251
-
252
- 1. **Test the fix**: Run the git configuration tests
253
- 2. **Deploy a test space**: Use the updated deployment script
254
- 3. **Verify deployment**: Check that the space is created successfully
255
- 4. **Use in production**: Deploy your actual Trackio Space
256
-
257
- The git configuration fix should resolve the deployment issues and allow successful Trackio Space creation! 🚀
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/GIT_CONFIGURATION_GUIDE.md DELETED
@@ -1,258 +0,0 @@
1
- # Git Configuration Guide for Hugging Face Operations
2
-
3
- This guide explains the correct way to configure git for Hugging Face Spaces deployment and model pushing operations.
4
-
5
- ## 🎯 **Overview**
6
-
7
- When working with Hugging Face Spaces and model repositories, proper git configuration is essential for:
8
- - Creating and deploying Spaces
9
- - Pushing models to the Hub
10
- - Managing experiment tracking datasets
11
- - Ensuring proper authentication
12
- - **Using the user's actual email address for proper git identity and commit attribution**
13
-
14
- ## ✅ **Correct Git Configuration**
15
-
16
- ### **1. Local vs Global Configuration**
17
-
18
- **❌ Wrong (Current):**
19
- ```bash
20
- git config --global user.email "[email protected]"
21
- git config --global user.name "$HF_USERNAME"
22
- ```
23
-
24
- **✅ Correct (Updated):**
25
- ```bash
26
- # Get user's actual email address
27
- read -p "Enter your email address for git configuration: " GIT_EMAIL
28
-
29
- # Configure git locally for this project only
30
- git config user.email "$GIT_EMAIL"
31
- git config user.name "$HF_USERNAME"
32
-
33
- # Verify configuration
34
- git config user.email
35
- git config user.name
36
- ```
37
-
38
- ### **2. Proper Authentication Setup**
39
-
40
- **✅ Correct Authentication:**
41
- ```bash
42
- # Login with token and add to git credentials
43
- hf login --token "$HF_TOKEN" --add-to-git-credential
44
-
45
- # Verify login
46
- hf whoami
47
- ```
48
-
49
- ### **3. Error Handling**
50
-
51
- **✅ Robust Configuration:**
52
- ```bash
53
- # Get user's email and configure git with error handling
54
- read -p "Enter your email address for git configuration: " GIT_EMAIL
55
-
56
- if git config user.email "$GIT_EMAIL" && \
57
- git config user.name "$HF_USERNAME"; then
58
- echo "✅ Git configured successfully"
59
- echo " Email: $(git config user.email)"
60
- echo " Name: $(git config user.name)"
61
- else
62
- echo "❌ Failed to configure git"
63
- exit 1
64
- fi
65
- ```
66
-
67
- ## 🔧 **Why These Changes Matter**
68
-
69
- ### **1. Local Configuration Benefits**
70
- - **Isolation**: Doesn't affect other projects on the system
71
- - **Project-specific**: Each project can have different git settings
72
- - **Cleaner**: No global state pollution
73
- - **Safer**: Won't interfere with existing git configurations
74
-
75
- ### **2. User's Actual Email Address**
76
- - **Professional**: Uses the user's real email address
77
- - **Authentic**: Represents the actual user's identity
78
- - **Consistent**: Matches the user's Hugging Face account
79
- - **Best Practice**: Follows git configuration standards
80
-
81
- ### **3. Token-based Authentication**
82
- - **Secure**: Uses HF token instead of username/password
83
- - **Automated**: No manual password entry required
84
- - **Persistent**: Credentials stored securely
85
- - **Verified**: Includes verification steps
86
-
87
- ## 📋 **Implementation in Launch Script**
88
-
89
- ### **Updated Authentication Step:**
90
- ```bash
91
- # Step 8: Authentication setup
92
- print_step "Step 8: Authentication Setup"
93
- echo "================================"
94
-
95
- export HF_TOKEN="$HF_TOKEN"
96
- export TRACKIO_DATASET_REPO="$TRACKIO_DATASET_REPO"
97
-
98
- # Login to Hugging Face with token
99
- print_info "Logging in to Hugging Face..."
100
- if hf login --token "$HF_TOKEN" --add-to-git-credential; then
101
- print_status "Successfully logged in to Hugging Face"
102
- print_info "Username: $(hf whoami)"
103
- else
104
- print_error "Failed to login to Hugging Face"
105
- print_error "Please check your token and try again"
106
- exit 1
107
- fi
108
-
109
- # Configure git for HF operations
110
- print_step "Step 8.1: Git Configuration"
111
- echo "================================"
112
-
113
- print_info "Configuring git for Hugging Face operations..."
114
-
115
- # Get user's email for git configuration
116
- get_input "Enter your email address for git configuration" "" GIT_EMAIL
117
-
118
- # Configure git locally (not globally) for this project
119
- git config user.email "$GIT_EMAIL"
120
- git config user.name "$HF_USERNAME"
121
-
122
- # Verify git configuration
123
- print_info "Verifying git configuration..."
124
- if git config user.email && git config user.name; then
125
- print_status "Git configured successfully"
126
- print_info " Email: $(git config user.email)"
127
- print_info " Name: $(git config user.name)"
128
- else
129
- print_error "Failed to configure git"
130
- exit 1
131
- fi
132
- ```
133
-
134
- ## 🚀 **Deployment Script Improvements**
135
-
136
- ### **Robust File Upload:**
137
- ```python
138
- def upload_files(self) -> bool:
139
- """Upload necessary files to the Space"""
140
- try:
141
- print("Uploading files to Space...")
142
-
143
- # Files to upload
144
- files_to_upload = [
145
- "app.py",
146
- "requirements_space.txt",
147
- "README.md"
148
- ]
149
-
150
- # Check if we're in a git repository
151
- try:
152
- subprocess.run(["git", "status"], capture_output=True, check=True)
153
- except subprocess.CalledProcessError:
154
- print("⚠️ Not in a git repository, initializing...")
155
- subprocess.run(["git", "init"], check=True)
156
- subprocess.run(["git", "remote", "add", "origin", f"https://huggingface.co/spaces/{self.username}/{self.space_name}"], check=True)
157
-
158
- # Add all files at once
159
- existing_files = [f for f in files_to_upload if os.path.exists(f)]
160
- if existing_files:
161
- subprocess.run(["git", "add"] + existing_files, check=True)
162
- subprocess.run(["git", "commit", "-m", "Initial Space setup"], check=True)
163
-
164
- # Push to the space
165
- try:
166
- subprocess.run(["git", "push", "origin", "main"], check=True)
167
- print(f"✅ Uploaded {len(existing_files)} files")
168
- except subprocess.CalledProcessError:
169
- # Try pushing to master branch if main doesn't exist
170
- subprocess.run(["git", "push", "origin", "master"], check=True)
171
- print(f"✅ Uploaded {len(existing_files)} files")
172
- else:
173
- print("⚠️ No files found to upload")
174
-
175
- return True
176
-
177
- except Exception as e:
178
- print(f"❌ Error uploading files: {e}")
179
- return False
180
- ```
181
-
182
- ## 🔍 **Troubleshooting**
183
-
184
- ### **Common Issues and Solutions:**
185
-
186
- #### **1. Git Configuration Fails**
187
- ```bash
188
- # Check current git config
189
- git config --list
190
-
191
- # Reset if needed
192
- git config --unset user.email
193
- git config --unset user.name
194
-
195
- # Reconfigure
196
- git config user.email "[email protected]"
197
- git config user.name "your-username"
198
- ```
199
-
200
- #### **2. Authentication Issues**
201
- ```bash
202
- # Check HF login status
203
- hf whoami
204
-
205
- # Re-login if needed
206
- hf logout
207
- hf login --token "your-token"
208
- ```
209
-
210
- #### **3. Space Deployment Fails**
211
- ```bash
212
- # Check git remote
213
- git remote -v
214
-
215
- # Re-add remote if needed
216
- git remote remove origin
217
- git remote add origin https://huggingface.co/spaces/username/space-name
218
- ```
219
-
220
- ## 📚 **Best Practices**
221
-
222
- ### **1. Always Use Local Configuration**
223
- - Use `git config` without `--global` flag
224
- - Keeps project configurations isolated
225
- - Prevents conflicts with other projects
226
-
227
- ### **2. Verify Configuration**
228
- - Always check that git config was successful
229
- - Display configured values for verification
230
- - Exit on failure to prevent downstream issues
231
-
232
- ### **3. Use Token-based Authentication**
233
- - More secure than username/password
234
- - Automatically handles credential storage
235
- - Works well with CI/CD systems
236
-
237
- ### **4. Handle Errors Gracefully**
238
- - Check return codes from git commands
239
- - Provide clear error messages
240
- - Exit early on critical failures
241
-
242
- ### **5. Test Configuration**
243
- - Verify git config after setting it
244
- - Test HF login before proceeding
245
- - Validate remote repository access
246
-
247
- ## 🎯 **Summary**
248
-
249
- The updated git configuration approach provides:
250
-
251
- 1. **✅ Better Isolation**: Local configuration doesn't affect system-wide settings
252
- 2. **✅ User's Actual Email**: Uses the user's real email address for proper git identity
253
- 3. **✅ Proper Authentication**: Token-based login with credential storage
254
- 4. **✅ Error Handling**: Robust verification and error reporting
255
- 5. **✅ Professional Setup**: Uses user's actual email and verification
256
- 6. **✅ Deployment Reliability**: Improved Space deployment with git repository handling
257
-
258
- This ensures a more reliable and professional setup for Hugging Face operations in the SmolLM3 fine-tuning pipeline.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/H100_LIGHTWEIGHT_GUIDE.md DELETED
@@ -1,276 +0,0 @@
1
- # H100 Lightweight Training Configuration Guide
2
-
3
- This guide explains the new **H100 Lightweight (Rapid)** training configuration, optimized for rapid fine-tuning on H100 GPUs with a small, carefully selected dataset.
4
-
5
- ## 🎯 Overview
6
-
7
- The H100 Lightweight configuration is designed for:
8
- - **Rapid experimentation** on H100 GPUs
9
- - **Efficient training** with 80K carefully selected samples
10
- - **Quick iteration** for research and development
11
- - **Cost-effective** training sessions
12
-
13
- ## 🚀 Key Features
14
-
15
- ### **Optimized for H100**
16
- - **Batch Size**: 16 (larger than A100 configs)
17
- - **Gradient Accumulation**: 4 (reduced for faster updates)
18
- - **Learning Rate**: 8e-6 (slightly higher for rapid convergence)
19
- - **Sequence Length**: 8192 (full context window)
20
-
21
- ### **Dataset Sampling**
22
- - **Source**: OpenHermes-FR dataset
23
- - **Sample Size**: 80,000 random samples
24
- - **Validation**: 1,000 samples (if available)
25
- - **Reproducibility**: Fixed random seed (42)
26
-
27
- ### **Training Optimizations**
28
- - **Warmup Steps**: 50 (reduced for rapid training)
29
- - **Evaluation**: Every 50 steps
30
- - **Logging**: Every 5 steps
31
- - **Saving**: Every 200 steps
32
- - **Checkpoints**: Keep only 2 (save storage)
33
-
34
- ## 📊 Configuration Details
35
-
36
- ### **Model Configuration**
37
- ```python
38
- model_name="HuggingFaceTB/SmolLM3-3B"
39
- max_seq_length=8192
40
- use_flash_attention=True
41
- use_gradient_checkpointing=True
42
- ```
43
-
44
- ### **Training Parameters**
45
- ```python
46
- batch_size=16
47
- gradient_accumulation_steps=4
48
- learning_rate=8e-6
49
- warmup_steps=50
50
- max_epochs=1
51
- ```
52
-
53
- ### **H100-Specific Optimizations**
54
- ```python
55
- dataloader_num_workers=4
56
- dataloader_pin_memory=True
57
- gradient_clipping=1.0
58
- group_by_length=True
59
- pad_to_multiple_of=8
60
- ```
61
-
62
- ### **Memory Optimizations**
63
- ```python
64
- save_total_limit=2
65
- early_stopping_patience=3
66
- max_grad_norm=1.0
67
- warmup_ratio=0.1
68
- ```
69
-
70
- ## 🔧 Usage
71
-
72
- ### **Interactive Selection**
73
- ```bash
74
- ./launch.sh
75
- # Select "H100 Lightweight (Rapid)" when prompted
76
- ```
77
-
78
- ### **Expected Training Time**
79
- - **H100**: ~2-4 hours (depending on hardware)
80
- - **A100**: ~4-6 hours
81
- - **V100**: ~6-8 hours
82
-
83
- ### **Memory Requirements**
84
- - **GPU Memory**: 40GB+ (H100 recommended)
85
- - **System RAM**: 32GB+
86
- - **Storage**: 50GB+ for dataset and checkpoints
87
-
88
- ## 📈 Performance Characteristics
89
-
90
- ### **Training Speed**
91
- - **Steps per Second**: ~2-3 (on H100)
92
- - **Samples per Second**: ~32-48
93
- - **Effective Batch Size**: 64 (16 × 4)
94
-
95
- ### **Convergence**
96
- - **Expected Loss**: 1.2-1.8 (after 1 epoch)
97
- - **Evaluation Frequency**: Every 50 steps
98
- - **Early Stopping**: After 3 evaluations without improvement
99
-
100
- ### **Dataset Efficiency**
101
- - **80K samples**: ~1.3% of full OpenHermes-FR
102
- - **Random sampling**: Ensures diversity
103
- - **Fixed seed**: Reproducible results
104
-
105
- ## 🎯 Use Cases
106
-
107
- ### **Perfect For**
108
- - **Rapid prototyping** of new ideas
109
- - **Hyperparameter tuning** experiments
110
- - **Model comparison** studies
111
- - **Research validation** before full training
112
- - **Educational purposes** and learning
113
-
114
- ### **Not Recommended For**
115
- - **Production models** (use Multiple Passes instead)
116
- - **Competition submissions** (use full dataset)
117
- - **Research papers** (use complete training)
118
-
119
- ## 🔄 Comparison with Other Configurations
120
-
121
- | Configuration | Dataset Size | Batch Size | Epochs | Training Time | Use Case |
122
- |---------------|--------------|------------|--------|---------------|----------|
123
- | **Basic Training** | Full SmolTalk | 2 | 3 | 6-8 hours | Learning |
124
- | **H100 Lightweight** | 80K Hermes-FR | 16 | 1 | 2-4 hours | Rapid experiments |
125
- | **A100 Large Scale** | Full Hermes-FR | 8 | 1.3 | 8-12 hours | Serious research |
126
- | **Multiple Passes** | Full Hermes-FR | 6 | 4 | 24-36 hours | Production |
127
-
128
- ## 🛠️ Customization
129
-
130
- ### **Modifying Sample Size**
131
- ```bash
132
- # In the launch script, you can modify:
133
- DATASET_SAMPLE_SIZE=50000 # For 50K samples
134
- DATASET_SAMPLE_SIZE=100000 # For 100K samples
135
- ```
136
-
137
- ### **Adjusting Training Parameters**
138
- ```bash
139
- # Modify in config/train_smollm3_h100_lightweight.py:
140
- batch_size=12 # Smaller batch size
141
- learning_rate=6e-6 # Lower learning rate
142
- warmup_steps=100 # More warmup steps
143
- ```
144
-
145
- ### **Changing Dataset**
146
- ```bash
147
- # Modify the dataset name in the configuration:
148
- dataset_name="your-custom-dataset"
149
- ```
150
-
151
- ## 📊 Monitoring and Results
152
-
153
- ### **Trackio Integration**
154
- - **Real-time metrics**: Loss, learning rate, gradient norm
155
- - **Training curves**: Visual progress tracking
156
- - **Resource usage**: GPU utilization, memory consumption
157
- - **Artifacts**: Model checkpoints, logs
158
-
159
- ### **Expected Metrics**
160
- - **Training Loss**: Starts ~3.0, ends ~1.5
161
- - **Validation Loss**: Should be close to training loss
162
- - **Learning Rate**: Cosine decay from 8e-6 to 2e-6
163
- - **Gradient Norm**: Should stay below 1.0
164
-
165
- ### **Success Indicators**
166
- - **Converging loss**: Steady decrease over time
167
- - **Stable gradients**: Consistent gradient norms
168
- - **Good validation**: Validation loss follows training loss
169
- - **No overfitting**: Validation loss doesn't increase
170
-
171
- ## 🚨 Troubleshooting
172
-
173
- ### **Common Issues**
174
-
175
- #### **Out of Memory (OOM)**
176
- ```bash
177
- # Reduce batch size in config:
178
- batch_size=12 # Instead of 16
179
- gradient_accumulation_steps=6 # Instead of 4
180
- ```
181
-
182
- #### **Slow Training**
183
- ```bash
184
- # Check GPU utilization:
185
- nvidia-smi
186
- # Ensure CUDA is properly installed
187
- python -c "import torch; print(torch.cuda.is_available())"
188
- ```
189
-
190
- #### **Poor Convergence**
191
- ```bash
192
- # Try different learning rate:
193
- learning_rate=6e-6 # Instead of 8e-6
194
- # Or increase warmup:
195
- warmup_steps=100 # Instead of 50
196
- ```
197
-
198
- #### **Dataset Issues**
199
- ```bash
200
- # Check dataset loading:
201
- python -c "from datasets import load_dataset; print(len(load_dataset('legmlai/openhermes-fr')['train']))"
202
- ```
203
-
204
- ### **Performance Tips**
205
-
206
- 1. **Use H100 if available**: Significantly faster than A100
207
- 2. **Monitor GPU memory**: Keep utilization below 90%
208
- 3. **Check logs regularly**: Look for convergence issues
209
- 4. **Save checkpoints**: Don't lose progress
210
- 5. **Use early stopping**: Prevent overfitting
211
-
212
- ## 📋 Example Workflow
213
-
214
- ### **Complete H100 Lightweight Training**
215
- ```bash
216
- # 1. Setup
217
- python setup_launch.py
218
-
219
- # 2. Check requirements
220
- python check_requirements.py
221
-
222
- # 3. Run interactive pipeline
223
- ./launch.sh
224
-
225
- # 4. Select configuration
226
- # Choose: "H100 Lightweight (Rapid)"
227
-
228
- # 5. Monitor training
229
- # Watch Trackio Space for real-time progress
230
-
231
- # 6. Check results
232
- # Model will be pushed to HF Hub
233
- # Summary in training_summary.md
234
- ```
235
-
236
- ### **Expected Output**
237
- ```
238
- ✅ Dataset prepared: 80000 train samples, 1000 validation samples
239
- 📈 Training started with 5000 total steps
240
- ⏱️ Estimated time: 2-4 hours
241
- 📊 Monitor progress at: https://huggingface.co/spaces/...
242
- ```
243
-
244
- ## 🎉 Benefits
245
-
246
- ### **Speed**
247
- - **3-4x faster** than full dataset training
248
- - **Rapid iteration** for research
249
- - **Quick validation** of ideas
250
-
251
- ### **Efficiency**
252
- - **Reduced costs** (less GPU time)
253
- - **Lower storage** requirements
254
- - **Faster experimentation** cycle
255
-
256
- ### **Quality**
257
- - **Still high quality** results
258
- - **Good for prototyping**
259
- - **Suitable for many use cases**
260
-
261
- ## 🔮 Future Enhancements
262
-
263
- ### **Planned Improvements**
264
- - **Adaptive sampling**: Smart dataset selection
265
- - **Multi-GPU support**: Distributed training
266
- - **Advanced monitoring**: More detailed metrics
267
- - **Auto-tuning**: Automatic hyperparameter optimization
268
-
269
- ### **Extensibility**
270
- - **Custom datasets**: Easy integration
271
- - **Different models**: Support for other architectures
272
- - **Advanced sampling**: Stratified, balanced sampling
273
-
274
- ---
275
-
276
- **Happy Rapid Training on H100! 🚀**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/HF_DATASETS_GUIDE.md DELETED
@@ -1,269 +0,0 @@
1
- # 🚀 Trackio with Hugging Face Datasets - Complete Guide
2
-
3
- ## Overview
4
-
5
- This guide explains how to use Hugging Face Datasets for persistent storage of Trackio experiments, providing reliable data persistence across Hugging Face Spaces deployments.
6
-
7
- ## 🏗️ Architecture
8
-
9
- ### Why HF Datasets?
10
-
11
- 1. **Persistent Storage**: Data survives Space restarts and redeployments
12
- 2. **Version Control**: Automatic versioning of experiment data
13
- 3. **Access Control**: Private datasets for security
14
- 4. **Reliability**: HF's infrastructure ensures data availability
15
- 5. **Scalability**: Handles large amounts of experiment data
16
-
17
- ### Data Flow
18
-
19
- ```
20
- Training Script → Trackio App → HF Dataset → Trackio App → Plots
21
- ```
22
-
23
- ## 🚀 Setup Instructions
24
-
25
- ### 1. Create HF Token
26
-
27
- 1. Go to [Hugging Face Settings](https://huggingface.co/settings/tokens)
28
- 2. Create a new token with `write` permissions
29
- 3. Copy the token for use in your Space
30
-
31
- ### 2. Set Up Dataset Repository
32
-
33
- ```bash
34
- # Run the setup script
35
- python setup_hf_dataset.py
36
- ```
37
-
38
- This will:
39
- - Create a private dataset: `tonic/trackio-experiments`
40
- - Add your existing experiments
41
- - Configure the dataset for Trackio
42
-
43
- ### 3. Configure Hugging Face Space
44
-
45
- #### Environment Variables
46
- Set these in your HF Space settings:
47
- ```bash
48
- HF_TOKEN=your_hf_token_here
49
- TRACKIO_DATASET_REPO=your-username/your-dataset-name
50
- ```
51
-
52
- **Environment Variables Explained:**
53
- - `HF_TOKEN`: Your Hugging Face token (required for dataset access)
54
- - `TRACKIO_DATASET_REPO`: Dataset repository to use (optional, defaults to `tonic/trackio-experiments`)
55
-
56
- **Example Configurations:**
57
- ```bash
58
- # Use default dataset
59
- HF_TOKEN=your_token_here
60
-
61
- # Use personal dataset
62
- HF_TOKEN=your_token_here
63
- TRACKIO_DATASET_REPO=your-username/trackio-experiments
64
-
65
- # Use team dataset
66
- HF_TOKEN=your_token_here
67
- TRACKIO_DATASET_REPO=your-org/team-experiments
68
-
69
- # Use project-specific dataset
70
- HF_TOKEN=your_token_here
71
- TRACKIO_DATASET_REPO=your-username/smollm3-experiments
72
- ```
73
-
74
- #### Requirements
75
- Update your `requirements.txt`:
76
- ```txt
77
- gradio>=4.0.0
78
- plotly>=5.0.0
79
- pandas>=1.5.0
80
- numpy>=1.24.0
81
- datasets>=2.14.0
82
- huggingface-hub>=0.16.0
83
- requests>=2.31.0
84
- ```
85
-
86
- ### 4. Deploy Updated App
87
-
88
- The updated `app.py` now:
89
- - Loads experiments from HF Dataset
90
- - Saves new experiments to the dataset
91
- - Falls back to backup data if dataset unavailable
92
- - Provides better error handling
93
-
94
- ### 5. Configure Environment Variables
95
-
96
- Use the configuration script to check your setup:
97
-
98
- ```bash
99
- python configure_trackio.py
100
- ```
101
-
102
- This script will:
103
- - Show current environment variables
104
- - Test dataset access
105
- - Generate configuration file
106
- - Provide usage examples
107
-
108
- **Available Environment Variables:**
109
-
110
- | Variable | Required | Default | Description |
111
- |----------|----------|---------|-------------|
112
- | `HF_TOKEN` | Yes | None | Your Hugging Face token |
113
- | `TRACKIO_DATASET_REPO` | No | `tonic/trackio-experiments` | Dataset repository to use |
114
- | `SPACE_ID` | Auto | None | HF Space ID (auto-detected) |
115
-
116
- ## 📊 Dataset Schema
117
-
118
- The HF Dataset contains these columns:
119
-
120
- | Column | Type | Description |
121
- |--------|------|-------------|
122
- | `experiment_id` | string | Unique experiment identifier |
123
- | `name` | string | Experiment name |
124
- | `description` | string | Experiment description |
125
- | `created_at` | string | ISO timestamp |
126
- | `status` | string | running/completed/failed |
127
- | `metrics` | string | JSON array of metric entries |
128
- | `parameters` | string | JSON object of experiment parameters |
129
- | `artifacts` | string | JSON array of artifacts |
130
- | `logs` | string | JSON array of log entries |
131
- | `last_updated` | string | ISO timestamp of last update |
132
-
133
- ## 🔧 Technical Details
134
-
135
- ### Loading Experiments
136
-
137
- ```python
138
- from datasets import load_dataset
139
-
140
- # Load from HF Dataset
141
- dataset = load_dataset("tonic/trackio-experiments", token=HF_TOKEN)
142
-
143
- # Convert to experiments dict
144
- for row in dataset['train']:
145
- experiment = {
146
- 'id': row['experiment_id'],
147
- 'metrics': json.loads(row['metrics']),
148
- 'parameters': json.loads(row['parameters']),
149
- # ... other fields
150
- }
151
- ```
152
-
153
- ### Saving Experiments
154
-
155
- ```python
156
- from datasets import Dataset
157
- from huggingface_hub import HfApi
158
-
159
- # Convert experiments to dataset format
160
- dataset_data = []
161
- for exp_id, exp_data in experiments.items():
162
- dataset_data.append({
163
- 'experiment_id': exp_id,
164
- 'metrics': json.dumps(exp_data['metrics']),
165
- 'parameters': json.dumps(exp_data['parameters']),
166
- # ... other fields
167
- })
168
-
169
- # Push to HF Hub
170
- dataset = Dataset.from_list(dataset_data)
171
- dataset.push_to_hub("tonic/trackio-experiments", token=HF_TOKEN, private=True)
172
- ```
173
-
174
- ## 📈 Your Current Experiments
175
-
176
- ### Available Experiments
177
-
178
- 1. **`exp_20250720_130853`** (petite-elle-l-aime-3)
179
- - 4 metric entries (steps 25, 50, 75, 100)
180
- - Loss decreasing: 1.1659 → 1.1528
181
- - Good convergence pattern
182
-
183
- 2. **`exp_20250720_134319`** (petite-elle-l-aime-3-1)
184
- - 2 metric entries (step 25)
185
- - Loss: 1.166
186
- - GPU memory tracking
187
-
188
- ### Metrics Available for Plotting
189
-
190
- - `loss` - Training loss curve
191
- - `learning_rate` - Learning rate schedule
192
- - `mean_token_accuracy` - Token-level accuracy
193
- - `grad_norm` - Gradient norm
194
- - `num_tokens` - Tokens processed
195
- - `epoch` - Training epoch
196
- - `gpu_0_memory_allocated` - GPU memory usage
197
- - `cpu_percent` - CPU usage
198
- - `memory_percent` - System memory
199
-
200
- ## 🎯 Usage Instructions
201
-
202
- ### 1. View Experiments
203
- - Go to "View Experiments" tab
204
- - Enter experiment ID: `exp_20250720_130853` or `exp_20250720_134319`
205
- - Click "View Experiment"
206
-
207
- ### 2. Create Plots
208
- - Go to "Visualizations" tab
209
- - Enter experiment ID
210
- - Select metric to plot
211
- - Click "Create Plot"
212
-
213
- ### 3. Compare Experiments
214
- - Use "Experiment Comparison" feature
215
- - Enter: `exp_20250720_130853,exp_20250720_134319`
216
- - Compare loss curves
217
-
218
- ## 🔍 Troubleshooting
219
-
220
- ### Issue: "No metrics data available"
221
- **Solutions**:
222
- 1. Check HF_TOKEN is set correctly
223
- 2. Verify dataset repository exists
224
- 3. Check network connectivity to HF Hub
225
-
226
- ### Issue: "Failed to load from dataset"
227
- **Solutions**:
228
- 1. App falls back to backup data automatically
229
- 2. Check dataset permissions
230
- 3. Verify token has read access
231
-
232
- ### Issue: "Failed to save experiments"
233
- **Solutions**:
234
- 1. Check token has write permissions
235
- 2. Verify dataset repository exists
236
- 3. Check network connectivity
237
-
238
- ## 🚀 Benefits of This Approach
239
-
240
- ### ✅ Advantages
241
- - **Persistent**: Data survives Space restarts
242
- - **Reliable**: HF's infrastructure ensures availability
243
- - **Secure**: Private datasets protect your data
244
- - **Scalable**: Handles large amounts of experiment data
245
- - **Versioned**: Automatic versioning of experiment data
246
-
247
- ### 🔄 Fallback Strategy
248
- 1. **Primary**: Load from HF Dataset
249
- 2. **Secondary**: Use backup data (your existing experiments)
250
- 3. **Tertiary**: Create new experiments locally
251
-
252
- ## 📋 Next Steps
253
-
254
- 1. **Set HF_TOKEN**: Add your token to Space environment
255
- 2. **Run Setup**: Execute `setup_hf_dataset.py`
256
- 3. **Deploy App**: Push updated `app.py` to your Space
257
- 4. **Test Plots**: Verify experiments load and plots work
258
- 5. **Monitor Training**: New experiments will be saved to dataset
259
-
260
- ## 🔐 Security Notes
261
-
262
- - Dataset is **private** by default
263
- - Only accessible with your HF_TOKEN
264
- - Experiment data is stored securely on HF infrastructure
265
- - No sensitive data is exposed publicly
266
-
267
- ---
268
-
269
- **Your experiments are now configured for reliable persistence using Hugging Face Datasets!** 🎉
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/HF_HUB_V0_34_UPDATE.md DELETED
@@ -1,170 +0,0 @@
1
- # Hugging Face Hub v0.34.0 Compatibility Update
2
-
3
- ## Overview
4
-
5
- This document outlines the updates made to ensure compatibility with the new Hugging Face Hub v0.34.0 release, which introduced significant changes to the CLI interface.
6
-
7
- ## Key Changes in HF Hub v0.34.0
8
-
9
- ### 1. CLI Rename
10
- - **Old**: `huggingface-cli`
11
- - **New**: `hf`
12
- - **Status**: Legacy `huggingface-cli` still works but is deprecated
13
-
14
- ### 2. New Features
15
- - **Jobs CLI**: New `hf jobs` command for running compute jobs
16
- - **Enhanced Inference**: Image-to-image support and PIL Image support
17
- - **Xet Integration**: Improved file transfer protocol
18
- - **Modern Command Format**: `hf <resource> <action> [options]`
19
-
20
- ## Files Updated
21
-
22
- ### Core Scripts
23
- 1. **`launch.sh`**
24
- - Updated `huggingface-cli whoami` → `hf whoami`
25
- - Updated `huggingface-cli login` → `hf login`
26
-
27
- 2. **`scripts/trackio_tonic/deploy_trackio_space.py`**
28
- - Updated CLI commands for space creation
29
- - Updated username extraction method
30
-
31
- 3. **`scripts/dataset_tonic/setup_hf_dataset.py`**
32
- - Updated username extraction method
33
-
34
- 4. **`scripts/trackio_tonic/configure_trackio.py`**
35
- - Updated username extraction method
36
-
37
- ### Documentation Files
38
- 1. **`setup_launch.py`**
39
- - Updated troubleshooting guide
40
-
41
- 2. **`README_END_TO_END.md`**
42
- - Updated CLI command examples
43
-
44
- 3. **`docs/GIT_CONFIGURATION_GUIDE.md`**
45
- - Updated authentication examples
46
-
47
- 4. **`docs/LAUNCH_SCRIPT_USERNAME_FIX.md`**
48
- - Updated username extraction method
49
-
50
- 5. **`docs/LAUNCH_SCRIPT_UPDATES.md`**
51
- - Updated CLI command references
52
-
53
- 6. **`docs/TRACKIO_DEPLOYMENT_FIXES.md`**
54
- - Updated troubleshooting commands
55
-
56
- 7. **`docs/GIT_CONFIGURATION_FIX.md`**
57
- - Updated authentication examples
58
-
59
- ## Compatibility Notes
60
-
61
- ### Backward Compatibility
62
- - The legacy `huggingface-cli` commands still work
63
- - Our scripts will continue to function with both old and new CLI
64
- - No breaking changes to the Python API
65
-
66
- ### Recommended Actions
67
- 1. **Update CLI Installation**: Ensure users have the latest `huggingface_hub` package
68
- 2. **Update Documentation**: All references now use the new `hf` command
69
- 3. **Test Deployment**: Verify that all deployment scripts work with the new CLI
70
-
71
- ## Verification Steps
72
-
73
- ### 1. Test CLI Installation
74
- ```bash
75
- # Check if hf command is available
76
- hf --version
77
-
78
- # Test authentication
79
- hf whoami
80
- ```
81
-
82
- ### 2. Test Deployment Scripts
83
- ```bash
84
- # Test space deployment
85
- python scripts/trackio_tonic/deploy_trackio_space.py
86
-
87
- # Test dataset setup
88
- python scripts/dataset_tonic/setup_hf_dataset.py
89
-
90
- # Test model push
91
- python scripts/model_tonic/push_to_huggingface.py
92
- ```
93
-
94
- ### 3. Test Launch Script
95
- ```bash
96
- # Run the interactive pipeline
97
- ./launch.sh
98
- ```
99
-
100
- ## Benefits of the Update
101
-
102
- ### 1. Future-Proof
103
- - Uses the new official CLI name
104
- - Follows HF's recommended practices
105
- - Ready for future HF Hub updates
106
-
107
- ### 2. Consistency
108
- - All scripts now use the same CLI command
109
- - Unified command format across the project
110
- - Consistent with HF's new conventions
111
-
112
- ### 3. Modern Interface
113
- - Aligns with HF's new command structure
114
- - Better integration with HF's ecosystem
115
- - Improved user experience
116
-
117
- ## Migration Guide
118
-
119
- ### For Users
120
- 1. **Update huggingface_hub**: `pip install --upgrade huggingface_hub`
121
- 2. **Test CLI**: Run `hf whoami` to verify installation
122
- 3. **Update Scripts**: Use the updated scripts from this repository
123
-
124
- ### For Developers
125
- 1. **Update Dependencies**: Ensure `huggingface_hub>=0.34.0`
126
- 2. **Test Scripts**: Verify all deployment scripts work
127
- 3. **Update Documentation**: Use `hf` instead of `huggingface-cli`
128
-
129
- ## Troubleshooting
130
-
131
- ### Common Issues
132
-
133
- #### 1. CLI Not Found
134
- ```bash
135
- # Install/upgrade huggingface_hub
136
- pip install --upgrade huggingface_hub
137
-
138
- # Verify installation
139
- hf --version
140
- ```
141
-
142
- #### 2. Authentication Issues
143
- ```bash
144
- # Login with new CLI
145
- hf login --token "your-token"
146
-
147
- # Verify login
148
- hf whoami
149
- ```
150
-
151
- #### 3. Script Compatibility
152
- - All scripts have been updated to use the new CLI
153
- - Legacy commands are still supported as fallback
154
- - No breaking changes to functionality
155
-
156
- ## Summary
157
-
158
- The update to HF Hub v0.34.0 compatibility ensures:
159
-
160
- 1. **✅ Future-Proof**: Uses the new official CLI name
161
- 2. **✅ Consistent**: All scripts use the same command format
162
- 3. **✅ Compatible**: Maintains backward compatibility
163
- 4. **✅ Modern**: Aligns with HF's latest conventions
164
- 5. **✅ Tested**: All deployment scripts verified to work
165
-
166
- The project is now fully compatible with Hugging Face Hub v0.34.0 and ready for future updates.
167
-
168
- ---
169
-
170
- **Note**: The legacy `huggingface-cli` commands will continue to work, but using `hf` is now the recommended approach for all new development and deployments.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/HF_SPACES_GUIDE.md DELETED
@@ -1,163 +0,0 @@
1
- # 🚀 Trackio on Hugging Face Spaces - Complete Guide
2
-
3
- ## Overview
4
-
5
- This guide explains how to properly deploy and use Trackio on Hugging Face Spaces, addressing the unique challenges of ephemeral storage and data persistence.
6
-
7
- ## 🏗️ Hugging Face Spaces Architecture
8
-
9
- ### Key Challenges
10
-
11
- 1. **Ephemeral Storage**: File system gets reset between deployments
12
- 2. **No Persistent Storage**: Files written during runtime don't persist
13
- 3. **Multiple Instances**: Training and monitoring might run in different environments
14
- 4. **Limited File System**: Restricted write permissions in certain directories
15
-
16
- ### How Trackio Handles HF Spaces
17
-
18
- The updated Trackio app now includes:
19
-
20
- - **Automatic HF Spaces Detection**: Detects when running on HF Spaces
21
- - **Persistent Path Selection**: Uses `/tmp/` for better persistence
22
- - **Backup Recovery**: Automatically recovers experiments from backup data
23
- - **Fallback Storage**: Multiple storage locations for redundancy
24
-
25
- ## 📊 Your Current Experiments
26
-
27
- Based on your logs, you have these experiments available:
28
-
29
- ### Experiment 1: `exp_20250720_130853`
30
- - **Name**: petite-elle-l-aime-3
31
- - **Status**: Running
32
- - **Metrics**: 4 entries (steps 25, 50, 75, 100)
33
- - **Key Metrics**: Loss decreasing from 1.1659 to 1.1528
34
-
35
- ### Experiment 2: `exp_20250720_134319`
36
- - **Name**: petite-elle-l-aime-3-1
37
- - **Status**: Running
38
- - **Metrics**: 2 entries (step 25)
39
- - **Key Metrics**: Loss 1.166, GPU memory usage
40
-
41
- ## 🎯 How to Use Your Experiments
42
-
43
- ### 1. View Experiments
44
- - Go to the "View Experiments" tab
45
- - Enter experiment ID: `exp_20250720_130853` or `exp_20250720_134319`
46
- - Click "View Experiment" to see details
47
-
48
- ### 2. Create Plots
49
- - Go to the "Visualizations" tab
50
- - Enter experiment ID
51
- - Select metric to plot:
52
- - `loss` - Training loss curve
53
- - `learning_rate` - Learning rate schedule
54
- - `mean_token_accuracy` - Token accuracy
55
- - `grad_norm` - Gradient norm
56
- - `gpu_0_memory_allocated` - GPU memory usage
57
-
58
- ### 3. Compare Experiments
59
- - Use the "Experiment Comparison" feature
60
- - Enter: `exp_20250720_130853,exp_20250720_134319`
61
- - Compare loss curves between experiments
62
-
63
- ## 🔧 Technical Details
64
-
65
- ### Data Persistence Strategy
66
-
67
- ```python
68
- # HF Spaces detection
69
- if os.environ.get('SPACE_ID'):
70
- data_file = "/tmp/trackio_experiments.json"
71
- else:
72
- data_file = "trackio_experiments.json"
73
- ```
74
-
75
- ### Backup Recovery
76
-
77
- The app automatically recovers your experiments from backup data when:
78
- - Running on HF Spaces
79
- - No existing experiments found
80
- - Data file is missing or empty
81
-
82
- ### Storage Locations
83
-
84
- 1. **Primary**: `/tmp/trackio_experiments.json`
85
- 2. **Backup**: `/tmp/trackio_backup.json`
86
- 3. **Fallback**: Local directory (for development)
87
-
88
- ## 🚀 Deployment Best Practices
89
-
90
- ### 1. Environment Variables
91
- ```bash
92
- # Set in HF Spaces environment
93
- SPACE_ID=your-space-id
94
- TRACKIO_URL=https://your-space.hf.space
95
- ```
96
-
97
- ### 2. File Structure
98
- ```
99
- your-space/
100
- ├── app.py # Main Trackio app
101
- ├── requirements.txt # Dependencies
102
- ├── README.md # Space description
103
- └── .gitignore # Ignore temporary files
104
- ```
105
-
106
- ### 3. Requirements
107
- ```txt
108
- gradio>=4.0.0
109
- plotly>=5.0.0
110
- pandas>=1.5.0
111
- numpy>=1.24.0
112
- ```
113
-
114
- ## 📈 Monitoring Your Training
115
-
116
- ### Real-time Metrics
117
- Your experiments show:
118
- - **Loss**: Decreasing from 1.1659 to 1.1528 (good convergence)
119
- - **Learning Rate**: Properly scheduled from 7e-08 to 2.8875e-07
120
- - **Token Accuracy**: Around 75-76% (reasonable for early training)
121
- - **GPU Memory**: ~17GB allocated, 75GB reserved
122
-
123
- ### Expected Behavior
124
- - Loss should continue decreasing
125
- - Learning rate will follow cosine schedule
126
- - Token accuracy should improve over time
127
- - GPU memory usage should remain stable
128
-
129
- ## 🔍 Troubleshooting
130
-
131
- ### Issue: "No metrics data available"
132
- **Solution**: The app now automatically recovers experiments from backup
133
-
134
- ### Issue: Plots not showing
135
- **Solution**:
136
- 1. Check experiment ID is correct
137
- 2. Try different metrics (loss, learning_rate, etc.)
138
- 3. Refresh the page
139
-
140
- ### Issue: Data not persisting
141
- **Solution**:
142
- 1. App now uses `/tmp/` for better persistence
143
- 2. Backup recovery ensures data availability
144
- 3. Multiple storage locations provide redundancy
145
-
146
- ## 🎯 Next Steps
147
-
148
- 1. **Deploy Updated App**: Push the updated `app.py` to your HF Space
149
- 2. **Test Plots**: Try plotting your experiments
150
- 3. **Monitor Training**: Continue monitoring your training runs
151
- 4. **Add New Experiments**: Create new experiments as needed
152
-
153
- ## 📞 Support
154
-
155
- If you encounter issues:
156
- 1. Check the logs in your HF Space
157
- 2. Verify experiment IDs are correct
158
- 3. Try the backup recovery feature
159
- 4. Contact for additional support
160
-
161
- ---
162
-
163
- **Your experiments are now properly configured and should display correctly in the Trackio interface!** 🎉
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/INTERACTIVE_PIPELINE_IMPROVEMENTS.md DELETED
@@ -1,330 +0,0 @@
1
- # Interactive Pipeline Improvements
2
-
3
- This document explains the improvements made to the `launch.sh` script to make it interactive and configurable for different training scenarios.
4
-
5
- ## 🎯 Key Improvements
6
-
7
- ### 1. **Interactive User Interface**
8
- - **Colored Output**: Added color-coded status messages for better UX
9
- - **Input Validation**: Real-time validation of user inputs
10
- - **Default Values**: Smart defaults for common configurations
11
- - **Error Handling**: Graceful error handling with helpful messages
12
-
13
- ### 2. **Training Configuration Selection**
14
- The script now offers 4 predefined training configurations:
15
-
16
- #### **Basic Training (Default)**
17
- ```bash
18
- Model: SmolLM3-3B
19
- Dataset: SmolTalk
20
- Epochs: 3
21
- Batch Size: 2
22
- Learning Rate: 5e-6
23
- Sequence Length: 4096
24
- Best for: Quick experiments, learning
25
- ```
26
-
27
- #### **H100 Lightweight (Rapid)**
28
- ```bash
29
- Model: SmolLM3-3B
30
- Dataset: OpenHermes-FR (80K samples)
31
- Epochs: 1
32
- Batch Size: 16
33
- Learning Rate: 8e-6
34
- Sequence Length: 8192
35
- Best for: Rapid training on H100
36
- ```
37
-
38
- #### **A100 Large Scale**
39
- ```bash
40
- Model: SmolLM3-3B
41
- Dataset: OpenHermes-FR
42
- Epochs: 1.3 passes
43
- Batch Size: 8
44
- Learning Rate: 5e-6
45
- Sequence Length: 8192
46
- Best for: High-performance training
47
- ```
48
-
49
- #### **Multiple Passes**
50
- ```bash
51
- Model: SmolLM3-3B
52
- Dataset: OpenHermes-FR
53
- Epochs: 4 passes
54
- Batch Size: 6
55
- Learning Rate: 3e-6
56
- Sequence Length: 8192
57
- Best for: Thorough training
58
- ```
59
-
60
- #### **Custom Configuration**
61
- - User-defined parameters
62
- - Flexible model and dataset selection
63
- - Custom training parameters
64
-
65
- ### 3. **Enhanced User Experience**
66
-
67
- #### **Step-by-Step Guidance**
68
- 1. **Authentication** - HF username and token validation
69
- 2. **Configuration Selection** - Choose from predefined configs
70
- 3. **Experiment Setup** - Configure experiment details
71
- 4. **Training Parameters** - Adjust hyperparameters
72
- 5. **Deployment Setup** - Trackio Space configuration
73
- 6. **Confirmation** - Review and confirm settings
74
-
75
- #### **Input Functions**
76
- ```bash
77
- # Get input with default value
78
- get_input "Prompt" "default_value" VARIABLE_NAME
79
-
80
- # Select from options
81
- select_option "Choose option:" "Option 1" "Option 2" "Option 3" VARIABLE_NAME
82
-
83
- # Validate HF token
84
- validate_hf_token "$HF_TOKEN"
85
- ```
86
-
87
- #### **Colored Output Functions**
88
- ```bash
89
- print_status "Success message" # Green ✅
90
- print_warning "Warning message" # Yellow ⚠️
91
- print_error "Error message" # Red ❌
92
- print_info "Info message" # Blue ℹ️
93
- print_header "Header message" # Purple 🚀
94
- print_step "Step message" # Cyan 📋
95
- ```
96
-
97
- ### 4. **Dynamic Configuration Generation**
98
-
99
- The script now generates training configurations based on user selection:
100
-
101
- ```python
102
- # Generated config file
103
- config = SmolLM3Config(
104
- model_name="$MODEL_NAME",
105
- max_seq_length=$MAX_SEQ_LENGTH,
106
- batch_size=$BATCH_SIZE,
107
- learning_rate=$LEARNING_RATE,
108
- # ... other parameters
109
- )
110
- ```
111
-
112
- ### 5. **Improved Error Handling**
113
-
114
- #### **Input Validation**
115
- - Required field validation
116
- - HF token validation
117
- - Numeric input validation
118
- - Choice validation
119
-
120
- #### **Graceful Degradation**
121
- - Clear error messages
122
- - Recovery suggestions
123
- - Exit on critical errors
124
-
125
- ### 6. **Configuration Management**
126
-
127
- #### **User Credentials**
128
- - Interactive username input
129
- - Secure token input
130
- - Real-time token validation
131
-
132
- #### **Experiment Details**
133
- - Dynamic experiment naming
134
- - Repository name generation
135
- - Dataset repository configuration
136
-
137
- #### **Training Parameters**
138
- - Batch size selection
139
- - Learning rate adjustment
140
- - Sequence length configuration
141
- - Save/eval/logging steps
142
-
143
- ### 7. **Enhanced Monitoring Integration**
144
-
145
- #### **Trackio Space**
146
- - Dynamic space naming
147
- - Automatic deployment
148
- - URL generation
149
-
150
- #### **HF Datasets**
151
- - Dataset repository setup
152
- - Experiment data storage
153
- - Access configuration
154
-
155
- ## 🔧 Technical Improvements
156
-
157
- ### 1. **Modular Functions**
158
- ```bash
159
- # Input handling
160
- get_input() # Get user input with defaults
161
- select_option() # Select from options
162
- validate_hf_token() # Validate HF token
163
-
164
- # Configuration
165
- show_training_configs() # Display available configs
166
- get_training_config() # Get config based on selection
167
- create_training_config() # Generate config file
168
-
169
- # Output formatting
170
- print_status() # Success messages
171
- print_warning() # Warning messages
172
- print_error() # Error messages
173
- print_info() # Info messages
174
- print_header() # Header messages
175
- print_step() # Step messages
176
- ```
177
-
178
- ### 2. **Configuration Selection Logic**
179
- ```bash
180
- case "$config_type" in
181
- "Basic Training")
182
- MODEL_NAME="HuggingFaceTB/SmolLM3-3B"
183
- DATASET_NAME="HuggingFaceTB/smoltalk"
184
- # ... other parameters
185
- ;;
186
- "A100 Large Scale")
187
- MODEL_NAME="HuggingFaceTB/SmolLM3-3B"
188
- DATASET_NAME="legmlai/openhermes-fr"
189
- # ... other parameters
190
- ;;
191
- # ... other configurations
192
- esac
193
- ```
194
-
195
- ### 3. **Dynamic File Generation**
196
- ```bash
197
- # Generate training config
198
- create_training_config "$CONFIG_FILE"
199
-
200
- # Generate deployment input
201
- cat > deploy_input.txt << EOF
202
- $HF_USERNAME
203
- $TRACKIO_SPACE_NAME
204
- $HF_TOKEN
205
- EOF
206
- ```
207
-
208
- ## 📊 User Workflow
209
-
210
- ### **Before (Static)**
211
- 1. Edit `launch.sh` manually
212
- 2. Update hardcoded variables
213
- 3. Run script
214
- 4. Hope configuration is correct
215
-
216
- ### **After (Interactive)**
217
- 1. Run `./launch.sh`
218
- 2. Follow interactive prompts
219
- 3. Select training configuration
220
- 4. Confirm settings
221
- 5. Watch automated pipeline
222
-
223
- ## 🎯 Benefits
224
-
225
- ### **For Users**
226
- - **No Manual Editing**: No need to edit script files
227
- - **Guided Experience**: Step-by-step prompts
228
- - **Validation**: Real-time input validation
229
- - **Flexibility**: Multiple configuration options
230
- - **Safety**: Confirmation before execution
231
-
232
- ### **For Developers**
233
- - **Maintainable**: Modular function structure
234
- - **Extensible**: Easy to add new configurations
235
- - **Robust**: Comprehensive error handling
236
- - **User-Friendly**: Clear feedback and guidance
237
-
238
- ### **For Different Use Cases**
239
- - **Beginners**: Basic Training configuration
240
- - **H100 Users**: H100 Lightweight for rapid experiments
241
- - **Researchers**: A100 Large Scale for serious experiments
242
- - **Production**: Multiple Passes for thorough training
243
- - **Custom**: User-defined parameters for specific needs
244
-
245
- ## 🔄 Configuration Examples
246
-
247
- ### **Quick Start (Basic Training)**
248
- ```bash
249
- ./launch.sh
250
- # Follow prompts:
251
- # 1. Enter HF username and token
252
- # 2. Select "Basic Training"
253
- # 3. Confirm settings
254
- # 4. Watch automated pipeline
255
- ```
256
-
257
- ### **High-Performance Training (A100)**
258
- ```bash
259
- ./launch.sh
260
- # Follow prompts:
261
- # 1. Enter HF username and token
262
- # 2. Select "A100 Large Scale"
263
- # 3. Adjust parameters if needed
264
- # 4. Confirm and run
265
- ```
266
-
267
- ### **Rapid Training (H100)**
268
- ```bash
269
- ./launch.sh
270
- # Follow prompts:
271
- # 1. Enter HF username and token
272
- # 2. Select "H100 Lightweight (Rapid)"
273
- # 3. Confirm settings
274
- # 4. Watch rapid training on H100
275
- ```
276
-
277
- ### **Custom Training**
278
- ```bash
279
- ./launch.sh
280
- # Follow prompts:
281
- # 1. Enter HF username and token
282
- # 2. Select "Custom Configuration"
283
- # 3. Enter custom parameters:
284
- # - Model: microsoft/DialoGPT-medium
285
- # - Dataset: your-custom-dataset
286
- # - Epochs: 5
287
- # - Batch Size: 4
288
- # - Learning Rate: 1e-5
289
- # 4. Confirm and run
290
- ```
291
-
292
- ## 🚀 Future Enhancements
293
-
294
- ### **Planned Improvements**
295
- - **GUI Interface**: Web-based configuration interface
296
- - **Configuration Templates**: Save/load custom configurations
297
- - **Advanced Validation**: More sophisticated input validation
298
- - **Progress Tracking**: Real-time progress indicators
299
- - **Rollback Capability**: Undo changes if needed
300
-
301
- ### **Extensibility**
302
- - **Plugin System**: Add custom training configurations
303
- - **API Integration**: Connect to external services
304
- - **Multi-GPU Support**: Distributed training options
305
- - **Advanced Monitoring**: Enhanced tracking capabilities
306
-
307
- ## 📋 Migration Guide
308
-
309
- ### **For Existing Users**
310
- 1. **Backup**: Save your current `launch.sh`
311
- 2. **Update**: Replace with new interactive version
312
- 3. **Test**: Run with basic configuration first
313
- 4. **Migrate**: Use interactive prompts instead of manual editing
314
-
315
- ### **For New Users**
316
- 1. **Setup**: Run `python setup_launch.py`
317
- 2. **Check**: Run `python check_requirements.py`
318
- 3. **Launch**: Run `./launch.sh`
319
- 4. **Follow**: Use interactive prompts
320
-
321
- ## 🎉 Conclusion
322
-
323
- The interactive pipeline provides a much better user experience with:
324
- - **Guided Configuration**: No manual editing required
325
- - **Multiple Options**: Predefined configurations for different use cases
326
- - **Validation**: Real-time input validation and error handling
327
- - **Flexibility**: Custom configuration support
328
- - **Safety**: Confirmation steps and error recovery
329
-
330
- The script is now production-ready for users of all skill levels, from beginners to advanced researchers.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/LATEST_DEPLOYMENT_APPROACH.md DELETED
@@ -1,267 +0,0 @@
1
- # Latest Trackio Space Deployment Approach
2
-
3
- ## Overview
4
-
5
- Based on the [Hugging Face Hub repository code](https://github.com/huggingface/huggingface_hub/blob/9e0493cfdb4de5a27b45c53c3342c83ab1a138fb/src/huggingface_hub/commands/repo.py#L30), I've updated the Trackio Space deployment to use the latest Hugging Face Hub Python API instead of CLI commands.
6
-
7
- ## Key Improvements
8
-
9
- ### 1. **Latest HF Hub API Integration**
10
-
11
- **Before**: Using CLI commands
12
- ```python
13
- cmd = ["hf", "repo", "create", f"{username}/{space_name}", "--type", "space"]
14
- ```
15
-
16
- **After**: Using Python API
17
- ```python
18
- from huggingface_hub import create_repo
19
-
20
- create_repo(
21
- repo_id=f"{username}/{space_name}",
22
- token=token,
23
- repo_type="space",
24
- exist_ok=True,
25
- private=False,
26
- space_sdk="gradio",
27
- space_hardware="cpu-basic"
28
- )
29
- ```
30
-
31
- ### 2. **Robust Fallback Mechanism**
32
-
33
- The deployment script now includes both API and CLI approaches:
34
-
35
- ```python
36
- def create_space(self) -> bool:
37
- """Create a new Hugging Face Space using the latest API"""
38
- try:
39
- if not HF_HUB_AVAILABLE:
40
- return self._create_space_cli()
41
-
42
- # Use latest API
43
- create_repo(...)
44
-
45
- except Exception as api_error:
46
- # Fallback to CLI
47
- return self._create_space_cli()
48
- ```
49
-
50
- ### 3. **Enhanced Dependencies**
51
-
52
- Updated `requirements/requirements_core.txt`:
53
- ```txt
54
- # Hugging Face Hub for model and space management
55
- huggingface_hub>=0.19.0
56
- ```
57
-
58
- ## API Parameters
59
-
60
- ### **Required Parameters**
61
- - `repo_id`: Repository identifier (username/space-name)
62
- - `token`: Hugging Face token with write permissions
63
-
64
- ### **Optional Parameters**
65
- - `repo_type`: Set to "space" for Spaces
66
- - `exist_ok`: Allow existing repositories (default: True)
67
- - `private`: Make repository private (default: False)
68
- - `space_sdk`: SDK type (default: "gradio")
69
- - `space_hardware`: Hardware specification (default: "cpu-basic")
70
-
71
- ## Deployment Process
72
-
73
- ### **Step 1: API Creation**
74
- ```python
75
- # Create space using latest API
76
- create_repo(
77
- repo_id=f"{username}/{space_name}",
78
- token=token,
79
- repo_type="space",
80
- exist_ok=True,
81
- private=False,
82
- space_sdk="gradio",
83
- space_hardware="cpu-basic"
84
- )
85
- ```
86
-
87
- ### **Step 2: File Preparation**
88
- ```python
89
- # Prepare files in temporary directory
90
- temp_dir = tempfile.mkdtemp()
91
- # Copy template files
92
- shutil.copy2(source_path, dest_path)
93
- # Update README with actual space URL
94
- readme_content.replace("{SPACE_URL}", self.space_url)
95
- ```
96
-
97
- ### **Step 3: Git Upload**
98
- ```python
99
- # Initialize git in temp directory
100
- os.chdir(temp_dir)
101
- subprocess.run(["git", "init"], check=True)
102
- subprocess.run(["git", "remote", "add", "origin", space_url], check=True)
103
- subprocess.run(["git", "add", "."], check=True)
104
- subprocess.run(["git", "commit", "-m", "Initial Trackio Space setup"], check=True)
105
- subprocess.run(["git", "push", "origin", "main"], check=True)
106
- ```
107
-
108
- ## Testing the Latest Deployment
109
-
110
- ### **Run Latest Deployment Tests**
111
- ```bash
112
- python tests/test_latest_deployment.py
113
- ```
114
-
115
- Expected output:
116
- ```
117
- 🚀 Testing Latest Trackio Space Deployment
118
- =======================================================
119
- 🔍 Testing huggingface_hub import...
120
- ✅ huggingface_hub imported successfully
121
-
122
- 🔍 Testing deployment script import...
123
- ✅ TrackioSpaceDeployer class imported successfully
124
- ✅ HF API initialized
125
-
126
- 🔍 Testing API methods...
127
- ✅ Method exists: create_space
128
- ✅ Method exists: _create_space_cli
129
- ✅ Method exists: prepare_space_files
130
- ✅ Method exists: upload_files_to_space
131
- ✅ Method exists: test_space
132
- ✅ Method exists: deploy
133
-
134
- 🔍 Testing create_repo API...
135
- ✅ Required parameter: repo_id
136
- ✅ Required parameter: token
137
- ✅ Optional parameter: repo_type
138
- ✅ Optional parameter: space_sdk
139
- ✅ Optional parameter: space_hardware
140
- ✅ create_repo API signature looks correct
141
-
142
- 🔍 Testing space creation logic...
143
- ✅ Space URL formatted correctly
144
- ✅ Repo ID formatted correctly
145
-
146
- 🔍 Testing template files...
147
- ✅ app.py exists
148
- ✅ requirements.txt exists
149
- ✅ README.md exists
150
-
151
- 🔍 Testing temporary directory handling...
152
- ✅ Created temp directory: /tmp/tmp_xxxxx
153
- ✅ File copying works
154
- ✅ Cleanup successful
155
-
156
- 📊 Test Results: 7/7 tests passed
157
- ✅ All deployment tests passed! The latest deployment should work correctly.
158
- ```
159
-
160
- ## Files Updated
161
-
162
- ### **Core Deployment Files**
163
- 1. **`scripts/trackio_tonic/deploy_trackio_space.py`**
164
- - Added HF Hub API integration
165
- - Implemented fallback mechanism
166
- - Enhanced error handling
167
- - Better logging and debugging
168
-
169
- ### **Dependencies**
170
- 2. **`requirements/requirements_core.txt`**
171
- - Updated huggingface_hub to >=0.19.0
172
- - Organized dependencies by category
173
- - Added missing dependencies
174
-
175
- ### **Testing**
176
- 3. **`tests/test_latest_deployment.py`**
177
- - Comprehensive API testing
178
- - Import validation
179
- - Method verification
180
- - Template file checking
181
-
182
- ## Benefits of Latest Approach
183
-
184
- ### **1. Better Error Handling**
185
- - API-first approach with CLI fallback
186
- - Detailed error messages
187
- - Graceful degradation
188
-
189
- ### **2. More Reliable**
190
- - Uses official HF Hub API
191
- - Better parameter validation
192
- - Consistent behavior
193
-
194
- ### **3. Future-Proof**
195
- - Follows latest HF Hub patterns
196
- - Easy to update with new API features
197
- - Maintains backward compatibility
198
-
199
- ### **4. Enhanced Logging**
200
- - Detailed progress reporting
201
- - Better debugging information
202
- - Clear success/failure indicators
203
-
204
- ## Usage Instructions
205
-
206
- ### **1. Install Latest Dependencies**
207
- ```bash
208
- pip install huggingface_hub>=0.19.0
209
- ```
210
-
211
- ### **2. Test the Deployment**
212
- ```bash
213
- python tests/test_latest_deployment.py
214
- ```
215
-
216
- ### **3. Deploy Trackio Space**
217
- ```bash
218
- python scripts/trackio_tonic/deploy_trackio_space.py
219
- ```
220
-
221
- ### **4. Verify Deployment**
222
- - Check the Space URL
223
- - Test the interface
224
- - Verify API endpoints
225
-
226
- ## Troubleshooting
227
-
228
- ### **Common Issues**
229
-
230
- #### **1. Import Errors**
231
- ```
232
- ❌ Failed to import huggingface_hub
233
- ```
234
- **Solution**: Install latest version
235
- ```bash
236
- pip install huggingface_hub>=0.19.0
237
- ```
238
-
239
- #### **2. API Errors**
240
- ```
241
- API creation failed: 401 Client Error
242
- ```
243
- **Solution**: Check token permissions and validity
244
-
245
- #### **3. Git Push Errors**
246
- ```
247
- ❌ Error uploading files: git push failed
248
- ```
249
- **Solution**: Verify git configuration and token access
250
-
251
- ### **Fallback Behavior**
252
-
253
- The deployment script automatically falls back to CLI if:
254
- - `huggingface_hub` is not available
255
- - API creation fails
256
- - Network issues occur
257
-
258
- ## Reference Implementation
259
-
260
- Based on the [Hugging Face Hub repository](https://github.com/huggingface/huggingface_hub/blob/9e0493cfdb4de5a27b45c53c3342c83ab1a138fb/src/huggingface_hub/commands/repo.py#L30), this implementation:
261
-
262
- 1. **Uses the latest API patterns**
263
- 2. **Follows HF Hub best practices**
264
- 3. **Maintains backward compatibility**
265
- 4. **Provides robust error handling**
266
-
267
- The Trackio Space deployment should now work reliably with the latest Hugging Face Hub infrastructure! 🚀
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/LAUNCH_SCRIPT_UPDATES.md DELETED
@@ -1,174 +0,0 @@
1
- # Launch Script Updates
2
-
3
- This document outlines the updates made to `launch.sh` to work with the new automated Trackio deployment features.
4
-
5
- ## Key Changes Made
6
-
7
- ### ✅ **Removed Manual Username Input**
8
- - **Before**: Script asked for username manually
9
- - **After**: Username is automatically extracted from HF token using `whoami()`
10
- - **Benefit**: Fewer manual inputs, better user experience
11
-
12
- ### ✅ **Updated Token Validation**
13
- - **Before**: `validate_hf_token()` only validated token
14
- - **After**: `validate_hf_token_and_get_username()` validates token AND extracts username
15
- - **Benefit**: Automatic username detection from token
16
-
17
- ### ✅ **Updated Deployment Workflow**
18
- - **Before**: Passed username manually to deployment script
19
- - **After**: Deployment script automatically gets username from token
20
- - **Benefit**: Consistent with new automated features
21
-
22
- ### ✅ **Enhanced User Feedback**
23
- - **Before**: Basic status messages
24
- - **After**: Clear information about automated features
25
- - **Benefit**: Users understand what's happening automatically
26
-
27
- ## Updated Workflow
28
-
29
- ### **Step 1: Authentication (Simplified)**
30
- ```bash
31
- # Before: Asked for username + token
32
- get_input "Hugging Face username" "" HF_USERNAME
33
- get_input "Hugging Face token" "" HF_TOKEN
34
-
35
- # After: Only asks for token, username auto-detected
36
- get_input "Hugging Face token" "" HF_TOKEN
37
- # Username automatically extracted from token
38
- ```
39
-
40
- ### **Step 9: Trackio Space Deployment (Automated)**
41
- ```bash
42
- # Before: Manual input file creation
43
- cat > deploy_input.txt << EOF
44
- $HF_USERNAME
45
- $TRACKIO_SPACE_NAME
46
- $HF_TOKEN
47
- $GIT_EMAIL
48
- $HF_USERNAME
49
- EOF
50
- python deploy_trackio_space.py < deploy_input.txt
51
-
52
- # After: Direct input with automated features
53
- python deploy_trackio_space.py << EOF
54
- $TRACKIO_SPACE_NAME
55
- $HF_TOKEN
56
- $GIT_EMAIL
57
- $HF_USERNAME
58
- EOF
59
- ```
60
-
61
- ### **Step 10: Dataset Setup (Automated)**
62
- ```bash
63
- # Before: Basic dataset setup
64
- python setup_hf_dataset.py
65
-
66
- # After: Automated dataset setup with user feedback
67
- print_info "Setting up HF Dataset with automated features..."
68
- print_info "Username will be auto-detected from token"
69
- print_info "Dataset repository: $TRACKIO_DATASET_REPO"
70
- python setup_hf_dataset.py
71
- ```
72
-
73
- ### **Step 11: Trackio Configuration (Automated)**
74
- ```bash
75
- # Before: Basic configuration
76
- python configure_trackio.py
77
-
78
- # After: Automated configuration with user feedback
79
- print_info "Configuring Trackio with automated features..."
80
- print_info "Username will be auto-detected from token"
81
- python configure_trackio.py
82
- ```
83
-
84
- ## New Function: `validate_hf_token_and_get_username()`
85
-
86
- ```bash
87
- validate_hf_token_and_get_username() {
88
- local token="$1"
89
- if [ -z "$token" ]; then
90
- return 1
91
- fi
92
-
93
- # Test the token and get username
94
- export HF_TOKEN="$token"
95
- if hf whoami >/dev/null 2>&1; then
96
- # Get username from whoami command
97
- HF_USERNAME=$(hf whoami | head -n1 | tr -d '\n')
98
- return 0
99
- else
100
- return 1
101
- fi
102
- }
103
- ```
104
-
105
- ## User Experience Improvements
106
-
107
- ### ✅ **Fewer Manual Inputs**
108
- - Only need to provide HF token
109
- - Username automatically detected
110
- - Git email still required (for git operations)
111
-
112
- ### ✅ **Better Feedback**
113
- - Clear messages about automated features
114
- - Shows what's happening automatically
115
- - Better error messages
116
-
117
- ### ✅ **Consistent Automation**
118
- - All scripts now use automated features
119
- - No manual username input anywhere
120
- - Automatic secret setting
121
-
122
- ## Configuration Summary Updates
123
-
124
- ### **Before:**
125
- ```
126
- 📋 Configuration Summary:
127
- ========================
128
- User: username (manually entered)
129
- Experiment: experiment_name
130
- ...
131
- ```
132
-
133
- ### **After:**
134
- ```
135
- 📋 Configuration Summary:
136
- ========================
137
- User: username (auto-detected from token)
138
- Experiment: experiment_name
139
- ...
140
- ```
141
-
142
- ## Benefits
143
-
144
- 1. **Simplified Workflow**: Only need token, username auto-detected
145
- 2. **Consistent Automation**: All scripts use automated features
146
- 3. **Better User Experience**: Clear feedback about automated features
147
- 4. **Reduced Errors**: No manual username input means fewer typos
148
- 5. **Streamlined Process**: Fewer steps, more automation
149
-
150
- ## Testing
151
-
152
- The updated launch script has been tested for:
153
- - ✅ Syntax validation (`bash -n launch.sh`)
154
- - ✅ Function integration with updated scripts
155
- - ✅ Automated username extraction
156
- - ✅ Consistent workflow with new features
157
-
158
- ## Compatibility
159
-
160
- The updated launch script is fully compatible with:
161
- - ✅ Updated `deploy_trackio_space.py` (automated features)
162
- - ✅ Updated `setup_hf_dataset.py` (username extraction)
163
- - ✅ Updated `configure_trackio.py` (automated configuration)
164
- - ✅ Existing training and model push scripts
165
-
166
- ## Summary
167
-
168
- The launch script now provides a seamless, automated experience that:
169
- - Extracts username automatically from HF token
170
- - Uses all the new automated features in the deployment scripts
171
- - Provides clear feedback about automated processes
172
- - Maintains compatibility with existing workflows
173
- - Reduces manual input requirements
174
- - Improves overall user experience
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/LAUNCH_SCRIPT_USERNAME_FIX.md DELETED
@@ -1,154 +0,0 @@
1
- # Launch Script Username Parameter Fix
2
-
3
- This document outlines the fix for removing unnecessary username parameters from the launch script deployment calls.
4
-
5
- ## 🐛 **Problem Description**
6
-
7
- The `launch.sh` script was still passing the username parameter to the deployment script even though the deployment script should auto-detect the username from the token.
8
-
9
- **Before:**
10
- ```bash
11
- # Run deployment script with automated features
12
- python deploy_trackio_space.py << EOF
13
- $TRACKIO_SPACE_NAME
14
- $HF_TOKEN
15
- $GIT_EMAIL
16
- $HF_USERNAME # ❌ Unnecessary - should be auto-detected
17
- EOF
18
- ```
19
-
20
- ## ✅ **Solution Implemented**
21
-
22
- ### **Removed Unnecessary Username Parameter**
23
-
24
- **After:**
25
- ```bash
26
- # Run deployment script with automated features
27
- python deploy_trackio_space.py << EOF
28
- $TRACKIO_SPACE_NAME
29
- $HF_TOKEN
30
- $GIT_EMAIL
31
-
32
- EOF
33
- ```
34
-
35
- ## 🔧 **Why This Fix Was Needed**
36
-
37
- ### **1. Deployment Script Auto-Detection**
38
- The `deploy_trackio_space.py` script already has robust username auto-detection:
39
-
40
- ```python
41
- def __init__(self, space_name: str, token: str, git_email: str = None, git_name: str = None):
42
- # Username is auto-detected from token
43
- username = get_username_from_token(token)
44
- if not username:
45
- username = get_username_from_cli(token)
46
- ```
47
-
48
- ### **2. Consistent Automation**
49
- All deployment scripts now use the same pattern:
50
- - `deploy_trackio_space.py` - Auto-detects username from token
51
- - `setup_hf_dataset.py` - Auto-detects username from token
52
- - `configure_trackio.py` - Auto-detects username from token
53
-
54
- ### **3. Reduced Manual Input**
55
- The launch script still extracts username for its own use (defaults, display), but doesn't pass it to scripts that can auto-detect it.
56
-
57
- ## 📋 **Current Workflow**
58
-
59
- ### **Launch Script Username Usage:**
60
- ```bash
61
- # 1. Extract username for launch script use
62
- HF_USERNAME=$(hf whoami | head -n1 | tr -d '\n')
63
-
64
- # 2. Use for default values and display
65
- get_input "Model repository name" "$HF_USERNAME/smollm3-finetuned-$(date +%Y%m%d)" REPO_NAME
66
- get_input "Trackio dataset repository" "$HF_USERNAME/trackio-experiments" TRACKIO_DATASET_REPO
67
- TRACKIO_URL="https://huggingface.co/spaces/$HF_USERNAME/$TRACKIO_SPACE_NAME"
68
-
69
- # 3. Display in summary
70
- echo " User: $HF_USERNAME (auto-detected from token)"
71
- ```
72
-
73
- ### **Deployment Script Auto-Detection:**
74
- ```python
75
- # Each script auto-detects username from token
76
- username = get_username_from_token(hf_token)
77
- if not username:
78
- username = get_username_from_cli(hf_token)
79
- ```
80
-
81
- ## 🎯 **Benefits**
82
-
83
- ### **✅ Consistent Automation**
84
- - All scripts use the same username detection method
85
- - No manual username input required anywhere
86
- - Automatic fallback to CLI if API fails
87
-
88
- ### **✅ Reduced Complexity**
89
- - Fewer parameters to pass between scripts
90
- - Less chance of username mismatch errors
91
- - Cleaner script interfaces
92
-
93
- ### **✅ Better User Experience**
94
- - Username is auto-detected from token
95
- - No manual username input required
96
- - Clear feedback about auto-detection
97
-
98
- ### **✅ Future-Proof**
99
- - If username detection method changes, only one place to update
100
- - Consistent behavior across all scripts
101
- - Easier to maintain and debug
102
-
103
- ## 🔍 **Scripts Updated**
104
-
105
- ### **1. `launch.sh`**
106
- - ✅ Removed `$HF_USERNAME` parameter from deployment script call
107
- - ✅ Kept username extraction for launch script use (defaults, display)
108
- - ✅ Maintained all other functionality
109
-
110
- ### **2. Deployment Scripts (No Changes Needed)**
111
- - ✅ `deploy_trackio_space.py` - Already auto-detects username
112
- - ✅ `setup_hf_dataset.py` - Already auto-detects username
113
- - ✅ `configure_trackio.py` - Already auto-detects username
114
-
115
- ## 🧪 **Testing Results**
116
-
117
- ```bash
118
- # Syntax check passes
119
- bash -n launch.sh
120
- # ✅ No syntax errors
121
-
122
- # All tests pass
123
- python tests/test_trackio_fixes.py
124
- # ✅ 7/7 tests passed
125
- ```
126
-
127
- ## 🚀 **Usage**
128
-
129
- The fix is transparent to users. The workflow remains the same:
130
-
131
- ```bash
132
- # 1. Run launch script
133
- bash launch.sh
134
-
135
- # 2. Enter token (username auto-detected)
136
- Enter your Hugging Face token: hf_...
137
-
138
- # 3. All deployment happens automatically
139
- # - Username auto-detected from token
140
- # - No manual username input required
141
- # - Consistent behavior across all scripts
142
- ```
143
-
144
- ## 🎉 **Summary**
145
-
146
- The username parameter fix ensures that:
147
-
148
- - ✅ **No Manual Username Input**: Username is auto-detected from token
149
- - ✅ **Consistent Automation**: All scripts use the same detection method
150
- - ✅ **Reduced Complexity**: Fewer parameters to pass between scripts
151
- - ✅ **Better User Experience**: Clear feedback about auto-detection
152
- - ✅ **Future-Proof**: Easy to maintain and update
153
-
154
- The launch script now provides a truly automated experience where the username is seamlessly extracted from the token and used consistently across all deployment scripts.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/MODEL_CARD_USER_INPUT_ANALYSIS.md DELETED
@@ -1,233 +0,0 @@
1
- # Model Card User Input Analysis
2
-
3
- ## Overview
4
-
5
- This document analyzes the interaction between the model card template (`templates/model_card.md`), the model card generator (`scripts/model_tonic/generate_model_card.py`), and the launch script (`launch.sh`) to identify variables that require user input and improve the user experience.
6
-
7
- ## Template Variables Analysis
8
-
9
- ### Variables in `templates/model_card.md`
10
-
11
- The model card template uses the following variables that can be populated with user input:
12
-
13
- #### Core Model Information
14
- - `{{model_name}}` - Display name of the model
15
- - `{{model_description}}` - Brief description of the model
16
- - `{{repo_name}}` - Hugging Face repository name
17
- - `{{base_model}}` - Base model used for fine-tuning
18
-
19
- #### Training Configuration
20
- - `{{training_config_type}}` - Type of training configuration used
21
- - `{{trainer_type}}` - Type of trainer (SFT, DPO, etc.)
22
- - `{{batch_size}}` - Training batch size
23
- - `{{gradient_accumulation_steps}}` - Gradient accumulation steps
24
- - `{{learning_rate}}` - Learning rate used
25
- - `{{max_epochs}}` - Maximum number of epochs
26
- - `{{max_seq_length}}` - Maximum sequence length
27
-
28
- #### Dataset Information
29
- - `{{dataset_name}}` - Name of the dataset used
30
- - `{{dataset_size}}` - Size of the dataset
31
- - `{{dataset_format}}` - Format of the dataset
32
- - `{{dataset_sample_size}}` - Sample size (for lightweight configs)
33
-
34
- #### Training Results
35
- - `{{training_loss}}` - Final training loss
36
- - `{{validation_loss}}` - Final validation loss
37
- - `{{perplexity}}` - Model perplexity
38
-
39
- #### Infrastructure
40
- - `{{hardware_info}}` - Hardware used for training
41
- - `{{experiment_name}}` - Name of the experiment
42
- - `{{trackio_url}}` - Trackio monitoring URL
43
- - `{{dataset_repo}}` - HF Dataset repository
44
-
45
- #### Author Information
46
- - `{{author_name}}` - Author name for citations and attribution
47
- - `{{model_name_slug}}` - URL-friendly model name
48
-
49
- #### Quantization
50
- - `{{quantized_models}}` - Boolean indicating if quantized models exist
51
-
52
- ## User Input Requirements
53
-
54
- ### Previously Missing User Inputs
55
-
56
- #### 1. **Author Name** (`author_name`)
57
- - **Purpose**: Used in model card metadata and citations
58
- - **Template Usage**: `{{#if author_name}}author: {{author_name}}{{/if}}`
59
- - **Citation Usage**: `author={{{author_name}}}`
60
- - **Default**: "Your Name"
61
- - **User Input Added**: ✅ **IMPLEMENTED**
62
-
63
- #### 2. **Model Description** (`model_description`)
64
- - **Purpose**: Brief description of the model's capabilities
65
- - **Template Usage**: `{{model_description}}`
66
- - **Default**: "A fine-tuned version of SmolLM3-3B for improved text generation and conversation capabilities."
67
- - **User Input Added**: ✅ **IMPLEMENTED**
68
-
69
- ### Variables That Don't Need User Input
70
-
71
- Most variables are automatically populated from:
72
- - **Training Configuration**: Batch size, learning rate, epochs, etc.
73
- - **System Detection**: Hardware info, model size, etc.
74
- - **Auto-Generation**: Repository names, experiment names, etc.
75
- - **Training Results**: Loss values, perplexity, etc.
76
-
77
- ## Implementation Changes
78
-
79
- ### 1. Launch Script Updates (`launch.sh`)
80
-
81
- #### Added User Input Prompts
82
- ```bash
83
- # Step 8.2: Author Information for Model Card
84
- print_step "Step 8.2: Author Information"
85
- echo "================================="
86
-
87
- print_info "This information will be used in the model card and citation."
88
- get_input "Author name for model card" "$HF_USERNAME" AUTHOR_NAME
89
-
90
- print_info "Model description will be used in the model card and repository."
91
- get_input "Model description" "A fine-tuned version of SmolLM3-3B for improved text generation and conversation capabilities." MODEL_DESCRIPTION
92
- ```
93
-
94
- #### Updated Configuration Summary
95
- ```bash
96
- echo " Author: $AUTHOR_NAME"
97
- ```
98
-
99
- #### Updated Model Push Call
100
- ```bash
101
- python scripts/model_tonic/push_to_huggingface.py /output-checkpoint "$REPO_NAME" \
102
- --token "$HF_TOKEN" \
103
- --trackio-url "$TRACKIO_URL" \
104
- --experiment-name "$EXPERIMENT_NAME" \
105
- --dataset-repo "$TRACKIO_DATASET_REPO" \
106
- --author-name "$AUTHOR_NAME" \
107
- --model-description "$MODEL_DESCRIPTION"
108
- ```
109
-
110
- ### 2. Push Script Updates (`scripts/model_tonic/push_to_huggingface.py`)
111
-
112
- #### Added Command Line Arguments
113
- ```python
114
- parser.add_argument('--author-name', type=str, default=None, help='Author name for model card')
115
- parser.add_argument('--model-description', type=str, default=None, help='Model description for model card')
116
- ```
117
-
118
- #### Updated Class Constructor
119
- ```python
120
- def __init__(
121
- self,
122
- model_path: str,
123
- repo_name: str,
124
- token: Optional[str] = None,
125
- private: bool = False,
126
- trackio_url: Optional[str] = None,
127
- experiment_name: Optional[str] = None,
128
- dataset_repo: Optional[str] = None,
129
- hf_token: Optional[str] = None,
130
- author_name: Optional[str] = None,
131
- model_description: Optional[str] = None
132
- ):
133
- ```
134
-
135
- #### Updated Model Card Generation
136
- ```python
137
- variables = {
138
- "model_name": f"{self.repo_name.split('/')[-1]} - Fine-tuned SmolLM3",
139
- "model_description": self.model_description or "A fine-tuned version of SmolLM3-3B for improved text generation and conversation capabilities.",
140
- # ... other variables
141
- "author_name": self.author_name or training_config.get('author_name', 'Your Name'),
142
- }
143
- ```
144
-
145
- ## User Experience Improvements
146
-
147
- ### 1. **Interactive Prompts**
148
- - Users are now prompted for author name and model description
149
- - Default values are provided for convenience
150
- - Clear explanations of what each field is used for
151
-
152
- ### 2. **Configuration Summary**
153
- - Author name is now displayed in the configuration summary
154
- - Users can review all settings before proceeding
155
-
156
- ### 3. **Automatic Integration**
157
- - User inputs are automatically passed to the model card generation
158
- - No manual editing of scripts required
159
-
160
- ## Template Variable Categories
161
-
162
- ### Automatic Variables (No User Input Needed)
163
- - `repo_name` - Auto-generated from username and date
164
- - `base_model` - Always "HuggingFaceTB/SmolLM3-3B"
165
- - `training_config_type` - From user selection
166
- - `trainer_type` - From user selection
167
- - `batch_size`, `learning_rate`, `max_epochs` - From training config
168
- - `hardware_info` - Auto-detected
169
- - `experiment_name` - Auto-generated with timestamp
170
- - `trackio_url` - Auto-generated from space name
171
- - `dataset_repo` - Auto-generated
172
- - `training_loss`, `validation_loss`, `perplexity` - From training results
173
-
174
- ### User Input Variables (Now Implemented)
175
- - `author_name` - ✅ **Added user prompt**
176
- - `model_description` - ✅ **Added user prompt**
177
-
178
- ### Conditional Variables
179
- - `quantized_models` - Set automatically based on quantization choices
180
- - `dataset_sample_size` - Set based on training configuration type
181
-
182
- ## Benefits of These Changes
183
-
184
- ### 1. **Better Attribution**
185
- - Author names are properly captured and used in citations
186
- - Model cards include proper attribution
187
-
188
- ### 2. **Customizable Descriptions**
189
- - Users can provide custom model descriptions
190
- - Better model documentation and discoverability
191
-
192
- ### 3. **Improved User Experience**
193
- - No need to manually edit scripts
194
- - Interactive prompts with helpful defaults
195
- - Clear feedback on what information is being collected
196
-
197
- ### 4. **Consistent Documentation**
198
- - All model cards will have proper author information
199
- - Standardized model descriptions
200
- - Better integration with Hugging Face Hub
201
-
202
- ## Future Enhancements
203
-
204
- ### Potential Additional User Inputs
205
- 1. **License Selection** - Allow users to choose model license
206
- 2. **Model Tags** - Custom tags for better discoverability
207
- 3. **Usage Examples** - Custom usage examples for specific use cases
208
- 4. **Limitations Description** - Custom limitations based on training data
209
-
210
- ### Template Improvements
211
- 1. **Dynamic License** - Support for different license types
212
- 2. **Custom Tags** - User-defined model tags
213
- 3. **Usage Scenarios** - Template sections for different use cases
214
-
215
- ## Testing
216
-
217
- The changes have been tested to ensure:
218
- - ✅ Author name is properly passed to model card generation
219
- - ✅ Model description is properly passed to model card generation
220
- - ✅ Default values work correctly
221
- - ✅ Configuration summary displays new fields
222
- - ✅ Model push script accepts new parameters
223
-
224
- ## Conclusion
225
-
226
- The analysis identified that the model card template had two key variables (`author_name` and `model_description`) that would benefit from user input. These have been successfully implemented with:
227
-
228
- 1. **Interactive prompts** in the launch script
229
- 2. **Command line arguments** in the push script
230
- 3. **Proper integration** with the model card generator
231
- 4. **User-friendly defaults** and clear explanations
232
-
233
- This improves the overall user experience and ensures that model cards have proper attribution and descriptions.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/MODEL_RECOVERY_GUIDE.md DELETED
@@ -1,228 +0,0 @@
1
- # Model Recovery and Deployment Guide
2
-
3
- This guide will help you recover your trained model from the cloud instance and deploy it to Hugging Face Hub with quantization.
4
-
5
- ## Prerequisites
6
-
7
- 1. **Hugging Face Token**: You need a Hugging Face token with write permissions
8
- 2. **Cloud Instance Access**: SSH access to your cloud instance
9
- 3. **Model Files**: Your trained model should be in `/output-checkpoint/` on the cloud instance
10
-
11
- ## Step 1: Connect to Your Cloud Instance
12
-
13
- ```bash
14
- ssh root@your-cloud-instance-ip
15
- cd ~/smollm3_finetune
16
- ```
17
-
18
- ## Step 2: Set Your Hugging Face Token
19
-
20
- ```bash
21
- export HF_TOKEN=your_huggingface_token_here
22
- ```
23
-
24
- Replace `your_huggingface_token_here` with your actual Hugging Face token.
25
-
26
- ## Step 3: Verify Model Files
27
-
28
- Check that your model files exist:
29
-
30
- ```bash
31
- ls -la /output-checkpoint/
32
- ```
33
-
34
- You should see files like:
35
- - `config.json`
36
- - `model.safetensors.index.json`
37
- - `model-00001-of-00002.safetensors`
38
- - `model-00002-of-00002.safetensors`
39
- - `tokenizer.json`
40
- - `tokenizer_config.json`
41
-
42
- ## Step 4: Update Configuration
43
-
44
- Edit the deployment script to use your Hugging Face username:
45
-
46
- ```bash
47
- nano cloud_deploy.py
48
- ```
49
-
50
- Change this line:
51
- ```python
52
- REPO_NAME = "your-username/smollm3-finetuned" # Change to your HF username and desired repo name
53
- ```
54
-
55
- To your actual username, for example:
56
- ```python
57
- REPO_NAME = "tonic/smollm3-finetuned"
58
- ```
59
-
60
- ## Step 5: Run the Deployment
61
-
62
- Execute the deployment script:
63
-
64
- ```bash
65
- python3 cloud_deploy.py
66
- ```
67
-
68
- This will:
69
- 1. ✅ Validate your model files
70
- 2. ✅ Install required dependencies (torchao, huggingface_hub)
71
- 3. ✅ Push the main model to Hugging Face Hub
72
- 4. ✅ Create quantized versions (int8 and int4)
73
- 5. ✅ Push quantized models to subdirectories
74
-
75
- ## Step 6: Verify Deployment
76
-
77
- After successful deployment, you can verify:
78
-
79
- 1. **Main Model**: https://huggingface.co/your-username/smollm3-finetuned
80
- 2. **int8 Quantized**: https://huggingface.co/your-username/smollm3-finetuned/int8
81
- 3. **int4 Quantized**: https://huggingface.co/your-username/smollm3-finetuned/int4
82
-
83
- ## Alternative: Manual Deployment
84
-
85
- If you prefer to run the steps manually:
86
-
87
- ### 1. Push Main Model Only
88
-
89
- ```bash
90
- python3 scripts/model_tonic/push_to_huggingface.py \
91
- /output-checkpoint/ \
92
- your-username/smollm3-finetuned \
93
- --hf-token $HF_TOKEN \
94
- --author-name "Your Name" \
95
- --model-description "A fine-tuned SmolLM3 model for improved text generation"
96
- ```
97
-
98
- ### 2. Quantize and Push (Optional)
99
-
100
- ```bash
101
- # int8 quantization (GPU optimized)
102
- python3 scripts/model_tonic/quantize_model.py \
103
- /output-checkpoint/ \
104
- your-username/smollm3-finetuned \
105
- --quant-type int8_weight_only \
106
- --hf-token $HF_TOKEN
107
-
108
- # int4 quantization (CPU optimized)
109
- python3 scripts/model_tonic/quantize_model.py \
110
- /output-checkpoint/ \
111
- your-username/smollm3-finetuned \
112
- --quant-type int4_weight_only \
113
- --hf-token $HF_TOKEN
114
- ```
115
-
116
- ## Troubleshooting
117
-
118
- ### Common Issues
119
-
120
- 1. **HF_TOKEN not set**
121
- ```bash
122
- export HF_TOKEN=your_token_here
123
- ```
124
-
125
- 2. **Model files not found**
126
- ```bash
127
- ls -la /output-checkpoint/
128
- ```
129
- Make sure the training completed successfully.
130
-
131
- 3. **Dependencies missing**
132
- ```bash
133
- pip install torchao huggingface_hub
134
- ```
135
-
136
- 4. **Permission denied**
137
- ```bash
138
- chmod +x cloud_deploy.py
139
- chmod +x recover_model.py
140
- ```
141
-
142
- ### Error Messages
143
-
144
- - **"Missing required model files"**: Check that your model training completed successfully
145
- - **"Repository creation failed"**: Verify your HF token has write permissions
146
- - **"Quantization failed"**: Check GPU memory availability or try CPU quantization
147
-
148
- ## Model Usage
149
-
150
- Once deployed, you can use your model:
151
-
152
- ```python
153
- from transformers import AutoModelForCausalLM, AutoTokenizer
154
-
155
- # Main model
156
- model = AutoModelForCausalLM.from_pretrained("your-username/smollm3-finetuned")
157
- tokenizer = AutoTokenizer.from_pretrained("your-username/smollm3-finetuned")
158
-
159
- # int8 quantized (GPU optimized)
160
- model = AutoModelForCausalLM.from_pretrained("your-username/smollm3-finetuned/int8")
161
- tokenizer = AutoTokenizer.from_pretrained("your-username/smollm3-finetuned/int8")
162
-
163
- # int4 quantized (CPU optimized)
164
- model = AutoModelForCausalLM.from_pretrained("your-username/smollm3-finetuned/int4")
165
- tokenizer = AutoTokenizer.from_pretrained("your-username/smollm3-finetuned/int4")
166
-
167
- # Generate text
168
- inputs = tokenizer("Hello, how are you?", return_tensors="pt")
169
- outputs = model.generate(**inputs, max_new_tokens=100)
170
- print(tokenizer.decode(outputs[0], skip_special_tokens=True))
171
- ```
172
-
173
- ## File Structure
174
-
175
- After deployment, your repository will have:
176
-
177
- ```
178
- your-username/smollm3-finetuned/
179
- ├── README.md (model card)
180
- ├── config.json
181
- ├── model.safetensors.index.json
182
- ├── model-00001-of-00002.safetensors
183
- ├── model-00002-of-00002.safetensors
184
- ├── tokenizer.json
185
- ├── tokenizer_config.json
186
- ├── int8/ (quantized model for GPU)
187
- │ ├── README.md
188
- │ ├── config.json
189
- │ └── pytorch_model.bin
190
- └── int4/ (quantized model for CPU)
191
- ├── README.md
192
- ├── config.json
193
- └── pytorch_model.bin
194
- ```
195
-
196
- ## Success Indicators
197
-
198
- ✅ **Successful deployment shows:**
199
- - "Model recovery and deployment completed successfully!"
200
- - "View your model at: https://huggingface.co/your-username/smollm3-finetuned"
201
- - No error messages in the output
202
-
203
- ❌ **Failed deployment shows:**
204
- - Error messages about missing files or permissions
205
- - "Model recovery and deployment failed!"
206
-
207
- ## Next Steps
208
-
209
- After successful deployment:
210
-
211
- 1. **Test your model** on Hugging Face Hub
212
- 2. **Share your model** with the community
213
- 3. **Monitor usage** through Hugging Face analytics
214
- 4. **Consider fine-tuning** further based on feedback
215
-
216
- ## Support
217
-
218
- If you encounter issues:
219
-
220
- 1. Check the error messages carefully
221
- 2. Verify your HF token permissions
222
- 3. Ensure all model files are present
223
- 4. Try running individual steps manually
224
- 5. Check the logs for detailed error information
225
-
226
- ---
227
-
228
- **Happy deploying! 🚀**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/MONITORING_IMPROVEMENTS_SUMMARY.md DELETED
@@ -1,191 +0,0 @@
1
- # 🚀 Monitoring Improvements Summary
2
-
3
- ## Overview
4
-
5
- The monitoring system has been significantly enhanced to support **Hugging Face Datasets** for persistent experiment storage, making it ideal for deployment on Hugging Face Spaces and other cloud environments.
6
-
7
- ## ✅ Key Improvements Made
8
-
9
- ### 1. **Enhanced `monitoring.py`**
10
- - ✅ **HF Datasets Integration**: Added support for saving experiments to HF Datasets repositories
11
- - ✅ **Environment Variables**: Automatic detection of `HF_TOKEN` and `TRACKIO_DATASET_REPO`
12
- - ✅ **Fallback Support**: Graceful degradation if HF Datasets unavailable
13
- - ✅ **Dual Storage**: Experiments saved to both Trackio and HF Datasets
14
- - ✅ **Periodic Saving**: Metrics saved to HF Dataset every 10 steps
15
- - ✅ **Error Handling**: Robust error logging and recovery
16
-
17
- ### 2. **Updated `train.py`**
18
- - ✅ **Monitoring Integration**: Automatic monitoring setup in training scripts
19
- - ✅ **Configuration Logging**: Experiment configuration logged at start
20
- - ✅ **Training Callbacks**: Monitoring callbacks added to trainer
21
- - ✅ **Summary Logging**: Training summaries logged at completion
22
- - ✅ **Error Logging**: Errors logged to monitoring system
23
- - ✅ **Cleanup**: Proper monitoring session cleanup
24
-
25
- ### 3. **Configuration Files Updated**
26
- - ✅ **HF Datasets Config**: Added `hf_token` and `dataset_repo` parameters
27
- - ✅ **Environment Support**: Environment variables automatically detected
28
- - ✅ **Backward Compatible**: Existing configurations still work
29
-
30
- ### 4. **New Utility Scripts**
31
- - ✅ **`configure_trackio.py`**: Configuration testing and setup
32
- - ✅ **`integrate_monitoring.py`**: Automated integration script
33
- - ✅ **`test_monitoring_integration.py`**: Comprehensive testing
34
- - ✅ **`setup_hf_dataset.py`**: Dataset repository setup
35
-
36
- ### 5. **Documentation**
37
- - ✅ **`MONITORING_INTEGRATION_GUIDE.md`**: Comprehensive usage guide
38
- - ✅ **`ENVIRONMENT_VARIABLES.md`**: Environment variable reference
39
- - ✅ **`HF_DATASETS_GUIDE.md`**: Detailed HF Datasets guide
40
-
41
- ## 🔧 Environment Variables
42
-
43
- | Variable | Required | Default | Description |
44
- |----------|----------|---------|-------------|
45
- | `HF_TOKEN` | ✅ Yes | None | Your Hugging Face token |
46
- | `TRACKIO_DATASET_REPO` | ❌ No | `tonic/trackio-experiments` | Dataset repository |
47
- | `TRACKIO_URL` | ❌ No | None | Trackio server URL |
48
- | `TRACKIO_TOKEN` | ❌ No | None | Trackio authentication token |
49
-
50
- ## 📊 What Gets Monitored
51
-
52
- ### **Training Metrics**
53
- - Loss values (training and validation)
54
- - Learning rate
55
- - Gradient norms
56
- - Training steps and epochs
57
-
58
- ### **System Metrics**
59
- - GPU memory usage
60
- - GPU utilization
61
- - CPU usage
62
- - Memory usage
63
-
64
- ### **Experiment Data**
65
- - Configuration parameters
66
- - Model checkpoints
67
- - Evaluation results
68
- - Training summaries
69
-
70
- ### **Artifacts**
71
- - Configuration files
72
- - Training logs
73
- - Evaluation results
74
- - Model checkpoints
75
-
76
- ## 🚀 Usage Examples
77
-
78
- ### **Basic Training**
79
- ```bash
80
- # Set environment variables
81
- export HF_TOKEN=your_token_here
82
- export TRACKIO_DATASET_REPO=your-username/experiments
83
-
84
- # Run training with monitoring
85
- python train.py config/train_smollm3_openhermes_fr.py
86
- ```
87
-
88
- ### **Advanced Configuration**
89
- ```bash
90
- # Train with custom settings
91
- python train.py config/train_smollm3_openhermes_fr.py \
92
- --experiment_name "smollm3_french_v2" \
93
- --hf_token your_token_here \
94
- --dataset_repo your-username/french-experiments
95
- ```
96
-
97
- ### **Testing Setup**
98
- ```bash
99
- # Test configuration
100
- python configure_trackio.py
101
-
102
- # Test monitoring integration
103
- python test_monitoring_integration.py
104
-
105
- # Test dataset access
106
- python test_hf_datasets.py
107
- ```
108
-
109
- ## 📈 Benefits
110
-
111
- ### **For HF Spaces Deployment**
112
- - ✅ **Persistent Storage**: Data survives Space restarts
113
- - ✅ **No Local Storage**: No dependency on ephemeral storage
114
- - ✅ **Scalable**: Works with any dataset size
115
- - ✅ **Secure**: Private dataset storage
116
-
117
- ### **For Experiment Management**
118
- - ✅ **Centralized**: All experiments in one place
119
- - ✅ **Searchable**: Easy to find specific experiments
120
- - ✅ **Versioned**: Dataset versioning for experiments
121
- - ✅ **Collaborative**: Share experiments with team
122
-
123
- ### **For Development**
124
- - ✅ **Flexible**: Easy to switch between datasets
125
- - ✅ **Configurable**: Environment-based configuration
126
- - ✅ **Robust**: Fallback mechanisms
127
- - ✅ **Debuggable**: Comprehensive logging
128
-
129
- ## 🧪 Testing Results
130
-
131
- All monitoring integration tests passed:
132
- - ✅ Module Import
133
- - ✅ Monitor Creation
134
- - ✅ Config Creation
135
- - ✅ Metrics Logging
136
- - ✅ Configuration Logging
137
- - ✅ System Metrics
138
- - ✅ Training Summary
139
- - ✅ Callback Creation
140
-
141
- ## 📋 Files Modified/Created
142
-
143
- ### **Core Files**
144
- - `monitoring.py` - Enhanced with HF Datasets support
145
- - `train.py` - Updated with monitoring integration
146
- - `requirements_core.txt` - Added monitoring dependencies
147
- - `requirements_space.txt` - Updated for HF Spaces
148
-
149
- ### **Configuration Files**
150
- - `config/train_smollm3.py` - Added HF Datasets config
151
- - `config/train_smollm3_openhermes_fr.py` - Added HF Datasets config
152
- - `config/train_smollm3_openhermes_fr_a100_balanced.py` - Added HF Datasets config
153
- - `config/train_smollm3_openhermes_fr_a100_large.py` - Added HF Datasets config
154
- - `config/train_smollm3_openhermes_fr_a100_max_performance.py` - Added HF Datasets config
155
- - `config/train_smollm3_openhermes_fr_a100_multiple_passes.py` - Added HF Datasets config
156
-
157
- ### **New Utility Scripts**
158
- - `configure_trackio.py` - Configuration testing
159
- - `integrate_monitoring.py` - Automated integration
160
- - `test_monitoring_integration.py` - Comprehensive testing
161
- - `setup_hf_dataset.py` - Dataset setup
162
-
163
- ### **Documentation**
164
- - `MONITORING_INTEGRATION_GUIDE.md` - Usage guide
165
- - `ENVIRONMENT_VARIABLES.md` - Environment reference
166
- - `HF_DATASETS_GUIDE.md` - HF Datasets guide
167
- - `MONITORING_IMPROVEMENTS_SUMMARY.md` - This summary
168
-
169
- ## 🎯 Next Steps
170
-
171
- 1. **Set up your HF token and dataset repository**
172
- 2. **Test the configuration with `python configure_trackio.py`**
173
- 3. **Run a training experiment to verify full functionality**
174
- 4. **Check your HF Dataset repository for experiment data**
175
- 5. **View results in your Trackio interface**
176
-
177
- ## 🔍 Troubleshooting
178
-
179
- ### **Common Issues**
180
- - **HF_TOKEN not set**: Set your Hugging Face token
181
- - **Dataset access failed**: Check token permissions and repository existence
182
- - **Monitoring not working**: Run `python test_monitoring_integration.py` to diagnose
183
-
184
- ### **Getting Help**
185
- - Check the comprehensive guides in the documentation files
186
- - Run the test scripts to verify your setup
187
- - Check logs for specific error messages
188
-
189
- ---
190
-
191
- **🎉 The monitoring system is now ready for production use with persistent HF Datasets storage!**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/MONITORING_INTEGRATION_GUIDE.md DELETED
@@ -1,245 +0,0 @@
1
- # 🔧 Improved Monitoring Integration Guide
2
-
3
- ## Overview
4
-
5
- The monitoring system has been enhanced to support **Hugging Face Datasets** for persistent experiment storage, making it ideal for deployment on Hugging Face Spaces and other cloud environments.
6
-
7
- ## 🚀 Key Improvements
8
-
9
- ### 1. **HF Datasets Integration**
10
- - ✅ **Persistent Storage**: Experiments are saved to HF Datasets repositories
11
- - ✅ **Environment Variables**: Configurable via `HF_TOKEN` and `TRACKIO_DATASET_REPO`
12
- - ✅ **Fallback Support**: Graceful degradation if HF Datasets unavailable
13
- - ✅ **Automatic Backup**: Local files as backup
14
-
15
- ### 2. **Enhanced Monitoring Features**
16
- - 📊 **Real-time Metrics**: Training metrics logged to both Trackio and HF Datasets
17
- - 🔧 **System Metrics**: GPU memory, CPU usage, and system performance
18
- - 📈 **Training Summaries**: Comprehensive experiment summaries
19
- - 🛡️ **Error Handling**: Robust error logging and recovery
20
-
21
- ### 3. **Easy Integration**
22
- - 🔌 **Automatic Setup**: Environment variables automatically detected
23
- - 📝 **Configuration**: Simple setup with environment variables
24
- - 🔄 **Backward Compatible**: Works with existing Trackio setup
25
-
26
- ## 📋 Environment Variables
27
-
28
- | Variable | Required | Default | Description |
29
- |----------|----------|---------|-------------|
30
- | `HF_TOKEN` | ✅ Yes | None | Your Hugging Face token |
31
- | `TRACKIO_DATASET_REPO` | ❌ No | `tonic/trackio-experiments` | Dataset repository |
32
- | `TRACKIO_URL` | ❌ No | None | Trackio server URL |
33
- | `TRACKIO_TOKEN` | ❌ No | None | Trackio authentication token |
34
-
35
- ## 🛠️ Setup Instructions
36
-
37
- ### 1. **Get Your HF Token**
38
- ```bash
39
- # Go to https://huggingface.co/settings/tokens
40
- # Create a new token with "Write" permissions
41
- # Copy the token
42
- ```
43
-
44
- ### 2. **Set Environment Variables**
45
- ```bash
46
- # For HF Spaces, add these to your Space settings:
47
- HF_TOKEN=your_hf_token_here
48
- TRACKIO_DATASET_REPO=your-username/your-dataset-name
49
-
50
- # For local development:
51
- export HF_TOKEN=your_hf_token_here
52
- export TRACKIO_DATASET_REPO=your-username/your-dataset-name
53
- ```
54
-
55
- ### 3. **Create Dataset Repository**
56
- ```bash
57
- # Run the setup script
58
- python setup_hf_dataset.py
59
-
60
- # Or manually create a dataset on HF Hub
61
- # Go to https://huggingface.co/datasets
62
- # Create a new dataset repository
63
- ```
64
-
65
- ### 4. **Test Configuration**
66
- ```bash
67
- # Test your setup
68
- python configure_trackio.py
69
-
70
- # Test dataset access
71
- python test_hf_datasets.py
72
- ```
73
-
74
- ## 🚀 Usage Examples
75
-
76
- ### **Basic Training with Monitoring**
77
- ```bash
78
- # Train with default monitoring
79
- python train.py config/train_smollm3_openhermes_fr.py
80
-
81
- # Train with custom dataset repository
82
- TRACKIO_DATASET_REPO=your-username/smollm3-experiments python train.py config/train_smollm3_openhermes_fr.py
83
- ```
84
-
85
- ### **Advanced Training Configuration**
86
- ```bash
87
- # Train with custom experiment name
88
- python train.py config/train_smollm3_openhermes_fr.py \
89
- --experiment_name "smollm3_french_tuning_v2" \
90
- --hf_token your_token_here \
91
- --dataset_repo your-username/french-experiments
92
- ```
93
-
94
- ### **Training Scripts with Monitoring**
95
- ```bash
96
- # All training scripts now support monitoring:
97
- python train.py config/train_smollm3_openhermes_fr_a100_balanced.py
98
- python train.py config/train_smollm3_openhermes_fr_a100_large.py
99
- python train.py config/train_smollm3_openhermes_fr_a100_max_performance.py
100
- python train.py config/train_smollm3_openhermes_fr_a100_multiple_passes.py
101
- ```
102
-
103
- ## 📊 What Gets Monitored
104
-
105
- ### **Training Metrics**
106
- - Loss values (training and validation)
107
- - Learning rate
108
- - Gradient norms
109
- - Training steps and epochs
110
-
111
- ### **System Metrics**
112
- - GPU memory usage
113
- - GPU utilization
114
- - CPU usage
115
- - Memory usage
116
-
117
- ### **Experiment Data**
118
- - Configuration parameters
119
- - Model checkpoints
120
- - Evaluation results
121
- - Training summaries
122
-
123
- ### **Artifacts**
124
- - Configuration files
125
- - Training logs
126
- - Evaluation results
127
- - Model checkpoints
128
-
129
- ## 🔍 Viewing Results
130
-
131
- ### **1. Trackio Interface**
132
- - Visit your Trackio Space
133
- - Navigate to "Experiments" tab
134
- - View real-time metrics and plots
135
-
136
- ### **2. HF Dataset Repository**
137
- - Go to your dataset repository on HF Hub
138
- - Browse experiment data
139
- - Download experiment files
140
-
141
- ### **3. Local Files**
142
- - Check local backup files
143
- - Review training logs
144
- - Examine configuration files
145
-
146
- ## 🛠️ Configuration Examples
147
-
148
- ### **Default Setup**
149
- ```python
150
- # Uses default dataset: tonic/trackio-experiments
151
- # Requires only HF_TOKEN
152
- ```
153
-
154
- ### **Personal Dataset**
155
- ```bash
156
- export HF_TOKEN=your_token_here
157
- export TRACKIO_DATASET_REPO=your-username/trackio-experiments
158
- ```
159
-
160
- ### **Team Dataset**
161
- ```bash
162
- export HF_TOKEN=your_token_here
163
- export TRACKIO_DATASET_REPO=your-org/team-experiments
164
- ```
165
-
166
- ### **Project-Specific Dataset**
167
- ```bash
168
- export HF_TOKEN=your_token_here
169
- export TRACKIO_DATASET_REPO=your-username/smollm3-experiments
170
- ```
171
-
172
- ## 🔧 Troubleshooting
173
-
174
- ### **Issue: "HF_TOKEN not found"**
175
- ```bash
176
- # Solution: Set your HF token
177
- export HF_TOKEN=your_token_here
178
- # Or add to HF Space environment variables
179
- ```
180
-
181
- ### **Issue: "Failed to load dataset"**
182
- ```bash
183
- # Solutions:
184
- # 1. Check token has read access
185
- # 2. Verify dataset repository exists
186
- # 3. Run setup script: python setup_hf_dataset.py
187
- ```
188
-
189
- ### **Issue: "Failed to save experiments"**
190
- ```bash
191
- # Solutions:
192
- # 1. Check token has write permissions
193
- # 2. Verify dataset repository exists
194
- # 3. Check network connectivity
195
- ```
196
-
197
- ### **Issue: "Monitoring not working"**
198
- ```bash
199
- # Solutions:
200
- # 1. Check environment variables
201
- # 2. Run configuration test: python configure_trackio.py
202
- # 3. Check logs for specific errors
203
- ```
204
-
205
- ## 📈 Benefits
206
-
207
- ### **For HF Spaces Deployment**
208
- - ✅ **Persistent Storage**: Data survives Space restarts
209
- - ✅ **No Local Storage**: No dependency on ephemeral storage
210
- - ✅ **Scalable**: Works with any dataset size
211
- - ✅ **Secure**: Private dataset storage
212
-
213
- ### **For Experiment Management**
214
- - ✅ **Centralized**: All experiments in one place
215
- - ✅ **Searchable**: Easy to find specific experiments
216
- - ✅ **Versioned**: Dataset versioning for experiments
217
- - ✅ **Collaborative**: Share experiments with team
218
-
219
- ### **For Development**
220
- - ✅ **Flexible**: Easy to switch between datasets
221
- - ✅ **Configurable**: Environment-based configuration
222
- - ✅ **Robust**: Fallback mechanisms
223
- - ✅ **Debuggable**: Comprehensive logging
224
-
225
- ## 🎯 Next Steps
226
-
227
- 1. **Set up your HF token and dataset repository**
228
- 2. **Test the configuration with `python configure_trackio.py`**
229
- 3. **Run a training experiment to verify monitoring**
230
- 4. **Check your HF Dataset repository for experiment data**
231
- 5. **View results in your Trackio interface**
232
-
233
- ## 📚 Related Files
234
-
235
- - `monitoring.py` - Enhanced monitoring with HF Datasets support
236
- - `train.py` - Updated training script with monitoring integration
237
- - `configure_trackio.py` - Configuration and testing script
238
- - `setup_hf_dataset.py` - Dataset repository setup
239
- - `test_hf_datasets.py` - Dataset access testing
240
- - `ENVIRONMENT_VARIABLES.md` - Environment variable reference
241
- - `HF_DATASETS_GUIDE.md` - Detailed HF Datasets guide
242
-
243
- ---
244
-
245
- **🎉 Your experiments are now persistently stored and easily accessible!**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/MONITORING_VERIFICATION_REPORT.md DELETED
@@ -1,163 +0,0 @@
1
- # Monitoring Verification Report
2
-
3
- ## Overview
4
-
5
- This document verifies that `src/monitoring.py` is fully compatible with the actual deployed Trackio space and all monitoring components.
6
-
7
- ## ✅ **VERIFICATION STATUS: ALL TESTS PASSED**
8
-
9
- ### **Trackio Space Deployment Verification**
10
-
11
- The actual deployed Trackio space at `https://tonic-trackio-monitoring-20250726.hf.space` provides the following API endpoints:
12
-
13
- #### **Available API Endpoints**
14
- 1. ✅ `/update_trackio_config` - Update configuration
15
- 2. ✅ `/test_dataset_connection` - Test dataset connection
16
- 3. ✅ `/create_dataset_repository` - Create dataset repository
17
- 4. ✅ `/create_experiment_interface` - Create experiment
18
- 5. ✅ `/log_metrics_interface` - Log metrics
19
- 6. ✅ `/log_parameters_interface` - Log parameters
20
- 7. ✅ `/get_experiment_details` - Get experiment details
21
- 8. ✅ `/list_experiments_interface` - List experiments
22
- 9. ✅ `/create_metrics_plot` - Create metrics plot
23
- 10. ✅ `/create_experiment_comparison` - Compare experiments
24
- 11. ✅ `/simulate_training_data` - Simulate training data
25
- 12. ✅ `/create_demo_experiment` - Create demo experiment
26
- 13. ✅ `/update_experiment_status_interface` - Update status
27
-
28
- ### **Monitoring.py Compatibility Verification**
29
-
30
- #### **✅ Dataset Structure Compatibility**
31
- - **Field Structure**: All 10 fields match between monitoring.py and actual dataset
32
- - `experiment_id`, `name`, `description`, `created_at`, `status`
33
- - `metrics`, `parameters`, `artifacts`, `logs`, `last_updated`
34
- - **Metrics Structure**: All 16 metrics fields compatible
35
- - `loss`, `grad_norm`, `learning_rate`, `num_tokens`, `mean_token_accuracy`
36
- - `epoch`, `total_tokens`, `throughput`, `step_time`, `batch_size`
37
- - `seq_len`, `token_acc`, `gpu_memory_allocated`, `gpu_memory_reserved`
38
- - `gpu_utilization`, `cpu_percent`, `memory_percent`
39
- - **Parameters Structure**: All 11 parameters fields compatible
40
- - `model_name`, `max_seq_length`, `batch_size`, `learning_rate`, `epochs`
41
- - `dataset`, `trainer_type`, `hardware`, `mixed_precision`
42
- - `gradient_checkpointing`, `flash_attention`
43
-
44
- #### **✅ Trackio API Client Compatibility**
45
- - **Available Methods**: All 7 methods working correctly
46
- - `create_experiment` ✅
47
- - `log_metrics` ✅
48
- - `log_parameters` ✅
49
- - `get_experiment_details` ✅
50
- - `list_experiments` ✅
51
- - `update_experiment_status` ✅
52
- - `simulate_training_data` ✅
53
-
54
- #### **✅ Monitoring Variables Verification**
55
- - **Core Variables**: All 10 variables present and working
56
- - `experiment_id`, `experiment_name`, `start_time`, `metrics_history`, `artifacts`
57
- - `trackio_client`, `hf_dataset_client`, `dataset_repo`, `hf_token`, `enable_tracking`
58
- - **Core Methods**: All 7 methods present and working
59
- - `log_metrics`, `log_configuration`, `log_model_checkpoint`, `log_evaluation_results`
60
- - `log_system_metrics`, `log_training_summary`, `create_monitoring_callback`
61
-
62
- #### **✅ Integration Verification**
63
- - **Monitor Creation**: ✅ Working perfectly
64
- - **Attribute Verification**: ✅ All 7 expected attributes present
65
- - **Dataset Repository**: ✅ Properly set and validated
66
- - **Enable Tracking**: ✅ Correctly configured
67
-
68
- ### **Key Compatibility Features**
69
-
70
- #### **1. Dataset Structure Alignment**
71
- ```python
72
- # monitoring.py uses the exact structure from setup_hf_dataset.py
73
- dataset_data = [{
74
- 'experiment_id': self.experiment_id or f"exp_{datetime.now().strftime('%Y%m%d_%H%M%S')}",
75
- 'name': self.experiment_name,
76
- 'description': "SmolLM3 fine-tuning experiment",
77
- 'created_at': self.start_time.isoformat(),
78
- 'status': 'running',
79
- 'metrics': json.dumps(self.metrics_history),
80
- 'parameters': json.dumps(experiment_data),
81
- 'artifacts': json.dumps(self.artifacts),
82
- 'logs': json.dumps([]),
83
- 'last_updated': datetime.now().isoformat()
84
- }]
85
- ```
86
-
87
- #### **2. Trackio Space Integration**
88
- ```python
89
- # Uses only available methods from deployed space
90
- self.trackio_client.log_metrics(experiment_id, metrics, step)
91
- self.trackio_client.log_parameters(experiment_id, parameters)
92
- self.trackio_client.list_experiments()
93
- self.trackio_client.update_experiment_status(experiment_id, status)
94
- ```
95
-
96
- #### **3. Error Handling**
97
- ```python
98
- # Graceful fallback when Trackio space is unavailable
99
- try:
100
- result = self.trackio_client.list_experiments()
101
- if result.get('error'):
102
- logger.warning(f"Trackio Space not accessible: {result['error']}")
103
- self.enable_tracking = False
104
- return
105
- except Exception as e:
106
- logger.warning(f"Trackio Space not accessible: {e}")
107
- self.enable_tracking = False
108
- ```
109
-
110
- ### **Verification Test Results**
111
-
112
- ```
113
- 🚀 Monitoring Verification Tests
114
- ==================================================
115
- ✅ Dataset structure: Compatible
116
- ✅ Trackio space: Compatible
117
- ✅ Monitoring variables: Correct
118
- ✅ API client: Compatible
119
- ✅ Integration: Working
120
- ✅ Structure compatibility: Verified
121
- ✅ Space compatibility: Verified
122
-
123
- 🎉 ALL MONITORING VERIFICATION TESTS PASSED!
124
- Monitoring.py is fully compatible with all components!
125
- ```
126
-
127
- ### **Deployed Trackio Space API Endpoints**
128
-
129
- The actual deployed space provides these endpoints that monitoring.py can use:
130
-
131
- #### **Core Experiment Management**
132
- - `POST /create_experiment_interface` - Create new experiments
133
- - `POST /log_metrics_interface` - Log training metrics
134
- - `POST /log_parameters_interface` - Log experiment parameters
135
- - `GET /list_experiments_interface` - List all experiments
136
- - `POST /update_experiment_status_interface` - Update experiment status
137
-
138
- #### **Configuration & Setup**
139
- - `POST /update_trackio_config` - Update HF token and dataset repo
140
- - `POST /test_dataset_connection` - Test dataset connectivity
141
- - `POST /create_dataset_repository` - Create HF dataset repository
142
-
143
- #### **Analysis & Visualization**
144
- - `POST /create_metrics_plot` - Generate metric plots
145
- - `POST /create_experiment_comparison` - Compare multiple experiments
146
- - `POST /get_experiment_details` - Get detailed experiment info
147
-
148
- #### **Testing & Demo**
149
- - `POST /simulate_training_data` - Generate demo training data
150
- - `POST /create_demo_experiment` - Create demonstration experiments
151
-
152
- ### **Conclusion**
153
-
154
- **✅ MONITORING.PY IS FULLY COMPATIBLE WITH THE ACTUAL DEPLOYED TRACKIO SPACE**
155
-
156
- The monitoring system has been verified to work correctly with:
157
- - ✅ All actual API endpoints from the deployed Trackio space
158
- - ✅ Complete dataset structure compatibility
159
- - ✅ Proper error handling and fallback mechanisms
160
- - ✅ All monitoring variables and methods working correctly
161
- - ✅ Seamless integration with HF Datasets and Trackio space
162
-
163
- **The monitoring.py file is production-ready and fully compatible with the actual deployed Trackio space!** 🚀
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/Model_Abstraction.md ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ```mermaid
2
+ graph LR
3
+ EntryPoint["EntryPoint"]
4
+ Model_Abstraction["Model Abstraction"]
5
+ EntryPoint -- "initiates model loading in" --> Model_Abstraction
6
+ click Model_Abstraction href "https://github.com/Josephrp/SmolFactory/blob/main/docs/Model_Abstraction.md" "Details"
7
+ ```
8
+
9
+ [![CodeBoarding](https://img.shields.io/badge/Generated%20by-CodeBoarding-9cf?style=flat-square)](https://github.com/CodeBoarding/GeneratedOnBoardings)[![Demo](https://img.shields.io/badge/Try%20our-Demo-blue?style=flat-square)](https://www.codeboarding.org/demo)[![Contact](https://img.shields.io/badge/Contact%20us%20-%[email protected]?style=flat-square)](mailto:[email protected])
10
+
11
+ ## Details
12
+
13
+ Updated analysis to include EntryPoint component and clarify its interaction with Model Abstraction.
14
+
15
+ ### EntryPoint
16
+ This component represents the primary execution flow of the `smollm3_finetune` application. It is responsible for initializing the application, parsing configuration, and orchestrating the high-level tasks such as initiating the model loading process and potentially the training or inference loops. It acts as the user-facing interface or the main script that kicks off the application's operations.
17
+
18
+
19
+ **Related Classes/Methods**:
20
+
21
+ - `smollm3_finetune.main` (1:1)
22
+
23
+
24
+ ### Model Abstraction [[Expand]](./Model_Abstraction.md)
25
+ This component is responsible for encapsulating the complex logic of loading pre-trained models, defining their architectures, and managing various model variants such as quantization and LoRA adapters. It provides a unified and consistent interface for interacting with different model configurations, ensuring that the core training logic can operate seamlessly regardless of the underlying model specifics. This abstraction is crucial for maintaining modularity and flexibility within the machine learning training and fine-tuning framework.
26
+
27
+
28
+ **Related Classes/Methods**:
29
+
30
+ - `smollm3_finetune.model` (1:1)
31
+ - `smollm3_finetune.model.load_model` (1:1)
32
+
33
+
34
+
35
+
36
+ ### [FAQ](https://github.com/CodeBoarding/GeneratedOnBoardings/tree/main?tab=readme-ov-file#faq)
docs/NO_THINK_TAG_GUIDE.md DELETED
@@ -1,146 +0,0 @@
1
- # SmolLM3 `/no_think` Tag Implementation Guide
2
-
3
- ## The Problem
4
-
5
- You were using the `enable_thinking` parameter in the chat template configuration, which is **incorrect** for SmolLM3. The `/no_think` tag should be added as a **system message** in your training data, not as a configuration parameter.
6
-
7
- ### What was wrong:
8
-
9
- ```python
10
- # ❌ INCORRECT - This doesn't work for SmolLM3
11
- chat_template_kwargs={
12
- "enable_thinking": False, # This parameter doesn't exist in SmolLM3
13
- "add_generation_prompt": True
14
- }
15
- ```
16
-
17
- ### What's correct:
18
-
19
- ```python
20
- # ✅ CORRECT - Add /no_think as system message
21
- messages = [
22
- {"role": "system", "content": "You are a helpful assistant. /no_think"},
23
- {"role": "user", "content": "What is machine learning?"},
24
- {"role": "assistant", "content": "Machine learning is..."}
25
- ]
26
- ```
27
-
28
- ## The Solution
29
-
30
- ### 1. Updated Data Processing
31
-
32
- The `data.py` file now properly handles the `/no_think` tag by:
33
-
34
- - Adding a system message with `/no_think` when `no_think_system_message=True`
35
- - Using the correct chat template parameters
36
- - Properly formatting messages for SmolLM3
37
-
38
- ### 2. Updated Configuration
39
-
40
- All configuration files now use the correct parameter:
41
-
42
- ```python
43
- chat_template_kwargs={
44
- "add_generation_prompt": True,
45
- "no_think_system_message": True # Set to True to add /no_think tag
46
- }
47
- ```
48
-
49
- ### 3. How It Works
50
-
51
- When `no_think_system_message=True`, the system automatically adds:
52
-
53
- ```
54
- {"role": "system", "content": "You are a helpful assistant. /no_think"}
55
- ```
56
-
57
- as the first message in each conversation.
58
-
59
- ## Testing the Fix
60
-
61
- ### 1. Run the Test Script
62
-
63
- ```bash
64
- python test_no_think.py
65
- ```
66
-
67
- This will show you the difference between:
68
- - Messages with `/no_think` tag
69
- - Messages without `/no_think` tag
70
-
71
- ### 2. Test with Training
72
-
73
- ```bash
74
- python train.py config/train_smollm3_no_think_test.py --dataset_dir=test_data
75
- ```
76
-
77
- This uses a configuration that enables the `/no_think` tag.
78
-
79
- ## Configuration Options
80
-
81
- ### Enable `/no_think` tag:
82
- ```python
83
- chat_template_kwargs={
84
- "add_generation_prompt": True,
85
- "no_think_system_message": True # Adds /no_think to system message
86
- }
87
- ```
88
-
89
- ### Disable `/no_think` tag (allow thinking):
90
- ```python
91
- chat_template_kwargs={
92
- "add_generation_prompt": True,
93
- "no_think_system_message": False # No /no_think tag
94
- }
95
- ```
96
-
97
- ## When to Use `/no_think`
98
-
99
- - **Use `/no_think`** for tasks that require direct, concise responses
100
- - **Don't use `/no_think`** for tasks that benefit from reasoning and step-by-step thinking
101
-
102
- ### Examples:
103
-
104
- **With `/no_think` (direct answers):**
105
- ```
106
- System: You are a helpful assistant. /no_think
107
- User: What is 2+2?
108
- Assistant: 4
109
- ```
110
-
111
- **Without `/no_think` (reasoning allowed):**
112
- ```
113
- System: You are a helpful assistant.
114
- User: Solve this math problem step by step: 15 * 7
115
- Assistant: Let me solve this step by step:
116
- 1. First, I'll break down 15 * 7
117
- 2. 15 * 7 = (10 + 5) * 7
118
- 3. = 10 * 7 + 5 * 7
119
- 4. = 70 + 35
120
- 5. = 105
121
- The answer is 105.
122
- ```
123
-
124
- ## Updated Files
125
-
126
- The following files were updated to fix the `/no_think` tag issue:
127
-
128
- 1. `data.py` - Updated `format_chat_template` function
129
- 2. `config/train_smollm3.py` - Updated default configuration
130
- 3. `config/train_smollm3_openhermes_fr.py` - Updated configuration
131
- 4. `config/train_smollm3_long_context.py` - Updated configuration
132
- 5. `config/runpod_config.py` - Updated configuration
133
- 6. All A100 configuration files - Updated configurations
134
-
135
- ## Verification
136
-
137
- To verify the fix is working:
138
-
139
- 1. Check that system messages include `/no_think` when `no_think_system_message=True`
140
- 2. Verify that the chat template is applied correctly
141
- 3. Test with actual training to ensure the model learns the `/no_think` behavior
142
-
143
- ## References
144
-
145
- - [SmolLM3 Model Card](https://huggingface.co/HuggingFaceTB/SmolLM3-3B)
146
- - [SmolLM3 Documentation](https://huggingface.co/docs/transformers/model_doc/smollm3)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/PIPELINE_SUMMARY.md DELETED
@@ -1,330 +0,0 @@
1
- # SmolLM3 End-to-End Pipeline - Implementation Summary
2
-
3
- This document summarizes the comprehensive refactoring and enhancement of the SmolLM3 fine-tuning codebase to create a complete end-to-end pipeline.
4
-
5
- ## 🎯 Overview
6
-
7
- The pipeline now provides a complete solution from Trackio Space deployment to model push, with integrated monitoring, dataset management, and automated deployment.
8
-
9
- ## 📁 Files Created/Modified
10
-
11
- ### **Core Pipeline Files**
12
-
13
- 1. **`launch.sh`** - Complete end-to-end pipeline script
14
- - 16-step comprehensive pipeline
15
- - Automated environment setup
16
- - Integrated monitoring and deployment
17
- - Dynamic configuration generation
18
-
19
- 2. **`setup_launch.py`** - User configuration helper
20
- - Interactive setup for user credentials
21
- - Automatic script configuration
22
- - Requirements checker generation
23
-
24
- 3. **`test_pipeline.py`** - Comprehensive testing suite
25
- - Import testing
26
- - Component verification
27
- - CUDA and HF token validation
28
-
29
- 4. **`README_END_TO_END.md`** - Complete documentation
30
- - Step-by-step usage guide
31
- - Troubleshooting section
32
- - Advanced configuration options
33
-
34
- ### **Scripts and Utilities**
35
-
36
- 5. **`scripts/trackio_tonic/trackio_api_client.py`** - API client for Trackio
37
- - Complete API client implementation
38
- - Error handling and retry logic
39
- - Support for both JSON and SSE responses
40
-
41
- 6. **`scripts/trackio_tonic/deploy_trackio_space.py`** - Space deployment
42
- - Automated HF Space creation
43
- - File upload and configuration
44
- - Space testing and validation
45
-
46
- 7. **`scripts/trackio_tonic/configure_trackio.py`** - Configuration helper
47
- - Environment variable setup
48
- - Dataset repository configuration
49
- - Usage examples and validation
50
-
51
- 8. **`scripts/model_tonic/push_to_huggingface.py`** - Model deployment
52
- - Complete model upload pipeline
53
- - Model card generation
54
- - Training results documentation
55
-
56
- 9. **`scripts/dataset_tonic/setup_hf_dataset.py`** - Dataset setup
57
- - HF Dataset repository creation
58
- - Initial experiment data structure
59
- - Dataset access configuration
60
-
61
- ### **Source Code Updates**
62
-
63
- 10. **`src/monitoring.py`** - Enhanced monitoring
64
- - HF Datasets integration
65
- - Trackio API client integration
66
- - Comprehensive metrics logging
67
-
68
- 11. **`src/train.py`** - Updated training script
69
- - Monitoring integration
70
- - HF Datasets support
71
- - Enhanced error handling
72
-
73
- 12. **`src/config.py`** - Configuration management
74
- - Dynamic config loading
75
- - Multiple config type support
76
- - Fallback mechanisms
77
-
78
- 13. **`src/data.py`** - Enhanced dataset handling
79
- - Multiple format support
80
- - Automatic conversion
81
- - Bad entry filtering
82
-
83
- 14. **`src/model.py`** - Model wrapper
84
- - SmolLM3-specific optimizations
85
- - Flash attention support
86
- - Long context handling
87
-
88
- 15. **`src/trainer.py`** - Training orchestration
89
- - Monitoring callback integration
90
- - Enhanced logging
91
- - Checkpoint management
92
-
93
- ## 🔧 Key Improvements
94
-
95
- ### **1. Import Path Fixes**
96
- - Fixed all import paths to work with the refactored structure
97
- - Added proper sys.path handling for cross-module imports
98
- - Ensured compatibility between different script locations
99
-
100
- ### **2. Monitoring Integration**
101
- - **Trackio Space**: Real-time experiment tracking
102
- - **HF Datasets**: Persistent experiment storage
103
- - **System Metrics**: GPU, memory, and CPU monitoring
104
- - **Training Callbacks**: Automatic metric logging
105
-
106
- ### **3. Dataset Handling**
107
- - **Multi-format Support**: Prompt/completion, instruction/output, chat formats
108
- - **Automatic Conversion**: Handles different dataset structures
109
- - **Validation**: Ensures data quality and completeness
110
- - **Splitting**: Automatic train/validation/test splits
111
-
112
- ### **4. Configuration Management**
113
- - **Dynamic Generation**: Creates configs based on user input
114
- - **Multiple Types**: Support for different training configurations
115
- - **Environment Variables**: Proper integration with environment
116
- - **Validation**: Ensures configuration correctness
117
-
118
- ### **5. Deployment Automation**
119
- - **Model Upload**: Complete model push to HF Hub
120
- - **Model Cards**: Comprehensive documentation generation
121
- - **Training Results**: Complete experiment documentation
122
- - **Testing**: Automated model validation
123
-
124
- ## 🚀 Pipeline Steps
125
-
126
- The end-to-end pipeline performs these 16 steps:
127
-
128
- 1. **Environment Setup** - System dependencies and Python environment
129
- 2. **PyTorch Installation** - CUDA-enabled PyTorch installation
130
- 3. **Dependencies** - All required Python packages
131
- 4. **Authentication** - HF token setup and validation
132
- 5. **Trackio Deployment** - HF Space creation and configuration
133
- 6. **Dataset Setup** - HF Dataset repository creation
134
- 7. **Trackio Configuration** - Environment and dataset configuration
135
- 8. **Training Config** - Dynamic configuration generation
136
- 9. **Dataset Preparation** - Download and format conversion
137
- 10. **Parameter Calculation** - Training steps and batch calculations
138
- 11. **Training Execution** - Model fine-tuning with monitoring
139
- 12. **Model Push** - Upload to HF Hub with documentation
140
- 13. **Model Testing** - Validation of uploaded model
141
- 14. **Summary Report** - Complete training documentation
142
- 15. **Resource Links** - All online resource URLs
143
- 16. **Next Steps** - Usage instructions and recommendations
144
-
145
- ## 📊 Monitoring Features
146
-
147
- ### **Trackio Space Interface**
148
- - Real-time training metrics
149
- - Experiment comparison
150
- - System resource monitoring
151
- - Training progress visualization
152
-
153
- ### **HF Dataset Storage**
154
- - Persistent experiment data
155
- - Version-controlled history
156
- - Collaborative sharing
157
- - Automated backup
158
-
159
- ### **Comprehensive Logging**
160
- - Training metrics (loss, accuracy, etc.)
161
- - System metrics (GPU, memory, CPU)
162
- - Configuration parameters
163
- - Training artifacts
164
-
165
- ## 🔧 Configuration Options
166
-
167
- ### **User Configuration**
168
- ```bash
169
- # Required
170
- HF_TOKEN="your_token"
171
- HF_USERNAME="your_username"
172
-
173
- # Optional
174
- MODEL_NAME="HuggingFaceTB/SmolLM3-3B"
175
- DATASET_NAME="HuggingFaceTB/smoltalk"
176
- ```
177
-
178
- ### **Training Parameters**
179
- ```bash
180
- BATCH_SIZE=2
181
- GRADIENT_ACCUMULATION_STEPS=8
182
- LEARNING_RATE=5e-6
183
- MAX_EPOCHS=3
184
- MAX_SEQ_LENGTH=4096
185
- ```
186
-
187
- ### **Monitoring Configuration**
188
- ```bash
189
- TRACKIO_DATASET_REPO="username/trackio-experiments"
190
- EXPERIMENT_NAME="smollm3_finetune_YYYYMMDD_HHMMSS"
191
- ```
192
-
193
- ## 🛠️ Error Handling
194
-
195
- ### **Comprehensive Error Handling**
196
- - Import error detection and reporting
197
- - Configuration validation
198
- - Network timeout handling
199
- - Graceful degradation
200
-
201
- ### **Debugging Support**
202
- - Detailed logging at all levels
203
- - Component-specific error messages
204
- - Fallback mechanisms
205
- - Testing utilities
206
-
207
- ## 📈 Performance Optimizations
208
-
209
- ### **Training Optimizations**
210
- - Flash Attention for efficiency
211
- - Gradient checkpointing for memory
212
- - Mixed precision training
213
- - Optimized data loading
214
-
215
- ### **Monitoring Optimizations**
216
- - Asynchronous logging
217
- - Batch metric updates
218
- - Efficient data storage
219
- - Minimal overhead
220
-
221
- ## 🔄 Integration Points
222
-
223
- ### **Hugging Face Ecosystem**
224
- - **HF Hub**: Model and dataset storage
225
- - **HF Spaces**: Trackio monitoring interface
226
- - **HF Datasets**: Experiment data persistence
227
- - **HF CLI**: Authentication and deployment
228
-
229
- ### **External Services**
230
- - **Trackio**: Experiment tracking
231
- - **CUDA**: GPU acceleration
232
- - **PyTorch**: Deep learning framework
233
- - **Transformers**: Model library
234
-
235
- ## 🎯 Usage Workflow
236
-
237
- ### **1. Setup Phase**
238
- ```bash
239
- python setup_launch.py # Configure with user info
240
- python test_pipeline.py # Verify all components
241
- ```
242
-
243
- ### **2. Execution Phase**
244
- ```bash
245
- chmod +x launch.sh # Make executable
246
- ./launch.sh # Run complete pipeline
247
- ```
248
-
249
- ### **3. Monitoring Phase**
250
- - Track progress in Trackio Space
251
- - Monitor metrics in real-time
252
- - Check logs for issues
253
- - Validate results
254
-
255
- ### **4. Results Phase**
256
- - Access model on HF Hub
257
- - Review training summary
258
- - Test model performance
259
- - Share results
260
-
261
- ## 📋 Quality Assurance
262
-
263
- ### **Testing Coverage**
264
- - Import testing for all modules
265
- - Script availability verification
266
- - Configuration validation
267
- - CUDA and token testing
268
- - Component integration testing
269
-
270
- ### **Documentation**
271
- - Comprehensive README
272
- - Step-by-step guides
273
- - Troubleshooting section
274
- - Advanced usage examples
275
-
276
- ### **Error Recovery**
277
- - Graceful error handling
278
- - Detailed error messages
279
- - Recovery mechanisms
280
- - Fallback options
281
-
282
- ## 🚀 Future Enhancements
283
-
284
- ### **Planned Improvements**
285
- - Multi-GPU training support
286
- - Distributed training
287
- - Advanced hyperparameter tuning
288
- - Custom dataset upload
289
- - Model evaluation metrics
290
- - Automated testing pipeline
291
-
292
- ### **Extensibility**
293
- - Plugin architecture for custom components
294
- - Configuration templates
295
- - Custom monitoring backends
296
- - Advanced deployment options
297
-
298
- ## 📊 Success Metrics
299
-
300
- ### **Pipeline Completeness**
301
- - ✅ All 16 steps implemented
302
- - ✅ Error handling at each step
303
- - ✅ Monitoring integration
304
- - ✅ Documentation complete
305
-
306
- ### **User Experience**
307
- - ✅ Simple setup process
308
- - ✅ Clear error messages
309
- - ✅ Comprehensive documentation
310
- - ✅ Testing utilities
311
-
312
- ### **Technical Quality**
313
- - ✅ Import path fixes
314
- - ✅ Configuration management
315
- - ✅ Monitoring integration
316
- - ✅ Deployment automation
317
-
318
- ## 🎉 Conclusion
319
-
320
- The SmolLM3 end-to-end pipeline provides a complete solution for fine-tuning with integrated monitoring, automated deployment, and comprehensive documentation. The refactored codebase is now production-ready with proper error handling, testing, and user experience considerations.
321
-
322
- **Key Achievements:**
323
- - Complete end-to-end automation
324
- - Integrated monitoring and tracking
325
- - Comprehensive error handling
326
- - Production-ready deployment
327
- - Extensive documentation
328
- - Testing and validation suite
329
-
330
- The pipeline is now ready for users to easily fine-tune SmolLM3 models with full monitoring and deployment capabilities.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/PUSH_GUIDE.md DELETED
@@ -1,406 +0,0 @@
1
- # Push to Hugging Face Hub Guide
2
-
3
- This guide explains how to use the `push_to_huggingface.py` script to upload your trained SmolLM3 models and results to Hugging Face Hub.
4
-
5
- ## Features
6
-
7
- - ✅ **Automatic Repository Creation** - Creates HF repositories automatically
8
- - ✅ **Model Validation** - Validates required model files before upload
9
- - ✅ **Comprehensive Model Cards** - Generates detailed model documentation
10
- - ✅ **Training Results Upload** - Uploads logs, configs, and results
11
- - ✅ **Trackio Integration** - Logs push actions to your monitoring system
12
- - ✅ **Private/Public Repositories** - Support for both private and public models
13
-
14
- ## Prerequisites
15
-
16
- ### 1. Install Dependencies
17
-
18
- ```bash
19
- pip install huggingface_hub
20
- ```
21
-
22
- ### 2. Set Up Hugging Face Token
23
-
24
- ```bash
25
- # Option 1: Environment variable
26
- export HF_TOKEN="your_huggingface_token_here"
27
-
28
- # Option 2: Use --token argument
29
- python push_to_huggingface.py model_path repo_name --token "your_token"
30
- ```
31
-
32
- ### 3. Get Your Hugging Face Token
33
-
34
- 1. Go to https://huggingface.co/settings/tokens
35
- 2. Click "New token"
36
- 3. Give it a name (e.g., "model-upload")
37
- 4. Select "Write" permissions
38
- 5. Copy the token
39
-
40
- ## Basic Usage
41
-
42
- ### Simple Model Push
43
-
44
- ```bash
45
- python push_to_huggingface.py /path/to/model username/model-name
46
- ```
47
-
48
- ### Push with Custom Token
49
-
50
- ```bash
51
- python push_to_huggingface.py /path/to/model username/model-name \
52
- --token "hf_your_token_here"
53
- ```
54
-
55
- ### Push Private Model
56
-
57
- ```bash
58
- python push_to_huggingface.py /path/to/model username/model-name \
59
- --private
60
- ```
61
-
62
- ### Push with Trackio Integration
63
-
64
- ```bash
65
- python push_to_huggingface.py /path/to/model username/model-name \
66
- --trackio-url "https://your-space.hf.space" \
67
- --experiment-name "my_experiment"
68
- ```
69
-
70
- ## Complete Workflow Example
71
-
72
- ### 1. Train Your Model
73
-
74
- ```bash
75
- python train.py config/train_smollm3.py \
76
- --dataset_dir my_dataset \
77
- --enable_tracking \
78
- --trackio_url "https://your-space.hf.space" \
79
- --experiment_name "smollm3_finetune_v1"
80
- ```
81
-
82
- ### 2. Push to Hugging Face Hub
83
-
84
- ```bash
85
- python push_to_huggingface.py /output-checkpoint username/smollm3-finetuned \
86
- --trackio-url "https://your-space.hf.space" \
87
- --experiment-name "smollm3_finetune_v1"
88
- ```
89
-
90
- ### 3. Use Your Model
91
-
92
- ```python
93
- from transformers import AutoModelForCausalLM, AutoTokenizer
94
-
95
- # Load your uploaded model
96
- model = AutoModelForCausalLM.from_pretrained("username/smollm3-finetuned")
97
- tokenizer = AutoTokenizer.from_pretrained("username/smollm3-finetuned")
98
-
99
- # Generate text
100
- inputs = tokenizer("Hello, how are you?", return_tensors="pt")
101
- outputs = model.generate(**inputs, max_new_tokens=100)
102
- print(tokenizer.decode(outputs[0], skip_special_tokens=True))
103
- ```
104
-
105
- ## Repository Structure
106
-
107
- After pushing, your repository will contain:
108
-
109
- ```
110
- username/model-name/
111
- ├── README.md # Auto-generated model card
112
- ├── config.json # Model configuration
113
- ├── pytorch_model.bin # Model weights
114
- ├── tokenizer.json # Tokenizer configuration
115
- ├── tokenizer_config.json # Tokenizer settings
116
- ├── special_tokens_map.json # Special tokens
117
- ├── training_results/ # Training artifacts
118
- │ ├── train_results.json
119
- │ ├── eval_results.json
120
- │ ├── training_config.json
121
- │ └── training.log
122
- └── .gitattributes # Git attributes
123
- ```
124
-
125
- ## Model Card Features
126
-
127
- The script automatically generates comprehensive model cards including:
128
-
129
- - **Model Details**: Base model, fine-tuning method, size
130
- - **Training Configuration**: All training parameters
131
- - **Training Results**: Loss, accuracy, steps, time
132
- - **Usage Examples**: Code snippets for loading and using
133
- - **Performance Metrics**: Training and validation metrics
134
- - **Hardware Information**: GPU/CPU used for training
135
-
136
- ## Advanced Usage
137
-
138
- ### Custom Repository Names
139
-
140
- ```bash
141
- # Public repository
142
- python push_to_huggingface.py /model myusername/smollm3-chatbot
143
-
144
- # Private repository
145
- python push_to_huggingface.py /model myusername/smollm3-private --private
146
- ```
147
-
148
- ### Integration with Training Pipeline
149
-
150
- ```bash
151
- #!/bin/bash
152
- # Complete training and push workflow
153
-
154
- # 1. Train the model
155
- python train.py config/train_smollm3.py \
156
- --dataset_dir my_dataset \
157
- --enable_tracking \
158
- --trackio_url "https://your-space.hf.space" \
159
- --experiment_name "smollm3_v1"
160
-
161
- # 2. Push to Hugging Face Hub
162
- python push_to_huggingface.py /output-checkpoint myusername/smollm3-v1 \
163
- --trackio-url "https://your-space.hf.space" \
164
- --experiment-name "smollm3_v1"
165
-
166
- # 3. Test the model
167
- python -c "
168
- from transformers import AutoModelForCausalLM, AutoTokenizer
169
- model = AutoModelForCausalLM.from_pretrained('myusername/smollm3-v1')
170
- tokenizer = AutoTokenizer.from_pretrained('myusername/smollm3-v1')
171
- print('Model loaded successfully!')
172
- "
173
- ```
174
-
175
- ### Batch Processing Multiple Models
176
-
177
- ```bash
178
- #!/bin/bash
179
- # Push multiple models
180
-
181
- models=(
182
- "smollm3-baseline"
183
- "smollm3-high-lr"
184
- "smollm3-dpo"
185
- )
186
-
187
- for model in "${models[@]}"; do
188
- echo "Pushing $model..."
189
- python push_to_huggingface.py "/models/$model" "username/$model"
190
- done
191
- ```
192
-
193
- ## Error Handling
194
-
195
- ### Common Issues and Solutions
196
-
197
- #### 1. Missing Model Files
198
-
199
- **Error**: `❌ Missing required files: ['config.json', 'pytorch_model.bin']`
200
-
201
- **Solution**: Ensure your model directory contains all required files:
202
- - `config.json`
203
- - `pytorch_model.bin`
204
- - `tokenizer.json`
205
- - `tokenizer_config.json`
206
-
207
- #### 2. Authentication Issues
208
-
209
- **Error**: `❌ Failed to create repository: 401 Client Error`
210
-
211
- **Solution**:
212
- - Check your HF token is valid
213
- - Ensure token has write permissions
214
- - Verify username in repository name matches your account
215
-
216
- #### 3. Repository Already Exists
217
-
218
- **Error**: `Repository already exists`
219
-
220
- **Solution**: The script handles this automatically with `exist_ok=True`, but you can:
221
- - Use a different repository name
222
- - Delete the existing repository first
223
- - Use version numbers: `username/model-v2`
224
-
225
- #### 4. Large File Upload Issues
226
-
227
- **Error**: `Upload failed for large files`
228
-
229
- **Solution**:
230
- - Check your internet connection
231
- - Use Git LFS for large files
232
- - Consider splitting large models
233
-
234
- ## Trackio Integration
235
-
236
- ### Logging Push Actions
237
-
238
- When using Trackio integration, the script logs:
239
-
240
- - **Push Action**: Repository creation and file uploads
241
- - **Model Metadata**: Size, configuration, results
242
- - **Repository Info**: Name, privacy settings, URL
243
- - **Training Results**: Loss, accuracy, steps
244
-
245
- ### Viewing Push Logs
246
-
247
- 1. Go to your Trackio Space
248
- 2. Navigate to the "View Experiments" tab
249
- 3. Find your experiment
250
- 4. Check the metrics for push-related actions
251
-
252
- ## Security Best Practices
253
-
254
- ### Token Management
255
-
256
- ```bash
257
- # Use environment variables (recommended)
258
- export HF_TOKEN="your_token_here"
259
- python push_to_huggingface.py model repo
260
-
261
- # Don't hardcode tokens in scripts
262
- # ❌ Bad: python push_to_huggingface.py model repo --token "hf_xxx"
263
- ```
264
-
265
- ### Private Models
266
-
267
- ```bash
268
- # For sensitive models, use private repositories
269
- python push_to_huggingface.py model username/private-model --private
270
- ```
271
-
272
- ### Repository Naming
273
-
274
- ```bash
275
- # Use descriptive names
276
- python push_to_huggingface.py model username/smollm3-chatbot-v1
277
-
278
- # Include version numbers
279
- python push_to_huggingface.py model username/smollm3-v2.0
280
- ```
281
-
282
- ## Performance Optimization
283
-
284
- ### Large Models
285
-
286
- For models > 5GB:
287
-
288
- ```bash
289
- # Use Git LFS for large files
290
- git lfs install
291
- git lfs track "*.bin"
292
-
293
- # Consider splitting models
294
- python push_to_huggingface.py model username/model-large --private
295
- ```
296
-
297
- ### Upload Speed
298
-
299
- ```bash
300
- # Use stable internet connection
301
- # Consider uploading during off-peak hours
302
- # Use private repositories for faster uploads
303
- ```
304
-
305
- ## Troubleshooting
306
-
307
- ### Debug Mode
308
-
309
- ```bash
310
- # Enable debug logging
311
- export LOG_LEVEL=DEBUG
312
- python push_to_huggingface.py model repo
313
- ```
314
-
315
- ### Validate Model Files
316
-
317
- ```bash
318
- # Check model structure before pushing
319
- ls -la /path/to/model/
320
- # Should contain: config.json, pytorch_model.bin, tokenizer.json, etc.
321
- ```
322
-
323
- ### Test Repository Access
324
-
325
- ```bash
326
- # Test your HF token
327
- python -c "
328
- from huggingface_hub import HfApi
329
- api = HfApi(token='your_token')
330
- print('Token is valid!')
331
- "
332
- ```
333
-
334
- ## Integration Examples
335
-
336
- ### With CI/CD Pipeline
337
-
338
- ```yaml
339
- # .github/workflows/train-and-push.yml
340
- name: Train and Push Model
341
-
342
- on:
343
- push:
344
- branches: [main]
345
-
346
- jobs:
347
- train-and-push:
348
- runs-on: ubuntu-latest
349
- steps:
350
- - uses: actions/checkout@v2
351
-
352
- - name: Train Model
353
- run: |
354
- python train.py config/train_smollm3.py
355
-
356
- - name: Push to HF Hub
357
- run: |
358
- python push_to_huggingface.py /output username/model-${{ github.run_number }}
359
- env:
360
- HF_TOKEN: ${{ secrets.HF_TOKEN }}
361
- ```
362
-
363
- ### With Docker
364
-
365
- ```dockerfile
366
- # Dockerfile
367
- FROM python:3.9
368
-
369
- WORKDIR /app
370
- COPY requirements.txt .
371
- RUN pip install -r requirements.txt
372
-
373
- COPY . .
374
-
375
- CMD ["python", "push_to_huggingface.py", "/model", "username/model"]
376
- ```
377
-
378
- ## Support and Resources
379
-
380
- ### Documentation
381
-
382
- - [Hugging Face Hub Documentation](https://huggingface.co/docs/hub/index)
383
- - [Transformers Documentation](https://huggingface.co/docs/transformers/index)
384
- - [Model Cards Guide](https://huggingface.co/docs/hub/model-cards)
385
-
386
- ### Community
387
-
388
- - [Hugging Face Forums](https://discuss.huggingface.co/)
389
- - [GitHub Issues](https://github.com/huggingface/huggingface_hub/issues)
390
-
391
- ### Examples
392
-
393
- - [Model Repository Examples](https://huggingface.co/models?search=smollm3)
394
- - [Fine-tuned Models](https://huggingface.co/models?pipeline_tag=text-generation&sort=downloads)
395
-
396
- ## Conclusion
397
-
398
- The `push_to_huggingface.py` script provides a complete solution for:
399
-
400
- - ✅ **Easy Model Deployment** - One command to push models
401
- - ✅ **Professional Documentation** - Auto-generated model cards
402
- - ✅ **Training Artifacts** - Complete experiment tracking
403
- - ✅ **Integration Ready** - Works with CI/CD and monitoring
404
- - ✅ **Security Focused** - Proper token and privacy management
405
-
406
- Start sharing your fine-tuned SmolLM3 models with the community!
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/PUSH_SCRIPT_GUIDE.md DELETED
@@ -1,267 +0,0 @@
1
- # 🚀 Push to Hugging Face Script Guide
2
-
3
- ## Overview
4
-
5
- The `push_to_huggingface.py` script has been enhanced to integrate with **HF Datasets** for experiment tracking and provides complete model deployment with persistent experiment storage.
6
-
7
- ## 🚀 Key Improvements
8
-
9
- ### **1. HF Datasets Integration**
10
- - ✅ **Dataset Repository Support**: Configurable dataset repository for experiment storage
11
- - ✅ **Environment Variables**: Automatic detection of `HF_TOKEN` and `TRACKIO_DATASET_REPO`
12
- - ✅ **Enhanced Logging**: Logs push actions to both Trackio and HF Datasets
13
- - ✅ **Model Card Integration**: Includes dataset repository information in model cards
14
-
15
- ### **2. Enhanced Configuration**
16
- - ✅ **Flexible Token Input**: Multiple ways to provide HF token
17
- - ✅ **Dataset Repository Tracking**: Links models to their experiment datasets
18
- - ✅ **Environment Variable Support**: Fallback to environment variables
19
- - ✅ **Command Line Arguments**: New arguments for HF Datasets integration
20
-
21
- ### **3. Improved Model Cards**
22
- - ✅ **Dataset Repository Info**: Shows which dataset contains experiment data
23
- - ✅ **Experiment Tracking Section**: Explains how to access training data
24
- - ✅ **Enhanced Documentation**: Better model cards with experiment links
25
-
26
- ## 📋 Usage Examples
27
-
28
- ### **Basic Usage**
29
- ```bash
30
- # Push model with default settings
31
- python push_to_huggingface.py /path/to/model username/repo-name
32
- ```
33
-
34
- ### **With HF Datasets Integration**
35
- ```bash
36
- # Push model with custom dataset repository
37
- python push_to_huggingface.py /path/to/model username/repo-name \
38
- --dataset-repo username/experiments
39
- ```
40
-
41
- ### **With Custom Token**
42
- ```bash
43
- # Push model with custom HF token
44
- python push_to_huggingface.py /path/to/model username/repo-name \
45
- --hf-token your_token_here
46
- ```
47
-
48
- ### **Complete Example**
49
- ```bash
50
- # Push model with all options
51
- python push_to_huggingface.py /path/to/model username/repo-name \
52
- --dataset-repo username/experiments \
53
- --hf-token your_token_here \
54
- --private \
55
- --experiment-name "smollm3_finetune_v2"
56
- ```
57
-
58
- ## 🔧 Command Line Arguments
59
-
60
- | Argument | Required | Default | Description |
61
- |----------|----------|---------|-------------|
62
- | `model_path` | ✅ Yes | None | Path to trained model directory |
63
- | `repo_name` | ✅ Yes | None | HF repository name (username/repo-name) |
64
- | `--token` | ❌ No | `HF_TOKEN` env | Hugging Face token |
65
- | `--hf-token` | ❌ No | `HF_TOKEN` env | HF token (alternative to --token) |
66
- | `--private` | ❌ No | False | Make repository private |
67
- | `--trackio-url` | ❌ No | None | Trackio Space URL for logging |
68
- | `--experiment-name` | ❌ No | None | Experiment name for Trackio |
69
- | `--dataset-repo` | ❌ No | `TRACKIO_DATASET_REPO` env | HF Dataset repository |
70
-
71
- ## 🛠️ Configuration Methods
72
-
73
- ### **Method 1: Command Line Arguments**
74
- ```bash
75
- python push_to_huggingface.py model_path repo_name \
76
- --dataset-repo username/experiments \
77
- --hf-token your_token_here
78
- ```
79
-
80
- ### **Method 2: Environment Variables**
81
- ```bash
82
- export HF_TOKEN=your_token_here
83
- export TRACKIO_DATASET_REPO=username/experiments
84
- python push_to_huggingface.py model_path repo_name
85
- ```
86
-
87
- ### **Method 3: Hybrid Approach**
88
- ```bash
89
- # Set defaults via environment variables
90
- export HF_TOKEN=your_token_here
91
- export TRACKIO_DATASET_REPO=username/experiments
92
-
93
- # Override specific values via command line
94
- python push_to_huggingface.py model_path repo_name \
95
- --dataset-repo username/specific-experiments
96
- ```
97
-
98
- ## 📊 What Gets Pushed
99
-
100
- ### **Model Files**
101
- - ✅ **Model Weights**: `pytorch_model.bin`
102
- - ✅ **Configuration**: `config.json`
103
- - ✅ **Tokenizer**: `tokenizer.json`, `tokenizer_config.json`
104
- - ✅ **All Other Files**: Any additional files in model directory
105
-
106
- ### **Documentation**
107
- - ✅ **Model Card**: Comprehensive README.md with model information
108
- - ✅ **Training Configuration**: JSON configuration used for training
109
- - ✅ **Training Results**: JSON results and metrics
110
- - ✅ **Training Logs**: Text logs from training process
111
-
112
- ### **Experiment Data**
113
- - ✅ **Dataset Repository**: Links to HF Dataset containing experiment data
114
- - ✅ **Training Metrics**: All training metrics stored in dataset
115
- - ✅ **Configuration**: Training configuration stored in dataset
116
- - ✅ **Artifacts**: Training artifacts and logs
117
-
118
- ## 🔍 Enhanced Model Cards
119
-
120
- The improved script creates enhanced model cards that include:
121
-
122
- ### **Model Information**
123
- - Base model and architecture
124
- - Training date and model size
125
- - **Dataset repository** for experiment data
126
-
127
- ### **Training Configuration**
128
- - Complete training parameters
129
- - Hardware information
130
- - Training duration and steps
131
-
132
- ### **Experiment Tracking**
133
- - Links to HF Dataset repository
134
- - Instructions for accessing experiment data
135
- - Training metrics and results
136
-
137
- ### **Usage Examples**
138
- - Code examples for loading and using the model
139
- - Generation examples
140
- - Performance information
141
-
142
- ## 📈 Logging Integration
143
-
144
- ### **Trackio Logging**
145
- - ✅ **Push Actions**: Logs model push events
146
- - ✅ **Model Information**: Repository name, size, configuration
147
- - ✅ **Training Data**: Links to experiment dataset
148
-
149
- ### **HF Datasets Logging**
150
- - ✅ **Experiment Summary**: Final training summary
151
- - ✅ **Push Metadata**: Model repository and push date
152
- - ✅ **Configuration**: Complete training configuration
153
-
154
- ### **Dual Storage**
155
- - ✅ **Trackio**: Real-time monitoring and visualization
156
- - ✅ **HF Datasets**: Persistent experiment storage
157
- - ✅ **Synchronized**: Both systems updated together
158
-
159
- ## 🚨 Troubleshooting
160
-
161
- ### **Issue: "Missing required files"**
162
- **Solutions**:
163
- 1. Check model directory contains required files
164
- 2. Ensure model was saved correctly during training
165
- 3. Verify file permissions
166
-
167
- ### **Issue: "Failed to create repository"**
168
- **Solutions**:
169
- 1. Check HF token has write permissions
170
- 2. Verify repository name format: `username/repo-name`
171
- 3. Ensure repository doesn't already exist (or use `--private`)
172
-
173
- ### **Issue: "Failed to upload files"**
174
- **Solutions**:
175
- 1. Check network connectivity
176
- 2. Verify HF token is valid
177
- 3. Ensure repository was created successfully
178
-
179
- ### **Issue: "Dataset repository not found"**
180
- **Solutions**:
181
- 1. Check dataset repository exists
182
- 2. Verify HF token has read access
183
- 3. Use `--dataset-repo` to specify correct repository
184
-
185
- ## 📋 Workflow Integration
186
-
187
- ### **Complete Training Workflow**
188
- 1. **Train Model**: Use training scripts with monitoring
189
- 2. **Monitor Progress**: View metrics in Trackio interface
190
- 3. **Push Model**: Use improved push script
191
- 4. **Access Data**: View experiments in HF Dataset repository
192
-
193
- ### **Example Workflow**
194
- ```bash
195
- # 1. Train model with monitoring
196
- python train.py config/train_smollm3_openhermes_fr.py \
197
- --experiment_name "smollm3_french_v2"
198
-
199
- # 2. Push model to HF Hub
200
- python push_to_huggingface.py outputs/model username/smollm3-french \
201
- --dataset-repo username/experiments \
202
- --experiment-name "smollm3_french_v2"
203
-
204
- # 3. View results
205
- # - Model: https://huggingface.co/username/smollm3-french
206
- # - Experiments: https://huggingface.co/datasets/username/experiments
207
- # - Trackio: Your Trackio Space interface
208
- ```
209
-
210
- ## 🎯 Benefits
211
-
212
- ### **For Model Deployment**
213
- - ✅ **Complete Documentation**: Enhanced model cards with experiment links
214
- - ✅ **Persistent Storage**: Experiment data stored in HF Datasets
215
- - ✅ **Easy Access**: Direct links to training data and metrics
216
- - ✅ **Reproducibility**: Complete training configuration included
217
-
218
- ### **For Experiment Management**
219
- - ✅ **Centralized Storage**: All experiments in HF Dataset repository
220
- - ✅ **Version Control**: Model versions linked to experiment data
221
- - ✅ **Collaboration**: Share experiments and models easily
222
- - ✅ **Searchability**: Easy to find specific experiments
223
-
224
- ### **For Development**
225
- - ✅ **Flexible Configuration**: Multiple ways to set parameters
226
- - ✅ **Backward Compatible**: Works with existing setups
227
- - ✅ **Error Handling**: Clear error messages and troubleshooting
228
- - ✅ **Integration**: Works with existing monitoring system
229
-
230
- ## 📊 Testing Results
231
-
232
- All push script tests passed:
233
- - ✅ **HuggingFacePusher Initialization**: Works with new parameters
234
- - ✅ **Model Card Creation**: Includes HF Datasets integration
235
- - ✅ **Logging Integration**: Logs to both Trackio and HF Datasets
236
- - ✅ **Argument Parsing**: Handles new command line arguments
237
- - ✅ **Environment Variables**: Proper fallback handling
238
-
239
- ## 🔄 Migration Guide
240
-
241
- ### **From Old Script**
242
- ```bash
243
- # Old way
244
- python push_to_huggingface.py model_path repo_name --token your_token
245
-
246
- # New way (same functionality)
247
- python push_to_huggingface.py model_path repo_name --hf-token your_token
248
-
249
- # New way with HF Datasets
250
- python push_to_huggingface.py model_path repo_name \
251
- --hf-token your_token \
252
- --dataset-repo username/experiments
253
- ```
254
-
255
- ### **Environment Variables**
256
- ```bash
257
- # Set environment variables for automatic detection
258
- export HF_TOKEN=your_token_here
259
- export TRACKIO_DATASET_REPO=username/experiments
260
-
261
- # Then use simple command
262
- python push_to_huggingface.py model_path repo_name
263
- ```
264
-
265
- ---
266
-
267
- **🎉 Your push script is now fully integrated with HF Datasets for complete experiment tracking and model deployment!**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/QUANTIZATION_FIX_SUMMARY.md DELETED
@@ -1,165 +0,0 @@
1
- # Quantization Fix Summary
2
-
3
- ## Issues Identified
4
-
5
- The quantization script was failing due to several compatibility issues:
6
-
7
- 1. **Int8 Quantization Error**:
8
- - Error: `The model is quantized with QuantizationMethod.TORCHAO and is not serializable`
9
- - Cause: Offloaded modules in the model cannot be quantized with torchao
10
- - Solution: Added alternative save method and fallback to bitsandbytes
11
-
12
- 2. **Int4 Quantization Error**:
13
- - Error: `Could not run 'aten::_convert_weight_to_int4pack_for_cpu' with arguments from the 'CUDA' backend`
14
- - Cause: Int4 quantization requires CPU backend but was being attempted on CUDA
15
- - Solution: Added proper device selection logic
16
-
17
- 3. **Monitoring Error**:
18
- - Error: `'SmolLM3Monitor' object has no attribute 'log_event'`
19
- - Cause: Incorrect monitoring API usage
20
- - Solution: Added flexible monitoring method detection
21
-
22
- ## Fixes Implemented
23
-
24
- ### 1. Enhanced Device Management (`scripts/model_tonic/quantize_model.py`)
25
-
26
- ```python
27
- def get_optimal_device(self, quant_type: str) -> str:
28
- """Get optimal device for quantization type"""
29
- if quant_type == "int4_weight_only":
30
- # Int4 quantization works better on CPU
31
- return "cpu"
32
- elif quant_type == "int8_weight_only":
33
- # Int8 quantization works on GPU
34
- if torch.cuda.is_available():
35
- return "cuda"
36
- else:
37
- logger.warning("⚠️ CUDA not available, falling back to CPU for int8")
38
- return "cpu"
39
- else:
40
- return "auto"
41
- ```
42
-
43
- ### 2. Alternative Quantization Method
44
-
45
- Added `quantize_model_alternative()` method using bitsandbytes for better compatibility:
46
-
47
- ```python
48
- def quantize_model_alternative(self, quant_type: str, device: str = "auto", group_size: int = 128, save_dir: Optional[str] = None) -> Optional[str]:
49
- """Alternative quantization using bitsandbytes for better compatibility"""
50
- # Uses BitsAndBytesConfig instead of TorchAoConfig
51
- # Handles serialization issues better
52
- ```
53
-
54
- ### 3. Improved Error Handling
55
-
56
- - Added fallback from torchao to bitsandbytes
57
- - Enhanced save method with alternative approaches
58
- - Better device mapping for different quantization types
59
-
60
- ### 4. Fixed Monitoring Integration
61
-
62
- ```python
63
- def log_to_trackio(self, action: str, details: Dict[str, Any]):
64
- """Log quantization events to Trackio"""
65
- if self.monitor:
66
- try:
67
- # Use the correct monitoring method
68
- if hasattr(self.monitor, 'log_event'):
69
- self.monitor.log_event(action, details)
70
- elif hasattr(self.monitor, 'log_metric'):
71
- self.monitor.log_metric(action, details.get('value', 1.0))
72
- elif hasattr(self.monitor, 'log'):
73
- self.monitor.log(action, details)
74
- else:
75
- logger.info(f"📊 {action}: {details}")
76
- except Exception as e:
77
- logger.warning(f"⚠️ Failed to log to Trackio: {e}")
78
- ```
79
-
80
- ## Usage Instructions
81
-
82
- ### 1. Install Dependencies
83
-
84
- ```bash
85
- pip install -r requirements_quantization.txt
86
- ```
87
-
88
- ### 2. Run Quantization
89
-
90
- ```bash
91
- python3 quantize_and_push.py
92
- ```
93
-
94
- ### 3. Test Fixes
95
-
96
- ```bash
97
- python3 test_quantization_fix.py
98
- ```
99
-
100
- ## Expected Behavior
101
-
102
- ### Successful Quantization
103
-
104
- The script will now:
105
-
106
- 1. **Try torchao first** for each quantization type
107
- 2. **Fall back to bitsandbytes** if torchao fails
108
- 3. **Use appropriate devices** (CPU for int4, GPU for int8)
109
- 4. **Handle serialization issues** with alternative save methods
110
- 5. **Log progress** without monitoring errors
111
-
112
- ### Output
113
-
114
- ```
115
- ✅ Model files validated
116
- 🔄 Processing quantization type: int8_weight_only
117
- 🔄 Using device: cuda
118
- ✅ int8_weight_only quantization and push completed
119
- 🔄 Processing quantization type: int4_weight_only
120
- 🔄 Using device: cpu
121
- ✅ int4_weight_only quantization and push completed
122
- 📊 Quantization summary: 2/2 successful
123
- ✅ Quantization completed successfully!
124
- ```
125
-
126
- ## Troubleshooting
127
-
128
- ### If All Quantization Fails
129
-
130
- 1. **Install bitsandbytes**:
131
- ```bash
132
- pip install bitsandbytes
133
- ```
134
-
135
- 2. **Check model path**:
136
- ```bash
137
- ls -la /output-checkpoint
138
- ```
139
-
140
- 3. **Verify dependencies**:
141
- ```bash
142
- python3 test_quantization_fix.py
143
- ```
144
-
145
- ### Common Issues
146
-
147
- 1. **Memory Issues**: Use CPU for int4 quantization
148
- 2. **Serialization Errors**: The script now handles these automatically
149
- 3. **Device Conflicts**: Automatic device selection based on quantization type
150
-
151
- ## Files Modified
152
-
153
- 1. `scripts/model_tonic/quantize_model.py` - Main quantization logic
154
- 2. `quantize_and_push.py` - Main script with better error handling
155
- 3. `test_quantization_fix.py` - Test script for verification
156
- 4. `requirements_quantization.txt` - Dependencies file
157
-
158
- ## Next Steps
159
-
160
- 1. Run the test script to verify fixes
161
- 2. Install bitsandbytes if not already installed
162
- 3. Run the quantization script
163
- 4. Check the Hugging Face repository for quantized models
164
-
165
- The fixes ensure robust quantization with multiple fallback options and proper error handling.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/QUANTIZATION_GUIDE.md DELETED
@@ -1,313 +0,0 @@
1
- # Model Quantization Guide
2
-
3
- ## Overview
4
-
5
- This guide covers the quantization functionality integrated into the SmolLM3 fine-tuning pipeline. The system supports creating quantized versions of trained models using `torchao` and automatically uploading them to Hugging Face Hub in a unified repository structure.
6
-
7
- ## Repository Structure
8
-
9
- With the updated pipeline, all models (main and quantized) are stored in a single repository:
10
-
11
- ```
12
- your-username/model-name/
13
- ├── README.md (unified model card)
14
- ├── config.json
15
- ├── pytorch_model.bin
16
- ├── tokenizer.json
17
- ├── tokenizer_config.json
18
- ├── int8/ (quantized model for GPU)
19
- │ ├── README.md
20
- │ ├── config.json
21
- │ └── pytorch_model.bin
22
- └── int4/ (quantized model for CPU)
23
- ├── README.md
24
- ├── config.json
25
- └── pytorch_model.bin
26
- ```
27
-
28
- ## Quantization Types
29
-
30
- ### int8 Weight-Only Quantization (GPU Optimized)
31
- - **Memory Reduction**: ~50% compared to original model
32
- - **Speed**: Faster inference with minimal accuracy loss
33
- - **Hardware**: GPU optimized for high-performance inference
34
- - **Use Case**: Production deployments with GPU resources
35
-
36
- ### int4 Weight-Only Quantization (CPU Optimized)
37
- - **Memory Reduction**: ~75% compared to original model
38
- - **Speed**: Significantly faster inference with some accuracy trade-off
39
- - **Hardware**: CPU optimized for deployment
40
- - **Use Case**: Edge deployment, CPU-only environments
41
-
42
- ## Integration with Pipeline
43
-
44
- ### Automatic Quantization
45
-
46
- The quantization process is integrated into the main training pipeline:
47
-
48
- 1. **Training**: Model is trained using the standard pipeline
49
- 2. **Model Push**: Main model is pushed to Hugging Face Hub
50
- 3. **Quantization Options**: User is prompted to create quantized versions
51
- 4. **Quantized Models**: Quantized models are created and pushed to subdirectories
52
- 5. **Unified Documentation**: Single model card covers all versions
53
-
54
- ### Pipeline Integration
55
-
56
- The quantization step is added to `launch.sh` after the main model push:
57
-
58
- ```bash
59
- # Step 16.5: Quantization Options
60
- print_step "Step 16.5: Model Quantization Options"
61
- echo "=========================================="
62
-
63
- print_info "Would you like to create quantized versions of your model?"
64
- print_info "Quantization reduces model size and improves inference speed."
65
-
66
- # Ask about quantization
67
- get_input "Create quantized models? (y/n)" "y" "CREATE_QUANTIZED"
68
-
69
- if [ "$CREATE_QUANTIZED" = "y" ] || [ "$CREATE_QUANTIZED" = "Y" ]; then
70
- print_info "Quantization options:"
71
- print_info "1. int8_weight_only (GPU optimized, ~50% memory reduction)"
72
- print_info "2. int4_weight_only (CPU optimized, ~75% memory reduction)"
73
- print_info "3. Both int8 and int4 versions"
74
-
75
- select_option "Select quantization type:" "int8_weight_only" "int4_weight_only" "both" "QUANT_TYPE"
76
-
77
- # Create quantized models in the same repository
78
- python scripts/model_tonic/quantize_model.py /output-checkpoint "$REPO_NAME" \
79
- --quant-type "$QUANT_TYPE" \
80
- --device "$DEVICE" \
81
- --token "$HF_TOKEN" \
82
- --trackio-url "$TRACKIO_URL" \
83
- --experiment-name "${EXPERIMENT_NAME}-${QUANT_TYPE}" \
84
- --dataset-repo "$TRACKIO_DATASET_REPO"
85
- fi
86
- ```
87
-
88
- ## Standalone Quantization
89
-
90
- ### Using the Standalone Script
91
-
92
- For models already uploaded to Hugging Face Hub:
93
-
94
- ```bash
95
- python scripts/model_tonic/quantize_standalone.py \
96
- "your-username/model-name" \
97
- "your-username/model-name" \
98
- --quant-type "int8_weight_only" \
99
- --device "auto" \
100
- --token "your-hf-token"
101
- ```
102
-
103
- ### Command Line Options
104
-
105
- ```bash
106
- python scripts/model_tonic/quantize_standalone.py model_path repo_name [options]
107
-
108
- Options:
109
- --quant-type {int8_weight_only,int4_weight_only,int8_dynamic}
110
- Quantization type (default: int8_weight_only)
111
- --device DEVICE Device for quantization (auto, cpu, cuda)
112
- --group-size GROUP_SIZE
113
- Group size for quantization (default: 128)
114
- --token TOKEN Hugging Face token
115
- --private Create private repository
116
- --trackio-url TRACKIO_URL
117
- Trackio URL for monitoring
118
- --experiment-name EXPERIMENT_NAME
119
- Experiment name for tracking
120
- --dataset-repo DATASET_REPO
121
- HF Dataset repository
122
- --save-only Save quantized model locally without pushing to HF
123
- ```
124
-
125
- ## Loading Quantized Models
126
-
127
- ### Loading Main Model
128
-
129
- ```python
130
- import torch
131
- from transformers import AutoModelForCausalLM, AutoTokenizer
132
-
133
- # Load the main model
134
- model = AutoModelForCausalLM.from_pretrained(
135
- "your-username/model-name",
136
- device_map="auto",
137
- torch_dtype=torch.bfloat16
138
- )
139
- tokenizer = AutoTokenizer.from_pretrained("your-username/model-name")
140
- ```
141
-
142
- ### Loading int8 Quantized Model (GPU)
143
-
144
- ```python
145
- import torch
146
- from transformers import AutoModelForCausalLM, AutoTokenizer
147
-
148
- # Load int8 quantized model (GPU optimized)
149
- model = AutoModelForCausalLM.from_pretrained(
150
- "your-username/model-name/int8",
151
- device_map="auto",
152
- torch_dtype=torch.bfloat16
153
- )
154
- tokenizer = AutoTokenizer.from_pretrained("your-username/model-name/int8")
155
- ```
156
-
157
- ### Loading int4 Quantized Model (CPU)
158
-
159
- ```python
160
- import torch
161
- from transformers import AutoModelForCausalLM, AutoTokenizer
162
-
163
- # Load int4 quantized model (CPU optimized)
164
- model = AutoModelForCausalLM.from_pretrained(
165
- "your-username/model-name/int4",
166
- device_map="cpu",
167
- torch_dtype=torch.bfloat16
168
- )
169
- tokenizer = AutoTokenizer.from_pretrained("your-username/model-name/int4")
170
- ```
171
-
172
- ## Usage Examples
173
-
174
- ### Text Generation with Quantized Model
175
-
176
- ```python
177
- from transformers import AutoModelForCausalLM, AutoTokenizer
178
-
179
- # Load quantized model
180
- model = AutoModelForCausalLM.from_pretrained("your-username/model-name/int8")
181
- tokenizer = AutoTokenizer.from_pretrained("your-username/model-name/int8")
182
-
183
- # Generate text
184
- text = "The future of artificial intelligence is"
185
- inputs = tokenizer(text, return_tensors="pt")
186
- outputs = model.generate(**inputs, max_new_tokens=100)
187
- print(tokenizer.decode(outputs[0], skip_special_tokens=True))
188
- ```
189
-
190
- ### Conversation with Quantized Model
191
-
192
- ```python
193
- def chat_with_quantized_model(prompt, max_length=100):
194
- inputs = tokenizer(prompt, return_tensors="pt")
195
- outputs = model.generate(**inputs, max_new_tokens=max_length)
196
- return tokenizer.decode(outputs[0], skip_special_tokens=True)
197
-
198
- response = chat_with_quantized_model("Hello, how are you today?")
199
- print(response)
200
- ```
201
-
202
- ## Configuration Options
203
-
204
- ### Quantization Parameters
205
-
206
- - **group_size**: Group size for quantization (default: 128)
207
- - **device**: Target device for quantization (auto, cpu, cuda)
208
- - **quant_type**: Type of quantization to apply
209
-
210
- ### Hardware Requirements
211
-
212
- - **Main Model**: GPU with 8GB+ VRAM recommended
213
- - **int8 Model**: GPU with 4GB+ VRAM
214
- - **int4 Model**: CPU deployment possible
215
-
216
- ## Performance Comparison
217
-
218
- | Model Type | Memory Usage | Speed | Accuracy | Use Case |
219
- |------------|--------------|-------|----------|----------|
220
- | Original | 100% | Baseline | Best | Development, Research |
221
- | int8 | ~50% | Faster | Minimal loss | Production GPU |
222
- | int4 | ~25% | Fastest | Some loss | Edge, CPU deployment |
223
-
224
- ## Best Practices
225
-
226
- ### When to Use Quantization
227
-
228
- 1. **int8 (GPU)**: When you need faster inference with minimal accuracy loss
229
- 2. **int4 (CPU)**: When deploying to CPU-only environments or edge devices
230
- 3. **Both**: When you need flexibility for different deployment scenarios
231
-
232
- ### Memory Optimization
233
-
234
- - Use int8 for GPU deployments with memory constraints
235
- - Use int4 for CPU deployments or very memory-constrained environments
236
- - Consider the trade-off between speed and accuracy
237
-
238
- ### Deployment Considerations
239
-
240
- - Test quantized models on your specific use case
241
- - Monitor performance and accuracy in production
242
- - Consider using the main model for development and quantized versions for deployment
243
-
244
- ## Troubleshooting
245
-
246
- ### Common Issues
247
-
248
- 1. **CUDA Out of Memory**: Reduce batch size or use int8 quantization
249
- 2. **Import Errors**: Install torchao: `pip install torchao>=0.10.0`
250
- 3. **Model Loading Errors**: Ensure the model path is correct and accessible
251
-
252
- ### Debugging
253
-
254
- ```bash
255
- # Test quantization functionality
256
- python tests/test_quantization.py
257
-
258
- # Check torchao installation
259
- python -c "import torchao; print('torchao available')"
260
-
261
- # Verify model files
262
- ls -la /path/to/model/
263
- ```
264
-
265
- ## Monitoring and Tracking
266
-
267
- ### Trackio Integration
268
-
269
- Quantization events are logged to Trackio:
270
-
271
- - `quantization_started`: When quantization begins
272
- - `quantization_completed`: When quantization finishes
273
- - `quantized_model_pushed`: When model is uploaded to HF Hub
274
- - `quantization_failed`: If quantization fails
275
-
276
- ### Metrics Tracked
277
-
278
- - Quantization type and parameters
279
- - Model size reduction
280
- - Upload URLs for quantized models
281
- - Processing time and success status
282
-
283
- ## Dependencies
284
-
285
- ### Required Packages
286
-
287
- ```bash
288
- pip install torchao>=0.10.0
289
- pip install transformers>=4.35.0
290
- pip install huggingface_hub>=0.16.0
291
- ```
292
-
293
- ### Optional Dependencies
294
-
295
- ```bash
296
- pip install accelerate>=0.20.0 # For device mapping
297
- pip install bitsandbytes>=0.41.0 # For additional quantization
298
- ```
299
-
300
- ## References
301
-
302
- - [torchao Documentation](https://huggingface.co/docs/transformers/main/en/quantization/torchao)
303
- - [Hugging Face Model Cards](https://huggingface.co/docs/hub/model-cards)
304
- - [Transformers Quantization Guide](https://huggingface.co/docs/transformers/main/en/quantization)
305
-
306
- ## Support
307
-
308
- For issues and questions:
309
-
310
- 1. Check the troubleshooting section above
311
- 2. Review the test files in `tests/test_quantization.py`
312
- 3. Open an issue on the project repository
313
- 4. Check the Trackio monitoring for detailed logs
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/QUANTIZATION_IMPLEMENTATION_SUMMARY.md DELETED
@@ -1,248 +0,0 @@
1
- # Quantization Implementation Summary
2
-
3
- This document summarizes the torchao quantization features that have been added to the SmolLM3 fine-tuning pipeline.
4
-
5
- ## 🚀 New Features Added
6
-
7
- ### 1. Core Quantization Scripts
8
-
9
- #### `scripts/model_tonic/quantize_model.py`
10
- - **Main quantization script** with full HF Hub integration
11
- - Supports int8 (GPU) and int4 (CPU) quantization
12
- - Automatic model card and README generation
13
- - Trackio monitoring integration
14
- - Comprehensive error handling and validation
15
-
16
- #### `scripts/model_tonic/quantize_standalone.py`
17
- - **Standalone quantization script** for independent use
18
- - Simple command-line interface
19
- - Option to save locally without pushing to HF Hub
20
- - Quick quantization workflow
21
-
22
- ### 2. Pipeline Integration
23
-
24
- #### Updated `launch.sh`
25
- - **Interactive quantization prompts** after model training
26
- - Support for single or dual quantization (int8 + int4)
27
- - Automatic repository naming with quantization suffixes
28
- - Enhanced summary reporting with quantization results
29
-
30
- ### 3. Documentation
31
-
32
- #### `docs/QUANTIZATION_GUIDE.md`
33
- - **Comprehensive quantization guide**
34
- - Usage examples and best practices
35
- - Performance comparisons
36
- - Troubleshooting section
37
- - Advanced configuration options
38
-
39
- #### Updated `README.md`
40
- - **Quantization section** with quick start examples
41
- - Integration with main pipeline documentation
42
- - Loading quantized models examples
43
-
44
- ### 4. Testing
45
-
46
- #### `tests/test_quantization.py`
47
- - **Comprehensive test suite** for quantization functionality
48
- - Tests for imports, initialization, configuration creation
49
- - Model validation and documentation generation tests
50
- - Automated testing workflow
51
-
52
- ### 5. Dependencies
53
-
54
- #### Updated `requirements/requirements.txt`
55
- - **Added torchao>=0.10.0** for quantization support
56
- - Maintains compatibility with existing dependencies
57
-
58
- ## 🔧 Quantization Types Supported
59
-
60
- ### int8_weight_only (GPU Optimized)
61
- - **Memory Reduction**: ~50%
62
- - **Accuracy**: Minimal degradation
63
- - **Speed**: Faster inference
64
- - **Hardware**: GPU optimized
65
- - **Use Case**: High-performance inference on GPU
66
-
67
- ### int4_weight_only (CPU Optimized)
68
- - **Memory Reduction**: ~75%
69
- - **Accuracy**: Some degradation acceptable
70
- - **Speed**: Significantly faster inference
71
- - **Hardware**: CPU optimized
72
- - **Use Case**: Deployment on CPU or memory-constrained environments
73
-
74
- ### int8_dynamic (Dynamic Quantization)
75
- - **Memory Reduction**: ~50%
76
- - **Accuracy**: Minimal degradation
77
- - **Speed**: Faster inference
78
- - **Hardware**: GPU optimized
79
- - **Use Case**: Dynamic quantization during inference
80
-
81
- ## 📋 Usage Examples
82
-
83
- ### Interactive Pipeline (launch.sh)
84
- ```bash
85
- ./launch.sh
86
- # Complete training and model push
87
- # Choose quantization options when prompted:
88
- # - y/n for quantization
89
- # - int8_weight_only / int4_weight_only / both
90
- ```
91
-
92
- ### Standalone Quantization
93
- ```bash
94
- # Quantize and push to HF Hub
95
- python scripts/model_tonic/quantize_standalone.py /path/to/model my-username/quantized-model \
96
- --quant-type int8_weight_only \
97
- --token YOUR_HF_TOKEN
98
-
99
- # Quantize and save locally
100
- python scripts/model_tonic/quantize_standalone.py /path/to/model my-username/quantized-model \
101
- --quant-type int4_weight_only \
102
- --device cpu \
103
- --save-only
104
- ```
105
-
106
- ### Loading Quantized Models
107
- ```python
108
- import torch
109
- from transformers import AutoModelForCausalLM, AutoTokenizer
110
-
111
- # Load int8 quantized model (GPU)
112
- model = AutoModelForCausalLM.from_pretrained(
113
- "your-username/model-int8",
114
- device_map="auto",
115
- torch_dtype=torch.bfloat16
116
- )
117
-
118
- # Load int4 quantized model (CPU)
119
- model = AutoModelForCausalLM.from_pretrained(
120
- "your-username/model-int4",
121
- device_map="cpu",
122
- torch_dtype=torch.bfloat16
123
- )
124
- ```
125
-
126
- ## 🧪 Testing
127
-
128
- Run the quantization tests:
129
- ```bash
130
- python tests/test_quantization.py
131
- ```
132
-
133
- Tests cover:
134
- - Import validation
135
- - Quantizer initialization
136
- - Configuration creation
137
- - Model validation
138
- - Documentation generation
139
-
140
- ## 📊 Performance Comparison
141
-
142
- | Model Type | Memory Usage | Speed | Accuracy | Hardware |
143
- |------------|--------------|-------|----------|----------|
144
- | Original | 100% | Baseline | Best | GPU/CPU |
145
- | int8 | ~50% | Faster | Minimal loss | GPU |
146
- | int4 | ~25% | Fastest | Some loss | CPU |
147
-
148
- ## 🔍 Key Features
149
-
150
- ### 1. Automatic Integration
151
- - Seamlessly integrated into the main training pipeline
152
- - Interactive prompts for quantization options
153
- - Automatic repository creation and naming
154
-
155
- ### 2. Comprehensive Documentation
156
- - Automatic model card generation
157
- - Detailed README creation
158
- - Usage examples and best practices
159
-
160
- ### 3. Monitoring Integration
161
- - Trackio logging for quantization events
162
- - Performance metrics tracking
163
- - Artifact storage and versioning
164
-
165
- ### 4. Error Handling
166
- - Robust validation of model paths
167
- - Graceful handling of quantization failures
168
- - Detailed error messages and logging
169
-
170
- ### 5. Flexibility
171
- - Support for multiple quantization types
172
- - Standalone usage option
173
- - Custom configuration options
174
-
175
- ## 🛠️ Technical Implementation
176
-
177
- ### Core Components
178
-
179
- 1. **ModelQuantizer Class**
180
- - Main quantization orchestration
181
- - HF Hub integration
182
- - Trackio monitoring
183
- - Error handling and validation
184
-
185
- 2. **Quantization Configuration**
186
- - torchao configuration management
187
- - Device-specific optimizations
188
- - Group size and parameter tuning
189
-
190
- 3. **Documentation Generation**
191
- - Automatic model card creation
192
- - README generation with usage examples
193
- - Performance and limitation documentation
194
-
195
- 4. **Pipeline Integration**
196
- - Interactive prompts in launch.sh
197
- - Automatic repository naming
198
- - Enhanced summary reporting
199
-
200
- ## 📈 Benefits
201
-
202
- ### For Users
203
- - **Easy Integration**: Seamless addition to existing pipeline
204
- - **Multiple Options**: Choose quantization type based on needs
205
- - **Performance**: Significant memory and speed improvements
206
- - **Documentation**: Automatic comprehensive documentation
207
-
208
- ### For Deployment
209
- - **GPU Optimization**: int8 for high-performance inference
210
- - **CPU Optimization**: int4 for resource-constrained environments
211
- - **Memory Efficiency**: 50-75% memory reduction
212
- - **Speed Improvement**: Faster inference times
213
-
214
- ## 🔮 Future Enhancements
215
-
216
- ### Planned Features
217
- 1. **Additional Quantization Types**: Support for more torchao configurations
218
- 2. **Automated Benchmarking**: Performance comparison tools
219
- 3. **Batch Quantization**: Process multiple models simultaneously
220
- 4. **Custom Configurations**: Advanced quantization parameter tuning
221
- 5. **Integration Testing**: End-to-end quantization workflow tests
222
-
223
- ### Potential Improvements
224
- 1. **Quantization-Aware Training**: Support for QAT workflows
225
- 2. **Mixed Precision**: Advanced precision optimization
226
- 3. **Hardware-Specific**: Optimizations for specific GPU/CPU types
227
- 4. **Automated Selection**: Smart quantization type selection
228
-
229
- ## 📚 References
230
-
231
- - [torchao Documentation](https://huggingface.co/docs/transformers/main/en/quantization/torchao)
232
- - [Hugging Face Quantization Guide](https://huggingface.co/docs/transformers/main/en/quantization)
233
- - [PyTorch Quantization](https://pytorch.org/docs/stable/quantization.html)
234
-
235
- ## 🎯 Summary
236
-
237
- The quantization implementation provides a complete, production-ready solution for creating optimized versions of fine-tuned SmolLM3 models. The integration is seamless, the documentation is comprehensive, and the functionality is robust and well-tested.
238
-
239
- Key achievements:
240
- - ✅ Full pipeline integration
241
- - ✅ Multiple quantization types
242
- - ✅ Comprehensive documentation
243
- - ✅ Robust error handling
244
- - ✅ Testing suite
245
- - ✅ Monitoring integration
246
- - ✅ Standalone usage option
247
-
248
- The implementation follows the repository's architecture patterns and maintains consistency with existing code structure and documentation standards.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/README_END_TO_END.md DELETED
@@ -1,303 +0,0 @@
1
- # SmolLM3 End-to-End Fine-tuning Pipeline
2
-
3
- This repository provides a complete end-to-end pipeline for fine-tuning SmolLM3 models with integrated experiment tracking, monitoring, and model deployment.
4
-
5
- ## 🚀 Quick Start
6
-
7
- ### 1. Setup Configuration
8
-
9
- ```bash
10
- # Run the setup script to configure with your information
11
- python setup_launch.py
12
- ```
13
-
14
-
15
- ### 2. Check Requirements
16
-
17
- ```bash
18
- # Verify all dependencies are installed
19
- python check_requirements.py
20
- ```
21
-
22
- ### 3. Run the Pipeline
23
-
24
- ```bash
25
- # Make the script executable and run
26
- chmod +x launch.sh
27
- ./launch.sh
28
- ```
29
- This will prompt you for:
30
- - Your Hugging Face token
31
- - Optional model and dataset customizations
32
-
33
- ## 📋 What the Pipeline Does
34
-
35
- The end-to-end pipeline performs the following steps:
36
-
37
- ### 1. **Environment Setup**
38
- - Installs system dependencies
39
- - Creates Python virtual environment
40
- - Installs PyTorch with CUDA support
41
- - Installs all required Python packages
42
-
43
- ### 2. **Trackio Space Deployment**
44
- - Creates a new Hugging Face Space for experiment tracking
45
- - Configures the Trackio monitoring interface
46
- - Sets up environment variables
47
-
48
- ### 3. **HF Dataset Setup**
49
- - Creates a Hugging Face Dataset repository for experiment storage
50
- - Configures dataset access and permissions
51
- - Sets up initial experiment data structure
52
-
53
- ### 4. **Dataset Preparation**
54
- - Downloads the specified dataset from Hugging Face Hub
55
- - Converts to training format (prompt/completion pairs)
56
- - Handles multiple dataset formats automatically
57
- - Creates train/validation splits
58
-
59
- ### 5. **Training Configuration**
60
- - Creates optimized training configuration
61
- - Sets up monitoring integration
62
- - Configures model parameters and hyperparameters
63
-
64
- ### 6. **Model Training**
65
- - Runs the SmolLM3 fine-tuning process
66
- - Logs metrics to Trackio Space in real-time
67
- - Saves experiment data to HF Dataset
68
- - Creates checkpoints during training
69
-
70
- ### 7. **Model Deployment**
71
- - Pushes trained model to Hugging Face Hub
72
- - Creates comprehensive model card
73
- - Uploads training results and logs
74
- - Tests the uploaded model
75
-
76
- ### 8. **Summary Report**
77
- - Generates detailed training summary
78
- - Provides links to all resources
79
- - Documents configuration and results
80
-
81
- ## 🎯 Features
82
-
83
- ### **Integrated Monitoring**
84
- - Real-time experiment tracking via Trackio Space
85
- - Persistent storage in Hugging Face Datasets
86
- - Comprehensive metrics logging
87
- - System resource monitoring
88
-
89
- ### **Flexible Dataset Support**
90
- - Automatic format detection and conversion
91
- - Support for multiple dataset types
92
- - Built-in data preprocessing
93
- - Train/validation split handling
94
-
95
- ### **Optimized Training**
96
- - Flash Attention support for efficiency
97
- - Gradient checkpointing for memory optimization
98
- - Mixed precision training
99
- - Automatic hyperparameter optimization
100
-
101
- ### **Complete Deployment**
102
- - Automated model upload to Hugging Face Hub
103
- - Comprehensive model cards
104
- - Training results documentation
105
- - Model testing and validation
106
-
107
- ## 📊 Monitoring & Tracking
108
-
109
- ### **Trackio Space Interface**
110
- - Real-time training metrics visualization
111
- - Experiment management and comparison
112
- - System resource monitoring
113
- - Training progress tracking
114
-
115
- ### **HF Dataset Storage**
116
- - Persistent experiment data storage
117
- - Version-controlled experiment history
118
- - Collaborative experiment sharing
119
- - Automated data backup
120
-
121
- ## 🔧 Configuration
122
-
123
- ### **Required Configuration**
124
- Update these variables in `launch.sh`:
125
-
126
- ```bash
127
- # Your Hugging Face credentials
128
- HF_TOKEN="your_hf_token_here"
129
- HF_USERNAME="your-username"
130
-
131
- # Model and dataset
132
- MODEL_NAME="HuggingFaceTB/SmolLM3-3B"
133
- DATASET_NAME="HuggingFaceTB/smoltalk"
134
-
135
- # Output repositories
136
- REPO_NAME="your-username/smollm3-finetuned-$(date +%Y%m%d)"
137
- TRACKIO_DATASET_REPO="your-username/trackio-experiments"
138
- ```
139
-
140
- ### **Training Parameters**
141
- Customize training parameters:
142
-
143
- ```bash
144
- # Training configuration
145
- BATCH_SIZE=2
146
- GRADIENT_ACCUMULATION_STEPS=8
147
- LEARNING_RATE=5e-6
148
- MAX_EPOCHS=3
149
- MAX_SEQ_LENGTH=4096
150
- ```
151
-
152
- ## 📁 Output Structure
153
-
154
- After running the pipeline, you'll have:
155
-
156
- ```
157
- ├── training_dataset/ # Prepared dataset
158
- │ ├── train.json
159
- │ └── validation.json
160
- ├── /output-checkpoint/ # Model checkpoints
161
- │ ├── config.json
162
- │ ├── pytorch_model.bin
163
- │ └── training_results/
164
- ├── training.log # Training logs
165
- ├── training_summary.md # Summary report
166
- └── config/train_smollm3_end_to_end.py # Training config
167
- ```
168
-
169
- ## 🌐 Online Resources
170
-
171
- The pipeline creates these online resources:
172
-
173
- - **Model Repository**: `https://huggingface.co/your-username/smollm3-finetuned-YYYYMMDD`
174
- - **Trackio Space**: `https://huggingface.co/spaces/your-username/trackio-monitoring-YYYYMMDD`
175
- - **Experiment Dataset**: `https://huggingface.co/datasets/your-username/trackio-experiments`
176
-
177
- ## 🛠️ Troubleshooting
178
-
179
- ### **Common Issues**
180
-
181
- 1. **HF Token Issues**
182
- ```bash
183
- # Verify your token is correct
184
- hf whoami
185
- ```
186
-
187
- 2. **CUDA Issues**
188
- ```bash
189
- # Check CUDA availability
190
- python -c "import torch; print(torch.cuda.is_available())"
191
- ```
192
-
193
- 3. **Memory Issues**
194
- ```bash
195
- # Reduce batch size or gradient accumulation
196
- BATCH_SIZE=1
197
- GRADIENT_ACCUMULATION_STEPS=16
198
- ```
199
-
200
- 4. **Dataset Issues**
201
- ```bash
202
- # Test dataset access
203
- python -c "from datasets import load_dataset; print(load_dataset('your-dataset'))"
204
- ```
205
-
206
- ### **Debug Mode**
207
-
208
- Run individual components for debugging:
209
-
210
- ```bash
211
- # Test Trackio deployment
212
- cd scripts/trackio_tonic
213
- python deploy_trackio_space.py
214
-
215
- # Test dataset setup
216
- cd scripts/dataset_tonic
217
- python setup_hf_dataset.py
218
-
219
- # Test training
220
- python src/train.py config/train_smollm3_end_to_end.py --help
221
- ```
222
-
223
- ## 📚 Advanced Usage
224
-
225
- ### **Custom Datasets**
226
-
227
- For custom datasets, ensure they have one of these formats:
228
-
229
- ```json
230
- // Format 1: Prompt/Completion
231
- {
232
- "prompt": "What is machine learning?",
233
- "completion": "Machine learning is..."
234
- }
235
-
236
- // Format 2: Instruction/Output
237
- {
238
- "instruction": "Explain machine learning",
239
- "output": "Machine learning is..."
240
- }
241
-
242
- // Format 3: Chat format
243
- {
244
- "messages": [
245
- {"role": "user", "content": "What is ML?"},
246
- {"role": "assistant", "content": "ML is..."}
247
- ]
248
- }
249
- ```
250
-
251
- ### **Custom Models**
252
-
253
- To use different models, update the configuration:
254
-
255
- ```bash
256
- MODEL_NAME="microsoft/DialoGPT-medium"
257
- MAX_SEQ_LENGTH=1024
258
- ```
259
-
260
- ### **Custom Training**
261
-
262
- Modify training parameters in the generated config:
263
-
264
- ```python
265
- # In config/train_smollm3_end_to_end.py
266
- config = SmolLM3Config(
267
- learning_rate=1e-5, # Custom learning rate
268
- max_iters=5000, # Custom training steps
269
- # ... other parameters
270
- )
271
- ```
272
-
273
- ## 🤝 Contributing
274
-
275
- 1. Fork the repository
276
- 2. Create a feature branch
277
- 3. Make your changes
278
- 4. Test the pipeline
279
- 5. Submit a pull request
280
-
281
- ## 📄 License
282
-
283
- This project is licensed under the MIT License - see the LICENSE file for details.
284
-
285
- ## 🙏 Acknowledgments
286
-
287
- - Hugging Face for the excellent transformers library
288
- - The SmolLM3 team for the base model
289
- - The Trackio team for experiment tracking
290
- - The open-source community for contributions
291
-
292
- ## 📞 Support
293
-
294
- For issues and questions:
295
-
296
- 1. Check the troubleshooting section
297
- 2. Review the logs in `training.log`
298
- 3. Check the Trackio Space for monitoring data
299
- 4. Open an issue on GitHub
300
-
301
- ---
302
-
303
- **Happy Fine-tuning! 🚀**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/SFT_TRAINER_CONFIG_USAGE.md DELETED
@@ -1,233 +0,0 @@
1
- # SFT Trainer Configuration Usage Guide
2
-
3
- ## Overview
4
-
5
- This guide describes how the SFT (Supervised Fine-tuning) trainer uses the premade configuration files and how the `trainer_type` field is passed through the system.
6
-
7
- ## How SFT Trainer Uses Premade Configs
8
-
9
- ### 1. Configuration Loading Process
10
-
11
- The SFT trainer uses premade configs through the following process:
12
-
13
- 1. **Config File Selection**: Users specify a config file via command line or launch script
14
- 2. **Config Loading**: The system loads the config using `get_config()` function
15
- 3. **Config Inheritance**: All configs inherit from `SmolLM3Config` base class
16
- 4. **Trainer Type Detection**: The system checks for `trainer_type` field in the config
17
- 5. **Training Arguments Creation**: Config parameters are used to create `TrainingArguments`
18
-
19
- ### 2. Configuration Parameters Used by SFT Trainer
20
-
21
- The SFT trainer uses the following config parameters:
22
-
23
- #### Model Configuration
24
- - `model_name`: Model to load (e.g., "HuggingFaceTB/SmolLM3-3B")
25
- - `max_seq_length`: Maximum sequence length for tokenization
26
- - `use_flash_attention`: Whether to use flash attention
27
- - `use_gradient_checkpointing`: Whether to use gradient checkpointing
28
-
29
- #### Training Configuration
30
- - `batch_size`: Per-device batch size
31
- - `gradient_accumulation_steps`: Gradient accumulation steps
32
- - `learning_rate`: Learning rate for optimization
33
- - `weight_decay`: Weight decay for optimizer
34
- - `warmup_steps`: Number of warmup steps
35
- - `max_iters`: Maximum training iterations
36
- - `save_steps`: Save checkpoint every N steps
37
- - `eval_steps`: Evaluate every N steps
38
- - `logging_steps`: Log every N steps
39
-
40
- #### Optimizer Configuration
41
- - `optimizer`: Optimizer type (e.g., "adamw_torch")
42
- - `beta1`, `beta2`, `eps`: Optimizer parameters
43
-
44
- #### Scheduler Configuration
45
- - `scheduler`: Learning rate scheduler type
46
- - `min_lr`: Minimum learning rate
47
-
48
- #### Mixed Precision
49
- - `fp16`: Whether to use fp16 precision
50
- - `bf16`: Whether to use bf16 precision
51
-
52
- #### Data Configuration
53
- - `dataset_name`: Hugging Face dataset name
54
- - `data_dir`: Local dataset directory
55
- - `train_file`: Training file name
56
- - `validation_file`: Validation file name
57
-
58
- #### Monitoring Configuration
59
- - `enable_tracking`: Whether to enable Trackio tracking
60
- - `trackio_url`: Trackio server URL
61
- - `experiment_name`: Experiment name for tracking
62
-
63
- ### 3. Training Arguments Creation
64
-
65
- The SFT trainer creates `TrainingArguments` from config parameters:
66
-
67
- ```python
68
- def get_training_arguments(self, output_dir: str, **kwargs) -> TrainingArguments:
69
- training_args = {
70
- "output_dir": output_dir,
71
- "per_device_train_batch_size": self.config.batch_size,
72
- "per_device_eval_batch_size": self.config.batch_size,
73
- "gradient_accumulation_steps": self.config.gradient_accumulation_steps,
74
- "learning_rate": self.config.learning_rate,
75
- "weight_decay": self.config.weight_decay,
76
- "warmup_steps": self.config.warmup_steps,
77
- "max_steps": self.config.max_iters,
78
- "save_steps": self.config.save_steps,
79
- "eval_steps": self.config.eval_steps,
80
- "logging_steps": self.config.logging_steps,
81
- "fp16": self.config.fp16,
82
- "bf16": self.config.bf16,
83
- # ... additional parameters
84
- }
85
- return TrainingArguments(**training_args)
86
- ```
87
-
88
- ### 4. Trainer Selection Logic
89
-
90
- The system determines which trainer to use based on the `trainer_type` field:
91
-
92
- ```python
93
- # Determine trainer type (command line overrides config)
94
- trainer_type = args.trainer_type or getattr(config, 'trainer_type', 'sft')
95
-
96
- # Initialize trainer based on type
97
- if trainer_type.lower() == 'dpo':
98
- trainer = SmolLM3DPOTrainer(...)
99
- else:
100
- trainer = SmolLM3Trainer(...) # SFT trainer
101
- ```
102
-
103
- ## Configuration Files Structure
104
-
105
- ### Base Config (`config/train_smollm3.py`)
106
-
107
- ```python
108
- @dataclass
109
- class SmolLM3Config:
110
- # Trainer type selection
111
- trainer_type: str = "sft" # "sft" or "dpo"
112
-
113
- # Model configuration
114
- model_name: str = "HuggingFaceTB/SmolLM3-3B"
115
- max_seq_length: int = 4096
116
- # ... other fields
117
- ```
118
-
119
- ### DPO Config (`config/train_smollm3_dpo.py`)
120
-
121
- ```python
122
- @dataclass
123
- class SmolLM3DPOConfig(SmolLM3Config):
124
- # Trainer type selection
125
- trainer_type: str = "dpo" # Override default to use DPO trainer
126
-
127
- # DPO-specific configuration
128
- beta: float = 0.1
129
- # ... DPO-specific fields
130
- ```
131
-
132
- ### Specialized Configs (e.g., `config/train_smollm3_openhermes_fr_a100_multiple_passes.py`)
133
-
134
- ```python
135
- @dataclass
136
- class SmolLM3ConfigOpenHermesFRMultiplePasses(SmolLM3Config):
137
- # Inherits trainer_type = "sft" from base config
138
-
139
- # Specialized configuration for multiple passes
140
- batch_size: int = 6
141
- gradient_accumulation_steps: int = 20
142
- learning_rate: float = 3e-6
143
- max_iters: int = 25000
144
- # ... other specialized fields
145
- ```
146
-
147
- ## Trainer Type Priority
148
-
149
- The trainer type is determined in the following order of priority:
150
-
151
- 1. **Command line argument** (`--trainer_type`) - Highest priority
152
- 2. **Config file** (`trainer_type` field) - Medium priority
153
- 3. **Default value** (`"sft"`) - Lowest priority
154
-
155
- ## Usage Examples
156
-
157
- ### Using SFT Trainer with Different Configs
158
-
159
- ```bash
160
- # Basic SFT training (uses base config)
161
- python src/train.py config/train_smollm3.py
162
-
163
- # SFT training with specialized config
164
- python src/train.py config/train_smollm3_openhermes_fr_a100_multiple_passes.py
165
-
166
- # SFT training with override
167
- python src/train.py config/train_smollm3.py --trainer_type sft
168
-
169
- # DPO training (uses DPO config)
170
- python src/train.py config/train_smollm3_dpo.py
171
-
172
- # Override config's trainer type
173
- python src/train.py config/train_smollm3.py --trainer_type dpo
174
- ```
175
-
176
- ### Launch Script Usage
177
-
178
- ```bash
179
- ./launch.sh
180
- # Select "SFT" when prompted for trainer type
181
- # The system will use the appropriate config based on selection
182
- ```
183
-
184
- ## Configuration Inheritance
185
-
186
- All specialized configs inherit from `SmolLM3Config` and automatically get:
187
-
188
- - `trainer_type = "sft"` (default)
189
- - All base training parameters
190
- - All monitoring configuration
191
- - All data configuration
192
-
193
- Specialized configs can override any of these parameters for their specific use case.
194
-
195
- ## SFT Trainer Features
196
-
197
- The SFT trainer provides:
198
-
199
- 1. **SFTTrainer Backend**: Uses Hugging Face's `SFTTrainer` for instruction tuning
200
- 2. **Fallback Support**: Falls back to standard `Trainer` if `SFTTrainer` fails
201
- 3. **Config Integration**: Uses all config parameters for training setup
202
- 4. **Monitoring**: Integrates with Trackio for experiment tracking
203
- 5. **Checkpointing**: Supports model checkpointing and resuming
204
- 6. **Mixed Precision**: Supports fp16 and bf16 training
205
-
206
- ## Troubleshooting
207
-
208
- ### Common Issues
209
-
210
- 1. **Missing trainer_type field**: Ensure all configs have the `trainer_type` field
211
- 2. **Config inheritance issues**: Check that specialized configs properly inherit from base
212
- 3. **Parameter conflicts**: Ensure command line arguments don't conflict with config values
213
-
214
- ### Debugging
215
-
216
- Enable verbose logging to see config usage:
217
-
218
- ```bash
219
- python src/train.py config/train_smollm3.py --trainer_type sft
220
- ```
221
-
222
- Look for these log messages:
223
- ```
224
- Using trainer type: sft
225
- Initializing SFT trainer...
226
- Creating SFTTrainer with training arguments...
227
- ```
228
-
229
- ## Related Documentation
230
-
231
- - [Trainer Selection Guide](TRAINER_SELECTION_GUIDE.md)
232
- - [Training Configuration Guide](TRAINING_CONFIGURATION_GUIDE.md)
233
- - [Monitoring Integration Guide](MONITORING_INTEGRATION_GUIDE.md)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/TOKEN_FIX_SUMMARY.md DELETED
@@ -1,249 +0,0 @@
1
- # Token Fix Summary
2
-
3
- ## Issue Identified
4
-
5
- The user encountered an error when running the launch script:
6
-
7
- ```
8
- usage: hf <command> [<args>]
9
- hf: error: argument {auth,cache,download,jobs,repo,repo-files,upload,upload-large-folder,env,version,lfs-enable-largefiles,lfs-multipart-upload}: invalid choice: 'login' (choose from 'auth', 'cache', 'download', 'jobs', 'repo', 'repo-files', 'upload', 'upload-large-folder', 'env', 'version', 'lfs-enable-largefiles', 'lfs-multipart-upload')
10
- ❌ Failed to login to Hugging Face
11
- ```
12
-
13
- ## Root Cause
14
-
15
- The `launch.sh` script was using `hf login` command which doesn't exist in the current version of the Hugging Face CLI. The script was trying to use CLI commands instead of the Python API for authentication.
16
-
17
- ## Fixes Applied
18
-
19
- ### 1. **Removed HF Login Step** ✅ **FIXED**
20
-
21
- **File**: `launch.sh`
22
-
23
- **Before**:
24
- ```bash
25
- # Login to Hugging Face with token
26
- print_info "Logging in to Hugging Face..."
27
- if hf login --token "$HF_TOKEN" --add-to-git-credential; then
28
- print_status "Successfully logged in to Hugging Face"
29
- print_info "Username: $(hf whoami)"
30
- else
31
- print_error "Failed to login to Hugging Face"
32
- print_error "Please check your token and try again"
33
- exit 1
34
- fi
35
- ```
36
-
37
- **After**:
38
- ```bash
39
- # Set HF token for Python API usage
40
- print_info "Setting up Hugging Face token for Python API..."
41
- export HUGGING_FACE_HUB_TOKEN="$HF_TOKEN"
42
- print_status "HF token configured for Python API usage"
43
- print_info "Username: $HF_USERNAME (auto-detected from token)"
44
- ```
45
-
46
- ### 2. **Updated Dataset Setup Script** ✅ **FIXED**
47
-
48
- **File**: `scripts/dataset_tonic/setup_hf_dataset.py`
49
-
50
- **Changes**:
51
- - Updated `main()` function to properly get token from environment
52
- - Added token validation before proceeding
53
- - Improved error handling for missing tokens
54
-
55
- **Before**:
56
- ```python
57
- def main():
58
- """Main function to set up the dataset."""
59
-
60
- # Get dataset name from command line or use default
61
- dataset_name = None
62
- if len(sys.argv) > 2:
63
- dataset_name = sys.argv[2]
64
-
65
- success = setup_trackio_dataset(dataset_name)
66
- sys.exit(0 if success else 1)
67
- ```
68
-
69
- **After**:
70
- ```python
71
- def main():
72
- """Main function to set up the dataset."""
73
-
74
- # Get token from environment first
75
- token = os.environ.get('HUGGING_FACE_HUB_TOKEN') or os.environ.get('HF_TOKEN')
76
-
77
- # If no token in environment, try command line argument
78
- if not token and len(sys.argv) > 1:
79
- token = sys.argv[1]
80
-
81
- if not token:
82
- print("❌ No HF token found. Please set HUGGING_FACE_HUB_TOKEN environment variable or provide as argument.")
83
- sys.exit(1)
84
-
85
- # Get dataset name from command line or use default
86
- dataset_name = None
87
- if len(sys.argv) > 2:
88
- dataset_name = sys.argv[2]
89
-
90
- success = setup_trackio_dataset(dataset_name)
91
- sys.exit(0 if success else 1)
92
- ```
93
-
94
- ### 3. **Updated Launch Script to Pass Token** ✅ **FIXED**
95
-
96
- **File**: `launch.sh`
97
-
98
- **Changes**:
99
- - Updated dataset setup call to pass token as argument
100
- - Updated Trackio Space deployment call to pass token as argument
101
-
102
- **Before**:
103
- ```bash
104
- python setup_hf_dataset.py
105
- ```
106
-
107
- **After**:
108
- ```bash
109
- python setup_hf_dataset.py "$HF_TOKEN"
110
- ```
111
-
112
- **Before**:
113
- ```bash
114
- python deploy_trackio_space.py << EOF
115
- $TRACKIO_SPACE_NAME
116
- $HF_TOKEN
117
- $GIT_EMAIL
118
-
119
- EOF
120
- ```
121
-
122
- **After**:
123
- ```bash
124
- python deploy_trackio_space.py "$TRACKIO_SPACE_NAME" "$HF_TOKEN" "$GIT_EMAIL"
125
- ```
126
-
127
- ### 4. **Updated Space Deployment Script** ✅ **FIXED**
128
-
129
- **File**: `scripts/trackio_tonic/deploy_trackio_space.py`
130
-
131
- **Changes**:
132
- - Updated `main()` function to handle command line arguments
133
- - Added support for both interactive and command-line modes
134
- - Improved token handling and validation
135
-
136
- **Before**:
137
- ```python
138
- def main():
139
- """Main deployment function"""
140
- print("Trackio Space Deployment Script")
141
- print("=" * 40)
142
-
143
- # Get user input (no username needed - will be extracted from token)
144
- space_name = input("Enter Space name (e.g., trackio-monitoring): ").strip()
145
- token = input("Enter your Hugging Face token: ").strip()
146
- ```
147
-
148
- **After**:
149
- ```python
150
- def main():
151
- """Main deployment function"""
152
- print("Trackio Space Deployment Script")
153
- print("=" * 40)
154
-
155
- # Check if arguments are provided
156
- if len(sys.argv) >= 3:
157
- # Use command line arguments
158
- space_name = sys.argv[1]
159
- token = sys.argv[2]
160
- git_email = sys.argv[3] if len(sys.argv) > 3 else None
161
- git_name = sys.argv[4] if len(sys.argv) > 4 else None
162
-
163
- print(f"Using provided arguments:")
164
- print(f" Space name: {space_name}")
165
- print(f" Token: {'*' * 10}...{token[-4:]}")
166
- print(f" Git email: {git_email or 'default'}")
167
- print(f" Git name: {git_name or 'default'}")
168
- else:
169
- # Get user input (no username needed - will be extracted from token)
170
- space_name = input("Enter Space name (e.g., trackio-monitoring): ").strip()
171
- token = input("Enter your Hugging Face token: ").strip()
172
- ```
173
-
174
- ## Key Improvements
175
-
176
- ### 1. **Complete Python API Usage**
177
- - ✅ **No CLI commands**: All authentication uses Python API
178
- - ✅ **Direct token passing**: Token passed directly to functions
179
- - ✅ **Environment variables**: Proper environment variable setup
180
- - ✅ **No username required**: Automatic extraction from token
181
-
182
- ### 2. **Robust Error Handling**
183
- - ✅ **Token validation**: Proper token validation before use
184
- - ✅ **Environment fallbacks**: Multiple ways to get token
185
- - ✅ **Clear error messages**: Descriptive error messages
186
- - ✅ **Graceful degradation**: Fallback mechanisms
187
-
188
- ### 3. **Automated Token Handling**
189
- - ✅ **Automatic extraction**: Username extracted from token
190
- - ✅ **Environment setup**: Token set in environment variables
191
- - ✅ **Command line support**: Token passed as arguments
192
- - ✅ **No manual input**: No username required
193
-
194
- ## Test Results
195
-
196
- ### **Token Validation Test**
197
- ```bash
198
- $ python tests/test_token_fix.py
199
-
200
- 🚀 Token Validation and Deployment Tests
201
- ==================================================
202
- 🔍 Testing Token Validation
203
- ✅ Token validation module imported successfully
204
- ✅ Token validation successful!
205
- ✅ Username: Tonic
206
-
207
- 🔍 Testing Dataset Setup
208
- ✅ Dataset setup module imported successfully
209
- ✅ Username extraction successful: Tonic
210
-
211
- 🔍 Testing Space Deployment
212
- ✅ Space deployment module imported successfully
213
- ✅ Space deployer initialization successful
214
- ✅ Username: Tonic
215
-
216
- ==================================================
217
- 🎉 ALL TOKEN TESTS PASSED!
218
- ✅ Token validation: Working
219
- ✅ Dataset setup: Working
220
- ✅ Space deployment: Working
221
-
222
- The token is working correctly with all components!
223
- ```
224
-
225
- ## User Token
226
-
227
- **Token**: `xxxx`
228
-
229
- **Status**: ✅ **Working correctly**
230
-
231
- **Username**: `Tonic` (auto-detected)
232
-
233
- ## Next Steps
234
-
235
- The user can now run the launch script without encountering the HF login error:
236
-
237
- ```bash
238
- ./launch.sh
239
- ```
240
-
241
- The script will:
242
- 1. ✅ **Validate token** using Python API
243
- 2. ✅ **Extract username** automatically from token
244
- 3. ✅ **Set environment variables** for Python API usage
245
- 4. ✅ **Deploy Trackio Space** using Python API
246
- 5. ✅ **Setup HF Dataset** using Python API
247
- 6. ✅ **Configure all components** automatically
248
-
249
- **No manual username input required!** 🎉
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/TOKEN_VALIDATION_FIX.md DELETED
@@ -1,183 +0,0 @@
1
- # Hugging Face Token Validation Fix
2
-
3
- ## Problem Description
4
-
5
- The original launch script was using the `hf` CLI command to validate Hugging Face tokens, which was causing authentication failures even with valid tokens. This was due to:
6
-
7
- 1. CLI installation issues
8
- 2. Inconsistent token format handling
9
- 3. Poor error reporting
10
-
11
- ## Solution Implementation
12
-
13
- ### New Python-Based Validation System
14
-
15
- We've implemented a robust Python-based token validation system using the official `huggingface_hub` API:
16
-
17
- #### Key Components
18
-
19
- 1. **`scripts/validate_hf_token.py`** - Main validation script
20
- 2. **Updated `launch.sh`** - Modified to use Python validation
21
- 3. **`tests/test_token_validation.py`** - Test suite for validation
22
- 4. **`scripts/check_dependencies.py`** - Dependency verification
23
-
24
- ### Features
25
-
26
- - ✅ **Robust Error Handling**: Detailed error messages for different failure types
27
- - ✅ **JSON Output**: Structured responses for easy parsing
28
- - ✅ **Multiple Input Methods**: Command line arguments or environment variables
29
- - ✅ **Username Extraction**: Automatically retrieves username from valid tokens
30
- - ✅ **Dependency Checking**: Verifies required packages are installed
31
-
32
- ## Usage
33
-
34
- ### Direct Script Usage
35
-
36
- ```bash
37
- # Using command line argument
38
- python scripts/validate_hf_token.py hf_your_token_here
39
-
40
- # Using environment variable
41
- export HF_TOKEN=hf_your_token_here
42
- python scripts/validate_hf_token.py
43
- ```
44
-
45
- ### Expected Output
46
-
47
- **Success:**
48
- ```json
49
- {"success": true, "username": "YourUsername", "error": null}
50
- ```
51
-
52
- **Failure:**
53
- ```json
54
- {"success": false, "username": null, "error": "Invalid token - unauthorized access"}
55
- ```
56
-
57
- ### Integration with Launch Script
58
-
59
- The `launch.sh` script now automatically:
60
-
61
- 1. Prompts for your HF token
62
- 2. Validates it using the Python script
63
- 3. Extracts your username automatically
64
- 4. Provides detailed error messages if validation fails
65
-
66
- ## Error Types and Solutions
67
-
68
- ### Common Error Messages
69
-
70
- | Error Message | Cause | Solution |
71
- |---------------|-------|----------|
72
- | "Invalid token - unauthorized access" | Token is invalid or expired | Generate new token at https://huggingface.co/settings/tokens |
73
- | "Token lacks required permissions" | Token doesn't have write access | Ensure token has write permissions |
74
- | "Network error" | Connection issues | Check internet connection |
75
- | "Failed to run token validation script" | Missing dependencies | Run `pip install huggingface_hub` |
76
-
77
- ### Dependency Installation
78
-
79
- ```bash
80
- # Install required dependencies
81
- pip install huggingface_hub
82
-
83
- # Check all dependencies
84
- python scripts/check_dependencies.py
85
-
86
- # Install all requirements
87
- pip install -r requirements/requirements.txt
88
- ```
89
-
90
- ## Testing
91
-
92
- ### Run the Test Suite
93
-
94
- ```bash
95
- python tests/test_token_validation.py
96
- ```
97
-
98
- ### Manual Testing
99
-
100
- ```bash
101
- # Test with your token
102
- python scripts/validate_hf_token.py hf_your_token_here
103
-
104
- # Test dependency check
105
- python scripts/check_dependencies.py
106
- ```
107
-
108
- ## Troubleshooting
109
-
110
- ### If Token Validation Still Fails
111
-
112
- 1. **Check Token Format**: Ensure token starts with `hf_`
113
- 2. **Verify Token Permissions**: Token needs read/write access
114
- 3. **Check Network**: Ensure internet connection is stable
115
- 4. **Update Dependencies**: Run `pip install --upgrade huggingface_hub`
116
-
117
- ### If Launch Script Fails
118
-
119
- 1. **Check Python Path**: Ensure `python3` is available
120
- 2. **Verify Script Permissions**: Script should be executable
121
- 3. **Check JSON Parsing**: Ensure Python can parse JSON output
122
- 4. **Review Error Messages**: Check the specific error in launch.sh output
123
-
124
- ## Technical Details
125
-
126
- ### Token Validation Process
127
-
128
- 1. **Environment Setup**: Sets `HUGGING_FACE_HUB_TOKEN` environment variable
129
- 2. **API Client Creation**: Initializes `HfApi()` client
130
- 3. **User Info Retrieval**: Calls `api.whoami()` to validate token
131
- 4. **Username Extraction**: Extracts username from user info
132
- 5. **Error Handling**: Catches and categorizes different error types
133
-
134
- ### JSON Parsing in Shell
135
-
136
- The launch script uses Python's JSON parser to safely extract values:
137
-
138
- ```bash
139
- local success=$(echo "$result" | python3 -c "
140
- import sys, json
141
- try:
142
- data = json.load(sys.stdin)
143
- print(data.get('success', False))
144
- except:
145
- print('False')
146
- ")
147
- ```
148
-
149
- ## Migration from Old System
150
-
151
- ### Before (CLI-based)
152
- ```bash
153
- if hf whoami >/dev/null 2>&1; then
154
- HF_USERNAME=$(hf whoami | head -n1 | tr -d '\n')
155
- ```
156
-
157
- ### After (Python-based)
158
- ```bash
159
- if result=$(python3 scripts/validate_hf_token.py "$token" 2>/dev/null); then
160
- # Parse JSON result with error handling
161
- local success=$(echo "$result" | python3 -c "...")
162
- local username=$(echo "$result" | python3 -c "...")
163
- ```
164
-
165
- ## Benefits
166
-
167
- 1. **Reliability**: Uses official Python API instead of CLI
168
- 2. **Error Reporting**: Detailed error messages for debugging
169
- 3. **Cross-Platform**: Works on Windows, Linux, and macOS
170
- 4. **Maintainability**: Easy to update and extend
171
- 5. **Testing**: Comprehensive test suite included
172
-
173
- ## Future Enhancements
174
-
175
- - [ ] Add token expiration checking
176
- - [ ] Implement token refresh functionality
177
- - [ ] Add support for organization tokens
178
- - [ ] Create GUI for token management
179
- - [ ] Add token security validation
180
-
181
- ---
182
-
183
- **Note**: This fix ensures that valid Hugging Face tokens are properly recognized and that users get clear feedback when there are authentication issues.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/TRACKIO_API_FIX_SUMMARY.md DELETED
@@ -1,276 +0,0 @@
1
- # Trackio API Fix Summary
2
-
3
- ## Overview
4
-
5
- This document summarizes the fixes applied to resolve the 404 errors in the Trackio integration and implement automatic Space URL resolution.
6
-
7
- ## Issues Identified
8
-
9
- ### 1. **404 Errors in Trackio API Calls**
10
- - **Problem**: The original API client was using incorrect endpoints and HTTP request patterns
11
- - **Error**: `POST request failed: 404 - Cannot POST /spaces/Tonic/trackio-monitoring-20250727/gradio_api/call/list_experiments_interface`
12
- - **Root Cause**: Using raw HTTP requests instead of the proper Gradio client API
13
-
14
- ### 2. **Hardcoded Space URL**
15
- - **Problem**: The Space URL was hardcoded, making it inflexible
16
- - **Issue**: No automatic resolution of Space URLs from Space IDs
17
- - **Impact**: Required manual URL updates when Space deployment changes
18
-
19
- ## Solutions Implemented
20
-
21
- ### 1. **Updated API Client to Use Gradio Client**
22
-
23
- **File**: `scripts/trackio_tonic/trackio_api_client.py`
24
-
25
- **Changes**:
26
- - Replaced custom HTTP requests with `gradio_client.Client`
27
- - Uses proper two-step process (POST to get event_id, then GET to get results)
28
- - Handles all Gradio API endpoints correctly
29
-
30
- **Before**:
31
- ```python
32
- # Custom HTTP requests with manual event_id handling
33
- response = requests.post(url, json=payload)
34
- event_id = response.json()["event_id"]
35
- result = requests.get(f"{url}/{event_id}")
36
- ```
37
-
38
- **After**:
39
- ```python
40
- # Using gradio_client for proper API communication
41
- result = self.client.predict(*args, api_name=api_name)
42
- ```
43
-
44
- ### 2. **Automatic Space URL Resolution**
45
-
46
- **Implementation**:
47
- - Uses Hugging Face Hub API to resolve Space URLs from Space IDs
48
- - Falls back to default URL format if API is unavailable
49
- - Supports both authenticated and anonymous access
50
-
51
- **Key Features**:
52
- ```python
53
- def _resolve_space_url(self) -> Optional[str]:
54
- """Resolve Space URL using Hugging Face Hub API"""
55
- api = HfApi(token=self.hf_token)
56
- space_info = api.space_info(self.space_id)
57
- if space_info and hasattr(space_info, 'host'):
58
- return space_info.host
59
- else:
60
- # Fallback to default URL format
61
- space_name = self.space_id.replace('/', '-')
62
- return f"https://{space_name}.hf.space"
63
- ```
64
-
65
- ### 3. **Updated Client Interface**
66
-
67
- **Before**:
68
- ```python
69
- client = TrackioAPIClient("https://tonic-trackio-monitoring-20250727.hf.space")
70
- ```
71
-
72
- **After**:
73
- ```python
74
- client = TrackioAPIClient("Tonic/trackio-monitoring-20250727", hf_token)
75
- ```
76
-
77
- ### 4. **Enhanced Monitoring Integration**
78
-
79
- **File**: `src/monitoring.py`
80
-
81
- **Changes**:
82
- - Updated to use Space ID instead of hardcoded URL
83
- - Automatic experiment creation with proper ID extraction
84
- - Better error handling and fallback mechanisms
85
-
86
- ## Dependencies Added
87
-
88
- ### Required Packages
89
- ```bash
90
- pip install gradio_client huggingface_hub
91
- ```
92
-
93
- ### Package Versions
94
- - `gradio_client>=1.10.4` - For proper Gradio API communication
95
- - `huggingface_hub>=0.19.3` - For Space URL resolution
96
-
97
- ## API Endpoints Supported
98
-
99
- The updated client supports all documented Gradio endpoints:
100
-
101
- 1. **Experiment Management**:
102
- - `/create_experiment_interface` - Create new experiments
103
- - `/list_experiments_interface` - List all experiments
104
- - `/get_experiment_details` - Get experiment details
105
- - `/update_experiment_status_interface` - Update experiment status
106
-
107
- 2. **Metrics and Parameters**:
108
- - `/log_metrics_interface` - Log training metrics
109
- - `/log_parameters_interface` - Log experiment parameters
110
-
111
- 3. **Visualization**:
112
- - `/create_metrics_plot` - Create metrics plots
113
- - `/create_experiment_comparison` - Compare experiments
114
-
115
- 4. **Testing and Demo**:
116
- - `/simulate_training_data` - Simulate training data
117
- - `/create_demo_experiment` - Create demo experiments
118
-
119
- ## Configuration
120
-
121
- ### Environment Variables
122
- ```bash
123
- # Required for Space URL resolution
124
- export HF_TOKEN="your_huggingface_token"
125
-
126
- # Optional: Custom Space ID
127
- export TRACKIO_SPACE_ID="your-username/your-space-name"
128
-
129
- # Optional: Dataset repository
130
- export TRACKIO_DATASET_REPO="your-username/your-dataset"
131
- ```
132
-
133
- ### Default Configuration
134
- - **Default Space ID**: `Tonic/trackio-monitoring-20250727`
135
- - **Default Dataset**: `tonic/trackio-experiments`
136
- - **Auto-resolution**: Enabled by default
137
-
138
- ## Testing
139
-
140
- ### Test Script
141
- **File**: `tests/test_trackio_api_fix.py`
142
-
143
- **Tests Included**:
144
- 1. **Space URL Resolution** - Tests automatic URL resolution
145
- 2. **API Client** - Tests all API endpoints
146
- 3. **Monitoring Integration** - Tests full monitoring workflow
147
-
148
- ### Running Tests
149
- ```bash
150
- python tests/test_trackio_api_fix.py
151
- ```
152
-
153
- **Expected Output**:
154
- ```
155
- 🚀 Starting Trackio API Client Tests with Automatic URL Resolution
156
- ======================================================================
157
- ✅ Space URL Resolution: PASSED
158
- ✅ API Client Test: PASSED
159
- ✅ Monitoring Integration: PASSED
160
-
161
- 🎉 All tests passed! The Trackio integration with automatic URL resolution is working correctly.
162
- ```
163
-
164
- ## Benefits
165
-
166
- ### 1. **Reliability**
167
- - ✅ No more 404 errors
168
- - ✅ Proper error handling and fallbacks
169
- - ✅ Automatic retry mechanisms
170
-
171
- ### 2. **Flexibility**
172
- - ✅ Automatic Space URL resolution
173
- - ✅ Support for any Trackio Space
174
- - ✅ Configurable via environment variables
175
-
176
- ### 3. **Maintainability**
177
- - ✅ Clean separation of concerns
178
- - ✅ Proper logging and debugging
179
- - ✅ Comprehensive test coverage
180
-
181
- ### 4. **User Experience**
182
- - ✅ Seamless integration with training pipeline
183
- - ✅ Real-time experiment monitoring
184
- - ✅ Automatic experiment creation and management
185
-
186
- ## Usage Examples
187
-
188
- ### Basic Usage
189
- ```python
190
- from scripts.trackio_tonic.trackio_api_client import TrackioAPIClient
191
-
192
- # Initialize with Space ID (URL resolved automatically)
193
- client = TrackioAPIClient("Tonic/trackio-monitoring-20250727")
194
-
195
- # Create experiment
196
- result = client.create_experiment("my_experiment", "Test experiment")
197
-
198
- # Log metrics
199
- metrics = {"loss": 1.234, "accuracy": 0.85}
200
- client.log_metrics("exp_123", metrics, step=100)
201
- ```
202
-
203
- ### With Monitoring Integration
204
- ```python
205
- from src.monitoring import SmolLM3Monitor
206
-
207
- # Create monitor (automatically creates experiment)
208
- monitor = SmolLM3Monitor(
209
- experiment_name="my_training_run",
210
- enable_tracking=True
211
- )
212
-
213
- # Log metrics during training
214
- monitor.log_metrics({"loss": 1.234}, step=100)
215
-
216
- # Log configuration
217
- monitor.log_config({"learning_rate": 2e-5, "batch_size": 8})
218
- ```
219
-
220
- ## Troubleshooting
221
-
222
- ### Common Issues
223
-
224
- 1. **"gradio_client not available"**
225
- ```bash
226
- pip install gradio_client
227
- ```
228
-
229
- 2. **"huggingface_hub not available"**
230
- ```bash
231
- pip install huggingface_hub
232
- ```
233
-
234
- 3. **"Space not accessible"**
235
- - Check if the Space is running
236
- - Verify Space ID is correct
237
- - Ensure HF token has proper permissions
238
-
239
- 4. **"Experiment not found"**
240
- - Experiments are created automatically by the monitor
241
- - Use the experiment ID returned by `create_experiment()`
242
-
243
- ### Debug Mode
244
- Enable debug logging to see detailed API calls:
245
- ```python
246
- import logging
247
- logging.basicConfig(level=logging.DEBUG)
248
- ```
249
-
250
- ## Future Enhancements
251
-
252
- ### Planned Features
253
- 1. **Multi-Space Support** - Support for multiple Trackio Spaces
254
- 2. **Advanced Metrics** - Support for custom metric types
255
- 3. **Artifact Upload** - Direct file upload to Spaces
256
- 4. **Real-time Dashboard** - Live monitoring dashboard
257
- 5. **Export Capabilities** - Export experiments to various formats
258
-
259
- ### Extensibility
260
- The new architecture is designed to be easily extensible:
261
- - Modular API client design
262
- - Plugin-based monitoring system
263
- - Configurable Space resolution
264
- - Support for custom endpoints
265
-
266
- ## Conclusion
267
-
268
- The Trackio API integration has been successfully fixed and enhanced with:
269
-
270
- - ✅ **Resolved 404 errors** through proper Gradio client usage
271
- - ✅ **Automatic URL resolution** using Hugging Face Hub API
272
- - ✅ **Comprehensive testing** with full test coverage
273
- - ✅ **Enhanced monitoring** with seamless integration
274
- - ✅ **Future-proof architecture** for easy extensions
275
-
276
- The system is now production-ready and provides reliable experiment tracking for SmolLM3 fine-tuning workflows.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/TRACKIO_DEPLOYMENT_FIXES.md DELETED
@@ -1,266 +0,0 @@
1
- # Trackio Deployment Fixes
2
-
3
- This document outlines the fixes made to resolve the Trackio Space deployment and dataset creation issues.
4
-
5
- ## Issues Identified
6
-
7
- ### 1. Git Authentication Issues in Space Deployment
8
- - **Problem**: The `deploy_trackio_space.py` script was using git commands for file upload, which failed with authentication errors
9
- - **Solution**: Replaced git commands with direct HF Hub API calls using `upload_file()`
10
-
11
- ### 2. Dataset Repository Creation Issues
12
- - **Problem**: The `setup_hf_dataset.py` script was trying to push to a dataset repository that didn't exist, causing 404 errors
13
- - **Solution**: Added proper repository creation using `create_repo()` before pushing the dataset
14
-
15
- ### 3. Missing Environment Variable Setup
16
- - **Problem**: The Space deployment didn't set up the required `HF_TOKEN` environment variable
17
- - **Solution**: Added automatic secret setting using `add_space_secret()` API method
18
-
19
- ### 4. Manual Username Input Required
20
- - **Problem**: Users had to manually enter their username
21
- - **Solution**: Automatically extract username from token using `whoami()` API method
22
-
23
- ### 5. Dataset Access Testing Issues
24
- - **Problem**: The configuration script failed when testing dataset access for non-existent datasets
25
- - **Solution**: Added proper error handling and repository existence checks
26
-
27
- ## Fixed Scripts
28
-
29
- ### 1. `scripts/trackio_tonic/deploy_trackio_space.py`
30
-
31
- #### Key Changes:
32
- - **Replaced git upload with HF Hub API**: Now uses `upload_file()` directly instead of git commands
33
- - **Automatic secret setting**: Uses `add_space_secret()` API to set HF_TOKEN automatically
34
- - **Username extraction from token**: Uses `whoami()` to get username automatically
35
- - **Removed manual username input**: No longer asks for username
36
- - **Improved error handling**: Better error messages and fallback options
37
-
38
- #### Usage:
39
- ```bash
40
- python scripts/trackio_tonic/deploy_trackio_space.py
41
- ```
42
-
43
- #### What it does:
44
- 1. Extracts username from HF token automatically
45
- 2. Creates a new HF Space using the API
46
- 3. Prepares Space files from templates
47
- 4. Uploads files using HF Hub API (no git required)
48
- 5. **Automatically sets secrets via API** (HF_TOKEN and TRACKIO_DATASET_REPO)
49
- 6. Tests the Space accessibility
50
-
51
- ### 2. `scripts/dataset_tonic/setup_hf_dataset.py`
52
-
53
- #### Key Changes:
54
- - **Added repository creation**: Creates the dataset repository before pushing data
55
- - **Username extraction from token**: Uses `whoami()` to get username automatically
56
- - **Automatic dataset naming**: Uses username in dataset repository name
57
- - **Improved error handling**: Better error messages for common issues
58
- - **Public datasets by default**: Makes datasets public for easier access
59
-
60
- #### Usage:
61
- ```bash
62
- python scripts/dataset_tonic/setup_hf_dataset.py
63
- ```
64
-
65
- #### What it does:
66
- 1. Extracts username from HF token automatically
67
- 2. Creates the dataset repository if it doesn't exist
68
- 3. Creates a dataset with sample experiment data
69
- 4. Uploads README template
70
- 5. Makes the dataset public for easier access
71
-
72
- ### 3. `scripts/trackio_tonic/configure_trackio.py`
73
-
74
- #### Key Changes:
75
- - **Added repository existence check**: Checks if dataset repository exists before trying to load
76
- - **Username extraction from token**: Uses `whoami()` to get username automatically
77
- - **Automatic dataset naming**: Uses username in default dataset repository
78
- - **Better error handling**: Distinguishes between missing repository and permission issues
79
- - **Improved user guidance**: Clear instructions for next steps
80
-
81
- #### Usage:
82
- ```bash
83
- python scripts/trackio_tonic/configure_trackio.py
84
- ```
85
-
86
- #### What it does:
87
- 1. Extracts username from HF token automatically
88
- 2. Validates current configuration
89
- 3. Tests dataset access with proper error handling
90
- 4. Generates configuration file with username
91
- 5. Provides usage examples with actual username
92
-
93
- ## Model Push Script (`scripts/model_tonic/push_to_huggingface.py`)
94
-
95
- The model push script was already using the HF Hub API correctly, so no changes were needed. It properly:
96
- - Creates repositories using `create_repo()`
97
- - Uploads files using `upload_file()`
98
- - Handles authentication correctly
99
-
100
- ## Environment Variables Required
101
-
102
- ### For HF Spaces:
103
- ```bash
104
- HF_TOKEN=your_hf_token_here
105
- TRACKIO_DATASET_REPO=your-username/your-dataset-name
106
- ```
107
-
108
- ### For Local Development:
109
- ```bash
110
- export HF_TOKEN=your_hf_token_here
111
- export TRACKIO_DATASET_REPO=your-username/your-dataset-name
112
- ```
113
-
114
- ## Deployment Workflow
115
-
116
- ### 1. Create Dataset
117
- ```bash
118
- # Set environment variables
119
- export HF_TOKEN=your_token_here
120
- # TRACKIO_DATASET_REPO will be auto-generated as username/trackio-experiments
121
-
122
- # Create the dataset
123
- python scripts/dataset_tonic/setup_hf_dataset.py
124
- ```
125
-
126
- ### 2. Deploy Trackio Space
127
- ```bash
128
- # Deploy the Space (no username needed - extracted from token)
129
- python scripts/trackio_tonic/deploy_trackio_space.py
130
- ```
131
-
132
- ### 3. Secrets are Automatically Set
133
- The script now automatically sets the required secrets via the HF Hub API:
134
- - `HF_TOKEN` - Your Hugging Face token
135
- - `TRACKIO_DATASET_REPO` - Your dataset repository (if specified)
136
-
137
- ### 4. Test Configuration
138
- ```bash
139
- # Test the configuration
140
- python scripts/trackio_tonic/configure_trackio.py
141
- ```
142
-
143
- ## New Features
144
-
145
- ### ✅ **Automatic Secret Setting**
146
- - Uses `add_space_secret()` API method
147
- - Sets `HF_TOKEN` automatically
148
- - Sets `TRACKIO_DATASET_REPO` if specified
149
- - Falls back to manual instructions if API fails
150
-
151
- ### ✅ **Username Extraction from Token**
152
- - Uses `whoami()` API method
153
- - No manual username input required
154
- - Automatically uses username in dataset names
155
- - Provides better user experience
156
-
157
- ### ✅ **Improved User Experience**
158
- - Fewer manual inputs required
159
- - Automatic configuration based on token
160
- - Clear feedback about what's happening
161
- - Better error messages
162
-
163
- ## Troubleshooting
164
-
165
- ### Common Issues:
166
-
167
- 1. **"Repository not found" errors**:
168
- - Run `setup_hf_dataset.py` to create the dataset first
169
- - Check that your HF token has write permissions
170
-
171
- 2. **"Authentication failed" errors**:
172
- - Verify your HF token is valid
173
- - Check token permissions on https://huggingface.co/settings/tokens
174
-
175
- 3. **"Space not accessible" errors**:
176
- - Wait 2-5 minutes for the Space to build
177
- - Check Space logs at the Space URL
178
- - Verify all files were uploaded correctly
179
-
180
- 4. **"Dataset access failed" errors**:
181
- - Ensure the dataset repository exists
182
- - Check that your token has read permissions
183
- - Verify the dataset repository name is correct
184
-
185
- 5. **"Secret setting failed" errors**:
186
- - The script will fall back to manual instructions
187
- - Follow the provided instructions to set secrets manually
188
- - Check that your token has write permissions to the Space
189
-
190
- ### Debugging Steps:
191
-
192
- 1. **Check token permissions**:
193
- ```bash
194
- hf whoami
195
- ```
196
-
197
- 2. **Test dataset access**:
198
- ```python
199
- from datasets import load_dataset
200
- dataset = load_dataset("your-username/your-dataset", token="your-token")
201
- ```
202
-
203
- 3. **Test Space deployment**:
204
- ```bash
205
- python scripts/trackio_tonic/deploy_trackio_space.py
206
- ```
207
-
208
- 4. **Test secret setting**:
209
- ```python
210
- from huggingface_hub import HfApi
211
- api = HfApi(token="your-token")
212
- api.add_space_secret("your-username/your-space", "TEST_KEY", "test_value")
213
- ```
214
-
215
- ## Security Considerations
216
-
217
- - **Public datasets**: Datasets are now public by default for easier access
218
- - **Token security**: Never commit tokens to version control
219
- - **Space secrets**: Automatically set via API, with manual fallback
220
- - **Access control**: Verify token permissions before deployment
221
-
222
- ## Performance Improvements
223
-
224
- - **Direct API calls**: Eliminated git dependency for faster uploads
225
- - **Automatic configuration**: No manual username input required
226
- - **Parallel processing**: Files are uploaded individually for better error handling
227
- - **Caching**: HF Hub API handles caching automatically
228
- - **Error recovery**: Better error handling and retry logic
229
-
230
- ## Future Enhancements
231
-
232
- 1. **Batch secret setting**: Set multiple secrets in one API call
233
- 2. **Progress tracking**: Add progress bars for large uploads
234
- 3. **Validation**: Add more comprehensive validation checks
235
- 4. **Rollback**: Add ability to rollback failed deployments
236
- 5. **Hardware configuration**: Automatically configure Space hardware
237
-
238
- ## Testing
239
-
240
- To test the fixes:
241
-
242
- ```bash
243
- # Test dataset creation
244
- python scripts/dataset_tonic/setup_hf_dataset.py
245
-
246
- # Test Space deployment
247
- python scripts/trackio_tonic/deploy_trackio_space.py
248
-
249
- # Test configuration
250
- python scripts/trackio_tonic/configure_trackio.py
251
-
252
- # Test model push (if you have a trained model)
253
- python scripts/model_tonic/push_to_huggingface.py --model-path /path/to/model --repo-name your-username/your-model
254
- ```
255
-
256
- ## Summary
257
-
258
- These fixes resolve the main issues with:
259
- - ✅ Git authentication problems
260
- - ✅ Dataset repository creation failures
261
- - ✅ Missing environment variable setup
262
- - ✅ Manual username input requirement
263
- - ✅ Poor error handling and user feedback
264
- - ✅ Security concerns with public datasets
265
-
266
- The scripts now use the HF Hub API directly, provide better error messages, handle edge cases properly, and offer a much improved user experience with automatic configuration.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/TRACKIO_DICT_ACCESS_FIX.md DELETED
@@ -1,144 +0,0 @@
1
- # TrackioConfig Dictionary-Style Access Fix
2
-
3
- ## Problem Description
4
-
5
- The error `'TrackioConfig' object does not support item assignment` occurred because the TRL library was trying to use dictionary-style item assignment on our `TrackioConfig` object (like `config['key'] = value`), but our implementation only supported attribute assignment.
6
-
7
- ## Root Cause
8
-
9
- TRL expects configuration objects to support both attribute-style and dictionary-style access:
10
- - Attribute-style: `config.project_name = "test"`
11
- - Dictionary-style: `config['project_name'] = "test"`
12
-
13
- Our `TrackioConfig` class only implemented attribute-style access, causing TRL to fail when it tried to use dictionary-style assignment.
14
-
15
- ## Solution Implementation
16
-
17
- ### Enhanced TrackioConfig Class
18
-
19
- Modified `src/trackio.py` to add full dictionary-style access support:
20
-
21
- ```python
22
- class TrackioConfig:
23
- """Configuration class for trackio (TRL compatibility)"""
24
-
25
- def __init__(self):
26
- # ... existing initialization ...
27
-
28
- def update(self, config_dict: Dict[str, Any] = None, **kwargs):
29
- # ... existing update method ...
30
-
31
- def __getitem__(self, key: str) -> Any:
32
- """Dictionary-style access to configuration values"""
33
- if hasattr(self, key):
34
- return getattr(self, key)
35
- else:
36
- raise KeyError(f"Configuration key '{key}' not found")
37
-
38
- def __setitem__(self, key: str, value: Any):
39
- """Dictionary-style assignment to configuration values"""
40
- setattr(self, key, value)
41
-
42
- def __contains__(self, key: str) -> bool:
43
- """Check if configuration key exists"""
44
- return hasattr(self, key)
45
-
46
- def get(self, key: str, default: Any = None) -> Any:
47
- """Get configuration value with default"""
48
- if hasattr(self, key):
49
- return getattr(self, key)
50
- else:
51
- return default
52
-
53
- def keys(self):
54
- """Get all configuration keys"""
55
- return list(self.__dict__.keys())
56
-
57
- def items(self):
58
- """Get all configuration key-value pairs"""
59
- return list(self.__dict__.items())
60
-
61
- def __repr__(self):
62
- """String representation of configuration"""
63
- attrs = []
64
- for key, value in self.__dict__.items():
65
- attrs.append(f"{key}={repr(value)}")
66
- return f"TrackioConfig({', '.join(attrs)})"
67
- ```
68
-
69
- ### Key Features Added
70
-
71
- #### 1. **Dictionary-Style Access**
72
- - `config['key']` - Get configuration value
73
- - `config['key'] = value` - Set configuration value
74
- - `'key' in config` - Check if key exists
75
-
76
- #### 2. **Dictionary Methods**
77
- - `config.get('key', default)` - Get with default value
78
- - `config.keys()` - Get all configuration keys
79
- - `config.items()` - Get all key-value pairs
80
-
81
- #### 3. **TRL Compatibility**
82
- - Supports TRL's dictionary-style configuration updates
83
- - Handles dynamic key assignment
84
- - Maintains backward compatibility with attribute access
85
-
86
- ## Testing Verification
87
-
88
- ### Test Results
89
- - ✅ Dictionary-style assignment: `config['project_name'] = 'test'`
90
- - ✅ Dictionary-style access: `config['project_name']`
91
- - ✅ Contains check: `'key' in config`
92
- - ✅ Get method: `config.get('key', default)`
93
- - ✅ Keys and items: `config.keys()`, `config.items()`
94
- - ✅ TRL-style usage: `config['allow_val_change'] = True`
95
-
96
- ### TRL-Specific Usage Patterns
97
- ```python
98
- # TRL-style configuration updates
99
- config['allow_val_change'] = True
100
- config['report_to'] = 'trackio'
101
- config['project_name'] = 'my_experiment'
102
-
103
- # Dictionary-style access
104
- project = config['project_name']
105
- allow_change = config.get('allow_val_change', False)
106
- ```
107
-
108
- ## Integration with Existing Features
109
-
110
- ### Maintains All Existing Functionality
111
- - ✅ Attribute-style access: `config.project_name`
112
- - ✅ Update method: `config.update({'key': 'value'})`
113
- - ✅ Keyword arguments: `config.update(allow_val_change=True)`
114
- - ✅ Dynamic attributes: New attributes added at runtime
115
-
116
- ### Enhanced Compatibility
117
- - ✅ Full TRL dictionary-style interface
118
- - ✅ Backward compatibility with existing code
119
- - ✅ Robust error handling for missing keys
120
- - ✅ Comprehensive dictionary methods
121
-
122
- ## Production Readiness
123
-
124
- ### Status: ✅ PRODUCTION READY
125
-
126
- The enhanced `TrackioConfig` class now provides:
127
- 1. **Complete TRL Compatibility** - Supports all TRL configuration patterns
128
- 2. **Flexible Access** - Both attribute and dictionary-style access
129
- 3. **Robust Error Handling** - Graceful handling of missing keys
130
- 4. **Comprehensive Interface** - Full dictionary-like behavior
131
- 5. **Backward Compatibility** - Existing code continues to work
132
-
133
- ## Conclusion
134
-
135
- The dictionary-style access fix resolves the `'TrackioConfig' object does not support item assignment` error and provides complete compatibility with TRL's configuration expectations.
136
-
137
- **Key Achievements:**
138
- - ✅ Full dictionary-style interface support
139
- - ✅ TRL configuration pattern compatibility
140
- - ✅ Backward compatibility maintained
141
- - ✅ Comprehensive testing verification
142
- - ✅ Production-ready implementation
143
-
144
- **No additional changes are required** for TRL configuration compatibility. The system now handles all known TRL configuration access patterns.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/TRACKIO_INTEGRATION.md DELETED
@@ -1,252 +0,0 @@
1
- # Trackio Integration for SmolLM3 Fine-tuning
2
-
3
- This document provides comprehensive information about the Trackio experiment tracking and monitoring integration for your SmolLM3 fine-tuning pipeline.
4
-
5
- ## Features
6
-
7
- - **SmolLM3 Fine-tuning**: Support for supervised fine-tuning and DPO training
8
- - **Trackio Integration**: Complete experiment tracking and monitoring
9
- - **Hugging Face Spaces Deployment**: Easy deployment of Trackio monitoring interface
10
- - **Comprehensive Logging**: Metrics, parameters, artifacts, and system monitoring
11
- - **Flexible Configuration**: Support for various training configurations
12
-
13
- ## Quick Start
14
-
15
- ### 1. Install Dependencies
16
-
17
- ```bash
18
- pip install -r requirements.txt
19
- ```
20
-
21
- ### 2. Basic Training with Trackio
22
-
23
- ```bash
24
- python train.py config/train_smollm3.py \
25
- --dataset_dir my_dataset \
26
- --enable_tracking \
27
- --trackio_url "https://your-trackio-instance.com" \
28
- --experiment_name "smollm3_finetune_v1"
29
- ```
30
-
31
- ### 3. Training with Custom Parameters
32
-
33
- ```bash
34
- python train.py config/train_smollm3.py \
35
- --dataset_dir my_dataset \
36
- --batch_size 8 \
37
- --learning_rate 1e-5 \
38
- --max_iters 2000 \
39
- --enable_tracking \
40
- --trackio_url "https://your-trackio-instance.com" \
41
- --experiment_name "smollm3_high_lr_experiment"
42
- ```
43
-
44
- ## Trackio Integration
45
-
46
- ### Configuration
47
-
48
- Add Trackio settings to your configuration:
49
-
50
- ```python
51
- # In your config file
52
- config = SmolLM3Config(
53
- # ... other settings ...
54
-
55
- # Trackio monitoring configuration
56
- enable_tracking=True,
57
- trackio_url="https://your-trackio-instance.com",
58
- trackio_token="your_token_here", # Optional
59
- log_artifacts=True,
60
- log_metrics=True,
61
- log_config=True,
62
- experiment_name="my_experiment"
63
- )
64
- ```
65
-
66
- ### Environment Variables
67
-
68
- You can also set Trackio configuration via environment variables:
69
-
70
- ```bash
71
- export TRACKIO_URL="https://your-trackio-instance.com"
72
- export TRACKIO_TOKEN="your_token_here"
73
- ```
74
-
75
- ### What Gets Tracked
76
-
77
- - **Configuration**: All training parameters and model settings
78
- - **Metrics**: Loss, accuracy, learning rate, and custom metrics
79
- - **System Metrics**: GPU memory, CPU usage, training time
80
- - **Artifacts**: Model checkpoints, evaluation results
81
- - **Training Summary**: Final results and experiment duration
82
-
83
- ## Hugging Face Spaces Deployment
84
-
85
- ### Deploy Trackio Monitoring Interface
86
-
87
- 1. **Create a new Space** on Hugging Face:
88
- - Go to https://huggingface.co/spaces
89
- - Click "Create new Space"
90
- - Choose "Gradio" as the SDK
91
- - Set visibility (Public or Private)
92
-
93
- 2. **Upload the deployment files**:
94
- - `app.py` - The Gradio interface
95
- - `requirements_space.txt` - Dependencies
96
- - `README.md` - Documentation
97
-
98
- 3. **Configure the Space**:
99
- - The Space will automatically install dependencies
100
- - The Gradio interface will be available at your Space URL
101
-
102
- ### Using the Trackio Space
103
-
104
- 1. **Create Experiments**: Use the "Create Experiment" tab to start new experiments
105
- 2. **Log Metrics**: Use the "Log Metrics" tab to track training progress
106
- 3. **View Results**: Use the "View Experiments" tab to see experiment details
107
- 4. **Update Status**: Use the "Update Status" tab to mark experiments as completed
108
-
109
- ### Integration with Your Training
110
-
111
- To connect your training script to the Trackio Space:
112
-
113
- ```python
114
- # In your training script
115
- from monitoring import SmolLM3Monitor
116
-
117
- # Initialize monitor
118
- monitor = SmolLM3Monitor(
119
- experiment_name="my_experiment",
120
- trackio_url="https://your-space.hf.space", # Your Space URL
121
- enable_tracking=True
122
- )
123
-
124
- # Log configuration
125
- monitor.log_config(config_dict)
126
-
127
- # Log metrics during training
128
- monitor.log_metrics({"loss": 0.5, "accuracy": 0.85}, step=100)
129
-
130
- # Log final results
131
- monitor.log_training_summary(final_results)
132
- ```
133
-
134
- ## Configuration Files
135
-
136
- ### Main Configuration (`config/train_smollm3.py`)
137
-
138
- ```python
139
- @dataclass
140
- class SmolLM3Config:
141
- # Model configuration
142
- model_name: str = "HuggingFaceTB/SmolLM3-3B"
143
- max_seq_length: int = 4096
144
-
145
- # Training configuration
146
- batch_size: int = 4
147
- learning_rate: float = 2e-5
148
- max_iters: int = 1000
149
-
150
- # Trackio monitoring
151
- enable_tracking: bool = True
152
- trackio_url: Optional[str] = None
153
- trackio_token: Optional[str] = None
154
- experiment_name: Optional[str] = None
155
- ```
156
-
157
- ### DPO Configuration (`config/train_smollm3_dpo.py`)
158
-
159
- ```python
160
- @dataclass
161
- class SmolLM3DPOConfig(SmolLM3Config):
162
- # DPO-specific settings
163
- beta: float = 0.1
164
- max_prompt_length: int = 2048
165
-
166
- # Trackio monitoring (inherited)
167
- enable_tracking: bool = True
168
- trackio_url: Optional[str] = None
169
- ```
170
-
171
- ## Monitoring Features
172
-
173
- ### Real-time Metrics
174
-
175
- - Training loss and evaluation metrics
176
- - Learning rate scheduling
177
- - GPU memory and utilization
178
- - Training time and progress
179
-
180
- ### Artifact Tracking
181
-
182
- - Model checkpoints at regular intervals
183
- - Evaluation results and plots
184
- - Configuration snapshots
185
- - Training logs and summaries
186
-
187
- ### Experiment Management
188
-
189
- - Experiment naming and organization
190
- - Status tracking (running, completed, failed)
191
- - Parameter comparison across experiments
192
- - Result visualization
193
-
194
- ## Advanced Usage
195
-
196
- ### Custom Metrics
197
-
198
- ```python
199
- # Log custom metrics
200
- monitor.log_metrics({
201
- "custom_metric": value,
202
- "perplexity": perplexity_score,
203
- "bleu_score": bleu_score
204
- }, step=current_step)
205
- ```
206
-
207
- ### System Monitoring
208
-
209
- ```python
210
- # Log system metrics
211
- monitor.log_system_metrics(step=current_step)
212
- ```
213
-
214
- ### Artifact Logging
215
-
216
- ```python
217
- # Log model checkpoint
218
- monitor.log_model_checkpoint("checkpoint-1000", step=1000)
219
-
220
- # Log evaluation results
221
- monitor.log_evaluation_results(eval_results, step=1000)
222
- ```
223
-
224
- ## Troubleshooting
225
-
226
- ### Common Issues
227
-
228
- 1. **Trackio not available**: Install with `pip install trackio`
229
- 2. **Connection errors**: Check your Trackio URL and token
230
- 3. **Missing metrics**: Ensure monitoring is enabled in configuration
231
- 4. **Space deployment issues**: Check Gradio version compatibility
232
-
233
- ### Debug Mode
234
-
235
- Enable debug logging:
236
-
237
- ```python
238
- import logging
239
- logging.basicConfig(level=logging.DEBUG)
240
- ```
241
-
242
- ## Contributing
243
-
244
- 1. Fork the repository
245
- 2. Create a feature branch
246
- 3. Make your changes
247
- 4. Add tests if applicable
248
- 5. Submit a pull request
249
-
250
- ## License
251
-
252
- This project is licensed under the MIT License - see the LICENSE file for details.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/TRACKIO_INTEGRATION_VERIFICATION.md DELETED
@@ -1,177 +0,0 @@
1
- # Trackio Integration Verification Report
2
-
3
- ## ✅ Verification Status: PASSED
4
-
5
- All Trackio integration tests have passed successfully. The integration is correctly implemented according to the documentation provided in `TRACKIO_INTEGRATION.md` and `TRACKIO_INTERFACE_GUIDE.md`.
6
-
7
- ## 🔧 Issues Fixed
8
-
9
- ### 1. **Training Arguments Configuration**
10
- - **Issue**: `'bool' object is not callable` error with `report_to` parameter
11
- - **Fix**: Changed `report_to: "none"` to `report_to: None` in `model.py`
12
- - **Impact**: Resolves the original training failure
13
-
14
- ### 2. **Boolean Parameter Type Safety**
15
- - **Issue**: Boolean parameters not properly typed in training arguments
16
- - **Fix**: Added explicit boolean conversion for all boolean parameters:
17
- - `dataloader_pin_memory`
18
- - `group_by_length`
19
- - `prediction_loss_only`
20
- - `ignore_data_skip`
21
- - `remove_unused_columns`
22
- - `ddp_find_unused_parameters`
23
- - `fp16`
24
- - `bf16`
25
- - `load_best_model_at_end`
26
- - `greater_is_better`
27
-
28
- ### 3. **Callback Implementation**
29
- - **Issue**: Callback creation failing when tracking disabled
30
- - **Fix**: Modified `create_monitoring_callback()` to always return a callback
31
- - **Improvement**: Added proper inheritance from `TrainerCallback`
32
-
33
- ### 4. **Method Naming Conflicts**
34
- - **Issue**: Boolean attributes conflicting with method names
35
- - **Fix**: Renamed boolean attributes to avoid conflicts:
36
- - `log_config` → `log_config_enabled`
37
- - `log_metrics` → `log_metrics_enabled`
38
-
39
- ### 5. **System Compatibility**
40
- - **Issue**: Training arguments test failing on systems without bf16 support
41
- - **Fix**: Added conditional bf16 support detection
42
- - **Improvement**: Added conditional support for `dataloader_prefetch_factor`
43
-
44
- ## 📊 Test Results
45
-
46
- | Test | Status | Description |
47
- |------|--------|-------------|
48
- | Trackio Configuration | ✅ PASS | All required attributes present |
49
- | Monitor Creation | ✅ PASS | Monitor created successfully |
50
- | Callback Creation | ✅ PASS | Callback with all required methods |
51
- | Monitor Methods | ✅ PASS | All logging methods work correctly |
52
- | Training Arguments | ✅ PASS | Arguments created without errors |
53
-
54
- ## 🎯 Key Features Verified
55
-
56
- ### 1. **Configuration Management**
57
- - ✅ Trackio-specific attributes properly defined
58
- - ✅ Environment variable support
59
- - ✅ Default values correctly set
60
- - ✅ Configuration inheritance working
61
-
62
- ### 2. **Monitoring Integration**
63
- - ✅ Monitor creation from config
64
- - ✅ Callback integration with Hugging Face Trainer
65
- - ✅ Real-time metrics logging
66
- - ✅ System metrics collection
67
- - ✅ Artifact tracking
68
- - ✅ Evaluation results logging
69
-
70
- ### 3. **Training Integration**
71
- - ✅ Training arguments properly configured
72
- - ✅ Boolean parameters correctly typed
73
- - ✅ Report_to parameter fixed
74
- - ✅ Callback methods properly implemented
75
- - ✅ Error handling enhanced
76
-
77
- ### 4. **Interface Compatibility**
78
- - ✅ Compatible with Trackio Space deployment
79
- - ✅ Supports all documented features
80
- - ✅ Handles missing Trackio URL gracefully
81
- - ✅ Provides fallback behavior
82
-
83
- ## 🚀 Integration Points
84
-
85
- ### 1. **With Training Script**
86
- ```python
87
- # Automatic integration via config
88
- config = SmolLM3ConfigOpenHermesFRBalanced()
89
- monitor = create_monitor_from_config(config)
90
-
91
- # Callback automatically added to trainer
92
- trainer = Trainer(
93
- model=model,
94
- args=training_args,
95
- callbacks=[monitor.create_monitoring_callback()]
96
- )
97
- ```
98
-
99
- ### 2. **With Trackio Space**
100
- ```python
101
- # Configuration for Trackio Space
102
- config.trackio_url = "https://your-space.hf.space"
103
- config.enable_tracking = True
104
- config.experiment_name = "my_experiment"
105
- ```
106
-
107
- ### 3. **With Hugging Face Trainer**
108
- ```python
109
- # Training arguments properly configured
110
- training_args = model.get_training_arguments(
111
- output_dir=output_dir,
112
- report_to=None, # Fixed
113
- # ... other parameters
114
- )
115
- ```
116
-
117
- ## 📈 Monitoring Features
118
-
119
- ### Real-time Metrics
120
- - ✅ Training loss and evaluation metrics
121
- - ✅ Learning rate scheduling
122
- - ✅ GPU memory and utilization
123
- - ✅ Training time and progress
124
-
125
- ### Artifact Tracking
126
- - ✅ Model checkpoints at regular intervals
127
- - ✅ Evaluation results and plots
128
- - ✅ Configuration snapshots
129
- - ✅ Training logs and summaries
130
-
131
- ### Experiment Management
132
- - ✅ Experiment naming and organization
133
- - ✅ Status tracking (running, completed, failed)
134
- - ✅ Parameter comparison across experiments
135
- - ✅ Result visualization
136
-
137
- ## 🔍 Error Handling
138
-
139
- ### Graceful Degradation
140
- - ✅ Continues training when Trackio unavailable
141
- - ✅ Handles missing environment variables
142
- - ✅ Provides console logging fallback
143
- - ✅ Maintains functionality without external dependencies
144
-
145
- ### Robust Callbacks
146
- - ✅ Callback methods handle exceptions gracefully
147
- - ✅ Training continues even if monitoring fails
148
- - ✅ Detailed error logging for debugging
149
- - ✅ Fallback to console monitoring
150
-
151
- ## 📋 Compliance with Documentation
152
-
153
- ### TRACKIO_INTEGRATION.md Requirements
154
- - ✅ All configuration options implemented
155
- - ✅ Environment variable support
156
- - ✅ Hugging Face Spaces deployment ready
157
- - ✅ Comprehensive logging features
158
- - ✅ Artifact tracking capabilities
159
-
160
- ### TRACKIO_INTERFACE_GUIDE.md Requirements
161
- - ✅ Real-time visualization support
162
- - ✅ Interactive plots and metrics
163
- - ✅ Experiment comparison features
164
- - ✅ Demo data generation
165
- - ✅ Status tracking and updates
166
-
167
- ## 🎉 Conclusion
168
-
169
- The Trackio integration is **fully functional** and **correctly implemented** according to the provided documentation. All major issues have been resolved:
170
-
171
- 1. **Original Error Fixed**: The `'bool' object is not callable` error has been resolved
172
- 2. **Callback Integration**: Trackio callbacks now work correctly with Hugging Face Trainer
173
- 3. **Configuration Management**: All Trackio-specific configuration is properly handled
174
- 4. **Error Handling**: Robust error handling and graceful degradation implemented
175
- 5. **Compatibility**: Works across different systems and configurations
176
-
177
- The integration is ready for production use and will provide comprehensive monitoring for SmolLM3 fine-tuning experiments.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/TRACKIO_INTERFACE_GUIDE.md DELETED
@@ -1,222 +0,0 @@
1
- # Enhanced Trackio Interface Guide
2
-
3
- ## Overview
4
-
5
- Your Trackio application has been significantly enhanced to provide comprehensive monitoring and visualization for SmolLM3 training experiments. Here's how to make the most of it.
6
-
7
- ## 🚀 Key Enhancements
8
-
9
- ### 1. **Real-time Visualization**
10
- - **Interactive Plots**: Loss curves, accuracy, learning rate, GPU metrics
11
- - **Experiment Comparison**: Compare multiple training runs side-by-side
12
- - **Live Updates**: Watch training progress in real-time
13
-
14
- ### 2. **Comprehensive Data Display**
15
- - **Formatted Output**: Clean, emoji-rich experiment details
16
- - **Statistics Overview**: Metrics count, parameters count, artifacts count
17
- - **Status Tracking**: Visual status indicators (🟢 running, ✅ completed, ❌ failed)
18
-
19
- ### 3. **Demo Data Generation**
20
- - **Realistic Simulation**: Generate realistic training metrics for testing
21
- - **Multiple Metrics**: Loss, accuracy, learning rate, GPU memory, training time
22
- - **Configurable Parameters**: Customize demo data to match your setup
23
-
24
- ## 📊 How to Use with Your SmolLM3 Training
25
-
26
- ### Step 1: Start Your Training
27
- ```bash
28
- python run_a100_large_experiment.py \
29
- --config config/train_smollm3_openhermes_fr_a100_balanced.py \
30
- --trackio_url "https://tonic-test-trackio-test.hf.space" \
31
- --experiment-name "petit-elle-l-aime-3-balanced" \
32
- --output-dir ./outputs/balanced
33
- ```
34
-
35
- ### Step 2: Monitor in Real-time
36
- 1. **Visit your Trackio Space**: `https://tonic-test-trackio-test.hf.space`
37
- 2. **Go to "View Experiments" tab**
38
- 3. **Enter your experiment ID** (e.g., `exp_20231201_143022`)
39
- 4. **Click "View Experiment"** to see detailed information
40
-
41
- ### Step 3: Visualize Training Progress
42
- 1. **Go to "📊 Visualizations" tab**
43
- 2. **Enter your experiment ID**
44
- 3. **Select a metric** (loss, accuracy, learning_rate, gpu_memory, training_time)
45
- 4. **Click "Create Plot"** to see interactive charts
46
-
47
- ### Step 4: Compare Experiments
48
- 1. **In the "📊 Visualizations" tab**
49
- 2. **Enter multiple experiment IDs** (comma-separated)
50
- 3. **Click "Compare Experiments"** to see side-by-side comparison
51
-
52
- ## 🎯 Interface Features
53
-
54
- ### Create Experiment Tab
55
- - **Experiment Name**: Descriptive name for your training run
56
- - **Description**: Detailed description of what you're training
57
- - **Automatic ID Generation**: Unique experiment identifier
58
-
59
- ### Log Metrics Tab
60
- - **Experiment ID**: The experiment to log metrics for
61
- - **Metrics JSON**: Training metrics in JSON format
62
- - **Step**: Current training step (optional)
63
-
64
- Example metrics JSON:
65
- ```json
66
- {
67
- "loss": 0.5234,
68
- "accuracy": 0.8567,
69
- "learning_rate": 3.5e-6,
70
- "gpu_memory_gb": 22.5,
71
- "gpu_utilization_percent": 87.3,
72
- "training_time_per_step": 0.456
73
- }
74
- ```
75
-
76
- ### Log Parameters Tab
77
- - **Experiment ID**: The experiment to log parameters for
78
- - **Parameters JSON**: Training configuration in JSON format
79
-
80
- Example parameters JSON:
81
- ```json
82
- {
83
- "model_name": "HuggingFaceTB/SmolLM3-3B",
84
- "batch_size": 8,
85
- "learning_rate": 3.5e-6,
86
- "max_iters": 18000,
87
- "mixed_precision": "bf16",
88
- "no_think_system_message": true
89
- }
90
- ```
91
-
92
- ### View Experiments Tab
93
- - **Experiment ID**: Enter to view specific experiment
94
- - **List All Experiments**: Shows overview of all experiments
95
- - **Detailed Information**: Formatted display with statistics
96
-
97
- ### 📊 Visualizations Tab
98
- - **Training Metrics**: Interactive plots for individual metrics
99
- - **Experiment Comparison**: Side-by-side comparison of multiple runs
100
- - **Real-time Updates**: Plots update as new data is logged
101
-
102
- ### 🎯 Demo Data Tab
103
- - **Generate Demo Data**: Create realistic training data for testing
104
- - **Configurable**: Adjust parameters to match your setup
105
- - **Multiple Metrics**: Simulates loss, accuracy, GPU metrics, etc.
106
-
107
- ### Update Status Tab
108
- - **Experiment ID**: The experiment to update
109
- - **Status**: running, completed, failed, paused
110
- - **Visual Indicators**: Status shown with emojis
111
-
112
- ## 📈 What Gets Displayed
113
-
114
- ### Training Metrics
115
- - **Loss**: Training loss over time
116
- - **Accuracy**: Model accuracy progression
117
- - **Learning Rate**: Learning rate scheduling
118
- - **GPU Memory**: Memory usage in GB
119
- - **GPU Utilization**: GPU usage percentage
120
- - **Training Time**: Time per training step
121
-
122
- ### Experiment Details
123
- - **Basic Info**: ID, name, description, status, creation time
124
- - **Statistics**: Metrics count, parameters count, artifacts count
125
- - **Parameters**: All training configuration
126
- - **Latest Metrics**: Most recent training metrics
127
-
128
- ### Visualizations
129
- - **Line Charts**: Smooth curves showing metric progression
130
- - **Interactive Hover**: Detailed information on hover
131
- - **Multiple Metrics**: Switch between different metrics
132
- - **Comparison Charts**: Side-by-side experiment comparison
133
-
134
- ## 🔧 Integration with Your Training
135
-
136
- ### Automatic Integration
137
- Your training script automatically:
138
- 1. **Creates experiments** with your specified name
139
- 2. **Logs parameters** from your configuration
140
- 3. **Logs metrics** every 25 steps (configurable)
141
- 4. **Logs system metrics** (GPU memory, utilization)
142
- 5. **Logs checkpoints** every 2000 steps
143
- 6. **Updates status** when training completes
144
-
145
- ### Manual Integration
146
- You can also manually:
147
- 1. **Create experiments** through the interface
148
- 2. **Log custom metrics** for specific analysis
149
- 3. **Compare different runs** with different parameters
150
- 4. **Generate demo data** for testing the interface
151
-
152
- ## 🎨 Customization
153
-
154
- ### Adding Custom Metrics
155
- ```python
156
- # In your training script
157
- custom_metrics = {
158
- "loss": current_loss,
159
- "accuracy": current_accuracy,
160
- "custom_metric": your_custom_value,
161
- "gpu_memory": gpu_memory_usage
162
- }
163
-
164
- monitor.log_metrics(custom_metrics, step=current_step)
165
- ```
166
-
167
- ### Custom Visualizations
168
- The interface supports any metric you log. Just add it to your metrics JSON and it will appear in the visualization dropdown.
169
-
170
- ## 🚨 Troubleshooting
171
-
172
- ### No Data Displayed
173
- 1. **Check experiment ID**: Make sure you're using the correct ID
174
- 2. **Verify metrics were logged**: Check if training is actually logging metrics
175
- 3. **Use demo data**: Generate demo data to test the interface
176
-
177
- ### Plots Not Updating
178
- 1. **Refresh the page**: Sometimes plots need a refresh
179
- 2. **Check data format**: Ensure metrics are in the correct JSON format
180
- 3. **Verify step numbers**: Make sure step numbers are increasing
181
-
182
- ### Interface Not Loading
183
- 1. **Check dependencies**: Ensure plotly and pandas are installed
184
- 2. **Check Gradio version**: Use Gradio 4.0.0 or higher
185
- 3. **Check browser console**: Look for JavaScript errors
186
-
187
- ## 📊 Example Workflow
188
-
189
- 1. **Start Training**:
190
- ```bash
191
- python run_a100_large_experiment.py --experiment-name "my_experiment"
192
- ```
193
-
194
- 2. **Monitor Progress**:
195
- - Visit your Trackio Space
196
- - Go to "View Experiments"
197
- - Enter your experiment ID
198
- - Watch real-time updates
199
-
200
- 3. **Visualize Results**:
201
- - Go to "📊 Visualizations"
202
- - Select "loss" metric
203
- - Create plot to see training progress
204
-
205
- 4. **Compare Runs**:
206
- - Run multiple experiments with different parameters
207
- - Use "Compare Experiments" to see differences
208
-
209
- 5. **Generate Demo Data**:
210
- - Use "🎯 Demo Data" tab to test the interface
211
- - Generate realistic training data for demonstration
212
-
213
- ## 🎉 Success Indicators
214
-
215
- Your interface is working correctly when you see:
216
- - ✅ **Formatted experiment details** with emojis and structure
217
- - ✅ **Interactive plots** that respond to your inputs
218
- - ✅ **Real-time metric updates** during training
219
- - ✅ **Clean experiment overview** with statistics
220
- - ✅ **Smooth visualization** with hover information
221
-
222
- The enhanced interface will now display much more meaningful information and provide a comprehensive monitoring experience for your SmolLM3 training experiments!