Spaces:

Tonic
/

SmolFactory

Running

App Files Files Community

Tonic commited on Jul 20

Commit

ebe598e

verified ·

1 Parent(s): 96fd5b3

adds formatting fix

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.gitignore +14 -11
FORMATTING_FIX_SUMMARY.md +15 -8
H100_LIGHTWEIGHT_GUIDE.md +276 -0
INTERACTIVE_PIPELINE_IMPROVEMENTS.md +330 -0
PIPELINE_SUMMARY.md +330 -0
README.md +1 -1
README_END_TO_END.md +304 -0
cloud_deployment.sh +0 -279
config/train_smollm3.py +4 -0
config/train_smollm3_h100_lightweight.py +112 -0
config/train_smollm3_openhermes_fr.py +4 -0
config/train_smollm3_openhermes_fr_a100_balanced.py +4 -0
config/train_smollm3_openhermes_fr_a100_large.py +4 -0
config/train_smollm3_openhermes_fr_a100_max_performance.py +4 -0
config/train_smollm3_openhermes_fr_a100_multiple_passes.py +4 -0
A100_LARGE_SCALE_GUIDE.md → docs/A100_LARGE_SCALE_GUIDE.md +0 -0
docs/APP_CONFIGURATION_GUIDE.md +234 -0
CLOUD_DEPLOYMENT_GUIDE.md → docs/CLOUD_DEPLOYMENT_GUIDE.md +0 -0
CLOUD_TRAINING_GUIDE.md → docs/CLOUD_TRAINING_GUIDE.md +0 -0
DEPLOYMENT_GUIDE.md → docs/DEPLOYMENT_GUIDE.md +0 -0
docs/ENVIRONMENT_VARIABLES.md +113 -0
docs/HF_DATASETS_GUIDE.md +269 -0
docs/HF_SPACES_GUIDE.md +163 -0
docs/MONITORING_IMPROVEMENTS_SUMMARY.md +191 -0
docs/MONITORING_INTEGRATION_GUIDE.md +245 -0
NO_THINK_TAG_GUIDE.md → docs/NO_THINK_TAG_GUIDE.md +0 -0
PUSH_GUIDE.md → docs/PUSH_GUIDE.md +0 -0
docs/PUSH_SCRIPT_GUIDE.md +267 -0
TRACKIO_INTEGRATION.md → docs/TRACKIO_INTEGRATION.md +0 -0
TRACKIO_INTEGRATION_VERIFICATION.md → docs/TRACKIO_INTEGRATION_VERIFICATION.md +0 -0
TRACKIO_INTERFACE_GUIDE.md → docs/TRACKIO_INTERFACE_GUIDE.md +0 -0
launch.sh +690 -0
requirements.txt → requirements/requirements.txt +0 -0
requirements_core.txt → requirements/requirements_core.txt +7 -1
requirements_minimal.txt → requirements/requirements_minimal.txt +0 -0
add_demo_data.py → scripts/dataset_tonic/add_demo_data.py +0 -0
scripts/dataset_tonic/setup_hf_dataset.py +275 -0
push_to_huggingface.py → scripts/model_tonic/push_to_huggingface.py +56 -13
scripts/trackio_tonic/configure_trackio.py +145 -0
deploy_trackio_space.py → scripts/trackio_tonic/deploy_trackio_space.py +1 -1
scripts/trackio_tonic/trackio_api_client.py +286 -0
run_a100_large_experiment.py → scripts/training/train.py +0 -0
setup_launch.py +283 -0
config.py → src/config.py +0 -0
data.py → src/data.py +0 -0
model.py → src/model.py +0 -0
monitoring.py → src/monitoring.py +157 -58
train.py → src/train.py +76 -0
trainer.py → src/trainer.py +0 -0
templates/datasets/readme.md +0 -0

.gitignore CHANGED Viewed

@@ -1,3 +1,6 @@
 # Python
 __pycache__/
 *.py[cod]
@@ -59,17 +62,17 @@ Thumbs.db
 logs/
 tensorboard_logs/
-# Model outputs
-output/
-checkpoints/
-models/
-wandb/
 # Datasets
-data/
-datasets/
-my_dataset/
-test_dataset/
 # Temporary files
 tmp/
@@ -86,9 +89,9 @@ accelerate_config.yaml
 # Training outputs
 runs/
-*.json
 !config/*.json
-!*.json.example
 # Evaluation results
 eval_results/

+.cursorrules/
+*.mdc
 # Python
 __pycache__/
 *.py[cod]
 logs/
 tensorboard_logs/
+# # Model outputs
+# output/
+# checkpoints/
+# models/
+# wandb/
 # Datasets
+# data/
+# datasets/
+# my_dataset/
+# test_dataset/
 # Temporary files
 tmp/
 # Training outputs
 runs/
+#*.json
 !config/*.json
+#!*.json.example
 # Evaluation results
 eval_results/

FORMATTING_FIX_SUMMARY.md CHANGED Viewed

@@ -19,10 +19,10 @@ I fixed the issue by standardizing all logging statements to use traditional str
 ### Files Fixed
-1. **`monitoring.py`** - Fixed all logging statements
-2. **`trainer.py`** - Fixed all logging statements
-3. **`model.py`** - Fixed all logging statements
-4. **`data.py`** - Fixed all logging statements
 ### Changes Made
@@ -52,6 +52,7 @@ This script tests:
 - ✅ Logging functionality
 - ✅ Module imports
 - ✅ Configuration loading
 - ✅ Error handling
 ## 🚀 Usage
@@ -68,25 +69,29 @@ python run_a100_large_experiment.py \
 ## 📋 Key Changes
-### 1. Monitoring Module (`monitoring.py`)
 - Fixed all `logger.info()`, `logger.error()`, `logger.warning()` calls
 - Replaced f-strings with `%` formatting
 - Fixed string concatenation in file paths
-### 2. Trainer Module (`trainer.py`)
 - Fixed logging in `SmolLM3Trainer` class
 - Fixed console output formatting
 - Fixed error message formatting
-### 3. Model Module (`model.py`)
 - Fixed model loading logging
 - Fixed configuration logging
 - Fixed error reporting
-### 4. Data Module (`data.py`)
 - Fixed dataset loading logging
 - Fixed processing progress logging
 - Fixed error handling
 ## 🔧 Technical Details
@@ -119,6 +124,7 @@ To verify the fix works:
    - ✅ Logging tests
    - ✅ Import tests
    - ✅ Configuration tests
 3. **Run your training command**:
    ```bash
@@ -131,6 +137,7 @@ To verify the fix works:
 - No changes to the training logic or configuration
 - All error messages and logging remain informative
 - The fix is backward compatible
 ## 🚨 Prevention

 ### Files Fixed
+1. **`src/monitoring.py`** - Fixed all logging statements
+2. **`src/trainer.py`** - Fixed all logging statements
+3. **`src/model.py`** - Fixed all logging statements
+4. **`src/data.py`** - Fixed all logging statements
 ### Changes Made
 - ✅ Logging functionality
 - ✅ Module imports
 - ✅ Configuration loading
+- ✅ Monitoring creation
 - ✅ Error handling
 ## 🚀 Usage
 ## 📋 Key Changes
+### 1. Monitoring Module (`src/monitoring.py`)
 - Fixed all `logger.info()`, `logger.error()`, `logger.warning()` calls
 - Replaced f-strings with `%` formatting
 - Fixed string concatenation in file paths
+- Fixed HF Datasets integration logging
+### 2. Trainer Module (`src/trainer.py`)
 - Fixed logging in `SmolLM3Trainer` class
 - Fixed console output formatting
 - Fixed error message formatting
+- Fixed callback logging
+### 3. Model Module (`src/model.py`)
 - Fixed model loading logging
 - Fixed configuration logging
 - Fixed error reporting
+- Fixed parameter logging
+### 4. Data Module (`src/data.py`)
 - Fixed dataset loading logging
 - Fixed processing progress logging
 - Fixed error handling
+- Fixed split processing logging
 ## 🔧 Technical Details
    - ✅ Logging tests
    - ✅ Import tests
    - ✅ Configuration tests
+   - ✅ Monitoring creation tests
 3. **Run your training command**:
    ```bash
 - No changes to the training logic or configuration
 - All error messages and logging remain informative
 - The fix is backward compatible
+- HF Datasets integration is preserved
 ## 🚨 Prevention

H100_LIGHTWEIGHT_GUIDE.md ADDED Viewed

	@@ -0,0 +1,276 @@

+# H100 Lightweight Training Configuration Guide
+This guide explains the new **H100 Lightweight (Rapid)** training configuration, optimized for rapid fine-tuning on H100 GPUs with a small, carefully selected dataset.
+## 🎯 Overview
+The H100 Lightweight configuration is designed for:
+- **Rapid experimentation** on H100 GPUs
+- **Efficient training** with 80K carefully selected samples
+- **Quick iteration** for research and development
+- **Cost-effective** training sessions
+## 🚀 Key Features
+### **Optimized for H100**
+- **Batch Size**: 16 (larger than A100 configs)
+- **Gradient Accumulation**: 4 (reduced for faster updates)
+- **Learning Rate**: 8e-6 (slightly higher for rapid convergence)
+- **Sequence Length**: 8192 (full context window)
+### **Dataset Sampling**
+- **Source**: OpenHermes-FR dataset
+- **Sample Size**: 80,000 random samples
+- **Validation**: 1,000 samples (if available)
+- **Reproducibility**: Fixed random seed (42)
+### **Training Optimizations**
+- **Warmup Steps**: 50 (reduced for rapid training)
+- **Evaluation**: Every 50 steps
+- **Logging**: Every 5 steps
+- **Saving**: Every 200 steps
+- **Checkpoints**: Keep only 2 (save storage)
+## 📊 Configuration Details
+### **Model Configuration**
+```python
+model_name="HuggingFaceTB/SmolLM3-3B"
+max_seq_length=8192
+use_flash_attention=True
+use_gradient_checkpointing=True
+```
+### **Training Parameters**
+```python
+batch_size=16
+gradient_accumulation_steps=4
+learning_rate=8e-6
+warmup_steps=50
+max_epochs=1
+```
+### **H100-Specific Optimizations**
+```python
+dataloader_num_workers=4
+dataloader_pin_memory=True
+gradient_clipping=1.0
+group_by_length=True
+pad_to_multiple_of=8
+```
+### **Memory Optimizations**
+```python
+save_total_limit=2
+early_stopping_patience=3
+max_grad_norm=1.0
+warmup_ratio=0.1
+```
+## 🔧 Usage
+### **Interactive Selection**
+```bash
+./launch.sh
+# Select "H100 Lightweight (Rapid)" when prompted
+```
+### **Expected Training Time**
+- **H100**: ~2-4 hours (depending on hardware)
+- **A100**: ~4-6 hours
+- **V100**: ~6-8 hours
+### **Memory Requirements**
+- **GPU Memory**: 40GB+ (H100 recommended)
+- **System RAM**: 32GB+
+- **Storage**: 50GB+ for dataset and checkpoints
+## 📈 Performance Characteristics
+### **Training Speed**
+- **Steps per Second**: ~2-3 (on H100)
+- **Samples per Second**: ~32-48
+- **Effective Batch Size**: 64 (16 × 4)
+### **Convergence**
+- **Expected Loss**: 1.2-1.8 (after 1 epoch)
+- **Evaluation Frequency**: Every 50 steps
+- **Early Stopping**: After 3 evaluations without improvement
+### **Dataset Efficiency**
+- **80K samples**: ~1.3% of full OpenHermes-FR
+- **Random sampling**: Ensures diversity
+- **Fixed seed**: Reproducible results
+## 🎯 Use Cases
+### **Perfect For**
+- **Rapid prototyping** of new ideas
+- **Hyperparameter tuning** experiments
+- **Model comparison** studies
+- **Research validation** before full training
+- **Educational purposes** and learning
+### **Not Recommended For**
+- **Production models** (use Multiple Passes instead)
+- **Competition submissions** (use full dataset)
+- **Research papers** (use complete training)
+## 🔄 Comparison with Other Configurations
+| Configuration | Dataset Size | Batch Size | Epochs | Training Time | Use Case |
+|---------------|--------------|------------|--------|---------------|----------|
+| **Basic Training** | Full SmolTalk | 2 | 3 | 6-8 hours | Learning |
+| **H100 Lightweight** | 80K Hermes-FR | 16 | 1 | 2-4 hours | Rapid experiments |
+| **A100 Large Scale** | Full Hermes-FR | 8 | 1.3 | 8-12 hours | Serious research |
+| **Multiple Passes** | Full Hermes-FR | 6 | 4 | 24-36 hours | Production |
+## 🛠️ Customization
+### **Modifying Sample Size**
+```bash
+# In the launch script, you can modify:
+DATASET_SAMPLE_SIZE=50000  # For 50K samples
+DATASET_SAMPLE_SIZE=100000 # For 100K samples
+```
+### **Adjusting Training Parameters**
+```bash
+# Modify in config/train_smollm3_h100_lightweight.py:
+batch_size=12              # Smaller batch size
+learning_rate=6e-6         # Lower learning rate
+warmup_steps=100          # More warmup steps
+```
+### **Changing Dataset**
+```bash
+# Modify the dataset name in the configuration:
+dataset_name="your-custom-dataset"
+```
+## 📊 Monitoring and Results
+### **Trackio Integration**
+- **Real-time metrics**: Loss, learning rate, gradient norm
+- **Training curves**: Visual progress tracking
+- **Resource usage**: GPU utilization, memory consumption
+- **Artifacts**: Model checkpoints, logs
+### **Expected Metrics**
+- **Training Loss**: Starts ~3.0, ends ~1.5
+- **Validation Loss**: Should be close to training loss
+- **Learning Rate**: Cosine decay from 8e-6 to 2e-6
+- **Gradient Norm**: Should stay below 1.0
+### **Success Indicators**
+- **Converging loss**: Steady decrease over time
+- **Stable gradients**: Consistent gradient norms
+- **Good validation**: Validation loss follows training loss
+- **No overfitting**: Validation loss doesn't increase
+## 🚨 Troubleshooting
+### **Common Issues**
+#### **Out of Memory (OOM)**
+```bash
+# Reduce batch size in config:
+batch_size=12  # Instead of 16
+gradient_accumulation_steps=6  # Instead of 4
+```
+#### **Slow Training**
+```bash
+# Check GPU utilization:
+nvidia-smi
+# Ensure CUDA is properly installed
+python -c "import torch; print(torch.cuda.is_available())"
+```
+#### **Poor Convergence**
+```bash
+# Try different learning rate:
+learning_rate=6e-6  # Instead of 8e-6
+# Or increase warmup:
+warmup_steps=100   # Instead of 50
+```
+#### **Dataset Issues**
+```bash
+# Check dataset loading:
+python -c "from datasets import load_dataset; print(len(load_dataset('legmlai/openhermes-fr')['train']))"
+```
+### **Performance Tips**
+1. **Use H100 if available**: Significantly faster than A100
+2. **Monitor GPU memory**: Keep utilization below 90%
+3. **Check logs regularly**: Look for convergence issues
+4. **Save checkpoints**: Don't lose progress
+5. **Use early stopping**: Prevent overfitting
+## 📋 Example Workflow
+### **Complete H100 Lightweight Training**
+```bash
+# 1. Setup
+python setup_launch.py
+# 2. Check requirements
+python check_requirements.py
+# 3. Run interactive pipeline
+./launch.sh
+# 4. Select configuration
+# Choose: "H100 Lightweight (Rapid)"
+# 5. Monitor training
+# Watch Trackio Space for real-time progress
+# 6. Check results
+# Model will be pushed to HF Hub
+# Summary in training_summary.md
+```
+### **Expected Output**
+```
+✅ Dataset prepared: 80000 train samples, 1000 validation samples
+📈 Training started with 5000 total steps
+⏱️ Estimated time: 2-4 hours
+📊 Monitor progress at: https://huggingface.co/spaces/...
+```
+## 🎉 Benefits
+### **Speed**
+- **3-4x faster** than full dataset training
+- **Rapid iteration** for research
+- **Quick validation** of ideas
+### **Efficiency**
+- **Reduced costs** (less GPU time)
+- **Lower storage** requirements
+- **Faster experimentation** cycle
+### **Quality**
+- **Still high quality** results
+- **Good for prototyping**
+- **Suitable for many use cases**
+## 🔮 Future Enhancements
+### **Planned Improvements**
+- **Adaptive sampling**: Smart dataset selection
+- **Multi-GPU support**: Distributed training
+- **Advanced monitoring**: More detailed metrics
+- **Auto-tuning**: Automatic hyperparameter optimization
+### **Extensibility**
+- **Custom datasets**: Easy integration
+- **Different models**: Support for other architectures
+- **Advanced sampling**: Stratified, balanced sampling
+---
+**Happy Rapid Training on H100! 🚀**

INTERACTIVE_PIPELINE_IMPROVEMENTS.md ADDED Viewed

	@@ -0,0 +1,330 @@

+# Interactive Pipeline Improvements
+This document explains the improvements made to the `launch.sh` script to make it interactive and configurable for different training scenarios.
+## 🎯 Key Improvements
+### 1. **Interactive User Interface**
+- **Colored Output**: Added color-coded status messages for better UX
+- **Input Validation**: Real-time validation of user inputs
+- **Default Values**: Smart defaults for common configurations
+- **Error Handling**: Graceful error handling with helpful messages
+### 2. **Training Configuration Selection**
+The script now offers 4 predefined training configurations:
+#### **Basic Training (Default)**
+```bash
+Model: SmolLM3-3B
+Dataset: SmolTalk
+Epochs: 3
+Batch Size: 2
+Learning Rate: 5e-6
+Sequence Length: 4096
+Best for: Quick experiments, learning
+```
+#### **H100 Lightweight (Rapid)**
+```bash
+Model: SmolLM3-3B
+Dataset: OpenHermes-FR (80K samples)
+Epochs: 1
+Batch Size: 16
+Learning Rate: 8e-6
+Sequence Length: 8192
+Best for: Rapid training on H100
+```
+#### **A100 Large Scale**
+```bash
+Model: SmolLM3-3B
+Dataset: OpenHermes-FR
+Epochs: 1.3 passes
+Batch Size: 8
+Learning Rate: 5e-6
+Sequence Length: 8192
+Best for: High-performance training
+```
+#### **Multiple Passes**
+```bash
+Model: SmolLM3-3B
+Dataset: OpenHermes-FR
+Epochs: 4 passes
+Batch Size: 6
+Learning Rate: 3e-6
+Sequence Length: 8192
+Best for: Thorough training
+```
+#### **Custom Configuration**
+- User-defined parameters
+- Flexible model and dataset selection
+- Custom training parameters
+### 3. **Enhanced User Experience**
+#### **Step-by-Step Guidance**
+1. **Authentication** - HF username and token validation
+2. **Configuration Selection** - Choose from predefined configs
+3. **Experiment Setup** - Configure experiment details
+4. **Training Parameters** - Adjust hyperparameters
+5. **Deployment Setup** - Trackio Space configuration
+6. **Confirmation** - Review and confirm settings
+#### **Input Functions**
+```bash
+# Get input with default value
+get_input "Prompt" "default_value" VARIABLE_NAME
+# Select from options
+select_option "Choose option:" "Option 1" "Option 2" "Option 3" VARIABLE_NAME
+# Validate HF token
+validate_hf_token "$HF_TOKEN"
+```
+#### **Colored Output Functions**
+```bash
+print_status "Success message"    # Green ✅
+print_warning "Warning message"   # Yellow ⚠️
+print_error "Error message"       # Red ❌
+print_info "Info message"         # Blue ℹ️
+print_header "Header message"     # Purple 🚀
+print_step "Step message"         # Cyan 📋
+```
+### 4. **Dynamic Configuration Generation**
+The script now generates training configurations based on user selection:
+```python
+# Generated config file
+config = SmolLM3Config(
+    model_name="$MODEL_NAME",
+    max_seq_length=$MAX_SEQ_LENGTH,
+    batch_size=$BATCH_SIZE,
+    learning_rate=$LEARNING_RATE,
+    # ... other parameters
+)
+```
+### 5. **Improved Error Handling**
+#### **Input Validation**
+- Required field validation
+- HF token validation
+- Numeric input validation
+- Choice validation
+#### **Graceful Degradation**
+- Clear error messages
+- Recovery suggestions
+- Exit on critical errors
+### 6. **Configuration Management**
+#### **User Credentials**
+- Interactive username input
+- Secure token input
+- Real-time token validation
+#### **Experiment Details**
+- Dynamic experiment naming
+- Repository name generation
+- Dataset repository configuration
+#### **Training Parameters**
+- Batch size selection
+- Learning rate adjustment
+- Sequence length configuration
+- Save/eval/logging steps
+### 7. **Enhanced Monitoring Integration**
+#### **Trackio Space**
+- Dynamic space naming
+- Automatic deployment
+- URL generation
+#### **HF Datasets**
+- Dataset repository setup
+- Experiment data storage
+- Access configuration
+## 🔧 Technical Improvements
+### 1. **Modular Functions**
+```bash
+# Input handling
+get_input()          # Get user input with defaults
+select_option()      # Select from options
+validate_hf_token()  # Validate HF token
+# Configuration
+show_training_configs()    # Display available configs
+get_training_config()      # Get config based on selection
+create_training_config()   # Generate config file
+# Output formatting
+print_status()       # Success messages
+print_warning()      # Warning messages
+print_error()        # Error messages
+print_info()         # Info messages
+print_header()       # Header messages
+print_step()         # Step messages
+```
+### 2. **Configuration Selection Logic**
+```bash
+case "$config_type" in
+    "Basic Training")
+        MODEL_NAME="HuggingFaceTB/SmolLM3-3B"
+        DATASET_NAME="HuggingFaceTB/smoltalk"
+        # ... other parameters
+        ;;
+    "A100 Large Scale")
+        MODEL_NAME="HuggingFaceTB/SmolLM3-3B"
+        DATASET_NAME="legmlai/openhermes-fr"
+        # ... other parameters
+        ;;
+    # ... other configurations
+esac
+```
+### 3. **Dynamic File Generation**
+```bash
+# Generate training config
+create_training_config "$CONFIG_FILE"
+# Generate deployment input
+cat > deploy_input.txt << EOF
+$HF_USERNAME
+$TRACKIO_SPACE_NAME
+$HF_TOKEN
+EOF
+```
+## 📊 User Workflow
+### **Before (Static)**
+1. Edit `launch.sh` manually
+2. Update hardcoded variables
+3. Run script
+4. Hope configuration is correct
+### **After (Interactive)**
+1. Run `./launch.sh`
+2. Follow interactive prompts
+3. Select training configuration
+4. Confirm settings
+5. Watch automated pipeline
+## 🎯 Benefits
+### **For Users**
+- **No Manual Editing**: No need to edit script files
+- **Guided Experience**: Step-by-step prompts
+- **Validation**: Real-time input validation
+- **Flexibility**: Multiple configuration options
+- **Safety**: Confirmation before execution
+### **For Developers**
+- **Maintainable**: Modular function structure
+- **Extensible**: Easy to add new configurations
+- **Robust**: Comprehensive error handling
+- **User-Friendly**: Clear feedback and guidance
+### **For Different Use Cases**
+- **Beginners**: Basic Training configuration
+- **H100 Users**: H100 Lightweight for rapid experiments
+- **Researchers**: A100 Large Scale for serious experiments
+- **Production**: Multiple Passes for thorough training
+- **Custom**: User-defined parameters for specific needs
+## 🔄 Configuration Examples
+### **Quick Start (Basic Training)**
+```bash
+./launch.sh
+# Follow prompts:
+# 1. Enter HF username and token
+# 2. Select "Basic Training"
+# 3. Confirm settings
+# 4. Watch automated pipeline
+```
+### **High-Performance Training (A100)**
+```bash
+./launch.sh
+# Follow prompts:
+# 1. Enter HF username and token
+# 2. Select "A100 Large Scale"
+# 3. Adjust parameters if needed
+# 4. Confirm and run
+```
+### **Rapid Training (H100)**
+```bash
+./launch.sh
+# Follow prompts:
+# 1. Enter HF username and token
+# 2. Select "H100 Lightweight (Rapid)"
+# 3. Confirm settings
+# 4. Watch rapid training on H100
+```
+### **Custom Training**
+```bash
+./launch.sh
+# Follow prompts:
+# 1. Enter HF username and token
+# 2. Select "Custom Configuration"
+# 3. Enter custom parameters:
+#    - Model: microsoft/DialoGPT-medium
+#    - Dataset: your-custom-dataset
+#    - Epochs: 5
+#    - Batch Size: 4
+#    - Learning Rate: 1e-5
+# 4. Confirm and run
+```
+## 🚀 Future Enhancements
+### **Planned Improvements**
+- **GUI Interface**: Web-based configuration interface
+- **Configuration Templates**: Save/load custom configurations
+- **Advanced Validation**: More sophisticated input validation
+- **Progress Tracking**: Real-time progress indicators
+- **Rollback Capability**: Undo changes if needed
+### **Extensibility**
+- **Plugin System**: Add custom training configurations
+- **API Integration**: Connect to external services
+- **Multi-GPU Support**: Distributed training options
+- **Advanced Monitoring**: Enhanced tracking capabilities
+## 📋 Migration Guide
+### **For Existing Users**
+1. **Backup**: Save your current `launch.sh`
+2. **Update**: Replace with new interactive version
+3. **Test**: Run with basic configuration first
+4. **Migrate**: Use interactive prompts instead of manual editing
+### **For New Users**
+1. **Setup**: Run `python setup_launch.py`
+2. **Check**: Run `python check_requirements.py`
+3. **Launch**: Run `./launch.sh`
+4. **Follow**: Use interactive prompts
+## 🎉 Conclusion
+The interactive pipeline provides a much better user experience with:
+- **Guided Configuration**: No manual editing required
+- **Multiple Options**: Predefined configurations for different use cases
+- **Validation**: Real-time input validation and error handling
+- **Flexibility**: Custom configuration support
+- **Safety**: Confirmation steps and error recovery
+The script is now production-ready for users of all skill levels, from beginners to advanced researchers.

PIPELINE_SUMMARY.md ADDED Viewed

	@@ -0,0 +1,330 @@

+# SmolLM3 End-to-End Pipeline - Implementation Summary
+This document summarizes the comprehensive refactoring and enhancement of the SmolLM3 fine-tuning codebase to create a complete end-to-end pipeline.
+## 🎯 Overview
+The pipeline now provides a complete solution from Trackio Space deployment to model push, with integrated monitoring, dataset management, and automated deployment.
+## 📁 Files Created/Modified
+### **Core Pipeline Files**
+1. **`launch.sh`** - Complete end-to-end pipeline script
+   - 16-step comprehensive pipeline
+   - Automated environment setup
+   - Integrated monitoring and deployment
+   - Dynamic configuration generation
+2. **`setup_launch.py`** - User configuration helper
+   - Interactive setup for user credentials
+   - Automatic script configuration
+   - Requirements checker generation
+3. **`test_pipeline.py`** - Comprehensive testing suite
+   - Import testing
+   - Component verification
+   - CUDA and HF token validation
+4. **`README_END_TO_END.md`** - Complete documentation
+   - Step-by-step usage guide
+   - Troubleshooting section
+   - Advanced configuration options
+### **Scripts and Utilities**
+5. **`scripts/trackio_tonic/trackio_api_client.py`** - API client for Trackio
+   - Complete API client implementation
+   - Error handling and retry logic
+   - Support for both JSON and SSE responses
+6. **`scripts/trackio_tonic/deploy_trackio_space.py`** - Space deployment
+   - Automated HF Space creation
+   - File upload and configuration
+   - Space testing and validation
+7. **`scripts/trackio_tonic/configure_trackio.py`** - Configuration helper
+   - Environment variable setup
+   - Dataset repository configuration
+   - Usage examples and validation
+8. **`scripts/model_tonic/push_to_huggingface.py`** - Model deployment
+   - Complete model upload pipeline
+   - Model card generation
+   - Training results documentation
+9. **`scripts/dataset_tonic/setup_hf_dataset.py`** - Dataset setup
+   - HF Dataset repository creation
+   - Initial experiment data structure
+   - Dataset access configuration
+### **Source Code Updates**
+10. **`src/monitoring.py`** - Enhanced monitoring
+    - HF Datasets integration
+    - Trackio API client integration
+    - Comprehensive metrics logging
+11. **`src/train.py`** - Updated training script
+    - Monitoring integration
+    - HF Datasets support
+    - Enhanced error handling
+12. **`src/config.py`** - Configuration management
+    - Dynamic config loading
+    - Multiple config type support
+    - Fallback mechanisms
+13. **`src/data.py`** - Enhanced dataset handling
+    - Multiple format support
+    - Automatic conversion
+    - Bad entry filtering
+14. **`src/model.py`** - Model wrapper
+    - SmolLM3-specific optimizations
+    - Flash attention support
+    - Long context handling
+15. **`src/trainer.py`** - Training orchestration
+    - Monitoring callback integration
+    - Enhanced logging
+    - Checkpoint management
+## 🔧 Key Improvements
+### **1. Import Path Fixes**
+- Fixed all import paths to work with the refactored structure
+- Added proper sys.path handling for cross-module imports
+- Ensured compatibility between different script locations
+### **2. Monitoring Integration**
+- **Trackio Space**: Real-time experiment tracking
+- **HF Datasets**: Persistent experiment storage
+- **System Metrics**: GPU, memory, and CPU monitoring
+- **Training Callbacks**: Automatic metric logging
+### **3. Dataset Handling**
+- **Multi-format Support**: Prompt/completion, instruction/output, chat formats
+- **Automatic Conversion**: Handles different dataset structures
+- **Validation**: Ensures data quality and completeness
+- **Splitting**: Automatic train/validation/test splits
+### **4. Configuration Management**
+- **Dynamic Generation**: Creates configs based on user input
+- **Multiple Types**: Support for different training configurations
+- **Environment Variables**: Proper integration with environment
+- **Validation**: Ensures configuration correctness
+### **5. Deployment Automation**
+- **Model Upload**: Complete model push to HF Hub
+- **Model Cards**: Comprehensive documentation generation
+- **Training Results**: Complete experiment documentation
+- **Testing**: Automated model validation
+## 🚀 Pipeline Steps
+The end-to-end pipeline performs these 16 steps:
+1. **Environment Setup** - System dependencies and Python environment
+2. **PyTorch Installation** - CUDA-enabled PyTorch installation
+3. **Dependencies** - All required Python packages
+4. **Authentication** - HF token setup and validation
+5. **Trackio Deployment** - HF Space creation and configuration
+6. **Dataset Setup** - HF Dataset repository creation
+7. **Trackio Configuration** - Environment and dataset configuration
+8. **Training Config** - Dynamic configuration generation
+9. **Dataset Preparation** - Download and format conversion
+10. **Parameter Calculation** - Training steps and batch calculations
+11. **Training Execution** - Model fine-tuning with monitoring
+12. **Model Push** - Upload to HF Hub with documentation
+13. **Model Testing** - Validation of uploaded model
+14. **Summary Report** - Complete training documentation
+15. **Resource Links** - All online resource URLs
+16. **Next Steps** - Usage instructions and recommendations
+## 📊 Monitoring Features
+### **Trackio Space Interface**
+- Real-time training metrics
+- Experiment comparison
+- System resource monitoring
+- Training progress visualization
+### **HF Dataset Storage**
+- Persistent experiment data
+- Version-controlled history
+- Collaborative sharing
+- Automated backup
+### **Comprehensive Logging**
+- Training metrics (loss, accuracy, etc.)
+- System metrics (GPU, memory, CPU)
+- Configuration parameters
+- Training artifacts
+## 🔧 Configuration Options
+### **User Configuration**
+```bash
+# Required
+HF_TOKEN="your_token"
+HF_USERNAME="your_username"
+# Optional
+MODEL_NAME="HuggingFaceTB/SmolLM3-3B"
+DATASET_NAME="HuggingFaceTB/smoltalk"
+```
+### **Training Parameters**
+```bash
+BATCH_SIZE=2
+GRADIENT_ACCUMULATION_STEPS=8
+LEARNING_RATE=5e-6
+MAX_EPOCHS=3
+MAX_SEQ_LENGTH=4096
+```
+### **Monitoring Configuration**
+```bash
+TRACKIO_DATASET_REPO="username/trackio-experiments"
+EXPERIMENT_NAME="smollm3_finetune_YYYYMMDD_HHMMSS"
+```
+## 🛠️ Error Handling
+### **Comprehensive Error Handling**
+- Import error detection and reporting
+- Configuration validation
+- Network timeout handling
+- Graceful degradation
+### **Debugging Support**
+- Detailed logging at all levels
+- Component-specific error messages
+- Fallback mechanisms
+- Testing utilities
+## 📈 Performance Optimizations
+### **Training Optimizations**
+- Flash Attention for efficiency
+- Gradient checkpointing for memory
+- Mixed precision training
+- Optimized data loading
+### **Monitoring Optimizations**
+- Asynchronous logging
+- Batch metric updates
+- Efficient data storage
+- Minimal overhead
+## 🔄 Integration Points
+### **Hugging Face Ecosystem**
+- **HF Hub**: Model and dataset storage
+- **HF Spaces**: Trackio monitoring interface
+- **HF Datasets**: Experiment data persistence
+- **HF CLI**: Authentication and deployment
+### **External Services**
+- **Trackio**: Experiment tracking
+- **CUDA**: GPU acceleration
+- **PyTorch**: Deep learning framework
+- **Transformers**: Model library
+## 🎯 Usage Workflow
+### **1. Setup Phase**
+```bash
+python setup_launch.py  # Configure with user info
+python test_pipeline.py # Verify all components
+```
+### **2. Execution Phase**
+```bash
+chmod +x launch.sh      # Make executable
+./launch.sh            # Run complete pipeline
+```
+### **3. Monitoring Phase**
+- Track progress in Trackio Space
+- Monitor metrics in real-time
+- Check logs for issues
+- Validate results
+### **4. Results Phase**
+- Access model on HF Hub
+- Review training summary
+- Test model performance
+- Share results
+## 📋 Quality Assurance
+### **Testing Coverage**
+- Import testing for all modules
+- Script availability verification
+- Configuration validation
+- CUDA and token testing
+- Component integration testing
+### **Documentation**
+- Comprehensive README
+- Step-by-step guides
+- Troubleshooting section
+- Advanced usage examples
+### **Error Recovery**
+- Graceful error handling
+- Detailed error messages
+- Recovery mechanisms
+- Fallback options
+## 🚀 Future Enhancements
+### **Planned Improvements**
+- Multi-GPU training support
+- Distributed training
+- Advanced hyperparameter tuning
+- Custom dataset upload
+- Model evaluation metrics
+- Automated testing pipeline
+### **Extensibility**
+- Plugin architecture for custom components
+- Configuration templates
+- Custom monitoring backends
+- Advanced deployment options
+## 📊 Success Metrics
+### **Pipeline Completeness**
+- ✅ All 16 steps implemented
+- ✅ Error handling at each step
+- ✅ Monitoring integration
+- ✅ Documentation complete
+### **User Experience**
+- ✅ Simple setup process
+- ✅ Clear error messages
+- ✅ Comprehensive documentation
+- ✅ Testing utilities
+### **Technical Quality**
+- ✅ Import path fixes
+- ✅ Configuration management
+- ✅ Monitoring integration
+- ✅ Deployment automation
+## 🎉 Conclusion
+The SmolLM3 end-to-end pipeline provides a complete solution for fine-tuning with integrated monitoring, automated deployment, and comprehensive documentation. The refactored codebase is now production-ready with proper error handling, testing, and user experience considerations.
+**Key Achievements:**
+- Complete end-to-end automation
+- Integrated monitoring and tracking
+- Comprehensive error handling
+- Production-ready deployment
+- Extensive documentation
+- Testing and validation suite
+The pipeline is now ready for users to easily fine-tune SmolLM3 models with full monitoring and deployment capabilities.

README.md CHANGED Viewed

@@ -1,4 +1,4 @@
-# SmolLM3 Fine-tuning for FlexAI Console
 This repository provides a complete setup for fine-tuning SmolLM3 models using the FlexAI console, following the nanoGPT structure but adapted for modern transformer models.


1	+ # SmolLM3 Fine-tuning
2
3	This repository provides a complete setup for fine-tuning SmolLM3 models using the FlexAI console, following the nanoGPT structure but adapted for modern transformer models.
4

README_END_TO_END.md ADDED Viewed

	@@ -0,0 +1,304 @@

+# SmolLM3 End-to-End Fine-tuning Pipeline
+This repository provides a complete end-to-end pipeline for fine-tuning SmolLM3 models with integrated experiment tracking, monitoring, and model deployment.
+## 🚀 Quick Start
+### 1. Setup Configuration
+```bash
+# Run the setup script to configure with your information
+python setup_launch.py
+```
+This will prompt you for:
+- Your Hugging Face username
+- Your Hugging Face token
+- Optional model and dataset customizations
+### 2. Check Requirements
+```bash
+# Verify all dependencies are installed
+python check_requirements.py
+```
+### 3. Run the Pipeline
+```bash
+# Make the script executable and run
+chmod +x launch.sh
+./launch.sh
+```
+## 📋 What the Pipeline Does
+The end-to-end pipeline performs the following steps:
+### 1. **Environment Setup**
+- Installs system dependencies
+- Creates Python virtual environment
+- Installs PyTorch with CUDA support
+- Installs all required Python packages
+### 2. **Trackio Space Deployment**
+- Creates a new Hugging Face Space for experiment tracking
+- Configures the Trackio monitoring interface
+- Sets up environment variables
+### 3. **HF Dataset Setup**
+- Creates a Hugging Face Dataset repository for experiment storage
+- Configures dataset access and permissions
+- Sets up initial experiment data structure
+### 4. **Dataset Preparation**
+- Downloads the specified dataset from Hugging Face Hub
+- Converts to training format (prompt/completion pairs)
+- Handles multiple dataset formats automatically
+- Creates train/validation splits
+### 5. **Training Configuration**
+- Creates optimized training configuration
+- Sets up monitoring integration
+- Configures model parameters and hyperparameters
+### 6. **Model Training**
+- Runs the SmolLM3 fine-tuning process
+- Logs metrics to Trackio Space in real-time
+- Saves experiment data to HF Dataset
+- Creates checkpoints during training
+### 7. **Model Deployment**
+- Pushes trained model to Hugging Face Hub
+- Creates comprehensive model card
+- Uploads training results and logs
+- Tests the uploaded model
+### 8. **Summary Report**
+- Generates detailed training summary
+- Provides links to all resources
+- Documents configuration and results
+## 🎯 Features
+### **Integrated Monitoring**
+- Real-time experiment tracking via Trackio Space
+- Persistent storage in Hugging Face Datasets
+- Comprehensive metrics logging
+- System resource monitoring
+### **Flexible Dataset Support**
+- Automatic format detection and conversion
+- Support for multiple dataset types
+- Built-in data preprocessing
+- Train/validation split handling
+### **Optimized Training**
+- Flash Attention support for efficiency
+- Gradient checkpointing for memory optimization
+- Mixed precision training
+- Automatic hyperparameter optimization
+### **Complete Deployment**
+- Automated model upload to Hugging Face Hub
+- Comprehensive model cards
+- Training results documentation
+- Model testing and validation
+## 📊 Monitoring & Tracking
+### **Trackio Space Interface**
+- Real-time training metrics visualization
+- Experiment management and comparison
+- System resource monitoring
+- Training progress tracking
+### **HF Dataset Storage**
+- Persistent experiment data storage
+- Version-controlled experiment history
+- Collaborative experiment sharing
+- Automated data backup
+## 🔧 Configuration
+### **Required Configuration**
+Update these variables in `launch.sh`:
+```bash
+# Your Hugging Face credentials
+HF_TOKEN="your_hf_token_here"
+HF_USERNAME="your-username"
+# Model and dataset
+MODEL_NAME="HuggingFaceTB/SmolLM3-3B"
+DATASET_NAME="HuggingFaceTB/smoltalk"
+# Output repositories
+REPO_NAME="your-username/smollm3-finetuned-$(date +%Y%m%d)"
+TRACKIO_DATASET_REPO="your-username/trackio-experiments"
+```
+### **Training Parameters**
+Customize training parameters:
+```bash
+# Training configuration
+BATCH_SIZE=2
+GRADIENT_ACCUMULATION_STEPS=8
+LEARNING_RATE=5e-6
+MAX_EPOCHS=3
+MAX_SEQ_LENGTH=4096
+```
+## 📁 Output Structure
+After running the pipeline, you'll have:
+```
+├── training_dataset/           # Prepared dataset
+│   ├── train.json
+│   └── validation.json
+├── /output-checkpoint/         # Model checkpoints
+│   ├── config.json
+│   ├── pytorch_model.bin
+│   └── training_results/
+├── training.log               # Training logs
+├── training_summary.md        # Summary report
+└── config/train_smollm3_end_to_end.py  # Training config
+```
+## 🌐 Online Resources
+The pipeline creates these online resources:
+- **Model Repository**: `https://huggingface.co/your-username/smollm3-finetuned-YYYYMMDD`
+- **Trackio Space**: `https://huggingface.co/spaces/your-username/trackio-monitoring-YYYYMMDD`
+- **Experiment Dataset**: `https://huggingface.co/datasets/your-username/trackio-experiments`
+## 🛠️ Troubleshooting
+### **Common Issues**
+1. **HF Token Issues**
+   ```bash
+   # Verify your token is correct
+   huggingface-cli whoami
+   ```
+2. **CUDA Issues**
+   ```bash
+   # Check CUDA availability
+   python -c "import torch; print(torch.cuda.is_available())"
+   ```
+3. **Memory Issues**
+   ```bash
+   # Reduce batch size or gradient accumulation
+   BATCH_SIZE=1
+   GRADIENT_ACCUMULATION_STEPS=16
+   ```
+4. **Dataset Issues**
+   ```bash
+   # Test dataset access
+   python -c "from datasets import load_dataset; print(load_dataset('your-dataset'))"
+   ```
+### **Debug Mode**
+Run individual components for debugging:
+```bash
+# Test Trackio deployment
+cd scripts/trackio_tonic
+python deploy_trackio_space.py
+# Test dataset setup
+cd scripts/dataset_tonic
+python setup_hf_dataset.py
+# Test training
+python src/train.py config/train_smollm3_end_to_end.py --help
+```
+## 📚 Advanced Usage
+### **Custom Datasets**
+For custom datasets, ensure they have one of these formats:
+```json
+// Format 1: Prompt/Completion
+{
+  "prompt": "What is machine learning?",
+  "completion": "Machine learning is..."
+}
+// Format 2: Instruction/Output
+{
+  "instruction": "Explain machine learning",
+  "output": "Machine learning is..."
+}
+// Format 3: Chat format
+{
+  "messages": [
+    {"role": "user", "content": "What is ML?"},
+    {"role": "assistant", "content": "ML is..."}
+  ]
+}
+```
+### **Custom Models**
+To use different models, update the configuration:
+```bash
+MODEL_NAME="microsoft/DialoGPT-medium"
+MAX_SEQ_LENGTH=1024
+```
+### **Custom Training**
+Modify training parameters in the generated config:
+```python
+# In config/train_smollm3_end_to_end.py
+config = SmolLM3Config(
+    learning_rate=1e-5,  # Custom learning rate
+    max_iters=5000,      # Custom training steps
+    # ... other parameters
+)
+```
+## 🤝 Contributing
+1. Fork the repository
+2. Create a feature branch
+3. Make your changes
+4. Test the pipeline
+5. Submit a pull request
+## 📄 License
+This project is licensed under the MIT License - see the LICENSE file for details.
+## 🙏 Acknowledgments
+- Hugging Face for the excellent transformers library
+- The SmolLM3 team for the base model
+- The Trackio team for experiment tracking
+- The open-source community for contributions
+## 📞 Support
+For issues and questions:
+1. Check the troubleshooting section
+2. Review the logs in `training.log`
+3. Check the Trackio Space for monitoring data
+4. Open an issue on GitHub
+---
+**Happy Fine-tuning! 🚀**

cloud_deployment.sh DELETED Viewed

@@ -1,279 +0,0 @@
-#!/bin/bash
-# Cloud Deployment Script for SmolLM3 DPO Training
-# This script sets up a cloud instance for training and uploading to Hugging Face
-set -e  # Exit on any error
-echo "🚀 Starting SmolLM3 DPO Cloud Deployment"
-echo "=========================================="
-# Configuration
-MODEL_NAME="HuggingFaceTB/SmolLM3-3B"
-DATASET_NAME="HuggingFaceTB/smoltalk"
-EXPERIMENT_NAME="smollm3_dpo_6epochs"
-REPO_NAME="your-username/smollm3-dpo-6epochs"  # Change this to your username
-TRACKIO_URL="https://your-trackio-space.hf.space"  # Change this to your Trackio Space URL
-HF_TOKEN="your_hf_token_here"  # Change this to your HF token
-# Training Configuration
-BATCH_SIZE=2
-GRADIENT_ACCUMULATION_STEPS=8
-LEARNING_RATE=5e-6
-MAX_EPOCHS=6
-MAX_SEQ_LENGTH=4096
-SAVE_STEPS=500
-EVAL_STEPS=100
-LOGGING_STEPS=10
-echo "📋 Configuration:"
-echo "  Model: $MODEL_NAME"
-echo "  Dataset: $DATASET_NAME"
-echo "  Experiment: $EXPERIMENT_NAME"
-echo "  Repository: $REPO_NAME"
-echo "  Epochs: $MAX_EPOCHS"
-echo "  Batch Size: $BATCH_SIZE"
-echo "  Learning Rate: $LEARNING_RATE"
-# Step 1: Update system and install dependencies
-echo ""
-echo "🔧 Step 1: Installing system dependencies..."
-sudo apt-get update
-sudo apt-get install -y git curl wget unzip
-# Step 2: Install Python and pip
-echo ""
-echo "🐍 Step 2: Installing Python dependencies..."
-sudo apt-get install -y python3 python3-pip python3-venv
-# Step 3: Create virtual environment
-echo ""
-echo "📦 Step 3: Setting up Python virtual environment..."
-python3 -m venv smollm3_env
-source smollm3_env/bin/activate
-# Step 4: Install PyTorch and CUDA
-echo ""
-echo "🔥 Step 4: Installing PyTorch with CUDA support..."
-pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
-# Step 5: Install project dependencies
-echo ""
-echo "📚 Step 5: Installing project dependencies..."
-pip install -r requirements.txt
-# Step 6: Install additional dependencies for DPO
-echo ""
-echo "🎯 Step 6: Installing DPO-specific dependencies..."
-pip install trl>=0.7.0
-pip install peft>=0.4.0
-pip install accelerate>=0.20.0
-# Step 7: Set up Hugging Face token
-echo ""
-echo "🔑 Step 7: Setting up Hugging Face authentication..."
-export HF_TOKEN="$HF_TOKEN"
-huggingface-cli login --token $HF_TOKEN
-# Step 8: Create DPO configuration
-echo ""
-echo "⚙️ Step 8: Creating DPO configuration..."
-cat > config/train_smollm3_dpo_6epochs.py << EOF
-"""
-SmolLM3 DPO Training Configuration - 6 Epochs
-Optimized for cloud deployment
-"""
-from config.train_smollm3_dpo import SmolLM3DPOConfig
-config = SmolLM3DPOConfig(
-    # Model configuration
-    model_name="$MODEL_NAME",
-    max_seq_length=$MAX_SEQ_LENGTH,
-    use_flash_attention=True,
-    use_gradient_checkpointing=True,
-    # Training configuration
-    batch_size=$BATCH_SIZE,
-    gradient_accumulation_steps=$GRADIENT_ACCUMULATION_STEPS,
-    learning_rate=$LEARNING_RATE,
-    weight_decay=0.01,
-    warmup_steps=100,
-    max_iters=None,  # Will be calculated based on epochs
-    eval_interval=100,
-    log_interval=10,
-    save_interval=500,
-    # DPO configuration
-    beta=0.1,
-    max_prompt_length=$((MAX_SEQ_LENGTH // 2)),
-    # Optimizer configuration
-    optimizer="adamw",
-    beta1=0.9,
-    beta2=0.95,
-    eps=1e-8,
-    # Scheduler configuration
-    scheduler="cosine",
-    min_lr=1e-6,
-    # Mixed precision
-    fp16=True,
-    bf16=False,
-    # Logging and saving
-    save_steps=$SAVE_STEPS,
-    eval_steps=$EVAL_STEPS,
-    logging_steps=$LOGGING_STEPS,
-    save_total_limit=3,
-    # Evaluation
-    eval_strategy="steps",
-    metric_for_best_model="eval_loss",
-    greater_is_better=False,
-    load_best_model_at_end=True,
-    # Data configuration
-    data_dir="smoltalk_dataset",
-    train_file="train.json",
-    validation_file="validation.json",
-    # Chat template configuration
-    use_chat_template=True,
-    chat_template_kwargs={
-        "enable_thinking": False,
-        "add_generation_prompt": True
-    },
-    # Trackio monitoring configuration
-    enable_tracking=True,
-    trackio_url="$TRACKIO_URL",
-    trackio_token=None,
-    log_artifacts=True,
-    log_metrics=True,
-    log_config=True,
-    experiment_name="$EXPERIMENT_NAME"
-)
-EOF
-# Step 9: Download and prepare dataset
-echo ""
-echo "📊 Step 9: Downloading and preparing dataset..."
-python -c "
-from datasets import load_dataset
-import json
-import os
-# Load SmolTalk dataset
-print('Loading SmolTalk dataset...')
-dataset = load_dataset('$DATASET_NAME')
-# Create dataset directory
-os.makedirs('smoltalk_dataset', exist_ok=True)
-# Convert to DPO format (preference pairs)
-def convert_to_dpo_format(example):
-    # For SmolTalk, we'll create preference pairs based on response quality
-    # This is a simplified example - you may need to adjust based on your needs
-    return {
-        'prompt': example.get('prompt', ''),
-        'chosen': example.get('chosen', ''),
-        'rejected': example.get('rejected', '')
-    }
-# Process train split
-train_data = []
-for example in dataset['train']:
-    dpo_example = convert_to_dpo_format(example)
-    if dpo_example['prompt'] and dpo_example['chosen'] and dpo_example['rejected']:
-        train_data.append(dpo_example)
-# Process validation split
-val_data = []
-for example in dataset['validation']:
-    dpo_example = convert_to_dpo_format(example)
-    if dpo_example['prompt'] and dpo_example['chosen'] and dpo_example['rejected']:
-        val_data.append(dpo_example)
-# Save to files
-with open('smoltalk_dataset/train.json', 'w') as f:
-    json.dump(train_data, f, indent=2)
-with open('smoltalk_dataset/validation.json', 'w') as f:
-    json.dump(val_data, f, indent=2)
-print(f'Dataset prepared: {len(train_data)} train samples, {len(val_data)} validation samples')
-"
-# Step 10: Calculate training steps based on epochs
-echo ""
-echo "📈 Step 10: Calculating training parameters..."
-TOTAL_SAMPLES=$(python -c "import json; data=json.load(open('smoltalk_dataset/train.json')); print(len(data))")
-EFFECTIVE_BATCH_SIZE=$((BATCH_SIZE * GRADIENT_ACCUMULATION_STEPS))
-STEPS_PER_EPOCH=$((TOTAL_SAMPLES / EFFECTIVE_BATCH_SIZE))
-MAX_STEPS=$((STEPS_PER_EPOCH * MAX_EPOCHS))
-echo "  Total samples: $TOTAL_SAMPLES"
-echo "  Effective batch size: $EFFECTIVE_BATCH_SIZE"
-echo "  Steps per epoch: $STEPS_PER_EPOCH"
-echo "  Total training steps: $MAX_STEPS"
-# Step 11: Start DPO training
-echo ""
-echo "🎯 Step 11: Starting DPO training..."
-python train.py config/train_smollm3_dpo_6epochs.py \
-    --dataset_dir smoltalk_dataset \
-    --out_dir /output-checkpoint \
-    --init_from scratch \
-    --max_iters $MAX_STEPS \
-    --batch_size $BATCH_SIZE \
-    --learning_rate $LEARNING_RATE \
-    --gradient_accumulation_steps $GRADIENT_ACCUMULATION_STEPS \
-    --max_seq_length $MAX_SEQ_LENGTH \
-    --save_steps $SAVE_STEPS \
-    --eval_steps $EVAL_STEPS \
-    --logging_steps $LOGGING_STEPS \
-    --enable_tracking \
-    --trackio_url "$TRACKIO_URL" \
-    --experiment_name "$EXPERIMENT_NAME"
-# Step 12: Push model to Hugging Face Hub
-echo ""
-echo "📤 Step 12: Pushing model to Hugging Face Hub..."
-python push_to_huggingface.py /output-checkpoint "$REPO_NAME" \
-    --token "$HF_TOKEN" \
-    --trackio-url "$TRACKIO_URL" \
-    --experiment-name "$EXPERIMENT_NAME"
-# Step 13: Test the uploaded model
-echo ""
-echo "🧪 Step 13: Testing uploaded model..."
-python -c "
-from transformers import AutoModelForCausalLM, AutoTokenizer
-import torch
-print('Loading uploaded model...')
-model = AutoModelForCausalLM.from_pretrained('$REPO_NAME', torch_dtype=torch.float16, device_map='auto')
-tokenizer = AutoTokenizer.from_pretrained('$REPO_NAME')
-print('Testing model generation...')
-prompt = 'Hello, how are you?'
-inputs = tokenizer(prompt, return_tensors='pt').to(model.device)
-outputs = model.generate(**inputs, max_new_tokens=50, do_sample=True, temperature=0.7)
-response = tokenizer.decode(outputs[0], skip_special_tokens=True)
-print(f'Prompt: {prompt}')
-print(f'Response: {response}')
-print('✅ Model test completed successfully!')
-"
-echo ""
-echo "🎉 Deployment completed successfully!"
-echo "====================================="
-echo "📊 Model: https://huggingface.co/$REPO_NAME"
-echo "📈 Trackio: $TRACKIO_URL"
-echo "📋 Experiment: $EXPERIMENT_NAME"
-echo ""
-echo "Next steps:"
-echo "1. Monitor training progress in your Trackio Space"
-echo "2. Check the model repository on Hugging Face Hub"
-echo "3. Use the model in your applications"

config/train_smollm3.py CHANGED Viewed

@@ -76,6 +76,10 @@ class SmolLM3Config:
     log_metrics: bool = True
     log_config: bool = True
     experiment_name: Optional[str] = None
     def __post_init__(self):
         if self.chat_template_kwargs is None:

     log_metrics: bool = True
     log_config: bool = True
     experiment_name: Optional[str] = None
+    # HF Datasets configuration
+    hf_token: Optional[str] = None
+    dataset_repo: Optional[str] = None
     def __post_init__(self):
         if self.chat_template_kwargs is None:

config/train_smollm3_h100_lightweight.py ADDED Viewed

	@@ -0,0 +1,112 @@

+"""
+SmolLM3 H100 Lightweight Training Configuration
+Optimized for rapid training on H100 with 80K Hermes-FR samples
+"""
+from config.train_smollm3 import SmolLM3Config
+config = SmolLM3Config(
+    # Model configuration
+    model_name="HuggingFaceTB/SmolLM3-3B",
+    max_seq_length=8192,
+    use_flash_attention=True,
+    use_gradient_checkpointing=True,
+    # Training configuration - Optimized for H100
+    batch_size=16,  # Larger batch size for H100
+    gradient_accumulation_steps=4,  # Reduced for faster updates
+    learning_rate=8e-6,  # Slightly higher for rapid convergence
+    weight_decay=0.01,
+    warmup_steps=50,  # Reduced warmup for rapid training
+    max_iters=None,  # Will be calculated based on epochs
+    eval_interval=50,  # More frequent evaluation
+    log_interval=5,  # More frequent logging
+    save_interval=200,  # More frequent saving
+    # Optimizer configuration - Optimized for rapid training
+    optimizer="adamw",
+    beta1=0.9,
+    beta2=0.95,
+    eps=1e-8,
+    # Scheduler configuration - Faster learning
+    scheduler="cosine",
+    min_lr=2e-6,  # Higher minimum LR
+    # Mixed precision - Full precision for H100
+    fp16=True,
+    bf16=False,
+    # Logging and saving - More frequent for rapid training
+    save_steps=200,
+    eval_steps=50,
+    logging_steps=5,
+    save_total_limit=2,  # Keep fewer checkpoints
+    # Evaluation
+    eval_strategy="steps",
+    metric_for_best_model="eval_loss",
+    greater_is_better=False,
+    load_best_model_at_end=True,
+    # Data configuration - Hermes-FR with sampling
+    dataset_name="legmlai/openhermes-fr",
+    dataset_split="train",
+    input_field="prompt",
+    target_field="completion",
+    filter_bad_entries=False,
+    bad_entry_field="bad_entry",
+    # Chat template configuration
+    use_chat_template=True,
+    chat_template_kwargs={
+        "enable_thinking": False,
+        "add_generation_prompt": True,
+        "no_think_system_message": True
+    },
+    # Trackio monitoring configuration
+    enable_tracking=True,
+    trackio_url=None,  # Will be set by launch script
+    trackio_token=None,
+    log_artifacts=True,
+    log_metrics=True,
+    log_config=True,
+    experiment_name=None,  # Will be set by launch script
+    # HF Datasets configuration
+    dataset_repo=None,  # Will be set by launch script
+    # H100-specific optimizations
+    dataloader_num_workers=4,  # Optimized for H100
+    dataloader_pin_memory=True,
+    gradient_clipping=1.0,  # Prevent gradient explosion
+    # Memory optimizations for rapid training
+    max_grad_norm=1.0,
+    warmup_ratio=0.1,  # 10% warmup
+    lr_scheduler_type="cosine",
+    # Early stopping for rapid training
+    early_stopping_patience=3,
+    early_stopping_threshold=0.001,
+    # H100-specific training optimizations
+    remove_unused_columns=False,
+    group_by_length=True,  # Group similar length sequences
+    length_column_name="length",
+    ignore_data_skip=False,
+    # Reporting
+    report_to=["tensorboard"],
+    run_name="smollm3-h100-lightweight",
+    # Seed for reproducibility
+    seed=42,
+    # Data collator settings
+    data_collator_kwargs={
+        "pad_to_multiple_of": 8,  # Optimized for H100
+        "return_tensors": "pt"
+    }
+)

config/train_smollm3_openhermes_fr.py CHANGED Viewed

@@ -85,6 +85,10 @@ class SmolLM3ConfigOpenHermesFR(SmolLM3Config):
     log_metrics: bool = True
     log_config: bool = True
     experiment_name: Optional[str] = None
     def __post_init__(self):
         if self.chat_template_kwargs is None:

     log_metrics: bool = True
     log_config: bool = True
     experiment_name: Optional[str] = None
+    # HF Datasets configuration
+    hf_token: Optional[str] = None
+    dataset_repo: Optional[str] = None
     def __post_init__(self):
         if self.chat_template_kwargs is None:

config/train_smollm3_openhermes_fr_a100_balanced.py CHANGED Viewed

@@ -91,6 +91,10 @@ class SmolLM3ConfigOpenHermesFRBalanced(SmolLM3Config):
     log_metrics: bool = True
     log_config: bool = True
     experiment_name: Optional[str] = None
     # Additional A100 optimizations for balanced performance
     dataloader_num_workers: int = 10  # More workers for faster data loading

     log_metrics: bool = True
     log_config: bool = True
     experiment_name: Optional[str] = None
+    # HF Datasets configuration
+    hf_token: Optional[str] = None
+    dataset_repo: Optional[str] = None
     # Additional A100 optimizations for balanced performance
     dataloader_num_workers: int = 10  # More workers for faster data loading

config/train_smollm3_openhermes_fr_a100_large.py CHANGED Viewed

@@ -85,6 +85,10 @@ class SmolLM3ConfigOpenHermesFRA100Large(SmolLM3Config):
     log_metrics: bool = True
     log_config: bool = True
     experiment_name: Optional[str] = None
     # Additional A100 optimizations
     dataloader_num_workers: int = 8  # More workers for faster data loading

     log_metrics: bool = True
     log_config: bool = True
     experiment_name: Optional[str] = None
+    # HF Datasets configuration
+    hf_token: Optional[str] = None
+    dataset_repo: Optional[str] = None
     # Additional A100 optimizations
     dataloader_num_workers: int = 8  # More workers for faster data loading

config/train_smollm3_openhermes_fr_a100_max_performance.py CHANGED Viewed

@@ -85,6 +85,10 @@ class SmolLM3ConfigOpenHermesFRMaxPerformance(SmolLM3Config):
     log_metrics: bool = True
     log_config: bool = True
     experiment_name: Optional[str] = None
     # Additional A100 optimizations for maximum performance
     dataloader_num_workers: int = 12  # More workers for faster data loading

     log_metrics: bool = True
     log_config: bool = True
     experiment_name: Optional[str] = None
+    # HF Datasets configuration
+    hf_token: Optional[str] = None
+    dataset_repo: Optional[str] = None
     # Additional A100 optimizations for maximum performance
     dataloader_num_workers: int = 12  # More workers for faster data loading

config/train_smollm3_openhermes_fr_a100_multiple_passes.py CHANGED Viewed

@@ -85,6 +85,10 @@ class SmolLM3ConfigOpenHermesFRMultiplePasses(SmolLM3Config):
     log_metrics: bool = True
     log_config: bool = True
     experiment_name: Optional[str] = None
     # Additional A100 optimizations
     dataloader_num_workers: int = 8  # More workers for faster data loading

     log_metrics: bool = True
     log_config: bool = True
     experiment_name: Optional[str] = None
+    # HF Datasets configuration
+    hf_token: Optional[str] = None
+    dataset_repo: Optional[str] = None
     # Additional A100 optimizations
     dataloader_num_workers: int = 8  # More workers for faster data loading

A100_LARGE_SCALE_GUIDE.md → docs/A100_LARGE_SCALE_GUIDE.md RENAMED Viewed

File without changes

docs/APP_CONFIGURATION_GUIDE.md ADDED Viewed

	@@ -0,0 +1,234 @@

+# ⚙️ App Configuration Guide
+## Overview
+The Trackio app now includes a **Configuration tab** that allows you to set your Hugging Face token and dataset repository directly through the interface, providing an alternative to environment variables.
+## 🚀 New Features
+### **Configuration Tab**
+- ✅ **HF Token Input**: Secure password field for your Hugging Face token
+- ✅ **Dataset Repository Input**: Text field for your dataset repository
+- ✅ **Update Configuration**: Apply new settings and reload experiments
+- ✅ **Test Connection**: Verify access to the dataset repository
+- ✅ **Create Dataset**: Create a new dataset repository if it doesn't exist
+### **Flexible Configuration**
+- ✅ **Environment Variables**: Still supported as fallback
+- ✅ **Interface Input**: New direct input method
+- ✅ **Dynamic Updates**: Change configuration without restarting
+- ✅ **Validation**: Input validation and error handling
+## 📋 Configuration Tab Usage
+### **1. Access the Configuration Tab**
+- Open the Trackio app
+- Click on the "⚙️ Configuration" tab
+- You'll see input fields for HF Token and Dataset Repository
+### **2. Set Your HF Token**
+```
+Hugging Face Token: hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
+```
+- **Type**: Password field (hidden for security)
+- **Required**: Yes (for dataset access)
+- **Format**: Your HF token starting with `hf_`
+- **Help**: Click the help text for instructions on getting your token
+### **3. Set Your Dataset Repository**
+```
+Dataset Repository: your-username/your-dataset-name
+```
+- **Type**: Text field
+- **Required**: No (defaults to `tonic/trackio-experiments`)
+- **Format**: `username/dataset-name`
+- **Examples**:
+  - `tonic/trackio-experiments`
+  - `your-username/my-experiments`
+  - `your-org/team-experiments`
+### **4. Use the Action Buttons**
+#### **Update Configuration**
+- Applies new settings immediately
+- Reloads experiments with new configuration
+- Shows current status and experiment count
+#### **Test Connection**
+- Verifies access to the dataset repository
+- Tests HF token permissions
+- Shows dataset information and experiment count
+#### **Create Dataset**
+- Creates a new dataset repository if it doesn't exist
+- Sets up the correct schema for experiments
+- Makes the dataset private by default
+## 🔧 Configuration Methods
+### **Method 1: Interface Input (New)**
+1. Go to "⚙️ Configuration" tab
+2. Enter your HF token and dataset repository
+3. Click "Update Configuration"
+4. Verify with "Test Connection"
+### **Method 2: Environment Variables (Existing)**
+```bash
+# Set environment variables
+export HF_TOKEN=your_hf_token_here
+export TRACKIO_DATASET_REPO=your-username/your-dataset-name
+# Or for HF Spaces, add to Space settings
+HF_TOKEN=your_hf_token_here
+TRACKIO_DATASET_REPO=your-username/your-dataset-name
+```
+### **Method 3: Hybrid Approach**
+- Set environment variables as defaults
+- Override specific values through the interface
+- Interface values take precedence over environment variables
+## 📊 Configuration Priority
+The app uses this priority order for configuration:
+1. **Interface Input** (highest priority)
+2. **Environment Variables** (fallback)
+3. **Default Values** (lowest priority)
+## 🛠️ Getting Your HF Token
+### **Step-by-Step Instructions**
+1. Go to [Hugging Face Settings](https://huggingface.co/settings/tokens)
+2. Click "New token"
+3. Give it a name (e.g., "Trackio Access")
+4. Select "Write" permissions
+5. Click "Generate token"
+6. Copy the token (starts with `hf_`)
+7. Paste it in the app's HF Token field
+### **Token Permissions**
+- **Read**: Required for loading experiments
+- **Write**: Required for saving experiments
+- **Scope**: Should have access to your dataset repositories
+## 📁 Dataset Repository Format
+### **Correct Format**
+```
+username/dataset-name
+```
+### **Examples**
+- `tonic/trackio-experiments` (default)
+- `your-username/my-experiments`
+- `your-org/team-experiments`
+- `your-username/smollm3-experiments`
+### **Validation**
+- Must contain exactly one `/`
+- Username must be valid HF username
+- Dataset name must be valid (alphanumeric + hyphens)
+## 🔍 Testing Your Configuration
+### **1. Test Connection**
+- Enter your HF token and dataset repository
+- Click "Test Connection"
+- Should show: "✅ Connection successful!"
+### **2. Create Dataset (if needed)**
+- If dataset doesn't exist, click "Create Dataset"
+- Should show: "✅ Dataset created successfully!"
+### **3. Update Configuration**
+- Click "Update Configuration"
+- Should show: "✅ Configuration updated successfully!"
+## 🚨 Troubleshooting
+### **Issue: "Please provide a Hugging Face token"**
+**Solution**:
+- Enter your HF token in the interface
+- Or set the `HF_TOKEN` environment variable
+### **Issue: "Connection failed: 401 Unauthorized"**
+**Solutions**:
+1. Check your HF token is correct
+2. Verify the token has read access to the dataset
+3. Ensure the dataset repository exists
+### **Issue: "Failed to create dataset"**
+**Solutions**:
+1. Check your HF token has write permissions
+2. Verify the username in the repository name
+3. Ensure the dataset name is valid
+### **Issue: "Dataset repository must be in format: username/dataset-name"**
+**Solution**:
+- Use the correct format: `username/dataset-name`
+- Example: `your-username/my-experiments`
+## 📈 Benefits
+### **For Users**
+- ✅ **Easy Setup**: No need to set environment variables
+- ✅ **Visual Interface**: Clear input fields and validation
+- ✅ **Immediate Feedback**: Test connection and see results
+- ✅ **Flexible**: Can change configuration anytime
+### **For Development**
+- ✅ **Backward Compatible**: Environment variables still work
+- ✅ **Fallback Support**: Graceful degradation
+- ✅ **Error Handling**: Clear error messages
+- ✅ **Validation**: Input validation and testing
+### **For Deployment**
+- ✅ **HF Spaces Ready**: Works on Hugging Face Spaces
+- ✅ **No Restart Required**: Dynamic configuration updates
+- ✅ **Secure**: Password field for token input
+- ✅ **User-Friendly**: Clear instructions and help text
+## 🎯 Usage Examples
+### **Basic Setup**
+1. Open the app
+2. Go to "⚙️ Configuration" tab
+3. Enter your HF token
+4. Enter your dataset repository
+5. Click "Update Configuration"
+6. Click "Test Connection" to verify
+### **Advanced Setup**
+1. Set environment variables as defaults
+2. Use interface to override specific values
+3. Test connection to verify access
+4. Create dataset if it doesn't exist
+5. Start using the app with persistent storage
+### **Team Setup**
+1. Create a shared dataset repository
+2. Share the repository name with team
+3. Each team member sets their own HF token
+4. All experiments are stored in the shared dataset
+## 📋 Configuration Status
+The app shows current configuration status:
+```
+📊 Dataset: your-username/your-dataset
+🔑 HF Token: Set
+📈 Experiments: 5
+```
+## 🔄 Updating Configuration
+You can update configuration at any time:
+1. Go to "⚙️ Configuration" tab
+2. Change HF token or dataset repository
+3. Click "Update Configuration"
+4. Experiments will reload with new settings
+---
+**🎉 Your Trackio app is now more flexible and user-friendly with direct configuration input!**

CLOUD_DEPLOYMENT_GUIDE.md → docs/CLOUD_DEPLOYMENT_GUIDE.md RENAMED Viewed

File without changes

CLOUD_TRAINING_GUIDE.md → docs/CLOUD_TRAINING_GUIDE.md RENAMED Viewed

File without changes

DEPLOYMENT_GUIDE.md → docs/DEPLOYMENT_GUIDE.md RENAMED Viewed

File without changes

docs/ENVIRONMENT_VARIABLES.md ADDED Viewed

	@@ -0,0 +1,113 @@

+# 🔧 Trackio Environment Variables Reference
+## Quick Setup
+Set these environment variables in your Hugging Face Space:
+```bash
+# Required: Your HF token for dataset access
+HF_TOKEN=your_hf_token_here
+# Optional: Dataset repository to use (defaults to tonic/trackio-experiments)
+TRACKIO_DATASET_REPO=your-username/your-dataset-name
+```
+## Environment Variables
+| Variable | Required | Default | Description |
+|----------|----------|---------|-------------|
+| `HF_TOKEN` | ✅ Yes | None | Your Hugging Face token for dataset access |
+| `TRACKIO_DATASET_REPO` | ❌ No | `tonic/trackio-experiments` | Dataset repository to load experiments from |
+| `SPACE_ID` | 🔄 Auto | None | HF Space ID (automatically detected) |
+## Configuration Examples
+### 1. Default Setup
+```bash
+HF_TOKEN=your_token_here
+# Uses: tonic/trackio-experiments
+```
+### 2. Personal Dataset
+```bash
+HF_TOKEN=your_token_here
+TRACKIO_DATASET_REPO=your-username/trackio-experiments
+```
+### 3. Team Dataset
+```bash
+HF_TOKEN=your_token_here
+TRACKIO_DATASET_REPO=your-org/team-experiments
+```
+### 4. Project-Specific Dataset
+```bash
+HF_TOKEN=your_token_here
+TRACKIO_DATASET_REPO=your-username/smollm3-experiments
+```
+## How to Set in HF Spaces
+1. Go to your Hugging Face Space settings
+2. Navigate to "Settings" → "Environment variables"
+3. Add the variables:
+   - `HF_TOKEN`: Your HF token
+   - `TRACKIO_DATASET_REPO`: Your dataset repository (optional)
+## Testing Configuration
+Run the configuration script to check your setup:
+```bash
+python configure_trackio.py
+```
+This will:
+- ✅ Show current environment variables
+- 🧪 Test dataset access
+- 📊 Display experiment count
+- 💾 Generate configuration file
+## Getting Your HF Token
+1. Go to [Hugging Face Settings](https://huggingface.co/settings/tokens)
+2. Click "New token"
+3. Give it a name (e.g., "Trackio Access")
+4. Select "Write" permissions
+5. Copy the token and set it as `HF_TOKEN`
+## Dataset Repository Format
+The `TRACKIO_DATASET_REPO` should follow this format:
+```
+username/dataset-name
+```
+Examples:
+- `tonic/trackio-experiments`
+- `your-username/my-experiments`
+- `your-org/team-experiments`
+## Troubleshooting
+### Issue: "HF_TOKEN not found"
+**Solution**: Set your HF token in the Space environment variables
+### Issue: "Failed to load dataset"
+**Solutions**:
+1. Check your token has read access to the dataset
+2. Verify the dataset repository exists
+3. Try the backup fallback (automatic)
+### Issue: "Failed to save experiments"
+**Solutions**:
+1. Check your token has write permissions
+2. Verify the dataset repository exists
+3. Check network connectivity
+## Security Notes
+- 🔒 Dataset is private by default
+- 🔑 Only accessible with your HF_TOKEN
+- 🛡️ No sensitive data exposed publicly
+- 🔐 Secure storage on HF infrastructure

docs/HF_DATASETS_GUIDE.md ADDED Viewed

	@@ -0,0 +1,269 @@

+# 🚀 Trackio with Hugging Face Datasets - Complete Guide
+## Overview
+This guide explains how to use Hugging Face Datasets for persistent storage of Trackio experiments, providing reliable data persistence across Hugging Face Spaces deployments.
+## 🏗️ Architecture
+### Why HF Datasets?
+1. **Persistent Storage**: Data survives Space restarts and redeployments
+2. **Version Control**: Automatic versioning of experiment data
+3. **Access Control**: Private datasets for security
+4. **Reliability**: HF's infrastructure ensures data availability
+5. **Scalability**: Handles large amounts of experiment data
+### Data Flow
+```
+Training Script → Trackio App → HF Dataset → Trackio App → Plots
+```
+## 🚀 Setup Instructions
+### 1. Create HF Token
+1. Go to [Hugging Face Settings](https://huggingface.co/settings/tokens)
+2. Create a new token with `write` permissions
+3. Copy the token for use in your Space
+### 2. Set Up Dataset Repository
+```bash
+# Run the setup script
+python setup_hf_dataset.py
+```
+This will:
+- Create a private dataset: `tonic/trackio-experiments`
+- Add your existing experiments
+- Configure the dataset for Trackio
+### 3. Configure Hugging Face Space
+#### Environment Variables
+Set these in your HF Space settings:
+```bash
+HF_TOKEN=your_hf_token_here
+TRACKIO_DATASET_REPO=your-username/your-dataset-name
+```
+**Environment Variables Explained:**
+- `HF_TOKEN`: Your Hugging Face token (required for dataset access)
+- `TRACKIO_DATASET_REPO`: Dataset repository to use (optional, defaults to `tonic/trackio-experiments`)
+**Example Configurations:**
+```bash
+# Use default dataset
+HF_TOKEN=your_token_here
+# Use personal dataset
+HF_TOKEN=your_token_here
+TRACKIO_DATASET_REPO=your-username/trackio-experiments
+# Use team dataset
+HF_TOKEN=your_token_here
+TRACKIO_DATASET_REPO=your-org/team-experiments
+# Use project-specific dataset
+HF_TOKEN=your_token_here
+TRACKIO_DATASET_REPO=your-username/smollm3-experiments
+```
+#### Requirements
+Update your `requirements.txt`:
+```txt
+gradio>=4.0.0
+plotly>=5.0.0
+pandas>=1.5.0
+numpy>=1.24.0
+datasets>=2.14.0
+huggingface-hub>=0.16.0
+requests>=2.31.0
+```
+### 4. Deploy Updated App
+The updated `app.py` now:
+- Loads experiments from HF Dataset
+- Saves new experiments to the dataset
+- Falls back to backup data if dataset unavailable
+- Provides better error handling
+### 5. Configure Environment Variables
+Use the configuration script to check your setup:
+```bash
+python configure_trackio.py
+```
+This script will:
+- Show current environment variables
+- Test dataset access
+- Generate configuration file
+- Provide usage examples
+**Available Environment Variables:**
+| Variable | Required | Default | Description |
+|----------|----------|---------|-------------|
+| `HF_TOKEN` | Yes | None | Your Hugging Face token |
+| `TRACKIO_DATASET_REPO` | No | `tonic/trackio-experiments` | Dataset repository to use |
+| `SPACE_ID` | Auto | None | HF Space ID (auto-detected) |
+## 📊 Dataset Schema
+The HF Dataset contains these columns:
+| Column | Type | Description |
+|--------|------|-------------|
+| `experiment_id` | string | Unique experiment identifier |
+| `name` | string | Experiment name |
+| `description` | string | Experiment description |
+| `created_at` | string | ISO timestamp |
+| `status` | string | running/completed/failed |
+| `metrics` | string | JSON array of metric entries |
+| `parameters` | string | JSON object of experiment parameters |
+| `artifacts` | string | JSON array of artifacts |
+| `logs` | string | JSON array of log entries |
+| `last_updated` | string | ISO timestamp of last update |
+## 🔧 Technical Details
+### Loading Experiments
+```python
+from datasets import load_dataset
+# Load from HF Dataset
+dataset = load_dataset("tonic/trackio-experiments", token=HF_TOKEN)
+# Convert to experiments dict
+for row in dataset['train']:
+    experiment = {
+        'id': row['experiment_id'],
+        'metrics': json.loads(row['metrics']),
+        'parameters': json.loads(row['parameters']),
+        # ... other fields
+    }
+```
+### Saving Experiments
+```python
+from datasets import Dataset
+from huggingface_hub import HfApi
+# Convert experiments to dataset format
+dataset_data = []
+for exp_id, exp_data in experiments.items():
+    dataset_data.append({
+        'experiment_id': exp_id,
+        'metrics': json.dumps(exp_data['metrics']),
+        'parameters': json.dumps(exp_data['parameters']),
+        # ... other fields
+    })
+# Push to HF Hub
+dataset = Dataset.from_list(dataset_data)
+dataset.push_to_hub("tonic/trackio-experiments", token=HF_TOKEN, private=True)
+```
+## 📈 Your Current Experiments
+### Available Experiments
+1. **`exp_20250720_130853`** (petite-elle-l-aime-3)
+   - 4 metric entries (steps 25, 50, 75, 100)
+   - Loss decreasing: 1.1659 → 1.1528
+   - Good convergence pattern
+2. **`exp_20250720_134319`** (petite-elle-l-aime-3-1)
+   - 2 metric entries (step 25)
+   - Loss: 1.166
+   - GPU memory tracking
+### Metrics Available for Plotting
+- `loss` - Training loss curve
+- `learning_rate` - Learning rate schedule
+- `mean_token_accuracy` - Token-level accuracy
+- `grad_norm` - Gradient norm
+- `num_tokens` - Tokens processed
+- `epoch` - Training epoch
+- `gpu_0_memory_allocated` - GPU memory usage
+- `cpu_percent` - CPU usage
+- `memory_percent` - System memory
+## 🎯 Usage Instructions
+### 1. View Experiments
+- Go to "View Experiments" tab
+- Enter experiment ID: `exp_20250720_130853` or `exp_20250720_134319`
+- Click "View Experiment"
+### 2. Create Plots
+- Go to "Visualizations" tab
+- Enter experiment ID
+- Select metric to plot
+- Click "Create Plot"
+### 3. Compare Experiments
+- Use "Experiment Comparison" feature
+- Enter: `exp_20250720_130853,exp_20250720_134319`
+- Compare loss curves
+## 🔍 Troubleshooting
+### Issue: "No metrics data available"
+**Solutions**:
+1. Check HF_TOKEN is set correctly
+2. Verify dataset repository exists
+3. Check network connectivity to HF Hub
+### Issue: "Failed to load from dataset"
+**Solutions**:
+1. App falls back to backup data automatically
+2. Check dataset permissions
+3. Verify token has read access
+### Issue: "Failed to save experiments"
+**Solutions**:
+1. Check token has write permissions
+2. Verify dataset repository exists
+3. Check network connectivity
+## 🚀 Benefits of This Approach
+### ✅ Advantages
+- **Persistent**: Data survives Space restarts
+- **Reliable**: HF's infrastructure ensures availability
+- **Secure**: Private datasets protect your data
+- **Scalable**: Handles large amounts of experiment data
+- **Versioned**: Automatic versioning of experiment data
+### 🔄 Fallback Strategy
+1. **Primary**: Load from HF Dataset
+2. **Secondary**: Use backup data (your existing experiments)
+3. **Tertiary**: Create new experiments locally
+## 📋 Next Steps
+1. **Set HF_TOKEN**: Add your token to Space environment
+2. **Run Setup**: Execute `setup_hf_dataset.py`
+3. **Deploy App**: Push updated `app.py` to your Space
+4. **Test Plots**: Verify experiments load and plots work
+5. **Monitor Training**: New experiments will be saved to dataset
+## 🔐 Security Notes
+- Dataset is **private** by default
+- Only accessible with your HF_TOKEN
+- Experiment data is stored securely on HF infrastructure
+- No sensitive data is exposed publicly
+---
+**Your experiments are now configured for reliable persistence using Hugging Face Datasets!** 🎉

docs/HF_SPACES_GUIDE.md ADDED Viewed

	@@ -0,0 +1,163 @@

+# 🚀 Trackio on Hugging Face Spaces - Complete Guide
+## Overview
+This guide explains how to properly deploy and use Trackio on Hugging Face Spaces, addressing the unique challenges of ephemeral storage and data persistence.
+## 🏗️ Hugging Face Spaces Architecture
+### Key Challenges
+1. **Ephemeral Storage**: File system gets reset between deployments
+2. **No Persistent Storage**: Files written during runtime don't persist
+3. **Multiple Instances**: Training and monitoring might run in different environments
+4. **Limited File System**: Restricted write permissions in certain directories
+### How Trackio Handles HF Spaces
+The updated Trackio app now includes:
+- **Automatic HF Spaces Detection**: Detects when running on HF Spaces
+- **Persistent Path Selection**: Uses `/tmp/` for better persistence
+- **Backup Recovery**: Automatically recovers experiments from backup data
+- **Fallback Storage**: Multiple storage locations for redundancy
+## 📊 Your Current Experiments
+Based on your logs, you have these experiments available:
+### Experiment 1: `exp_20250720_130853`
+- **Name**: petite-elle-l-aime-3
+- **Status**: Running
+- **Metrics**: 4 entries (steps 25, 50, 75, 100)
+- **Key Metrics**: Loss decreasing from 1.1659 to 1.1528
+### Experiment 2: `exp_20250720_134319`
+- **Name**: petite-elle-l-aime-3-1
+- **Status**: Running
+- **Metrics**: 2 entries (step 25)
+- **Key Metrics**: Loss 1.166, GPU memory usage
+## 🎯 How to Use Your Experiments
+### 1. View Experiments
+- Go to the "View Experiments" tab
+- Enter experiment ID: `exp_20250720_130853` or `exp_20250720_134319`
+- Click "View Experiment" to see details
+### 2. Create Plots
+- Go to the "Visualizations" tab
+- Enter experiment ID
+- Select metric to plot:
+  - `loss` - Training loss curve
+  - `learning_rate` - Learning rate schedule
+  - `mean_token_accuracy` - Token accuracy
+  - `grad_norm` - Gradient norm
+  - `gpu_0_memory_allocated` - GPU memory usage
+### 3. Compare Experiments
+- Use the "Experiment Comparison" feature
+- Enter: `exp_20250720_130853,exp_20250720_134319`
+- Compare loss curves between experiments
+## 🔧 Technical Details
+### Data Persistence Strategy
+```python
+# HF Spaces detection
+if os.environ.get('SPACE_ID'):
+    data_file = "/tmp/trackio_experiments.json"
+else:
+    data_file = "trackio_experiments.json"
+```
+### Backup Recovery
+The app automatically recovers your experiments from backup data when:
+- Running on HF Spaces
+- No existing experiments found
+- Data file is missing or empty
+### Storage Locations
+1. **Primary**: `/tmp/trackio_experiments.json`
+2. **Backup**: `/tmp/trackio_backup.json`
+3. **Fallback**: Local directory (for development)
+## 🚀 Deployment Best Practices
+### 1. Environment Variables
+```bash
+# Set in HF Spaces environment
+SPACE_ID=your-space-id
+TRACKIO_URL=https://your-space.hf.space
+```
+### 2. File Structure
+```
+your-space/
+├── app.py                 # Main Trackio app
+├── requirements.txt       # Dependencies
+├── README.md             # Space description
+└── .gitignore           # Ignore temporary files
+```
+### 3. Requirements
+```txt
+gradio>=4.0.0
+plotly>=5.0.0
+pandas>=1.5.0
+numpy>=1.24.0
+```
+## 📈 Monitoring Your Training
+### Real-time Metrics
+Your experiments show:
+- **Loss**: Decreasing from 1.1659 to 1.1528 (good convergence)
+- **Learning Rate**: Properly scheduled from 7e-08 to 2.8875e-07
+- **Token Accuracy**: Around 75-76% (reasonable for early training)
+- **GPU Memory**: ~17GB allocated, 75GB reserved
+### Expected Behavior
+- Loss should continue decreasing
+- Learning rate will follow cosine schedule
+- Token accuracy should improve over time
+- GPU memory usage should remain stable
+## 🔍 Troubleshooting
+### Issue: "No metrics data available"
+**Solution**: The app now automatically recovers experiments from backup
+### Issue: Plots not showing
+**Solution**:
+1. Check experiment ID is correct
+2. Try different metrics (loss, learning_rate, etc.)
+3. Refresh the page
+### Issue: Data not persisting
+**Solution**:
+1. App now uses `/tmp/` for better persistence
+2. Backup recovery ensures data availability
+3. Multiple storage locations provide redundancy
+## 🎯 Next Steps
+1. **Deploy Updated App**: Push the updated `app.py` to your HF Space
+2. **Test Plots**: Try plotting your experiments
+3. **Monitor Training**: Continue monitoring your training runs
+4. **Add New Experiments**: Create new experiments as needed
+## 📞 Support
+If you encounter issues:
+1. Check the logs in your HF Space
+2. Verify experiment IDs are correct
+3. Try the backup recovery feature
+4. Contact for additional support
+---
+**Your experiments are now properly configured and should display correctly in the Trackio interface!** 🎉

docs/MONITORING_IMPROVEMENTS_SUMMARY.md ADDED Viewed

	@@ -0,0 +1,191 @@

+# 🚀 Monitoring Improvements Summary
+## Overview
+The monitoring system has been significantly enhanced to support **Hugging Face Datasets** for persistent experiment storage, making it ideal for deployment on Hugging Face Spaces and other cloud environments.
+## ✅ Key Improvements Made
+### 1. **Enhanced `monitoring.py`**
+- ✅ **HF Datasets Integration**: Added support for saving experiments to HF Datasets repositories
+- ✅ **Environment Variables**: Automatic detection of `HF_TOKEN` and `TRACKIO_DATASET_REPO`
+- ✅ **Fallback Support**: Graceful degradation if HF Datasets unavailable
+- ✅ **Dual Storage**: Experiments saved to both Trackio and HF Datasets
+- ✅ **Periodic Saving**: Metrics saved to HF Dataset every 10 steps
+- ✅ **Error Handling**: Robust error logging and recovery
+### 2. **Updated `train.py`**
+- ✅ **Monitoring Integration**: Automatic monitoring setup in training scripts
+- ✅ **Configuration Logging**: Experiment configuration logged at start
+- ✅ **Training Callbacks**: Monitoring callbacks added to trainer
+- ✅ **Summary Logging**: Training summaries logged at completion
+- ✅ **Error Logging**: Errors logged to monitoring system
+- ✅ **Cleanup**: Proper monitoring session cleanup
+### 3. **Configuration Files Updated**
+- ✅ **HF Datasets Config**: Added `hf_token` and `dataset_repo` parameters
+- ✅ **Environment Support**: Environment variables automatically detected
+- ✅ **Backward Compatible**: Existing configurations still work
+### 4. **New Utility Scripts**
+- ✅ **`configure_trackio.py`**: Configuration testing and setup
+- ✅ **`integrate_monitoring.py`**: Automated integration script
+- ✅ **`test_monitoring_integration.py`**: Comprehensive testing
+- ✅ **`setup_hf_dataset.py`**: Dataset repository setup
+### 5. **Documentation**
+- ✅ **`MONITORING_INTEGRATION_GUIDE.md`**: Comprehensive usage guide
+- ✅ **`ENVIRONMENT_VARIABLES.md`**: Environment variable reference
+- ✅ **`HF_DATASETS_GUIDE.md`**: Detailed HF Datasets guide
+## 🔧 Environment Variables
+| Variable | Required | Default | Description |
+|----------|----------|---------|-------------|
+| `HF_TOKEN` | ✅ Yes | None | Your Hugging Face token |
+| `TRACKIO_DATASET_REPO` | ❌ No | `tonic/trackio-experiments` | Dataset repository |
+| `TRACKIO_URL` | ❌ No | None | Trackio server URL |
+| `TRACKIO_TOKEN` | ❌ No | None | Trackio authentication token |
+## 📊 What Gets Monitored
+### **Training Metrics**
+- Loss values (training and validation)
+- Learning rate
+- Gradient norms
+- Training steps and epochs
+### **System Metrics**
+- GPU memory usage
+- GPU utilization
+- CPU usage
+- Memory usage
+### **Experiment Data**
+- Configuration parameters
+- Model checkpoints
+- Evaluation results
+- Training summaries
+### **Artifacts**
+- Configuration files
+- Training logs
+- Evaluation results
+- Model checkpoints
+## 🚀 Usage Examples
+### **Basic Training**
+```bash
+# Set environment variables
+export HF_TOKEN=your_token_here
+export TRACKIO_DATASET_REPO=your-username/experiments
+# Run training with monitoring
+python train.py config/train_smollm3_openhermes_fr.py
+```
+### **Advanced Configuration**
+```bash
+# Train with custom settings
+python train.py config/train_smollm3_openhermes_fr.py \
+  --experiment_name "smollm3_french_v2" \
+  --hf_token your_token_here \
+  --dataset_repo your-username/french-experiments
+```
+### **Testing Setup**
+```bash
+# Test configuration
+python configure_trackio.py
+# Test monitoring integration
+python test_monitoring_integration.py
+# Test dataset access
+python test_hf_datasets.py
+```
+## 📈 Benefits
+### **For HF Spaces Deployment**
+- ✅ **Persistent Storage**: Data survives Space restarts
+- ✅ **No Local Storage**: No dependency on ephemeral storage
+- ✅ **Scalable**: Works with any dataset size
+- ✅ **Secure**: Private dataset storage
+### **For Experiment Management**
+- ✅ **Centralized**: All experiments in one place
+- ✅ **Searchable**: Easy to find specific experiments
+- ✅ **Versioned**: Dataset versioning for experiments
+- ✅ **Collaborative**: Share experiments with team
+### **For Development**
+- ✅ **Flexible**: Easy to switch between datasets
+- ✅ **Configurable**: Environment-based configuration
+- ✅ **Robust**: Fallback mechanisms
+- ✅ **Debuggable**: Comprehensive logging
+## 🧪 Testing Results
+All monitoring integration tests passed:
+- ✅ Module Import
+- ✅ Monitor Creation
+- ✅ Config Creation
+- ✅ Metrics Logging
+- ✅ Configuration Logging
+- ✅ System Metrics
+- ✅ Training Summary
+- ✅ Callback Creation
+## 📋 Files Modified/Created
+### **Core Files**
+- `monitoring.py` - Enhanced with HF Datasets support
+- `train.py` - Updated with monitoring integration
+- `requirements_core.txt` - Added monitoring dependencies
+- `requirements_space.txt` - Updated for HF Spaces
+### **Configuration Files**
+- `config/train_smollm3.py` - Added HF Datasets config
+- `config/train_smollm3_openhermes_fr.py` - Added HF Datasets config
+- `config/train_smollm3_openhermes_fr_a100_balanced.py` - Added HF Datasets config
+- `config/train_smollm3_openhermes_fr_a100_large.py` - Added HF Datasets config
+- `config/train_smollm3_openhermes_fr_a100_max_performance.py` - Added HF Datasets config
+- `config/train_smollm3_openhermes_fr_a100_multiple_passes.py` - Added HF Datasets config
+### **New Utility Scripts**
+- `configure_trackio.py` - Configuration testing
+- `integrate_monitoring.py` - Automated integration
+- `test_monitoring_integration.py` - Comprehensive testing
+- `setup_hf_dataset.py` - Dataset setup
+### **Documentation**
+- `MONITORING_INTEGRATION_GUIDE.md` - Usage guide
+- `ENVIRONMENT_VARIABLES.md` - Environment reference
+- `HF_DATASETS_GUIDE.md` - HF Datasets guide
+- `MONITORING_IMPROVEMENTS_SUMMARY.md` - This summary
+## 🎯 Next Steps
+1. **Set up your HF token and dataset repository**
+2. **Test the configuration with `python configure_trackio.py`**
+3. **Run a training experiment to verify full functionality**
+4. **Check your HF Dataset repository for experiment data**
+5. **View results in your Trackio interface**
+## 🔍 Troubleshooting
+### **Common Issues**
+- **HF_TOKEN not set**: Set your Hugging Face token
+- **Dataset access failed**: Check token permissions and repository existence
+- **Monitoring not working**: Run `python test_monitoring_integration.py` to diagnose
+### **Getting Help**
+- Check the comprehensive guides in the documentation files
+- Run the test scripts to verify your setup
+- Check logs for specific error messages
+---
+**🎉 The monitoring system is now ready for production use with persistent HF Datasets storage!**

docs/MONITORING_INTEGRATION_GUIDE.md ADDED Viewed

	@@ -0,0 +1,245 @@

+# 🔧 Improved Monitoring Integration Guide
+## Overview
+The monitoring system has been enhanced to support **Hugging Face Datasets** for persistent experiment storage, making it ideal for deployment on Hugging Face Spaces and other cloud environments.
+## 🚀 Key Improvements
+### 1. **HF Datasets Integration**
+- ✅ **Persistent Storage**: Experiments are saved to HF Datasets repositories
+- ✅ **Environment Variables**: Configurable via `HF_TOKEN` and `TRACKIO_DATASET_REPO`
+- ✅ **Fallback Support**: Graceful degradation if HF Datasets unavailable
+- ✅ **Automatic Backup**: Local files as backup
+### 2. **Enhanced Monitoring Features**
+- 📊 **Real-time Metrics**: Training metrics logged to both Trackio and HF Datasets
+- 🔧 **System Metrics**: GPU memory, CPU usage, and system performance
+- 📈 **Training Summaries**: Comprehensive experiment summaries
+- 🛡️ **Error Handling**: Robust error logging and recovery
+### 3. **Easy Integration**
+- 🔌 **Automatic Setup**: Environment variables automatically detected
+- 📝 **Configuration**: Simple setup with environment variables
+- 🔄 **Backward Compatible**: Works with existing Trackio setup
+## 📋 Environment Variables
+| Variable | Required | Default | Description |
+|----------|----------|---------|-------------|
+| `HF_TOKEN` | ✅ Yes | None | Your Hugging Face token |
+| `TRACKIO_DATASET_REPO` | ❌ No | `tonic/trackio-experiments` | Dataset repository |
+| `TRACKIO_URL` | ❌ No | None | Trackio server URL |
+| `TRACKIO_TOKEN` | ❌ No | None | Trackio authentication token |
+## 🛠️ Setup Instructions
+### 1. **Get Your HF Token**
+```bash
+# Go to https://huggingface.co/settings/tokens
+# Create a new token with "Write" permissions
+# Copy the token
+```
+### 2. **Set Environment Variables**
+```bash
+# For HF Spaces, add these to your Space settings:
+HF_TOKEN=your_hf_token_here
+TRACKIO_DATASET_REPO=your-username/your-dataset-name
+# For local development:
+export HF_TOKEN=your_hf_token_here
+export TRACKIO_DATASET_REPO=your-username/your-dataset-name
+```
+### 3. **Create Dataset Repository**
+```bash
+# Run the setup script
+python setup_hf_dataset.py
+# Or manually create a dataset on HF Hub
+# Go to https://huggingface.co/datasets
+# Create a new dataset repository
+```
+### 4. **Test Configuration**
+```bash
+# Test your setup
+python configure_trackio.py
+# Test dataset access
+python test_hf_datasets.py
+```
+## 🚀 Usage Examples
+### **Basic Training with Monitoring**
+```bash
+# Train with default monitoring
+python train.py config/train_smollm3_openhermes_fr.py
+# Train with custom dataset repository
+TRACKIO_DATASET_REPO=your-username/smollm3-experiments python train.py config/train_smollm3_openhermes_fr.py
+```
+### **Advanced Training Configuration**
+```bash
+# Train with custom experiment name
+python train.py config/train_smollm3_openhermes_fr.py \
+  --experiment_name "smollm3_french_tuning_v2" \
+  --hf_token your_token_here \
+  --dataset_repo your-username/french-experiments
+```
+### **Training Scripts with Monitoring**
+```bash
+# All training scripts now support monitoring:
+python train.py config/train_smollm3_openhermes_fr_a100_balanced.py
+python train.py config/train_smollm3_openhermes_fr_a100_large.py
+python train.py config/train_smollm3_openhermes_fr_a100_max_performance.py
+python train.py config/train_smollm3_openhermes_fr_a100_multiple_passes.py
+```
+## 📊 What Gets Monitored
+### **Training Metrics**
+- Loss values (training and validation)
+- Learning rate
+- Gradient norms
+- Training steps and epochs
+### **System Metrics**
+- GPU memory usage
+- GPU utilization
+- CPU usage
+- Memory usage
+### **Experiment Data**
+- Configuration parameters
+- Model checkpoints
+- Evaluation results
+- Training summaries
+### **Artifacts**
+- Configuration files
+- Training logs
+- Evaluation results
+- Model checkpoints
+## 🔍 Viewing Results
+### **1. Trackio Interface**
+- Visit your Trackio Space
+- Navigate to "Experiments" tab
+- View real-time metrics and plots
+### **2. HF Dataset Repository**
+- Go to your dataset repository on HF Hub
+- Browse experiment data
+- Download experiment files
+### **3. Local Files**
+- Check local backup files
+- Review training logs
+- Examine configuration files
+## 🛠️ Configuration Examples
+### **Default Setup**
+```python
+# Uses default dataset: tonic/trackio-experiments
+# Requires only HF_TOKEN
+```
+### **Personal Dataset**
+```bash
+export HF_TOKEN=your_token_here
+export TRACKIO_DATASET_REPO=your-username/trackio-experiments
+```
+### **Team Dataset**
+```bash
+export HF_TOKEN=your_token_here
+export TRACKIO_DATASET_REPO=your-org/team-experiments
+```
+### **Project-Specific Dataset**
+```bash
+export HF_TOKEN=your_token_here
+export TRACKIO_DATASET_REPO=your-username/smollm3-experiments
+```
+## 🔧 Troubleshooting
+### **Issue: "HF_TOKEN not found"**
+```bash
+# Solution: Set your HF token
+export HF_TOKEN=your_token_here
+# Or add to HF Space environment variables
+```
+### **Issue: "Failed to load dataset"**
+```bash
+# Solutions:
+# 1. Check token has read access
+# 2. Verify dataset repository exists
+# 3. Run setup script: python setup_hf_dataset.py
+```
+### **Issue: "Failed to save experiments"**
+```bash
+# Solutions:
+# 1. Check token has write permissions
+# 2. Verify dataset repository exists
+# 3. Check network connectivity
+```
+### **Issue: "Monitoring not working"**
+```bash
+# Solutions:
+# 1. Check environment variables
+# 2. Run configuration test: python configure_trackio.py
+# 3. Check logs for specific errors
+```
+## 📈 Benefits
+### **For HF Spaces Deployment**
+- ✅ **Persistent Storage**: Data survives Space restarts
+- ✅ **No Local Storage**: No dependency on ephemeral storage
+- ✅ **Scalable**: Works with any dataset size
+- ✅ **Secure**: Private dataset storage
+### **For Experiment Management**
+- ✅ **Centralized**: All experiments in one place
+- ✅ **Searchable**: Easy to find specific experiments
+- ✅ **Versioned**: Dataset versioning for experiments
+- ✅ **Collaborative**: Share experiments with team
+### **For Development**
+- ✅ **Flexible**: Easy to switch between datasets
+- ✅ **Configurable**: Environment-based configuration
+- ✅ **Robust**: Fallback mechanisms
+- ✅ **Debuggable**: Comprehensive logging
+## 🎯 Next Steps
+1. **Set up your HF token and dataset repository**
+2. **Test the configuration with `python configure_trackio.py`**
+3. **Run a training experiment to verify monitoring**
+4. **Check your HF Dataset repository for experiment data**
+5. **View results in your Trackio interface**
+## 📚 Related Files
+- `monitoring.py` - Enhanced monitoring with HF Datasets support
+- `train.py` - Updated training script with monitoring integration
+- `configure_trackio.py` - Configuration and testing script
+- `setup_hf_dataset.py` - Dataset repository setup
+- `test_hf_datasets.py` - Dataset access testing
+- `ENVIRONMENT_VARIABLES.md` - Environment variable reference
+- `HF_DATASETS_GUIDE.md` - Detailed HF Datasets guide
+---
+**🎉 Your experiments are now persistently stored and easily accessible!**

NO_THINK_TAG_GUIDE.md → docs/NO_THINK_TAG_GUIDE.md RENAMED Viewed

File without changes

PUSH_GUIDE.md → docs/PUSH_GUIDE.md RENAMED Viewed

File without changes

docs/PUSH_SCRIPT_GUIDE.md ADDED Viewed

	@@ -0,0 +1,267 @@

+# 🚀 Push to Hugging Face Script Guide
+## Overview
+The `push_to_huggingface.py` script has been enhanced to integrate with **HF Datasets** for experiment tracking and provides complete model deployment with persistent experiment storage.
+## 🚀 Key Improvements
+### **1. HF Datasets Integration**
+- ✅ **Dataset Repository Support**: Configurable dataset repository for experiment storage
+- ✅ **Environment Variables**: Automatic detection of `HF_TOKEN` and `TRACKIO_DATASET_REPO`
+- ✅ **Enhanced Logging**: Logs push actions to both Trackio and HF Datasets
+- ✅ **Model Card Integration**: Includes dataset repository information in model cards
+### **2. Enhanced Configuration**
+- ✅ **Flexible Token Input**: Multiple ways to provide HF token
+- ✅ **Dataset Repository Tracking**: Links models to their experiment datasets
+- ✅ **Environment Variable Support**: Fallback to environment variables
+- ✅ **Command Line Arguments**: New arguments for HF Datasets integration
+### **3. Improved Model Cards**
+- ✅ **Dataset Repository Info**: Shows which dataset contains experiment data
+- ✅ **Experiment Tracking Section**: Explains how to access training data
+- ✅ **Enhanced Documentation**: Better model cards with experiment links
+## 📋 Usage Examples
+### **Basic Usage**
+```bash
+# Push model with default settings
+python push_to_huggingface.py /path/to/model username/repo-name
+```
+### **With HF Datasets Integration**
+```bash
+# Push model with custom dataset repository
+python push_to_huggingface.py /path/to/model username/repo-name \
+  --dataset-repo username/experiments
+```
+### **With Custom Token**
+```bash
+# Push model with custom HF token
+python push_to_huggingface.py /path/to/model username/repo-name \
+  --hf-token your_token_here
+```
+### **Complete Example**
+```bash
+# Push model with all options
+python push_to_huggingface.py /path/to/model username/repo-name \
+  --dataset-repo username/experiments \
+  --hf-token your_token_here \
+  --private \
+  --experiment-name "smollm3_finetune_v2"
+```
+## 🔧 Command Line Arguments
+| Argument | Required | Default | Description |
+|----------|----------|---------|-------------|
+| `model_path` | ✅ Yes | None | Path to trained model directory |
+| `repo_name` | ✅ Yes | None | HF repository name (username/repo-name) |
+| `--token` | ❌ No | `HF_TOKEN` env | Hugging Face token |
+| `--hf-token` | ❌ No | `HF_TOKEN` env | HF token (alternative to --token) |
+| `--private` | ❌ No | False | Make repository private |
+| `--trackio-url` | ❌ No | None | Trackio Space URL for logging |
+| `--experiment-name` | ❌ No | None | Experiment name for Trackio |
+| `--dataset-repo` | ❌ No | `TRACKIO_DATASET_REPO` env | HF Dataset repository |
+## 🛠️ Configuration Methods
+### **Method 1: Command Line Arguments**
+```bash
+python push_to_huggingface.py model_path repo_name \
+  --dataset-repo username/experiments \
+  --hf-token your_token_here
+```
+### **Method 2: Environment Variables**
+```bash
+export HF_TOKEN=your_token_here
+export TRACKIO_DATASET_REPO=username/experiments
+python push_to_huggingface.py model_path repo_name
+```
+### **Method 3: Hybrid Approach**
+```bash
+# Set defaults via environment variables
+export HF_TOKEN=your_token_here
+export TRACKIO_DATASET_REPO=username/experiments
+# Override specific values via command line
+python push_to_huggingface.py model_path repo_name \
+  --dataset-repo username/specific-experiments
+```
+## 📊 What Gets Pushed
+### **Model Files**
+- ✅ **Model Weights**: `pytorch_model.bin`
+- ✅ **Configuration**: `config.json`
+- ✅ **Tokenizer**: `tokenizer.json`, `tokenizer_config.json`
+- ✅ **All Other Files**: Any additional files in model directory
+### **Documentation**
+- ✅ **Model Card**: Comprehensive README.md with model information
+- ✅ **Training Configuration**: JSON configuration used for training
+- ✅ **Training Results**: JSON results and metrics
+- ✅ **Training Logs**: Text logs from training process
+### **Experiment Data**
+- ✅ **Dataset Repository**: Links to HF Dataset containing experiment data
+- ✅ **Training Metrics**: All training metrics stored in dataset
+- ✅ **Configuration**: Training configuration stored in dataset
+- ✅ **Artifacts**: Training artifacts and logs
+## 🔍 Enhanced Model Cards
+The improved script creates enhanced model cards that include:
+### **Model Information**
+- Base model and architecture
+- Training date and model size
+- **Dataset repository** for experiment data
+### **Training Configuration**
+- Complete training parameters
+- Hardware information
+- Training duration and steps
+### **Experiment Tracking**
+- Links to HF Dataset repository
+- Instructions for accessing experiment data
+- Training metrics and results
+### **Usage Examples**
+- Code examples for loading and using the model
+- Generation examples
+- Performance information
+## 📈 Logging Integration
+### **Trackio Logging**
+- ✅ **Push Actions**: Logs model push events
+- ✅ **Model Information**: Repository name, size, configuration
+- ✅ **Training Data**: Links to experiment dataset
+### **HF Datasets Logging**
+- ✅ **Experiment Summary**: Final training summary
+- ✅ **Push Metadata**: Model repository and push date
+- ✅ **Configuration**: Complete training configuration
+### **Dual Storage**
+- ✅ **Trackio**: Real-time monitoring and visualization
+- ✅ **HF Datasets**: Persistent experiment storage
+- ✅ **Synchronized**: Both systems updated together
+## 🚨 Troubleshooting
+### **Issue: "Missing required files"**
+**Solutions**:
+1. Check model directory contains required files
+2. Ensure model was saved correctly during training
+3. Verify file permissions
+### **Issue: "Failed to create repository"**
+**Solutions**:
+1. Check HF token has write permissions
+2. Verify repository name format: `username/repo-name`
+3. Ensure repository doesn't already exist (or use `--private`)
+### **Issue: "Failed to upload files"**
+**Solutions**:
+1. Check network connectivity
+2. Verify HF token is valid
+3. Ensure repository was created successfully
+### **Issue: "Dataset repository not found"**
+**Solutions**:
+1. Check dataset repository exists
+2. Verify HF token has read access
+3. Use `--dataset-repo` to specify correct repository
+## 📋 Workflow Integration
+### **Complete Training Workflow**
+1. **Train Model**: Use training scripts with monitoring
+2. **Monitor Progress**: View metrics in Trackio interface
+3. **Push Model**: Use improved push script
+4. **Access Data**: View experiments in HF Dataset repository
+### **Example Workflow**
+```bash
+# 1. Train model with monitoring
+python train.py config/train_smollm3_openhermes_fr.py \
+  --experiment_name "smollm3_french_v2"
+# 2. Push model to HF Hub
+python push_to_huggingface.py outputs/model username/smollm3-french \
+  --dataset-repo username/experiments \
+  --experiment-name "smollm3_french_v2"
+# 3. View results
+# - Model: https://huggingface.co/username/smollm3-french
+# - Experiments: https://huggingface.co/datasets/username/experiments
+# - Trackio: Your Trackio Space interface
+```
+## 🎯 Benefits
+### **For Model Deployment**
+- ✅ **Complete Documentation**: Enhanced model cards with experiment links
+- ✅ **Persistent Storage**: Experiment data stored in HF Datasets
+- ✅ **Easy Access**: Direct links to training data and metrics
+- ✅ **Reproducibility**: Complete training configuration included
+### **For Experiment Management**
+- ✅ **Centralized Storage**: All experiments in HF Dataset repository
+- ✅ **Version Control**: Model versions linked to experiment data
+- ✅ **Collaboration**: Share experiments and models easily
+- ✅ **Searchability**: Easy to find specific experiments
+### **For Development**
+- ✅ **Flexible Configuration**: Multiple ways to set parameters
+- ✅ **Backward Compatible**: Works with existing setups
+- ✅ **Error Handling**: Clear error messages and troubleshooting
+- ✅ **Integration**: Works with existing monitoring system
+## 📊 Testing Results
+All push script tests passed:
+- ✅ **HuggingFacePusher Initialization**: Works with new parameters
+- ✅ **Model Card Creation**: Includes HF Datasets integration
+- ✅ **Logging Integration**: Logs to both Trackio and HF Datasets
+- ✅ **Argument Parsing**: Handles new command line arguments
+- ✅ **Environment Variables**: Proper fallback handling
+## 🔄 Migration Guide
+### **From Old Script**
+```bash
+# Old way
+python push_to_huggingface.py model_path repo_name --token your_token
+# New way (same functionality)
+python push_to_huggingface.py model_path repo_name --hf-token your_token
+# New way with HF Datasets
+python push_to_huggingface.py model_path repo_name \
+  --hf-token your_token \
+  --dataset-repo username/experiments
+```
+### **Environment Variables**
+```bash
+# Set environment variables for automatic detection
+export HF_TOKEN=your_token_here
+export TRACKIO_DATASET_REPO=username/experiments
+# Then use simple command
+python push_to_huggingface.py model_path repo_name
+```
+---
+**🎉 Your push script is now fully integrated with HF Datasets for complete experiment tracking and model deployment!**

TRACKIO_INTEGRATION.md → docs/TRACKIO_INTEGRATION.md RENAMED Viewed

File without changes

TRACKIO_INTEGRATION_VERIFICATION.md → docs/TRACKIO_INTEGRATION_VERIFICATION.md RENAMED Viewed

File without changes

TRACKIO_INTERFACE_GUIDE.md → docs/TRACKIO_INTERFACE_GUIDE.md RENAMED Viewed

File without changes

launch.sh ADDED Viewed

	@@ -0,0 +1,690 @@

+#!/bin/bash
+# Interactive SmolLM3 End-to-End Fine-tuning Pipeline
+# This script creates a complete finetuning pipeline with user configuration
+set -e  # Exit on any error
+# Colors for output
+RED='\033[0;31m'
+GREEN='\033[0;32m'
+YELLOW='\033[1;33m'
+BLUE='\033[0;34m'
+PURPLE='\033[0;35m'
+CYAN='\033[0;36m'
+NC='\033[0m' # No Color
+# Function to print colored output
+print_status() {
+    echo -e "${GREEN}✅ $1${NC}"
+}
+print_warning() {
+    echo -e "${YELLOW}⚠️  $1${NC}"
+}
+print_error() {
+    echo -e "${RED}❌ $1${NC}"
+}
+print_info() {
+    echo -e "${BLUE}ℹ️  $1${NC}"
+}
+print_header() {
+    echo -e "${PURPLE}🚀 $1${NC}"
+}
+print_step() {
+    echo -e "${CYAN}📋 $1${NC}"
+}
+# Function to get user input with default value
+get_input() {
+    local prompt="$1"
+    local default="$2"
+    local var_name="$3"
+    if [ -n "$default" ]; then
+        read -p "$prompt [$default]: " input
+        if [ -z "$input" ]; then
+            input="$default"
+        fi
+    else
+        read -p "$prompt: " input
+        while [ -z "$input" ]; do
+            print_error "This field is required!"
+            read -p "$prompt: " input
+        done
+    fi
+    eval "$var_name=\"$input\""
+}
+# Function to select from options
+select_option() {
+    local prompt="$1"
+    local options=("${@:2}")
+    local var_name="${!#}"
+    echo "$prompt"
+    for i in "${!options[@]}"; do
+        echo "  $((i+1)). ${options[$i]}"
+    done
+    while true; do
+        read -p "Enter your choice (1-${#options[@]}): " choice
+        if [[ "$choice" =~ ^[0-9]+$ ]] && [ "$choice" -ge 1 ] && [ "$choice" -le "${#options[@]}" ]; then
+            eval "$var_name=\"${options[$((choice-1))]}\""
+            break
+        else
+            print_error "Invalid choice. Please enter a number between 1 and ${#options[@]}"
+        fi
+    done
+}
+# Function to validate HF token
+validate_hf_token() {
+    local token="$1"
+    if [ -z "$token" ]; then
+        return 1
+    fi
+    # Test the token
+    export HF_TOKEN="$token"
+    if huggingface-cli whoami >/dev/null 2>&1; then
+        return 0
+    else
+        return 1
+    fi
+}
+# Function to show training configurations
+show_training_configs() {
+    echo ""
+    print_header "Available Training Configurations"
+    echo "======================================"
+    echo ""
+    echo "1. Basic Training (Default)"
+    echo "   - Model: SmolLM3-3B"
+    echo "   - Dataset: SmolTalk"
+    echo "   - Epochs: 3"
+    echo "   - Batch Size: 2"
+    echo "   - Learning Rate: 5e-6"
+    echo ""
+    echo "2. H100 Lightweight (Rapid)"
+    echo "   - Model: SmolLM3-3B"
+    echo "   - Dataset: OpenHermes-FR (80K samples)"
+    echo "   - Epochs: 1"
+    echo "   - Batch Size: 16"
+    echo "   - Learning Rate: 8e-6"
+    echo "   - Sequence Length: 8192"
+    echo "   - Optimized for H100 rapid training"
+    echo ""
+    echo "3. A100 Large Scale"
+    echo "   - Model: SmolLM3-3B"
+    echo "   - Dataset: OpenHermes-FR"
+    echo "   - Epochs: 1.3 passes"
+    echo "   - Batch Size: 8"
+    echo "   - Learning Rate: 5e-6"
+    echo "   - Sequence Length: 8192"
+    echo ""
+    echo "4. Multiple Passes"
+    echo "   - Model: SmolLM3-3B"
+    echo "   - Dataset: OpenHermes-FR"
+    echo "   - Epochs: 4 passes"
+    echo "   - Batch Size: 6"
+    echo "   - Learning Rate: 3e-6"
+    echo "   - Sequence Length: 8192"
+    echo ""
+    echo "5. Custom Configuration"
+    echo "   - User-defined parameters"
+    echo ""
+}
+# Function to get training configuration
+get_training_config() {
+    local config_type="$1"
+    case "$config_type" in
+        "Basic Training")
+            MODEL_NAME="HuggingFaceTB/SmolLM3-3B"
+            DATASET_NAME="HuggingFaceTB/smoltalk"
+            MAX_EPOCHS=3
+            BATCH_SIZE=2
+            GRADIENT_ACCUMULATION_STEPS=8
+            LEARNING_RATE=5e-6
+            MAX_SEQ_LENGTH=4096
+            CONFIG_FILE="config/train_smollm3.py"
+            ;;
+        "H100 Lightweight (Rapid)")
+            MODEL_NAME="HuggingFaceTB/SmolLM3-3B"
+            DATASET_NAME="legmlai/openhermes-fr"
+            MAX_EPOCHS=1
+            BATCH_SIZE=16
+            GRADIENT_ACCUMULATION_STEPS=4
+            LEARNING_RATE=8e-6
+            MAX_SEQ_LENGTH=8192
+            DATASET_SAMPLE_SIZE=80000
+            CONFIG_FILE="config/train_smollm3_h100_lightweight.py"
+            ;;
+        "A100 Large Scale")
+            MODEL_NAME="HuggingFaceTB/SmolLM3-3B"
+            DATASET_NAME="legmlai/openhermes-fr"
+            MAX_EPOCHS=1
+            BATCH_SIZE=8
+            GRADIENT_ACCUMULATION_STEPS=16
+            LEARNING_RATE=5e-6
+            MAX_SEQ_LENGTH=8192
+            CONFIG_FILE="config/train_smollm3_openhermes_fr_a100_large.py"
+            ;;
+        "Multiple Passes")
+            MODEL_NAME="HuggingFaceTB/SmolLM3-3B"
+            DATASET_NAME="legmlai/openhermes-fr"
+            MAX_EPOCHS=4
+            BATCH_SIZE=6
+            GRADIENT_ACCUMULATION_STEPS=20
+            LEARNING_RATE=3e-6
+            MAX_SEQ_LENGTH=8192
+            CONFIG_FILE="config/train_smollm3_openhermes_fr_a100_multiple_passes.py"
+            ;;
+        "Custom Configuration")
+            get_custom_config
+            ;;
+    esac
+}
+# Function to get custom configuration
+get_custom_config() {
+    print_step "Custom Configuration Setup"
+    echo "============================="
+    get_input "Model name" "HuggingFaceTB/SmolLM3-3B" MODEL_NAME
+    get_input "Dataset name" "HuggingFaceTB/smoltalk" DATASET_NAME
+    get_input "Number of epochs" "3" MAX_EPOCHS
+    get_input "Batch size" "2" BATCH_SIZE
+    get_input "Gradient accumulation steps" "8" GRADIENT_ACCUMULATION_STEPS
+    get_input "Learning rate" "5e-6" LEARNING_RATE
+    get_input "Max sequence length" "4096" MAX_SEQ_LENGTH
+    # Select config file based on dataset
+    if [[ "$DATASET_NAME" == *"openhermes"* ]]; then
+        CONFIG_FILE="config/train_smollm3_openhermes_fr.py"
+    else
+        CONFIG_FILE="config/train_smollm3.py"
+    fi
+}
+# Function to create training configuration file
+create_training_config() {
+    local config_file="$1"
+    cat > "$config_file" << EOF
+"""
+SmolLM3 Training Configuration - Generated by launch.sh
+Optimized for: $TRAINING_CONFIG_TYPE
+"""
+from config.train_smollm3 import SmolLM3Config
+config = SmolLM3Config(
+    # Model configuration
+    model_name="$MODEL_NAME",
+    max_seq_length=$MAX_SEQ_LENGTH,
+    use_flash_attention=True,
+    use_gradient_checkpointing=True,
+    # Training configuration
+    batch_size=$BATCH_SIZE,
+    gradient_accumulation_steps=$GRADIENT_ACCUMULATION_STEPS,
+    learning_rate=$LEARNING_RATE,
+    weight_decay=0.01,
+    warmup_steps=100,
+    max_iters=None,  # Will be calculated based on epochs
+    eval_interval=100,
+    log_interval=10,
+    save_interval=500,
+    # Optimizer configuration
+    optimizer="adamw",
+    beta1=0.9,
+    beta2=0.95,
+    eps=1e-8,
+    # Scheduler configuration
+    scheduler="cosine",
+    min_lr=1e-6,
+    # Mixed precision
+    fp16=True,
+    bf16=False,
+    # Logging and saving
+    save_steps=$SAVE_STEPS,
+    eval_steps=$EVAL_STEPS,
+    logging_steps=$LOGGING_STEPS,
+    save_total_limit=3,
+    # Evaluation
+    eval_strategy="steps",
+    metric_for_best_model="eval_loss",
+    greater_is_better=False,
+    load_best_model_at_end=True,
+    # Data configuration
+    dataset_name="$DATASET_NAME",
+    dataset_split="train",
+    input_field="prompt",
+    target_field="completion",
+    filter_bad_entries=False,
+    bad_entry_field="bad_entry",
+    # Chat template configuration
+    use_chat_template=True,
+    chat_template_kwargs={
+        "enable_thinking": False,
+        "add_generation_prompt": True,
+        "no_think_system_message": True
+    },
+    # Trackio monitoring configuration
+    enable_tracking=True,
+    trackio_url="$TRACKIO_URL",
+    trackio_token=None,
+    log_artifacts=True,
+    log_metrics=True,
+    log_config=True,
+    experiment_name="$EXPERIMENT_NAME",
+    # HF Datasets configuration
+    dataset_repo="$TRACKIO_DATASET_REPO"
+)
+EOF
+}
+# Main script starts here
+print_header "SmolLM3 End-to-End Fine-tuning Pipeline"
+echo "=============================================="
+echo ""
+# Step 1: Get user credentials
+print_step "Step 1: User Authentication"
+echo "================================"
+get_input "Hugging Face username" "" HF_USERNAME
+get_input "Hugging Face token (get from https://huggingface.co/settings/tokens)" "" HF_TOKEN
+# Validate HF token
+print_info "Validating Hugging Face token..."
+if validate_hf_token "$HF_TOKEN"; then
+    print_status "HF token validated successfully"
+else
+    print_error "Invalid HF token. Please check your token and try again."
+    exit 1
+fi
+# Step 2: Select training configuration
+print_step "Step 2: Training Configuration"
+echo "=================================="
+show_training_configs
+select_option "Select training configuration:" "Basic Training" "H100 Lightweight (Rapid)" "A100 Large Scale" "Multiple Passes" "Custom Configuration" TRAINING_CONFIG_TYPE
+get_training_config "$TRAINING_CONFIG_TYPE"
+# Step 3: Get experiment details
+print_step "Step 3: Experiment Details"
+echo "=============================="
+get_input "Experiment name" "smollm3_finetune_$(date +%Y%m%d_%H%M%S)" EXPERIMENT_NAME
+get_input "Model repository name" "$HF_USERNAME/smollm3-finetuned-$(date +%Y%m%d)" REPO_NAME
+get_input "Trackio dataset repository" "$HF_USERNAME/trackio-experiments" TRACKIO_DATASET_REPO
+# Step 4: Training parameters
+print_step "Step 4: Training Parameters"
+echo "==============================="
+echo "Current configuration:"
+echo "  Model: $MODEL_NAME"
+echo "  Dataset: $DATASET_NAME"
+if [ "$TRAINING_CONFIG_TYPE" = "H100 Lightweight (Rapid)" ]; then
+    echo "  Dataset Sample Size: ${DATASET_SAMPLE_SIZE:-80000}"
+fi
+echo "  Epochs: $MAX_EPOCHS"
+echo "  Batch Size: $BATCH_SIZE"
+echo "  Gradient Accumulation: $GRADIENT_ACCUMULATION_STEPS"
+echo "  Learning Rate: $LEARNING_RATE"
+echo "  Sequence Length: $MAX_SEQ_LENGTH"
+get_input "Save steps" "500" SAVE_STEPS
+get_input "Evaluation steps" "100" EVAL_STEPS
+get_input "Logging steps" "10" LOGGING_STEPS
+# Step 5: Trackio Space configuration
+print_step "Step 5: Trackio Space Configuration"
+echo "======================================"
+get_input "Trackio Space name" "trackio-monitoring-$(date +%Y%m%d)" TRACKIO_SPACE_NAME
+TRACKIO_URL="https://huggingface.co/spaces/$HF_USERNAME/$TRACKIO_SPACE_NAME"
+# Step 6: Confirm configuration
+print_step "Step 6: Configuration Summary"
+echo "================================="
+echo ""
+echo "📋 Configuration Summary:"
+echo "========================"
+echo "  User: $HF_USERNAME"
+echo "  Experiment: $EXPERIMENT_NAME"
+echo "  Model: $MODEL_NAME"
+echo "  Dataset: $DATASET_NAME"
+echo "  Training Config: $TRAINING_CONFIG_TYPE"
+if [ "$TRAINING_CONFIG_TYPE" = "H100 Lightweight (Rapid)" ]; then
+    echo "  Dataset Sample Size: ${DATASET_SAMPLE_SIZE:-80000}"
+fi
+echo "  Epochs: $MAX_EPOCHS"
+echo "  Batch Size: $BATCH_SIZE"
+echo "  Learning Rate: $LEARNING_RATE"
+echo "  Model Repo: $REPO_NAME"
+echo "  Trackio Space: $TRACKIO_URL"
+echo "  HF Dataset: $TRACKIO_DATASET_REPO"
+echo ""
+read -p "Proceed with this configuration? (y/N): " confirm
+if [[ ! "$confirm" =~ ^[Yy]$ ]]; then
+    print_info "Configuration cancelled. Exiting."
+    exit 0
+fi
+# Step 7: Environment setup
+print_step "Step 7: Environment Setup"
+echo "============================"
+print_info "Installing system dependencies..."
+sudo apt-get update
+sudo apt-get install -y git curl wget unzip python3-pip python3-venv
+print_info "Creating Python virtual environment..."
+python3 -m venv smollm3_env
+source smollm3_env/bin/activate
+print_info "Installing PyTorch with CUDA support..."
+pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
+print_info "Installing project dependencies..."
+pip install -r requirements/requirements_core.txt
+print_info "Installing additional dependencies..."
+pip install trl>=0.7.0
+pip install peft>=0.4.0
+pip install accelerate>=0.20.0
+pip install huggingface-hub>=0.16.0
+pip install datasets>=2.14.0
+pip install requests>=2.31.0
+# Step 8: Authentication setup
+print_step "Step 8: Authentication Setup"
+echo "================================"
+export HF_TOKEN="$HF_TOKEN"
+export TRACKIO_DATASET_REPO="$TRACKIO_DATASET_REPO"
+huggingface-cli login --token $HF_TOKEN
+# Step 9: Deploy Trackio Space
+print_step "Step 9: Deploying Trackio Space"
+echo "==================================="
+cd scripts/trackio_tonic
+# Create deployment script input
+cat > deploy_input.txt << EOF
+$HF_USERNAME
+$TRACKIO_SPACE_NAME
+$HF_TOKEN
+EOF
+# Run deployment script
+python deploy_trackio_space.py < deploy_input.txt
+print_status "Trackio Space deployed: $TRACKIO_URL"
+# Step 10: Setup HF Dataset
+print_step "Step 10: Setting up HF Dataset"
+echo "=================================="
+cd ../dataset_tonic
+python setup_hf_dataset.py
+# Step 11: Configure Trackio
+print_step "Step 11: Configuring Trackio"
+echo "================================="
+cd ../trackio_tonic
+python configure_trackio.py
+# Step 12: Create training configuration
+print_step "Step 12: Creating Training Configuration"
+echo "==========================================="
+cd ../..
+create_training_config "$CONFIG_FILE"
+# Step 13: Download and prepare dataset
+print_step "Step 13: Preparing Dataset"
+echo "==============================="
+python -c "
+from datasets import load_dataset
+import json
+import os
+import random
+# Load dataset
+print('Loading dataset: $DATASET_NAME')
+dataset = load_dataset('$DATASET_NAME')
+# Create dataset directory
+os.makedirs('training_dataset', exist_ok=True)
+# Convert to training format
+def convert_to_training_format(example):
+    # Handle different dataset formats
+    if 'prompt' in example and 'completion' in example:
+        return {
+            'prompt': example['prompt'],
+            'completion': example['completion']
+        }
+    elif 'instruction' in example and 'output' in example:
+        return {
+            'prompt': example['instruction'],
+            'completion': example['output']
+        }
+    elif 'messages' in example:
+        # Handle chat format
+        messages = example['messages']
+        if len(messages) >= 2:
+            return {
+                'prompt': messages[0]['content'],
+                'completion': messages[1]['content']
+            }
+    else:
+        # Fallback
+        return {
+            'prompt': str(example.get('input', '')),
+            'completion': str(example.get('output', ''))
+        }
+# Process train split
+train_data = []
+for example in dataset['train']:
+    training_example = convert_to_training_format(example)
+    if training_example['prompt'] and training_example['completion']:
+        train_data.append(training_example)
+# Apply dataset sampling for lightweight configuration
+if '$TRAINING_CONFIG_TYPE' == 'H100 Lightweight (Rapid)' and len(train_data) > ${DATASET_SAMPLE_SIZE:-0}:
+    print(f'Sampling {${DATASET_SAMPLE_SIZE:-80000}} random samples from {len(train_data)} total samples')
+    random.seed(42)  # For reproducibility
+    train_data = random.sample(train_data, ${DATASET_SAMPLE_SIZE:-80000})
+    print(f'Selected {len(train_data)} samples for lightweight training')
+# Process validation split if available
+val_data = []
+if 'validation' in dataset:
+    for example in dataset['validation']:
+        training_example = convert_to_training_format(example)
+        if training_example['prompt'] and training_example['completion']:
+            val_data.append(training_example)
+# For lightweight config, also sample validation if it's large
+if '$TRAINING_CONFIG_TYPE' == 'H100 Lightweight (Rapid)' and len(val_data) > 1000:
+    print(f'Sampling 1000 random validation samples from {len(val_data)} total')
+    random.seed(42)  # For reproducibility
+    val_data = random.sample(val_data, 1000)
+# Save to files
+with open('training_dataset/train.json', 'w') as f:
+    json.dump(train_data, f, indent=2)
+if val_data:
+    with open('training_dataset/validation.json', 'w') as f:
+        json.dump(val_data, f, indent=2)
+print(f'Dataset prepared: {len(train_data)} train samples, {len(val_data)} validation samples')
+"
+# Step 14: Calculate training parameters
+print_step "Step 14: Calculating Training Parameters"
+echo "============================================"
+TOTAL_SAMPLES=$(python -c "import json; data=json.load(open('training_dataset/train.json')); print(len(data))")
+EFFECTIVE_BATCH_SIZE=$((BATCH_SIZE * GRADIENT_ACCUMULATION_STEPS))
+STEPS_PER_EPOCH=$((TOTAL_SAMPLES / EFFECTIVE_BATCH_SIZE))
+MAX_STEPS=$((STEPS_PER_EPOCH * MAX_EPOCHS))
+echo "  Total samples: $TOTAL_SAMPLES"
+echo "  Effective batch size: $EFFECTIVE_BATCH_SIZE"
+echo "  Steps per epoch: $STEPS_PER_EPOCH"
+echo "  Total training steps: $MAX_STEPS"
+# Step 15: Start training
+print_step "Step 15: Starting Training"
+echo "=============================="
+python src/train.py "$CONFIG_FILE" \
+    --dataset_dir training_dataset \
+    --out_dir /output-checkpoint \
+    --init_from scratch \
+    --max_iters $MAX_STEPS \
+    --batch_size $BATCH_SIZE \
+    --learning_rate $LEARNING_RATE \
+    --gradient_accumulation_steps $GRADIENT_ACCUMULATION_STEPS \
+    --max_seq_length $MAX_SEQ_LENGTH \
+    --save_steps $SAVE_STEPS \
+    --eval_steps $EVAL_STEPS \
+    --logging_steps $LOGGING_STEPS \
+    --enable_tracking \
+    --trackio_url "$TRACKIO_URL" \
+    --experiment_name "$EXPERIMENT_NAME" \
+    --hf_token "$HF_TOKEN" \
+    --dataset_repo "$TRACKIO_DATASET_REPO"
+# Step 16: Push model to Hugging Face Hub
+print_step "Step 16: Pushing Model to HF Hub"
+echo "====================================="
+python scripts/model_tonic/push_to_huggingface.py /output-checkpoint "$REPO_NAME" \
+    --token "$HF_TOKEN" \
+    --trackio-url "$TRACKIO_URL" \
+    --experiment-name "$EXPERIMENT_NAME" \
+    --dataset-repo "$TRACKIO_DATASET_REPO"
+# Step 17: Test the uploaded model
+print_step "Step 17: Testing Uploaded Model"
+echo "==================================="
+python -c "
+from transformers import AutoModelForCausalLM, AutoTokenizer
+import torch
+print('Loading uploaded model...')
+model = AutoModelForCausalLM.from_pretrained('$REPO_NAME', torch_dtype=torch.float16, device_map='auto')
+tokenizer = AutoTokenizer.from_pretrained('$REPO_NAME')
+print('Testing model generation...')
+prompt = 'Hello, how are you?'
+inputs = tokenizer(prompt, return_tensors='pt').to(model.device)
+outputs = model.generate(**inputs, max_new_tokens=50, do_sample=True, temperature=0.7)
+response = tokenizer.decode(outputs[0], skip_special_tokens=True)
+print(f'Prompt: {prompt}')
+print(f'Response: {response}')
+print('✅ Model test completed successfully!')
+"
+# Step 18: Create summary report
+print_step "Step 18: Creating Summary Report"
+echo "===================================="
+cat > training_summary.md << EOF
+# SmolLM3 Fine-tuning Summary
+## Configuration
+- **Model**: $MODEL_NAME
+- **Dataset**: $DATASET_NAME
+- **Experiment**: $EXPERIMENT_NAME
+- **Repository**: $REPO_NAME
+- **Trackio Space**: $TRACKIO_URL
+- **HF Dataset**: $TRACKIO_DATASET_REPO
+- **Training Config**: $TRAINING_CONFIG_TYPE
+$(if [ "$TRAINING_CONFIG_TYPE" = "H100 Lightweight (Rapid)" ]; then
+echo "- **Dataset Sample Size**: ${DATASET_SAMPLE_SIZE:-80000}"
+fi)
+## Training Parameters
+- **Batch Size**: $BATCH_SIZE
+- **Gradient Accumulation**: $GRADIENT_ACCUMULATION_STEPS
+- **Learning Rate**: $LEARNING_RATE
+- **Max Epochs**: $MAX_EPOCHS
+- **Max Steps**: $MAX_STEPS
+- **Total Samples**: $TOTAL_SAMPLES
+- **Sequence Length**: $MAX_SEQ_LENGTH
+## Results
+- **Model Repository**: https://huggingface.co/$REPO_NAME
+- **Trackio Monitoring**: $TRACKIO_URL
+- **Experiment Data**: https://huggingface.co/datasets/$TRACKIO_DATASET_REPO
+## Next Steps
+1. Monitor training progress in your Trackio Space
+2. Check the model repository on Hugging Face Hub
+3. Use the model in your applications
+4. Share your results with the community
+## Files Created
+- Training configuration: \`$CONFIG_FILE\`
+- Dataset: \`training_dataset/\`
+- Model checkpoint: \`/output-checkpoint/\`
+- Training logs: \`training.log\`
+- Summary report: \`training_summary.md\`
+EOF
+print_status "Summary report saved to: training_summary.md"
+# Final summary
+echo ""
+print_header "🎉 End-to-End Pipeline Completed Successfully!"
+echo "=================================================="
+echo ""
+echo "📊 Model: https://huggingface.co/$REPO_NAME"
+echo "📈 Trackio: $TRACKIO_URL"
+echo "📋 Experiment: $EXPERIMENT_NAME"
+echo "📊 Dataset: https://huggingface.co/datasets/$TRACKIO_DATASET_REPO"
+echo ""
+echo "📋 Summary report saved to: training_summary.md"
+echo ""
+echo "🚀 Next steps:"
+echo "1. Monitor training progress in your Trackio Space"
+echo "2. Check the model repository on Hugging Face Hub"
+echo "3. Use the model in your applications"
+echo "4. Share your results with the community"
+echo ""
+print_status "Pipeline completed successfully!"

requirements.txt → requirements/requirements.txt RENAMED Viewed

File without changes

requirements_core.txt → requirements/requirements_core.txt RENAMED Viewed

@@ -9,6 +9,12 @@ tokenizers>=0.13.0
 bitsandbytes>=0.41.0
 numpy>=1.24.0
 tqdm>=4.65.0
 trackio>=0.1.0
 psutil>=5.9.0
-pynvml>=12.0.0

 bitsandbytes>=0.41.0
 numpy>=1.24.0
 tqdm>=4.65.0
+# Monitoring dependencies
+requests>=2.31.0
+pandas>=2.0.0
+plotly>=5.0.0
 trackio>=0.1.0
 psutil>=5.9.0
+pynvml>=12.0.0

requirements_minimal.txt → requirements/requirements_minimal.txt RENAMED Viewed

File without changes

add_demo_data.py → scripts/dataset_tonic/add_demo_data.py RENAMED Viewed

File without changes

scripts/dataset_tonic/setup_hf_dataset.py ADDED Viewed

	@@ -0,0 +1,275 @@

+#!/usr/bin/env python3
+"""
+Setup script for Hugging Face Dataset repository for Trackio experiments
+"""
+import os
+import json
+from datetime import datetime
+from datasets import Dataset
+from huggingface_hub import HfApi
+def setup_trackio_dataset():
+    """Set up the Trackio experiments dataset on Hugging Face Hub"""
+    # Configuration - get from environment variables with fallbacks
+    dataset_repo = os.environ.get('TRACKIO_DATASET_REPO', 'tonic/trackio-experiments')
+    hf_token = os.environ.get('HF_TOKEN')
+    if not hf_token:
+        print("❌ HF_TOKEN not found. Please set the HF_TOKEN environment variable.")
+        print("You can get your token from: https://huggingface.co/settings/tokens")
+        return False
+    print(f"🚀 Setting up Trackio dataset: {dataset_repo}")
+    print(f"🔧 Using dataset repository: {dataset_repo}")
+    # Initial experiment data
+    initial_experiments = [
+        {
+            'experiment_id': 'exp_20250720_130853',
+            'name': 'petite-elle-l-aime-3',
+            'description': 'SmolLM3 fine-tuning experiment',
+            'created_at': '2025-07-20T11:20:01.780908',
+            'status': 'running',
+            'metrics': json.dumps([
+                {
+                    'timestamp': '2025-07-20T11:20:01.780908',
+                    'step': 25,
+                    'metrics': {
+                        'loss': 1.1659,
+                        'grad_norm': 10.3125,
+                        'learning_rate': 7e-08,
+                        'num_tokens': 1642080.0,
+                        'mean_token_accuracy': 0.75923578992486,
+                        'epoch': 0.004851130919895701
+                    }
+                },
+                {
+                    'timestamp': '2025-07-20T11:26:39.042155',
+                    'step': 50,
+                    'metrics': {
+                        'loss': 1.165,
+                        'grad_norm': 10.75,
+                        'learning_rate': 1.4291666666666667e-07,
+                        'num_tokens': 3324682.0,
+                        'mean_token_accuracy': 0.7577659255266189,
+                        'epoch': 0.009702261839791402
+                    }
+                },
+                {
+                    'timestamp': '2025-07-20T11:33:16.203045',
+                    'step': 75,
+                    'metrics': {
+                        'loss': 1.1639,
+                        'grad_norm': 10.6875,
+                        'learning_rate': 2.1583333333333334e-07,
+                        'num_tokens': 4987941.0,
+                        'mean_token_accuracy': 0.7581205774843692,
+                        'epoch': 0.014553392759687101
+                    }
+                },
+                {
+                    'timestamp': '2025-07-20T11:39:53.453917',
+                    'step': 100,
+                    'metrics': {
+                        'loss': 1.1528,
+                        'grad_norm': 10.75,
+                        'learning_rate': 2.8875e-07,
+                        'num_tokens': 6630190.0,
+                        'mean_token_accuracy': 0.7614579878747463,
+                        'epoch': 0.019404523679582803
+                    }
+                }
+            ]),
+            'parameters': json.dumps({
+                'model_name': 'HuggingFaceTB/SmolLM3-3B',
+                'max_seq_length': 12288,
+                'use_flash_attention': True,
+                'use_gradient_checkpointing': False,
+                'batch_size': 8,
+                'gradient_accumulation_steps': 16,
+                'learning_rate': 3.5e-06,
+                'weight_decay': 0.01,
+                'warmup_steps': 1200,
+                'max_iters': 18000,
+                'eval_interval': 1000,
+                'log_interval': 25,
+                'save_interval': 2000,
+                'optimizer': 'adamw_torch',
+                'beta1': 0.9,
+                'beta2': 0.999,
+                'eps': 1e-08,
+                'scheduler': 'cosine',
+                'min_lr': 3.5e-07,
+                'fp16': False,
+                'bf16': True,
+                'ddp_backend': 'nccl',
+                'ddp_find_unused_parameters': False,
+                'save_steps': 2000,
+                'eval_steps': 1000,
+                'logging_steps': 25,
+                'save_total_limit': 5,
+                'eval_strategy': 'steps',
+                'metric_for_best_model': 'eval_loss',
+                'greater_is_better': False,
+                'load_best_model_at_end': True,
+                'data_dir': None,
+                'train_file': None,
+                'validation_file': None,
+                'test_file': None,
+                'use_chat_template': True,
+                'chat_template_kwargs': {'add_generation_prompt': True, 'no_think_system_message': True},
+                'enable_tracking': True,
+                'trackio_url': 'https://tonic-test-trackio-test.hf.space',
+                'trackio_token': None,
+                'log_artifacts': True,
+                'log_metrics': True,
+                'log_config': True,
+                'experiment_name': 'petite-elle-l-aime-3',
+                'dataset_name': 'legmlai/openhermes-fr',
+                'dataset_split': 'train',
+                'input_field': 'prompt',
+                'target_field': 'accepted_completion',
+                'filter_bad_entries': True,
+                'bad_entry_field': 'bad_entry',
+                'packing': False,
+                'max_prompt_length': 12288,
+                'max_completion_length': 8192,
+                'truncation': True,
+                'dataloader_num_workers': 10,
+                'dataloader_pin_memory': True,
+                'dataloader_prefetch_factor': 3,
+                'max_grad_norm': 1.0,
+                'group_by_length': True
+            }),
+            'artifacts': json.dumps([]),
+            'logs': json.dumps([]),
+            'last_updated': datetime.now().isoformat()
+        },
+        {
+            'experiment_id': 'exp_20250720_134319',
+            'name': 'petite-elle-l-aime-3-1',
+            'description': 'SmolLM3 fine-tuning experiment',
+            'created_at': '2025-07-20T11:54:31.993219',
+            'status': 'running',
+            'metrics': json.dumps([
+                {
+                    'timestamp': '2025-07-20T11:54:31.993219',
+                    'step': 25,
+                    'metrics': {
+                        'loss': 1.166,
+                        'grad_norm': 10.375,
+                        'learning_rate': 7e-08,
+                        'num_tokens': 1642080.0,
+                        'mean_token_accuracy': 0.7590958896279335,
+                        'epoch': 0.004851130919895701
+                    }
+                },
+                {
+                    'timestamp': '2025-07-20T11:54:33.589487',
+                    'step': 25,
+                    'metrics': {
+                        'gpu_0_memory_allocated': 17.202261447906494,
+                        'gpu_0_memory_reserved': 75.474609375,
+                        'gpu_0_utilization': 0,
+                        'cpu_percent': 2.7,
+                        'memory_percent': 10.1
+                    }
+                }
+            ]),
+            'parameters': json.dumps({
+                'model_name': 'HuggingFaceTB/SmolLM3-3B',
+                'max_seq_length': 12288,
+                'use_flash_attention': True,
+                'use_gradient_checkpointing': False,
+                'batch_size': 8,
+                'gradient_accumulation_steps': 16,
+                'learning_rate': 3.5e-06,
+                'weight_decay': 0.01,
+                'warmup_steps': 1200,
+                'max_iters': 18000,
+                'eval_interval': 1000,
+                'log_interval': 25,
+                'save_interval': 2000,
+                'optimizer': 'adamw_torch',
+                'beta1': 0.9,
+                'beta2': 0.999,
+                'eps': 1e-08,
+                'scheduler': 'cosine',
+                'min_lr': 3.5e-07,
+                'fp16': False,
+                'bf16': True,
+                'ddp_backend': 'nccl',
+                'ddp_find_unused_parameters': False,
+                'save_steps': 2000,
+                'eval_steps': 1000,
+                'logging_steps': 25,
+                'save_total_limit': 5,
+                'eval_strategy': 'steps',
+                'metric_for_best_model': 'eval_loss',
+                'greater_is_better': False,
+                'load_best_model_at_end': True,
+                'data_dir': None,
+                'train_file': None,
+                'validation_file': None,
+                'test_file': None,
+                'use_chat_template': True,
+                'chat_template_kwargs': {'add_generation_prompt': True, 'no_think_system_message': True},
+                'enable_tracking': True,
+                'trackio_url': 'https://tonic-test-trackio-test.hf.space',
+                'trackio_token': None,
+                'log_artifacts': True,
+                'log_metrics': True,
+                'log_config': True,
+                'experiment_name': 'petite-elle-l-aime-3-1',
+                'dataset_name': 'legmlai/openhermes-fr',
+                'dataset_split': 'train',
+                'input_field': 'prompt',
+                'target_field': 'accepted_completion',
+                'filter_bad_entries': True,
+                'bad_entry_field': 'bad_entry',
+                'packing': False,
+                'max_prompt_length': 12288,
+                'max_completion_length': 8192,
+                'truncation': True,
+                'dataloader_num_workers': 10,
+                'dataloader_pin_memory': True,
+                'dataloader_prefetch_factor': 3,
+                'max_grad_norm': 1.0,
+                'group_by_length': True
+            }),
+            'artifacts': json.dumps([]),
+            'logs': json.dumps([]),
+            'last_updated': datetime.now().isoformat()
+        }
+    ]
+    try:
+        # Create dataset
+        dataset = Dataset.from_list(initial_experiments)
+        # Push to HF Hub
+        api = HfApi(token=hf_token)
+        dataset.push_to_hub(
+            dataset_repo,
+            token=hf_token,
+            private=True  # Make it private for security
+        )
+        print(f"✅ Successfully created dataset: {dataset_repo}")
+        print(f"📊 Added {len(initial_experiments)} experiments")
+        print("🔒 Dataset is private (only accessible with your token)")
+        print("\n🎯 Next steps:")
+        print("1. Set HF_TOKEN in your Hugging Face Space environment")
+        print("2. Deploy the updated app.py to your Space")
+        print("3. The app will now load experiments from the dataset")
+        return True
+    except Exception as e:
+        print(f"❌ Failed to create dataset: {e}")
+        return False
+if __name__ == "__main__":
+    setup_trackio_dataset()

push_to_huggingface.py → scripts/model_tonic/push_to_huggingface.py RENAMED Viewed

@@ -1,7 +1,7 @@
 #!/usr/bin/env python3
 """
 Push Trained Model and Results to Hugging Face Hub
-Integrates with Trackio monitoring and provides complete model deployment
 """
 import os
@@ -23,6 +23,9 @@ except ImportError:
     print("Warning: huggingface_hub not available. Install with: pip install huggingface_hub")
 try:
     from monitoring import SmolLM3Monitor
     MONITORING_AVAILABLE = True
 except ImportError:
@@ -32,7 +35,7 @@ except ImportError:
 logger = logging.getLogger(__name__)
 class HuggingFacePusher:
-    """Push trained models and results to Hugging Face Hub"""
     def __init__(
         self,
@@ -41,15 +44,21 @@ class HuggingFacePusher:
         token: Optional[str] = None,
         private: bool = False,
         trackio_url: Optional[str] = None,
-        experiment_name: Optional[str] = None
     ):
         self.model_path = Path(model_path)
         self.repo_name = repo_name
-        self.token = token or os.getenv('HF_TOKEN')
         self.private = private
         self.trackio_url = trackio_url
         self.experiment_name = experiment_name
         # Initialize HF API
         if HF_AVAILABLE:
             self.api = HfApi(token=self.token)
@@ -58,14 +67,17 @@ class HuggingFacePusher:
         # Initialize monitoring if available
         self.monitor = None
-        if MONITORING_AVAILABLE and trackio_url:
             self.monitor = SmolLM3Monitor(
                 experiment_name=experiment_name or "model_push",
                 trackio_url=trackio_url,
-                enable_tracking=True
             )
         logger.info(f"Initialized HuggingFacePusher for {repo_name}")
     def create_repository(self) -> bool:
         """Create the Hugging Face repository"""
@@ -131,6 +143,7 @@ This is a fine-tuned SmolLM3 model based on the HuggingFaceTB/SmolLM3-3B archite
 - **Fine-tuning Method**: Supervised Fine-tuning
 - **Training Date**: {datetime.now().strftime('%Y-%m-%d')}
 - **Model Size**: {self._get_model_size():.1f} GB
 ## Training Configuration
@@ -166,6 +179,7 @@ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 - **Training Time**: {results.get('training_time_hours', 'Unknown')} hours
 - **Final Loss**: {results.get('final_loss', 'Unknown')}
 - **Final Accuracy**: {results.get('final_accuracy', 'Unknown')}
 ## Model Performance
@@ -173,6 +187,10 @@ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 - **Validation Loss**: {results.get('eval_loss', 'Unknown')}
 - **Training Steps**: {results.get('total_steps', 'Unknown')}
 ## Limitations and Biases
 This model is fine-tuned for specific tasks and may not generalize well to all use cases. Please evaluate the model's performance on your specific task before deployment.
@@ -293,6 +311,7 @@ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 - **Model Size**: {self._get_model_size():.1f} GB
 - **Training Steps**: {results.get('total_steps', 'Unknown')}
 - **Final Loss**: {results.get('final_loss', 'Unknown')}
 ## Training Configuration
@@ -306,6 +325,10 @@ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 {json.dumps(results, indent=2)}
 ```
 ## Files
 - `pytorch_model.bin`: Model weights
@@ -327,8 +350,8 @@ MIT License
             upload_file(
                 path_or_fileobj=str(readme_path),
                 path_in_repo="README.md",
-                repo_id=self.repo_name,
-                token=self.token
             )
             # Clean up
@@ -342,23 +365,36 @@ MIT License
             return False
     def log_to_trackio(self, action: str, details: Dict[str, Any]):
-        """Log push action to Trackio"""
         if self.monitor:
             try:
                 self.monitor.log_metrics({
                     "push_action": action,
                     "repo_name": self.repo_name,
                     "model_size_gb": self._get_model_size(),
                     **details
                 })
-                logger.info(f"✅ Logged {action} to Trackio")
             except Exception as e:
                 logger.error(f"❌ Failed to log to Trackio: {e}")
     def push_model(self, training_config: Optional[Dict[str, Any]] = None,
                    results: Optional[Dict[str, Any]] = None) -> bool:
-        """Complete model push process"""
         logger.info(f"🚀 Starting model push to {self.repo_name}")
         # Validate model path
         if not self.validate_model_path():
@@ -399,7 +435,7 @@ MIT License
         if results:
             self.upload_training_results(str(self.model_path))
-        # Log to Trackio
         self.log_to_trackio("model_push", {
             "model_path": str(self.model_path),
             "repo_name": self.repo_name,
@@ -409,6 +445,7 @@ MIT License
         })
         logger.info(f"🎉 Model successfully pushed to: https://huggingface.co/{self.repo_name}")
         return True
     def _load_training_config(self) -> Dict[str, Any]:
@@ -437,9 +474,11 @@ def parse_args():
     # Optional arguments
     parser.add_argument('--token', type=str, default=None, help='Hugging Face token')
     parser.add_argument('--private', action='store_true', help='Make repository private')
     parser.add_argument('--trackio-url', type=str, default=None, help='Trackio Space URL for logging')
     parser.add_argument('--experiment-name', type=str, default=None, help='Experiment name for Trackio')
     return parser.parse_args()
@@ -463,7 +502,9 @@ def main():
             token=args.token,
             private=args.private,
             trackio_url=args.trackio_url,
-            experiment_name=args.experiment_name
         )
         # Push model
@@ -472,6 +513,8 @@ def main():
         if success:
             logger.info("✅ Model push completed successfully!")
             logger.info(f"🌐 View your model at: https://huggingface.co/{args.repo_name}")
         else:
             logger.error("❌ Model push failed!")
             return 1

 #!/usr/bin/env python3
 """
 Push Trained Model and Results to Hugging Face Hub
+Integrates with Trackio monitoring and HF Datasets for complete model deployment
 """
 import os
     print("Warning: huggingface_hub not available. Install with: pip install huggingface_hub")
 try:
+    import sys
+    import os
+    sys.path.append(os.path.join(os.path.dirname(__file__), '..', '..', 'src'))
     from monitoring import SmolLM3Monitor
     MONITORING_AVAILABLE = True
 except ImportError:
 logger = logging.getLogger(__name__)
 class HuggingFacePusher:
+    """Push trained models and results to Hugging Face Hub with HF Datasets integration"""
     def __init__(
         self,
         token: Optional[str] = None,
         private: bool = False,
         trackio_url: Optional[str] = None,
+        experiment_name: Optional[str] = None,
+        dataset_repo: Optional[str] = None,
+        hf_token: Optional[str] = None
     ):
         self.model_path = Path(model_path)
         self.repo_name = repo_name
+        self.token = token or hf_token or os.getenv('HF_TOKEN')
         self.private = private
         self.trackio_url = trackio_url
         self.experiment_name = experiment_name
+        # HF Datasets configuration
+        self.dataset_repo = dataset_repo or os.getenv('TRACKIO_DATASET_REPO', 'tonic/trackio-experiments')
+        self.hf_token = hf_token or os.getenv('HF_TOKEN')
         # Initialize HF API
         if HF_AVAILABLE:
             self.api = HfApi(token=self.token)
         # Initialize monitoring if available
         self.monitor = None
+        if MONITORING_AVAILABLE:
             self.monitor = SmolLM3Monitor(
                 experiment_name=experiment_name or "model_push",
                 trackio_url=trackio_url,
+                enable_tracking=bool(trackio_url),
+                hf_token=self.hf_token,
+                dataset_repo=self.dataset_repo
             )
         logger.info(f"Initialized HuggingFacePusher for {repo_name}")
+        logger.info(f"Dataset repository: {self.dataset_repo}")
     def create_repository(self) -> bool:
         """Create the Hugging Face repository"""
 - **Fine-tuning Method**: Supervised Fine-tuning
 - **Training Date**: {datetime.now().strftime('%Y-%m-%d')}
 - **Model Size**: {self._get_model_size():.1f} GB
+- **Dataset Repository**: {self.dataset_repo}
 ## Training Configuration
 - **Training Time**: {results.get('training_time_hours', 'Unknown')} hours
 - **Final Loss**: {results.get('final_loss', 'Unknown')}
 - **Final Accuracy**: {results.get('final_accuracy', 'Unknown')}
+- **Dataset Repository**: {self.dataset_repo}
 ## Model Performance
 - **Validation Loss**: {results.get('eval_loss', 'Unknown')}
 - **Training Steps**: {results.get('total_steps', 'Unknown')}
+## Experiment Tracking
+This model was trained with experiment tracking enabled. Training metrics and configuration are stored in the HF Dataset repository: `{self.dataset_repo}`
 ## Limitations and Biases
 This model is fine-tuned for specific tasks and may not generalize well to all use cases. Please evaluate the model's performance on your specific task before deployment.
 - **Model Size**: {self._get_model_size():.1f} GB
 - **Training Steps**: {results.get('total_steps', 'Unknown')}
 - **Final Loss**: {results.get('final_loss', 'Unknown')}
+- **Dataset Repository**: {self.dataset_repo}
 ## Training Configuration
 {json.dumps(results, indent=2)}
 ```
+## Experiment Tracking
+Training metrics and configuration are stored in the HF Dataset repository: `{self.dataset_repo}`
 ## Files
 - `pytorch_model.bin`: Model weights
             upload_file(
                 path_or_fileobj=str(readme_path),
                 path_in_repo="README.md",
+                token=self.token,
+                repo_id=self.repo_name
             )
             # Clean up
             return False
     def log_to_trackio(self, action: str, details: Dict[str, Any]):
+        """Log push action to Trackio and HF Datasets"""
         if self.monitor:
             try:
+                # Log to Trackio
                 self.monitor.log_metrics({
                     "push_action": action,
                     "repo_name": self.repo_name,
                     "model_size_gb": self._get_model_size(),
+                    "dataset_repo": self.dataset_repo,
+                    **details
+                })
+                # Log training summary
+                self.monitor.log_training_summary({
+                    "model_push": True,
+                    "model_repo": self.repo_name,
+                    "dataset_repo": self.dataset_repo,
+                    "push_date": datetime.now().isoformat(),
                     **details
                 })
+                logger.info(f"✅ Logged {action} to Trackio and HF Datasets")
             except Exception as e:
                 logger.error(f"❌ Failed to log to Trackio: {e}")
     def push_model(self, training_config: Optional[Dict[str, Any]] = None,
                    results: Optional[Dict[str, Any]] = None) -> bool:
+        """Complete model push process with HF Datasets integration"""
         logger.info(f"🚀 Starting model push to {self.repo_name}")
+        logger.info(f"📊 Dataset repository: {self.dataset_repo}")
         # Validate model path
         if not self.validate_model_path():
         if results:
             self.upload_training_results(str(self.model_path))
+        # Log to Trackio and HF Datasets
         self.log_to_trackio("model_push", {
             "model_path": str(self.model_path),
             "repo_name": self.repo_name,
         })
         logger.info(f"🎉 Model successfully pushed to: https://huggingface.co/{self.repo_name}")
+        logger.info(f"📊 Experiment data stored in: {self.dataset_repo}")
         return True
     def _load_training_config(self) -> Dict[str, Any]:
     # Optional arguments
     parser.add_argument('--token', type=str, default=None, help='Hugging Face token')
+    parser.add_argument('--hf-token', type=str, default=None, help='Hugging Face token (alternative to --token)')
     parser.add_argument('--private', action='store_true', help='Make repository private')
     parser.add_argument('--trackio-url', type=str, default=None, help='Trackio Space URL for logging')
     parser.add_argument('--experiment-name', type=str, default=None, help='Experiment name for Trackio')
+    parser.add_argument('--dataset-repo', type=str, default=None, help='HF Dataset repository for experiment storage')
     return parser.parse_args()
             token=args.token,
             private=args.private,
             trackio_url=args.trackio_url,
+            experiment_name=args.experiment_name,
+            dataset_repo=args.dataset_repo,
+            hf_token=args.hf_token
         )
         # Push model
         if success:
             logger.info("✅ Model push completed successfully!")
             logger.info(f"🌐 View your model at: https://huggingface.co/{args.repo_name}")
+            if args.dataset_repo:
+                logger.info(f"📊 View experiment data at: https://huggingface.co/datasets/{args.dataset_repo}")
         else:
             logger.error("❌ Model push failed!")
             return 1

scripts/trackio_tonic/configure_trackio.py ADDED Viewed

	@@ -0,0 +1,145 @@

+#!/usr/bin/env python3
+"""
+Configuration script for Trackio environment variables
+"""
+import os
+import json
+from datetime import datetime
+def configure_trackio():
+    """Configure Trackio environment variables"""
+    print("🔧 Trackio Configuration")
+    print("=" * 40)
+    # Current configuration
+    current_config = {
+        'HF_TOKEN': os.environ.get('HF_TOKEN', 'Not set'),
+        'TRACKIO_DATASET_REPO': os.environ.get('TRACKIO_DATASET_REPO', 'tonic/trackio-experiments'),
+        'SPACE_ID': os.environ.get('SPACE_ID', 'Not set')
+    }
+    print("📋 Current Configuration:")
+    for key, value in current_config.items():
+        status = "✅" if value != "Not set" else "❌"
+        print(f"   {status} {key}: {value}")
+    print("\n🎯 Configuration Options:")
+    print("1. Set HF_TOKEN - Required for dataset access")
+    print("2. Set TRACKIO_DATASET_REPO - Dataset repository (optional)")
+    print("3. Set SPACE_ID - HF Space ID (auto-detected)")
+    # Check if running on HF Spaces
+    if os.environ.get('SPACE_ID'):
+        print("\n🚀 Running on Hugging Face Spaces")
+        print(f"   Space ID: {os.environ.get('SPACE_ID')}")
+    # Validate configuration
+    print("\n🔍 Configuration Validation:")
+    # Check HF_TOKEN
+    if current_config['HF_TOKEN'] != 'Not set':
+        print("✅ HF_TOKEN is set")
+        print("   This allows the app to read/write to HF Datasets")
+    else:
+        print("❌ HF_TOKEN is not set")
+        print("   Please set HF_TOKEN to enable dataset functionality")
+        print("   Get your token from: https://huggingface.co/settings/tokens")
+    # Check dataset repository
+    dataset_repo = current_config['TRACKIO_DATASET_REPO']
+    print(f"📊 Dataset Repository: {dataset_repo}")
+    # Test dataset access if token is available
+    if current_config['HF_TOKEN'] != 'Not set':
+        print("\n🧪 Testing Dataset Access...")
+        try:
+            from datasets import load_dataset
+            dataset = load_dataset(dataset_repo, token=current_config['HF_TOKEN'])
+            print(f"✅ Successfully loaded dataset: {dataset_repo}")
+            # Show experiment count
+            if 'train' in dataset:
+                experiment_count = len(dataset['train'])
+                print(f"📈 Found {experiment_count} experiments in dataset")
+                # Show sample experiments
+                if experiment_count > 0:
+                    print("🔬 Sample experiments:")
+                    for i, row in enumerate(dataset['train'][:3]):  # Show first 3
+                        exp_id = row.get('experiment_id', 'Unknown')
+                        name = row.get('name', 'Unnamed')
+                        print(f"   {i+1}. {exp_id}: {name}")
+        except Exception as e:
+            print(f"❌ Failed to load dataset: {e}")
+            print("   This might be normal if the dataset doesn't exist yet")
+    # Generate configuration file
+    config_file = "trackio_config.json"
+    config_data = {
+        'hf_token': current_config['HF_TOKEN'],
+        'dataset_repo': current_config['TRACKIO_DATASET_REPO'],
+        'space_id': current_config['SPACE_ID'],
+        'last_updated': datetime.now().isoformat(),
+        'notes': 'Trackio configuration - set these as environment variables in your HF Space'
+    }
+    with open(config_file, 'w') as f:
+        json.dump(config_data, f, indent=2)
+    print(f"\n💾 Configuration saved to: {config_file}")
+    # Show environment variable commands
+    print("\n📝 Environment Variables for HF Space:")
+    print("=" * 50)
+    print(f"HF_TOKEN={current_config['HF_TOKEN']}")
+    print(f"TRACKIO_DATASET_REPO={current_config['TRACKIO_DATASET_REPO']}")
+    print("\n🎯 Next Steps:")
+    print("1. Set HF_TOKEN in your HF Space environment variables")
+    print("2. Optionally set TRACKIO_DATASET_REPO to use a different dataset")
+    print("3. Deploy your updated app.py to the Space")
+    print("4. Run setup_hf_dataset.py if you haven't created the dataset yet")
+def show_usage_examples():
+    """Show usage examples for different dataset repositories"""
+    print("\n📚 Usage Examples")
+    print("=" * 30)
+    examples = [
+        {
+            'name': 'Default Dataset',
+            'repo': 'tonic/trackio-experiments',
+            'description': 'Default dataset for your experiments'
+        },
+        {
+            'name': 'Personal Dataset',
+            'repo': 'your-username/trackio-experiments',
+            'description': 'Your personal experiment dataset'
+        },
+        {
+            'name': 'Team Dataset',
+            'repo': 'your-org/team-experiments',
+            'description': 'Shared dataset for team experiments'
+        },
+        {
+            'name': 'Project Dataset',
+            'repo': 'your-username/smollm3-experiments',
+            'description': 'Dataset specific to SmolLM3 experiments'
+        }
+    ]
+    for i, example in enumerate(examples, 1):
+        print(f"{i}. {example['name']}")
+        print(f"   Repository: {example['repo']}")
+        print(f"   Description: {example['description']}")
+        print(f"   Set with: TRACKIO_DATASET_REPO={example['repo']}")
+        print()
+if __name__ == "__main__":
+    configure_trackio()
+    show_usage_examples()

deploy_trackio_space.py → scripts/trackio_tonic/deploy_trackio_space.py RENAMED Viewed

@@ -95,7 +95,7 @@ class TrackioSpaceDeployer:
             # Write README.md for the space
             space_readme = f"""---
-title: Trackio for Petite Elle L'Aime
 emoji: 🐠
 colorFrom: indigo
 colorTo: yellow

             # Write README.md for the space
             space_readme = f"""---
+title: Trackio Tonic
 emoji: 🐠
 colorFrom: indigo
 colorTo: yellow

scripts/trackio_tonic/trackio_api_client.py ADDED Viewed

	@@ -0,0 +1,286 @@

+#!/usr/bin/env python3
+"""
+Trackio API Client for Hugging Face Spaces
+Connects to the Trackio Space using the actual API endpoints
+"""
+import requests
+import json
+import time
+import logging
+from typing import Dict, Any, Optional
+from datetime import datetime
+# Setup logging
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+class TrackioAPIClient:
+    """API client for Trackio Space"""
+    def __init__(self, space_url: str):
+        self.space_url = space_url.rstrip('/')
+        self.base_url = f"{self.space_url}/gradio_api/call"
+    def _make_api_call(self, endpoint: str, data: list, max_retries: int = 3) -> Dict[str, Any]:
+        """Make an API call to the Trackio Space"""
+        url = f"{self.base_url}/{endpoint}"
+        payload = {
+            "data": data
+        }
+        for attempt in range(max_retries):
+            try:
+                logger.debug(f"Attempt {attempt + 1}: Making POST request to {url}")
+                # POST request to get EVENT_ID
+                response = requests.post(
+                    url,
+                    json=payload,
+                    headers={"Content-Type": "application/json"},
+                    timeout=30
+                )
+                if response.status_code != 200:
+                    logger.error(f"POST request failed: {response.status_code} - {response.text}")
+                    if attempt < max_retries - 1:
+                        time.sleep(2 ** attempt)  # Exponential backoff
+                        continue
+                    return {"error": f"POST failed: {response.status_code}"}
+                # Extract EVENT_ID from response
+                response_data = response.json()
+                logger.debug(f"POST response: {response_data}")
+                # Check for event_id (correct field name)
+                if "event_id" in response_data:
+                    event_id = response_data["event_id"]
+                elif "hash" in response_data:
+                    event_id = response_data["hash"]
+                else:
+                    logger.error(f"No event_id or hash in response: {response_data}")
+                    return {"error": "No EVENT_ID in response"}
+                # GET request to get results
+                get_url = f"{url}/{event_id}"
+                logger.debug(f"Making GET request to: {get_url}")
+                # Wait a bit for the processing to complete
+                time.sleep(1)
+                get_response = requests.get(get_url, timeout=30)
+                if get_response.status_code != 200:
+                    logger.error(f"GET request failed: {get_response.status_code} - {get_response.text}")
+                    if attempt < max_retries - 1:
+                        time.sleep(2 ** attempt)
+                        continue
+                    return {"error": f"GET failed: {get_response.status_code}"}
+                # Check if response is empty
+                if not get_response.content:
+                    logger.warning(f"Empty response from GET request (attempt {attempt + 1})")
+                    if attempt < max_retries - 1:
+                        time.sleep(2 ** attempt)
+                        continue
+                    return {"error": "Empty response from server"}
+                # Parse the response - handle both JSON and SSE formats
+                response_text = get_response.text.strip()
+                logger.debug(f"Raw response: {response_text}")
+                # Try to parse as JSON first
+                try:
+                    result_data = get_response.json()
+                    logger.debug(f"Parsed as JSON: {result_data}")
+                    if "data" in result_data and len(result_data["data"]) > 0:
+                        return {"success": True, "data": result_data["data"][0]}
+                    else:
+                        logger.warning(f"No data in JSON response (attempt {attempt + 1}): {result_data}")
+                        if attempt < max_retries - 1:
+                            time.sleep(2 ** attempt)
+                            continue
+                        return {"error": "No data in JSON response", "raw": result_data}
+                except json.JSONDecodeError:
+                    # Try to parse as Server-Sent Events (SSE) format
+                    logger.debug("Response is not JSON, trying SSE format")
+                    # Parse SSE format: "event: complete\ndata: [\"message\"]"
+                    lines = response_text.split('\n')
+                    data_line = None
+                    for line in lines:
+                        if line.startswith('data: '):
+                            data_line = line[6:]  # Remove 'data: ' prefix
+                            break
+                    if data_line:
+                        try:
+                            # Parse the data array from SSE
+                            import ast
+                            data_array = ast.literal_eval(data_line)
+                            if isinstance(data_array, list) and len(data_array) > 0:
+                                result_message = data_array[0]
+                                logger.debug(f"Parsed SSE data: {result_message}")
+                                return {"success": True, "data": result_message}
+                            else:
+                                logger.warning(f"Invalid SSE data format (attempt {attempt + 1}): {data_array}")
+                                if attempt < max_retries - 1:
+                                    time.sleep(2 ** attempt)
+                                    continue
+                                return {"error": "Invalid SSE data format", "raw": data_array}
+                        except (ValueError, SyntaxError) as e:
+                            logger.error(f"Failed to parse SSE data: {e}")
+                            logger.debug(f"Raw SSE data: {data_line}")
+                            if attempt < max_retries - 1:
+                                time.sleep(2 ** attempt)
+                                continue
+                            return {"error": f"Failed to parse SSE data: {e}"}
+                    else:
+                        logger.error(f"No data line found in SSE response")
+                        if attempt < max_retries - 1:
+                            time.sleep(2 ** attempt)
+                            continue
+                        return {"error": "No data line in SSE response", "raw": response_text}
+            except requests.exceptions.RequestException as e:
+                logger.error(f"API call failed (attempt {attempt + 1}): {e}")
+                if attempt < max_retries - 1:
+                    time.sleep(2 ** attempt)
+                    continue
+                return {"error": f"Request failed: {e}"}
+            except Exception as e:
+                logger.error(f"Unexpected error (attempt {attempt + 1}): {e}")
+                if attempt < max_retries - 1:
+                    time.sleep(2 ** attempt)
+                    continue
+                return {"error": f"Unexpected error: {e}"}
+        return {"error": f"Failed after {max_retries} attempts"}
+    def create_experiment(self, name: str, description: str = "") -> Dict[str, Any]:
+        """Create a new experiment"""
+        logger.info(f"Creating experiment: {name}")
+        result = self._make_api_call("create_experiment_interface", [name, description])
+        if "success" in result:
+            logger.info(f"Experiment created successfully: {result['data']}")
+            return result
+        else:
+            logger.error(f"Failed to create experiment: {result}")
+            return result
+    def log_metrics(self, experiment_id: str, metrics: Dict[str, Any], step: Optional[int] = None) -> Dict[str, Any]:
+        """Log metrics for an experiment"""
+        metrics_json = json.dumps(metrics)
+        step_str = str(step) if step is not None else ""
+        logger.info(f"Logging metrics for experiment {experiment_id} at step {step}")
+        result = self._make_api_call("log_metrics_interface", [experiment_id, metrics_json, step_str])
+        if "success" in result:
+            logger.info(f"Metrics logged successfully: {result['data']}")
+            return result
+        else:
+            logger.error(f"Failed to log metrics: {result}")
+            return result
+    def log_parameters(self, experiment_id: str, parameters: Dict[str, Any]) -> Dict[str, Any]:
+        """Log parameters for an experiment"""
+        parameters_json = json.dumps(parameters)
+        logger.info(f"Logging parameters for experiment {experiment_id}")
+        result = self._make_api_call("log_parameters_interface", [experiment_id, parameters_json])
+        if "success" in result:
+            logger.info(f"Parameters logged successfully: {result['data']}")
+            return result
+        else:
+            logger.error(f"Failed to log parameters: {result}")
+            return result
+    def get_experiment_details(self, experiment_id: str) -> Dict[str, Any]:
+        """Get experiment details"""
+        logger.info(f"Getting details for experiment {experiment_id}")
+        result = self._make_api_call("get_experiment_details_interface", [experiment_id])
+        if "success" in result:
+            logger.info(f"Experiment details retrieved: {result['data']}")
+            return result
+        else:
+            logger.error(f"Failed to get experiment details: {result}")
+            return result
+    def list_experiments(self) -> Dict[str, Any]:
+        """List all experiments"""
+        logger.info("Listing experiments")
+        result = self._make_api_call("list_experiments_interface", [])
+        if "success" in result:
+            logger.info(f"Experiments listed successfully: {result['data']}")
+            return result
+        else:
+            logger.error(f"Failed to list experiments: {result}")
+            return result
+    def update_experiment_status(self, experiment_id: str, status: str) -> Dict[str, Any]:
+        """Update experiment status"""
+        logger.info(f"Updating experiment {experiment_id} status to {status}")
+        result = self._make_api_call("update_experiment_status_interface", [experiment_id, status])
+        if "success" in result:
+            logger.info(f"Experiment status updated successfully: {result['data']}")
+            return result
+        else:
+            logger.error(f"Failed to update experiment status: {result}")
+            return result
+    def simulate_training_data(self, experiment_id: str) -> Dict[str, Any]:
+        """Simulate training data for testing"""
+        logger.info(f"Simulating training data for experiment {experiment_id}")
+        result = self._make_api_call("simulate_training_data_interface", [experiment_id])
+        if "success" in result:
+            logger.info(f"Training data simulated successfully: {result['data']}")
+            return result
+        else:
+            logger.error(f"Failed to simulate training data: {result}")
+            return result
+    def get_training_metrics(self, experiment_id: str) -> Dict[str, Any]:
+        """Get training metrics for an experiment"""
+        logger.info(f"Getting training metrics for experiment {experiment_id}")
+        result = self._make_api_call("get_training_metrics_interface", [experiment_id])
+        if "success" in result:
+            logger.info(f"Training metrics retrieved: {result['data']}")
+            return result
+        else:
+            logger.error(f"Failed to get training metrics: {result}")
+            return result
+    def get_experiment_metrics_history(self, experiment_id: str) -> Dict[str, Any]:
+        """Get experiment metrics history"""
+        logger.info(f"Getting metrics history for experiment {experiment_id}")
+        result = self._make_api_call("get_experiment_metrics_history_interface", [experiment_id])
+        if "success" in result:
+            logger.info(f"Metrics history retrieved: {result['data']}")
+            return result
+        else:
+            logger.error(f"Failed to get metrics history: {result}")
+            return result

run_a100_large_experiment.py → scripts/training/train.py RENAMED Viewed

File without changes

setup_launch.py ADDED Viewed

	@@ -0,0 +1,283 @@

+#!/usr/bin/env python3
+"""
+Setup script for the interactive SmolLM3 end-to-end fine-tuning pipeline
+Helps users prepare for the interactive launch script
+"""
+import os
+import re
+from pathlib import Path
+def setup_launch_script():
+    """Setup the launch.sh script with user configuration"""
+    print("🚀 SmolLM3 Interactive End-to-End Fine-tuning Setup")
+    print("=" * 60)
+    print("\n📋 This setup will help you prepare for the interactive pipeline.")
+    print("The launch script will now prompt you for all necessary information.")
+    # Check if launch.sh exists
+    launch_path = Path("launch.sh")
+    if not launch_path.exists():
+        print("❌ launch.sh not found")
+        return False
+    print("\n✅ launch.sh found - no configuration needed!")
+    print("The script is now interactive and will prompt you for all settings.")
+    return True
+def create_requirements_check():
+    """Create a requirements check script"""
+    check_script = """#!/usr/bin/env python3
+\"\"\"
+Requirements check for SmolLM3 fine-tuning
+\"\"\"
+import sys
+import subprocess
+def check_requirements():
+    \"\"\"Check if all requirements are met\"\"\"
+    print("🔍 Checking requirements...")
+    # Check Python version
+    if sys.version_info < (3, 8):
+        print("❌ Python 3.8+ required")
+        return False
+    else:
+        print(f"✅ Python {sys.version_info.major}.{sys.version_info.minor}")
+    # Check required packages
+    required_packages = [
+        'torch',
+        'transformers',
+        'datasets',
+        'accelerate',
+        'trl',
+        'huggingface_hub',
+        'requests'
+    ]
+    missing_packages = []
+    for package in required_packages:
+        try:
+            __import__(package)
+            print(f"✅ {package}")
+        except ImportError:
+            print(f"❌ {package}")
+            missing_packages.append(package)
+    if missing_packages:
+        print(f"\\n📦 Install missing packages:")
+        print(f"pip install {' '.join(missing_packages)}")
+        return False
+    # Check CUDA
+    try:
+        import torch
+        if torch.cuda.is_available():
+            print(f"✅ CUDA available: {torch.cuda.get_device_name(0)}")
+        else:
+            print("⚠️  CUDA not available (training will be slower)")
+    except:
+        print("⚠️  Could not check CUDA availability")
+    print("\\n✅ All requirements met!")
+    return True
+if __name__ == "__main__":
+    check_requirements()
+"""
+    with open("check_requirements.py", 'w') as f:
+        f.write(check_script)
+    print("✅ Created check_requirements.py")
+def create_quick_start_guide():
+    """Create a quick start guide"""
+    guide = """# SmolLM3 Interactive Pipeline - Quick Start Guide
+## 🚀 Quick Start
+### 1. Check Requirements
+```bash
+python check_requirements.py
+```
+### 2. Run the Interactive Pipeline
+```bash
+chmod +x launch.sh
+./launch.sh
+```
+## 📋 What the Interactive Pipeline Does
+The pipeline will guide you through:
+1. **Authentication** - Enter your HF username and token
+2. **Configuration Selection** - Choose from predefined training configs:
+   - Basic Training (SmolLM3 + SmolTalk)
+   - H100 Lightweight (Rapid training on H100)
+   - A100 Large Scale (SmolLM3 + OpenHermes-FR)
+   - Multiple Passes (Extended training)
+   - Custom Configuration (User-defined)
+3. **Experiment Setup** - Configure experiment name and repositories
+4. **Training Parameters** - Adjust batch size, learning rate, etc.
+5. **Deployment** - Automatic Trackio Space and HF Dataset setup
+6. **Training** - Monitored fine-tuning with real-time tracking
+7. **Model Push** - Upload to HF Hub with documentation
+## 🎯 Available Training Configurations
+### 1. Basic Training (Default)
+- **Model**: SmolLM3-3B
+- **Dataset**: SmolTalk
+- **Epochs**: 3
+- **Batch Size**: 2
+- **Learning Rate**: 5e-6
+- **Best for**: Quick experiments, learning
+### 2. H100 Lightweight (Rapid)
+- **Model**: SmolLM3-3B
+- **Dataset**: OpenHermes-FR (80K samples)
+- **Epochs**: 1
+- **Batch Size**: 16
+- **Learning Rate**: 8e-6
+- **Sequence Length**: 8192
+- **Best for**: Rapid training on H100
+### 3. A100 Large Scale
+- **Model**: SmolLM3-3B
+- **Dataset**: OpenHermes-FR
+- **Epochs**: 1.3 passes
+- **Batch Size**: 8
+- **Learning Rate**: 5e-6
+- **Sequence Length**: 8192
+- **Best for**: High-performance training
+### 4. Multiple Passes
+- **Model**: SmolLM3-3B
+- **Dataset**: OpenHermes-FR
+- **Epochs**: 4 passes
+- **Batch Size**: 6
+- **Learning Rate**: 3e-6
+- **Sequence Length**: 8192
+- **Best for**: Thorough training
+### 5. Custom Configuration
+- **User-defined parameters**
+- **Flexible model and dataset selection**
+- **Custom training parameters**
+## 🔧 Prerequisites
+1. **Hugging Face Account**
+   - Create account at https://huggingface.co
+   - Generate token at https://huggingface.co/settings/tokens
+2. **System Requirements**
+   - Python 3.8+
+   - CUDA-compatible GPU (recommended)
+   - 16GB+ RAM
+   - 50GB+ storage
+3. **Dependencies**
+   - PyTorch with CUDA
+   - Transformers
+   - Datasets
+   - Accelerate
+   - TRL
+## 📊 Expected Outputs
+After running the pipeline, you'll have:
+- **Model Repository**: `https://huggingface.co/your-username/smollm3-finetuned-YYYYMMDD`
+- **Trackio Space**: `https://huggingface.co/spaces/your-username/trackio-monitoring-YYYYMMDD`
+- **Experiment Dataset**: `https://huggingface.co/datasets/your-username/trackio-experiments`
+- **Training Summary**: `training_summary.md`
+## 🛠️ Troubleshooting
+### Common Issues
+1. **HF Token Issues**
+   ```bash
+   huggingface-cli whoami
+   ```
+2. **CUDA Issues**
+   ```bash
+   python -c "import torch; print(torch.cuda.is_available())"
+   ```
+3. **Memory Issues**
+   - Reduce batch size in custom configuration
+   - Increase gradient accumulation steps
+4. **Network Issues**
+   - Check internet connection
+   - Verify HF token permissions
+## 🎯 Tips for Success
+1. **Start with Basic Training** for your first run
+2. **Use H100 Lightweight** for rapid experiments on H100
+3. **Use A100 Large Scale** for serious experiments
+3. **Monitor in Trackio Space** for real-time progress
+4. **Check logs** if something goes wrong
+5. **Test the model** after training completes
+## 📞 Support
+- Check the troubleshooting section
+- Review logs in `training.log`
+- Monitor progress in Trackio Space
+- Open an issue on GitHub
+---
+**Happy Fine-tuning! 🚀**
+"""
+    with open("QUICK_START_GUIDE.md", 'w') as f:
+        f.write(guide)
+    print("✅ Created QUICK_START_GUIDE.md")
+def main():
+    """Main setup function"""
+    print("Welcome to SmolLM3 Interactive End-to-End Fine-tuning Setup!")
+    print("This will help you prepare for the interactive pipeline.")
+    if setup_launch_script():
+        create_requirements_check()
+        create_quick_start_guide()
+        print("\n🎉 Setup completed successfully!")
+        print("\n📋 Files created:")
+        print("  - check_requirements.py (requirement checker)")
+        print("  - QUICK_START_GUIDE.md (usage guide)")
+        print("\n🚀 Ready to start training!")
+        print("Next steps:")
+        print("1. Run: python check_requirements.py")
+        print("2. Run: chmod +x launch.sh")
+        print("3. Run: ./launch.sh")
+        print("4. Follow the interactive prompts")
+        print("\n📚 For detailed information, see:")
+        print("  - QUICK_START_GUIDE.md")
+        print("  - README_END_TO_END.md")
+    else:
+        print("\n❌ Setup failed. Please check your input and try again.")
+if __name__ == "__main__":
+    main()

config.py → src/config.py RENAMED Viewed

File without changes

data.py → src/data.py RENAMED Viewed

File without changes

model.py → src/model.py RENAMED Viewed

File without changes

monitoring.py → src/monitoring.py RENAMED Viewed

@@ -1,6 +1,6 @@
 """
 Trackio Monitoring Integration for SmolLM3 Fine-tuning
-Provides comprehensive experiment tracking and monitoring capabilities
 """
 import os
@@ -13,7 +13,7 @@ from pathlib import Path
 # Import the real API client
 try:
-    from trackio_api_client import TrackioAPIClient
     TRACKIO_AVAILABLE = True
 except ImportError:
     TRACKIO_AVAILABLE = False
@@ -22,7 +22,7 @@ except ImportError:
 logger = logging.getLogger(__name__)
 class SmolLM3Monitor:
-    """Monitoring and tracking for SmolLM3 fine-tuning experiments"""
     def __init__(
         self,
@@ -32,7 +32,9 @@ class SmolLM3Monitor:
         enable_tracking: bool = True,
         log_artifacts: bool = True,
         log_metrics: bool = True,
-        log_config: bool = True
     ):
         self.experiment_name = experiment_name
         self.enable_tracking = enable_tracking and TRACKIO_AVAILABLE
@@ -40,6 +42,10 @@ class SmolLM3Monitor:
         self.log_metrics_enabled = log_metrics  # Rename to avoid conflict
         self.log_config_enabled = log_config  # Rename to avoid conflict
         # Initialize experiment metadata first
         self.experiment_id = None
         self.start_time = datetime.now()
@@ -51,7 +57,33 @@ class SmolLM3Monitor:
         if self.enable_tracking:
             self._setup_trackio(trackio_url, trackio_token)
         logger.info("Initialized monitoring for experiment: %s", experiment_name)
     def _setup_trackio(self, trackio_url: Optional[str], trackio_token: Optional[str]):
         """Setup Trackio API client"""
@@ -91,6 +123,44 @@ class SmolLM3Monitor:
             logger.error("Failed to initialize Trackio API: %s", e)
             self.enable_tracking = False
     def log_configuration(self, config: Dict[str, Any]):
         """Log experiment configuration"""
         if not self.enable_tracking or not self.log_config_enabled:
@@ -98,24 +168,30 @@ class SmolLM3Monitor:
         try:
             # Log configuration as parameters
-            result = self.trackio_client.log_parameters(
-                experiment_id=self.experiment_id,
-                parameters=config
-            )
-            if "success" in result:
-                # Also save config locally
-                config_path = "config_{}_{}.json".format(
-                    self.experiment_name,
-                    self.start_time.strftime('%Y%m%d_%H%M%S')
                 )
-                with open(config_path, 'w') as f:
-                    json.dump(config, f, indent=2, default=str)
-                self.artifacts.append(config_path)
-                logger.info("Configuration logged to Trackio and saved to %s", config_path)
-            else:
-                logger.error("Failed to log configuration: %s", result)
         except Exception as e:
             logger.error("Failed to log configuration: %s", e)
@@ -136,18 +212,26 @@ class SmolLM3Monitor:
                 metrics['step'] = step
             # Log to Trackio
-            result = self.trackio_client.log_metrics(
-                experiment_id=self.experiment_id,
-                metrics=metrics,
-                step=step
-            )
-            if "success" in result:
-                # Store locally
-                self.metrics_history.append(metrics)
-                logger.debug("Metrics logged: %s", metrics)
-            else:
-                logger.error("Failed to log metrics: %s", result)
         except Exception as e:
             logger.error("Failed to log metrics: %s", e)
@@ -166,16 +250,19 @@ class SmolLM3Monitor:
                 "checkpoint_size": os.path.getsize(checkpoint_path) if os.path.exists(checkpoint_path) else 0
             }
-            result = self.trackio_client.log_parameters(
-                experiment_id=self.experiment_id,
-                parameters=checkpoint_info
-            )
-            if "success" in result:
-                self.artifacts.append(checkpoint_path)
-                logger.info("Checkpoint logged: %s", checkpoint_path)
-            else:
-                logger.error("Failed to log checkpoint: %s", result)
         except Exception as e:
             logger.error("Failed to log checkpoint: %s", e)
@@ -245,25 +332,31 @@ class SmolLM3Monitor:
             summary['experiment_duration_seconds'] = duration
             summary['experiment_duration_hours'] = duration / 3600
-            # Log final summary
-            result = self.trackio_client.log_parameters(
-                experiment_id=self.experiment_id,
-                parameters=summary
-            )
-            if "success" in result:
-                # Save summary locally
-                summary_path = "training_summary_{}_{}.json".format(
-                    self.experiment_name,
-                    self.start_time.strftime('%Y%m%d_%H%M%S')
                 )
-                with open(summary_path, 'w') as f:
-                    json.dump(summary, f, indent=2, default=str)
-                self.artifacts.append(summary_path)
-                logger.info("Training summary logged and saved to %s", summary_path)
-            else:
-                logger.error("Failed to log training summary: %s", result)
         except Exception as e:
             logger.error("Failed to log training summary: %s", e)
@@ -356,6 +449,10 @@ class SmolLM3Monitor:
                     logger.error("Failed to close monitoring session: %s", result)
             except Exception as e:
                 logger.error("Failed to close monitoring session: %s", e)
 # Utility function to create monitor from config
 def create_monitor_from_config(config, experiment_name: Optional[str] = None) -> SmolLM3Monitor:
@@ -370,5 +467,7 @@ def create_monitor_from_config(config, experiment_name: Optional[str] = None) ->
         enable_tracking=getattr(config, 'enable_tracking', True),
         log_artifacts=getattr(config, 'log_artifacts', True),
         log_metrics=getattr(config, 'log_metrics', True),
-        log_config=getattr(config, 'log_config', True)
     )

 """
 Trackio Monitoring Integration for SmolLM3 Fine-tuning
+Provides comprehensive experiment tracking and monitoring capabilities with HF Datasets support
 """
 import os
 # Import the real API client
 try:
+    from scripts.trackio_tonic.trackio_api_client import TrackioAPIClient
     TRACKIO_AVAILABLE = True
 except ImportError:
     TRACKIO_AVAILABLE = False
 logger = logging.getLogger(__name__)
 class SmolLM3Monitor:
+    """Monitoring and tracking for SmolLM3 fine-tuning experiments with HF Datasets support"""
     def __init__(
         self,
         enable_tracking: bool = True,
         log_artifacts: bool = True,
         log_metrics: bool = True,
+        log_config: bool = True,
+        hf_token: Optional[str] = None,
+        dataset_repo: Optional[str] = None
     ):
         self.experiment_name = experiment_name
         self.enable_tracking = enable_tracking and TRACKIO_AVAILABLE
         self.log_metrics_enabled = log_metrics  # Rename to avoid conflict
         self.log_config_enabled = log_config  # Rename to avoid conflict
+        # HF Datasets configuration
+        self.hf_token = hf_token or os.environ.get('HF_TOKEN')
+        self.dataset_repo = dataset_repo or os.environ.get('TRACKIO_DATASET_REPO', 'tonic/trackio-experiments')
         # Initialize experiment metadata first
         self.experiment_id = None
         self.start_time = datetime.now()
         if self.enable_tracking:
             self._setup_trackio(trackio_url, trackio_token)
+        # Initialize HF Datasets client
+        self.hf_dataset_client = None
+        if self.hf_token:
+            self._setup_hf_datasets()
         logger.info("Initialized monitoring for experiment: %s", experiment_name)
+        logger.info("Dataset repository: %s", self.dataset_repo)
+    def _setup_hf_datasets(self):
+        """Setup HF Datasets client for persistent storage"""
+        try:
+            from datasets import Dataset
+            from huggingface_hub import HfApi
+            self.hf_dataset_client = {
+                'Dataset': Dataset,
+                'HfApi': HfApi,
+                'api': HfApi(token=self.hf_token)
+            }
+            logger.info("✅ HF Datasets client initialized for %s", self.dataset_repo)
+        except ImportError:
+            logger.warning("⚠️ datasets or huggingface-hub not available. Install with: pip install datasets huggingface-hub")
+            self.hf_dataset_client = None
+        except Exception as e:
+            logger.error("Failed to initialize HF Datasets client: %s", e)
+            self.hf_dataset_client = None
     def _setup_trackio(self, trackio_url: Optional[str], trackio_token: Optional[str]):
         """Setup Trackio API client"""
             logger.error("Failed to initialize Trackio API: %s", e)
             self.enable_tracking = False
+    def _save_to_hf_dataset(self, experiment_data: Dict[str, Any]):
+        """Save experiment data to HF Dataset"""
+        if not self.hf_dataset_client:
+            return False
+        try:
+            # Convert experiment data to dataset format
+            dataset_data = [{
+                'experiment_id': self.experiment_id or "exp_{}".format(datetime.now().strftime('%Y%m%d_%H%M%S')),
+                'name': self.experiment_name,
+                'description': "SmolLM3 fine-tuning experiment",
+                'created_at': self.start_time.isoformat(),
+                'status': 'running',
+                'metrics': json.dumps(self.metrics_history),
+                'parameters': json.dumps(experiment_data),
+                'artifacts': json.dumps(self.artifacts),
+                'logs': json.dumps([]),
+                'last_updated': datetime.now().isoformat()
+            }]
+            # Create dataset
+            Dataset = self.hf_dataset_client['Dataset']
+            dataset = Dataset.from_list(dataset_data)
+            # Push to HF Hub
+            dataset.push_to_hub(
+                self.dataset_repo,
+                token=self.hf_token,
+                private=True
+            )
+            logger.info("✅ Saved experiment data to %s", self.dataset_repo)
+            return True
+        except Exception as e:
+            logger.error("Failed to save to HF Dataset: %s", e)
+            return False
     def log_configuration(self, config: Dict[str, Any]):
         """Log experiment configuration"""
         if not self.enable_tracking or not self.log_config_enabled:
         try:
             # Log configuration as parameters
+            if self.trackio_client:
+                result = self.trackio_client.log_parameters(
+                    experiment_id=self.experiment_id,
+                    parameters=config
                 )
+                if "success" in result:
+                    logger.info("Configuration logged to Trackio")
+                else:
+                    logger.error("Failed to log configuration: %s", result)
+            # Save to HF Dataset
+            self._save_to_hf_dataset(config)
+            # Also save config locally
+            config_path = "config_{}_{}.json".format(
+                self.experiment_name,
+                self.start_time.strftime('%Y%m%d_%H%M%S')
+            )
+            with open(config_path, 'w') as f:
+                json.dump(config, f, indent=2, default=str)
+            self.artifacts.append(config_path)
+            logger.info("Configuration saved to %s", config_path)
         except Exception as e:
             logger.error("Failed to log configuration: %s", e)
                 metrics['step'] = step
             # Log to Trackio
+            if self.trackio_client:
+                result = self.trackio_client.log_metrics(
+                    experiment_id=self.experiment_id,
+                    metrics=metrics,
+                    step=step
+                )
+                if "success" in result:
+                    logger.debug("Metrics logged to Trackio")
+                else:
+                    logger.error("Failed to log metrics to Trackio: %s", result)
+            # Store locally
+            self.metrics_history.append(metrics)
+            # Save to HF Dataset periodically
+            if len(self.metrics_history) % 10 == 0:  # Save every 10 metrics
+                self._save_to_hf_dataset({'metrics': self.metrics_history})
+            logger.debug("Metrics logged: %s", metrics)
         except Exception as e:
             logger.error("Failed to log metrics: %s", e)
                 "checkpoint_size": os.path.getsize(checkpoint_path) if os.path.exists(checkpoint_path) else 0
             }
+            if self.trackio_client:
+                result = self.trackio_client.log_parameters(
+                    experiment_id=self.experiment_id,
+                    parameters=checkpoint_info
+                )
+                if "success" in result:
+                    logger.info("Checkpoint logged to Trackio")
+                else:
+                    logger.error("Failed to log checkpoint to Trackio: %s", result)
+            self.artifacts.append(checkpoint_path)
+            logger.info("Checkpoint logged: %s", checkpoint_path)
         except Exception as e:
             logger.error("Failed to log checkpoint: %s", e)
             summary['experiment_duration_seconds'] = duration
             summary['experiment_duration_hours'] = duration / 3600
+            # Log final summary to Trackio
+            if self.trackio_client:
+                result = self.trackio_client.log_parameters(
+                    experiment_id=self.experiment_id,
+                    parameters=summary
                 )
+                if "success" in result:
+                    logger.info("Training summary logged to Trackio")
+                else:
+                    logger.error("Failed to log training summary to Trackio: %s", result)
+            # Save to HF Dataset
+            self._save_to_hf_dataset(summary)
+            # Save summary locally
+            summary_path = "training_summary_{}_{}.json".format(
+                self.experiment_name,
+                self.start_time.strftime('%Y%m%d_%H%M%S')
+            )
+            with open(summary_path, 'w') as f:
+                json.dump(summary, f, indent=2, default=str)
+            self.artifacts.append(summary_path)
+            logger.info("Training summary logged and saved to %s", summary_path)
         except Exception as e:
             logger.error("Failed to log training summary: %s", e)
                     logger.error("Failed to close monitoring session: %s", result)
             except Exception as e:
                 logger.error("Failed to close monitoring session: %s", e)
+        # Final save to HF Dataset
+        if self.hf_dataset_client:
+            self._save_to_hf_dataset({'status': 'completed'})
 # Utility function to create monitor from config
 def create_monitor_from_config(config, experiment_name: Optional[str] = None) -> SmolLM3Monitor:
         enable_tracking=getattr(config, 'enable_tracking', True),
         log_artifacts=getattr(config, 'log_artifacts', True),
         log_metrics=getattr(config, 'log_metrics', True),
+        log_config=getattr(config, 'log_config', True),
+        hf_token=getattr(config, 'hf_token', None),
+        dataset_repo=getattr(config, 'dataset_repo', None)
     )

train.py → src/train.py RENAMED Viewed

@@ -20,6 +20,7 @@ from config import get_config
 from model import SmolLM3Model
 from data import SmolLM3Dataset
 from trainer import SmolLM3Trainer
 def setup_logging():
     """Setup logging configuration"""
@@ -86,6 +87,12 @@ def parse_args():
     parser.add_argument('--experiment_name', type=str, default=None,
                        help='Custom experiment name for tracking')
     return parser.parse_args()
 def main():
@@ -119,6 +126,12 @@ def main():
     if args.experiment_name is not None:
         config.experiment_name = args.experiment_name
     # Setup paths
     output_path = args.out_dir
@@ -127,6 +140,22 @@ def main():
     logger.info(f"Output path: {output_path}")
     # Initialize model
     model = SmolLM3Model(
         model_name=args.model_name,
@@ -162,13 +191,60 @@ def main():
         init_from=args.init_from
     )
     # Start training
     try:
         trainer.train()
         logger.info("Training completed successfully!")
     except Exception as e:
         logger.error(f"Training failed: {e}")
         raise
 if __name__ == '__main__':
     main()

 from model import SmolLM3Model
 from data import SmolLM3Dataset
 from trainer import SmolLM3Trainer
+from monitoring import create_monitor_from_config
 def setup_logging():
     """Setup logging configuration"""
     parser.add_argument('--experiment_name', type=str, default=None,
                        help='Custom experiment name for tracking')
+    # HF Datasets arguments
+    parser.add_argument('--hf_token', type=str, default=None,
+                       help='Hugging Face token for dataset access')
+    parser.add_argument('--dataset_repo', type=str, default=None,
+                       help='HF Dataset repository for experiment storage')
     return parser.parse_args()
 def main():
     if args.experiment_name is not None:
         config.experiment_name = args.experiment_name
+    # Override HF Datasets configuration
+    if args.hf_token is not None:
+        os.environ['HF_TOKEN'] = args.hf_token
+    if args.dataset_repo is not None:
+        os.environ['TRACKIO_DATASET_REPO'] = args.dataset_repo
     # Setup paths
     output_path = args.out_dir
     logger.info(f"Output path: {output_path}")
+    # Initialize monitoring
+    monitor = None
+    if config.enable_tracking:
+        try:
+            monitor = create_monitor_from_config(config, args.experiment_name)
+            logger.info(f"✅ Monitoring initialized for experiment: {monitor.experiment_name}")
+            logger.info(f"📊 Dataset repository: {monitor.dataset_repo}")
+            # Log configuration
+            config_dict = {k: v for k, v in vars(config).items() if not k.startswith('_')}
+            monitor.log_configuration(config_dict)
+        except Exception as e:
+            logger.error(f"Failed to initialize monitoring: {e}")
+            logger.warning("Continuing without monitoring...")
     # Initialize model
     model = SmolLM3Model(
         model_name=args.model_name,
         init_from=args.init_from
     )
+    # Add monitoring callback if available
+    if monitor:
+        try:
+            callback = monitor.create_monitoring_callback()
+            trainer.add_callback(callback)
+            logger.info("✅ Monitoring callback added to trainer")
+        except Exception as e:
+            logger.error(f"Failed to add monitoring callback: {e}")
     # Start training
     try:
         trainer.train()
         logger.info("Training completed successfully!")
+        # Log training summary
+        if monitor:
+            try:
+                summary = {
+                    'final_loss': getattr(trainer, 'final_loss', None),
+                    'total_steps': getattr(trainer, 'total_steps', None),
+                    'training_duration': getattr(trainer, 'training_duration', None),
+                    'model_path': output_path,
+                    'config_file': args.config
+                }
+                monitor.log_training_summary(summary)
+                logger.info("✅ Training summary logged")
+            except Exception as e:
+                logger.error(f"Failed to log training summary: {e}")
     except Exception as e:
         logger.error(f"Training failed: {e}")
+        # Log error to monitoring
+        if monitor:
+            try:
+                error_summary = {
+                    'error': str(e),
+                    'status': 'failed',
+                    'model_path': output_path,
+                    'config_file': args.config
+                }
+                monitor.log_training_summary(error_summary)
+            except Exception as log_error:
+                logger.error(f"Failed to log error to monitoring: {log_error}")
         raise
+    finally:
+        # Close monitoring
+        if monitor:
+            try:
+                monitor.close()
+                logger.info("✅ Monitoring session closed")
+            except Exception as e:
+                logger.error(f"Failed to close monitoring: {e}")
 if __name__ == '__main__':
     main()

trainer.py → src/trainer.py RENAMED Viewed

File without changes

templates/datasets/readme.md ADDED Viewed

File without changes