Spaces:

Tonic
/

SmolFactory

Running

App Files Files Community

Tonic commited on Jul 19

Commit

5fe83da

verified ·

1 Parent(s): 231fcd0

adds A100 large experiments

Browse files

Files changed (26) hide show

A100_LARGE_SCALE_GUIDE.md +195 -0
CLOUD_DEPLOYMENT_GUIDE.md +462 -0
CLOUD_TRAINING_GUIDE.md +440 -0
DEPLOYMENT_GUIDE.md +397 -0
PUSH_GUIDE.md +406 -0
README.md +14 -1
TRACKIO_INTEGRATION.md +252 -0
app.py +318 -0
cloud_deployment.sh +279 -0
config/__init__.py +19 -0
config/runpod_config.py +47 -0
config/train_smollm3.py +9 -0
config/train_smollm3_dpo.py +85 -28
config/train_smollm3_openhermes_fr.py +129 -0
config/train_smollm3_openhermes_fr_a100_large.py +161 -0
config/train_smollm3_openhermes_fr_a100_multiple_passes.py +164 -0
data.py +21 -1
deploy_trackio_space.py +235 -0
monitoring.py +298 -0
push_to_huggingface.py +486 -0
requirements.txt +8 -1
requirements_space.txt +18 -0
run_a100_large_experiment.py +134 -0
test_monitoring.py +181 -0
train.py +34 -4
trainer.py +41 -0

A100_LARGE_SCALE_GUIDE.md ADDED Viewed

	@@ -0,0 +1,195 @@

+# A100 Large Scale Training Guide
+This guide provides configurations and instructions for running fully-fledged experiments with multiple passes on the full OpenHermes-FR dataset (800k+ datapoints) using A100 GPUs.
+## Available Configurations
+### 1. A100 Large Batch Configuration
+**File**: `config/train_smollm3_openhermes_fr_a100_large.py`
+**Key Features**:
+- **Effective Batch Size**: 128 (8 × 16 gradient accumulation)
+- **Training Duration**: ~1.3 passes (8,000 steps)
+- **Learning Rate**: 5e-6 (optimized for large batches)
+- **Mixed Precision**: bf16 (A100 optimized)
+- **Sequence Length**: 8192 tokens
+- **Memory Optimizations**: No gradient checkpointing for A100 efficiency
+**Estimated Training Time**: ~6-8 hours on A100
+### 2. Multiple Passes Configuration
+**File**: `config/train_smollm3_openhermes_fr_a100_multiple_passes.py`
+**Key Features**:
+- **Effective Batch Size**: 120 (6 × 20 gradient accumulation)
+- **Training Duration**: ~4 passes (25,000 steps)
+- **Learning Rate**: 3e-6 (conservative for long training)
+- **Warmup Steps**: 2000 (longer warmup for stability)
+- **Checkpoint Strategy**: More frequent saves (every 2000 steps)
+**Estimated Training Time**: ~20-24 hours on A100
+## Training Commands
+### Quick Start - Large Batch Experiment
+```bash
+python run_a100_large_experiment.py \
+    --config config/train_smollm3_openhermes_fr_a100_large.py \
+    --experiment-name "smollm3_openhermes_fr_large_batch" \
+    --output-dir ./outputs/large_batch
+```
+### Multiple Passes Experiment
+```bash
+python run_a100_large_experiment.py \
+    --config config/train_smollm3_openhermes_fr_a100_multiple_passes.py \
+    --experiment-name "smollm3_openhermes_fr_multiple_passes" \
+    --output-dir ./outputs/multiple_passes
+```
+### Dry Run (Check Configuration)
+```bash
+python run_a100_large_experiment.py \
+    --config config/train_smollm3_openhermes_fr_a100_large.py \
+    --dry-run
+```
+### Resume Training
+```bash
+python run_a100_large_experiment.py \
+    --config config/train_smollm3_openhermes_fr_a100_multiple_passes.py \
+    --resume ./outputs/multiple_passes/checkpoint-10000 \
+    --output-dir ./outputs/multiple_passes
+```
+## Configuration Details
+### Memory Usage Optimization
+- **Gradient Checkpointing**: Disabled for A100 efficiency
+- **Flash Attention**: Enabled for memory efficiency
+- **bf16 Mixed Precision**: Better for A100 than fp16
+- **Gradient Clipping**: 1.0 for stability
+- **Group by Length**: Enabled for better batching
+### Data Loading Optimization
+- **Num Workers**: 8 for faster data loading
+- **Pin Memory**: Enabled for GPU transfer efficiency
+- **Prefetch Factor**: 2 for pipeline optimization
+### Training Stability
+- **Conservative Learning Rate**: Lower LR for large effective batch sizes
+- **Longer Warmup**: More warmup steps for stability
+- **Higher Beta2**: 0.999 for AdamW stability
+- **Gradient Clipping**: Prevents gradient explosion
+## Expected Results
+### Large Batch Configuration (1.3 passes)
+- **Training Steps**: 8,000
+- **Effective Batch Size**: 128
+- **Steps per Epoch**: ~6,250
+- **Epochs**: ~1.3
+- **Expected Loss**: Should converge to ~1.5-2.0
+### Multiple Passes Configuration (4 passes)
+- **Training Steps**: 25,000
+- **Effective Batch Size**: 120
+- **Steps per Epoch**: ~6,667
+- **Epochs**: ~3.75
+- **Expected Loss**: Should converge to ~1.2-1.5
+## Monitoring and Logging
+### Trackio Integration
+Both configurations include Trackio monitoring:
+- **Metrics Logging**: Every 25-50 steps
+- **Artifact Logging**: Model checkpoints
+- **Config Logging**: Training configuration
+### Checkpoint Strategy
+- **Large Batch**: Save every 1000 steps (8 checkpoints)
+- **Multiple Passes**: Save every 2000 steps (12 checkpoints)
+- **Best Model**: Automatically load best model at end
+## Hardware Requirements
+### Minimum Requirements
+- **GPU**: A100 80GB (or multiple A100s)
+- **RAM**: 64GB+ system RAM
+- **Storage**: 100GB+ for checkpoints and logs
+- **Network**: Fast internet for dataset download
+### Recommended Setup
+- **GPU**: 2-4x A100 80GB
+- **RAM**: 128GB+ system RAM
+- **Storage**: 500GB+ NVMe SSD
+- **Network**: 10Gbps+ connection
+## Troubleshooting
+### Out of Memory (OOM)
+If you encounter OOM errors:
+1. Reduce `batch_size` from 8 to 6 or 4
+2. Increase `gradient_accumulation_steps` to maintain effective batch size
+3. Reduce `max_seq_length` from 8192 to 4096
+### Slow Training
+If training is too slow:
+1. Increase `dataloader_num_workers` to 12-16
+2. Ensure you're using bf16 mixed precision
+3. Check that gradient checkpointing is disabled
+4. Verify flash attention is enabled
+### Convergence Issues
+If loss doesn't converge:
+1. Reduce learning rate by 2x
+2. Increase warmup steps
+3. Check gradient norms in logs
+4. Verify dataset quality
+## Customization
+### For Different Dataset Sizes
+Adjust `max_iters` based on your dataset size:
+```python
+# For 1M datapoints with effective batch size 120
+steps_per_epoch = 1000000 // 120  # ~8,333 steps
+max_iters = steps_per_epoch * desired_epochs
+```
+### For Different GPU Memory
+Adjust batch size and gradient accumulation:
+```python
+# For 40GB A100
+batch_size = 4
+gradient_accumulation_steps = 32  # Effective batch size = 128
+# For 24GB GPU
+batch_size = 2
+gradient_accumulation_steps = 64  # Effective batch size = 128
+```
+## Performance Tips
+1. **Use bf16**: Better than fp16 for A100
+2. **Disable Gradient Checkpointing**: A100 has enough memory
+3. **Use Flash Attention**: Memory efficient attention
+4. **Group by Length**: Better batching efficiency
+5. **Pin Memory**: Faster GPU transfers
+6. **Multiple Workers**: Faster data loading
+## Expected Timeline
+- **Large Batch**: 6-8 hours for 1.3 passes
+- **Multiple Passes**: 20-24 hours for 4 passes
+- **Full Dataset (5+ passes)**: 30+ hours
+## Next Steps
+After training completes:
+1. Evaluate on validation set
+2. Test generation quality
+3. Push to Hugging Face Hub
+4. Deploy for inference
+For deployment instructions, see `DEPLOYMENT_GUIDE.md`.

CLOUD_DEPLOYMENT_GUIDE.md ADDED Viewed

	@@ -0,0 +1,462 @@

+# Cloud Deployment Guide for SmolLM3 DPO Training
+This guide provides the exact sequence of commands to deploy and run SmolLM3 DPO training on a cloud computing instance with 6 epochs.
+## Prerequisites
+### Cloud Instance Requirements
+- **GPU**: NVIDIA A100, H100, or similar (16GB+ VRAM)
+- **RAM**: 64GB+ system memory
+- **Storage**: 100GB+ SSD storage
+- **OS**: Ubuntu 20.04 or 22.04
+### Required Information
+Before starting, gather these details:
+- Your Hugging Face username
+- Your Hugging Face token (with write permissions)
+- Your Trackio Space URL (if using monitoring)
+## Step-by-Step Deployment
+### Step 1: Launch Cloud Instance
+Choose your cloud provider and launch an instance:
+#### AWS (g5.2xlarge or g5.4xlarge)
+```bash
+# Launch instance with Ubuntu 22.04 and appropriate GPU
+aws ec2 run-instances \
+    --image-id ami-0c7217cdde317cfec \
+    --instance-type g5.2xlarge \
+    --key-name your-key-pair \
+    --security-group-ids sg-xxxxxxxxx
+```
+#### Google Cloud (n1-standard-8 with T4/V100)
+```bash
+gcloud compute instances create smollm3-dpo \
+    --zone=us-central1-a \
+    --machine-type=n1-standard-8 \
+    --accelerator="type=nvidia-tesla-t4,count=1" \
+    --image-family=ubuntu-2204-lts \
+    --image-project=ubuntu-os-cloud
+```
+#### Azure (Standard_NC6s_v3)
+```bash
+az vm create \
+    --resource-group your-rg \
+    --name smollm3-dpo \
+    --image Canonical:0001-com-ubuntu-server-jammy:22_04-lts:latest \
+    --size Standard_NC6s_v3 \
+    --admin-username azureuser
+```
+### Step 2: Connect to Instance
+```bash
+# SSH to your instance
+ssh -i your-key.pem ubuntu@your-instance-ip
+# Or for Azure
+ssh azureuser@your-instance-ip
+```
+### Step 3: Update System and Install Dependencies
+```bash
+# Update system
+sudo apt-get update
+sudo apt-get upgrade -y
+# Install system dependencies
+sudo apt-get install -y git curl wget unzip python3 python3-pip python3-venv
+# Install NVIDIA drivers (if not pre-installed)
+curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
+curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
+    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
+    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
+sudo apt-get update
+sudo apt-get install -y nvidia-container-toolkit
+```
+### Step 4: Clone Repository and Setup Environment
+```bash
+# Clone your repository
+git clone https://github.com/your-username/flexai-finetune.git
+cd flexai-finetune
+# Create virtual environment
+python3 -m venv smollm3_env
+source smollm3_env/bin/activate
+# Install PyTorch with CUDA
+pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
+# Install project dependencies
+pip install -r requirements.txt
+# Install additional DPO dependencies
+pip install trl>=0.7.0
+pip install peft>=0.4.0
+pip install accelerate>=0.20.0
+```
+### Step 5: Configure Authentication
+```bash
+# Set your Hugging Face token
+export HF_TOKEN="your_huggingface_token_here"
+# Login to Hugging Face
+huggingface-cli login --token $HF_TOKEN
+```
+### Step 6: Create Configuration Files
+Create the DPO configuration file:
+```bash
+cat > config/train_smollm3_dpo_6epochs.py << 'EOF'
+"""
+SmolLM3 DPO Training Configuration - 6 Epochs
+Optimized for cloud deployment
+"""
+from config.train_smollm3_dpo import SmolLM3DPOConfig
+config = SmolLM3DPOConfig(
+    # Model configuration
+    model_name="HuggingFaceTB/SmolLM3-3B",
+    max_seq_length=4096,
+    use_flash_attention=True,
+    use_gradient_checkpointing=True,
+    # Training configuration
+    batch_size=2,
+    gradient_accumulation_steps=8,
+    learning_rate=5e-6,
+    weight_decay=0.01,
+    warmup_steps=100,
+    max_iters=None,  # Will be calculated based on epochs
+    eval_interval=100,
+    log_interval=10,
+    save_interval=500,
+    # DPO configuration
+    beta=0.1,
+    max_prompt_length=2048,
+    # Optimizer configuration
+    optimizer="adamw",
+    beta1=0.9,
+    beta2=0.95,
+    eps=1e-8,
+    # Scheduler configuration
+    scheduler="cosine",
+    min_lr=1e-6,
+    # Mixed precision
+    fp16=True,
+    bf16=False,
+    # Logging and saving
+    save_steps=500,
+    eval_steps=100,
+    logging_steps=10,
+    save_total_limit=3,
+    # Evaluation
+    eval_strategy="steps",
+    metric_for_best_model="eval_loss",
+    greater_is_better=False,
+    load_best_model_at_end=True,
+    # Data configuration
+    data_dir="smoltalk_dataset",
+    train_file="train.json",
+    validation_file="validation.json",
+    # Chat template configuration
+    use_chat_template=True,
+    chat_template_kwargs={
+        "enable_thinking": False,
+        "add_generation_prompt": True
+    },
+    # Trackio monitoring configuration
+    enable_tracking=True,
+    trackio_url="https://your-trackio-space.hf.space",  # Change this
+    trackio_token=None,
+    log_artifacts=True,
+    log_metrics=True,
+    log_config=True,
+    experiment_name="smollm3_dpo_6epochs"
+)
+EOF
+```
+### Step 7: Download and Prepare Dataset
+```bash
+# Create dataset preparation script
+cat > prepare_dataset.py << 'EOF'
+from datasets import load_dataset
+import json
+import os
+# Load SmolTalk dataset
+print('Loading SmolTalk dataset...')
+dataset = load_dataset('HuggingFaceTB/smoltalk')
+# Create dataset directory
+os.makedirs('smoltalk_dataset', exist_ok=True)
+# Convert to DPO format (preference pairs)
+def convert_to_dpo_format(example):
+    # For SmolTalk, we'll create preference pairs based on response quality
+    # This is a simplified example - you may need to adjust based on your needs
+    return {
+        'prompt': example.get('prompt', ''),
+        'chosen': example.get('chosen', ''),
+        'rejected': example.get('rejected', '')
+    }
+# Process train split
+train_data = []
+for example in dataset['train']:
+    dpo_example = convert_to_dpo_format(example)
+    if dpo_example['prompt'] and dpo_example['chosen'] and dpo_example['rejected']:
+        train_data.append(dpo_example)
+# Process validation split
+val_data = []
+for example in dataset['validation']:
+    dpo_example = convert_to_dpo_format(example)
+    if dpo_example['prompt'] and dpo_example['chosen'] and dpo_example['rejected']:
+        val_data.append(dpo_example)
+# Save to files
+with open('smoltalk_dataset/train.json', 'w') as f:
+    json.dump(train_data, f, indent=2)
+with open('smoltalk_dataset/validation.json', 'w') as f:
+    json.dump(val_data, f, indent=2)
+print(f'Dataset prepared: {len(train_data)} train samples, {len(val_data)} validation samples')
+EOF
+# Run dataset preparation
+python prepare_dataset.py
+```
+### Step 8: Calculate Training Parameters
+```bash
+# Calculate training steps based on epochs
+TOTAL_SAMPLES=$(python -c "import json; data=json.load(open('smoltalk_dataset/train.json')); print(len(data))")
+BATCH_SIZE=2
+GRADIENT_ACCUMULATION_STEPS=8
+MAX_EPOCHS=6
+EFFECTIVE_BATCH_SIZE=$((BATCH_SIZE * GRADIENT_ACCUMULATION_STEPS))
+STEPS_PER_EPOCH=$((TOTAL_SAMPLES / EFFECTIVE_BATCH_SIZE))
+MAX_STEPS=$((STEPS_PER_EPOCH * MAX_EPOCHS))
+echo "Training Configuration:"
+echo "  Total samples: $TOTAL_SAMPLES"
+echo "  Effective batch size: $EFFECTIVE_BATCH_SIZE"
+echo "  Steps per epoch: $STEPS_PER_EPOCH"
+echo "  Total training steps: $MAX_STEPS"
+echo "  Training epochs: $MAX_EPOCHS"
+```
+### Step 9: Start DPO Training
+```bash
+# Start training with all parameters
+python train.py config/train_smollm3_dpo_6epochs.py \
+    --dataset_dir smoltalk_dataset \
+    --out_dir /output-checkpoint \
+    --init_from scratch \
+    --max_iters $MAX_STEPS \
+    --batch_size $BATCH_SIZE \
+    --learning_rate 5e-6 \
+    --gradient_accumulation_steps $GRADIENT_ACCUMULATION_STEPS \
+    --max_seq_length 4096 \
+    --save_steps 500 \
+    --eval_steps 100 \
+    --logging_steps 10 \
+    --enable_tracking \
+    --trackio_url "https://your-trackio-space.hf.space" \
+    --experiment_name "smollm3_dpo_6epochs"
+```
+### Step 10: Push Model to Hugging Face Hub
+```bash
+# Push the trained model
+python push_to_huggingface.py /output-checkpoint "your-username/smollm3-dpo-6epochs" \
+    --token "$HF_TOKEN" \
+    --trackio-url "https://your-trackio-space.hf.space" \
+    --experiment-name "smollm3_dpo_6epochs"
+```
+### Step 11: Test the Uploaded Model
+```bash
+# Test the model
+python -c "
+from transformers import AutoModelForCausalLM, AutoTokenizer
+import torch
+print('Loading uploaded model...')
+model = AutoModelForCausalLM.from_pretrained('your-username/smollm3-dpo-6epochs', torch_dtype=torch.float16, device_map='auto')
+tokenizer = AutoTokenizer.from_pretrained('your-username/smollm3-dpo-6epochs')
+print('Testing model generation...')
+prompt = 'Hello, how are you?'
+inputs = tokenizer(prompt, return_tensors='pt').to(model.device)
+outputs = model.generate(**inputs, max_new_tokens=50, do_sample=True, temperature=0.7)
+response = tokenizer.decode(outputs[0], skip_special_tokens=True)
+print(f'Prompt: {prompt}')
+print(f'Response: {response}')
+print('✅ Model test completed successfully!')
+"
+```
+## Complete One-Line Deployment
+If you want to run everything automatically, use the deployment script:
+```bash
+# Make script executable
+chmod +x cloud_deployment.sh
+# Edit configuration in the script first
+nano cloud_deployment.sh
+# Change these variables:
+# - REPO_NAME="your-username/smollm3-dpo-6epochs"
+# - TRACKIO_URL="https://your-trackio-space.hf.space"
+# - HF_TOKEN="your_hf_token_here"
+# Run the complete deployment
+./cloud_deployment.sh
+```
+## Monitoring and Debugging
+### Check GPU Usage
+```bash
+# Monitor GPU usage during training
+watch -n 1 nvidia-smi
+```
+### Check Training Logs
+```bash
+# Monitor training progress
+tail -f training.log
+# Check system resources
+htop
+```
+### Monitor Trackio
+```bash
+# Check if Trackio is logging properly
+curl -s "https://your-trackio-space.hf.space" | grep -i "experiment"
+```
+## Expected Timeline
+- **Setup**: 15-30 minutes
+- **Dataset preparation**: 5-10 minutes
+- **Training (6 epochs)**: 4-8 hours (depending on GPU)
+- **Model upload**: 10-30 minutes
+- **Testing**: 5-10 minutes
+## Troubleshooting
+### Common Issues
+#### 1. Out of Memory (OOM)
+```bash
+# Reduce batch size
+BATCH_SIZE=1
+GRADIENT_ACCUMULATION_STEPS=16
+# Or use gradient checkpointing
+# Already enabled in config
+```
+#### 2. Slow Training
+```bash
+# Check GPU utilization
+nvidia-smi
+# Check if mixed precision is working
+# Look for "fp16" in training logs
+```
+#### 3. Dataset Issues
+```bash
+# Check dataset format
+head -n 5 smoltalk_dataset/train.json
+# Verify dataset size
+wc -l smoltalk_dataset/train.json
+```
+#### 4. Authentication Issues
+```bash
+# Test HF token
+python -c "
+from huggingface_hub import HfApi
+api = HfApi(token='$HF_TOKEN')
+print('Token is valid!')
+"
+```
+## Cost Estimation
+### AWS (g5.2xlarge)
+- **Instance**: $0.526/hour
+- **Training time**: 6 hours
+- **Total cost**: ~$3.16
+### Google Cloud (n1-standard-8 + T4)
+- **Instance**: $0.38/hour
+- **Training time**: 6 hours
+- **Total cost**: ~$2.28
+### Azure (Standard_NC6s_v3)
+- **Instance**: $0.90/hour
+- **Training time**: 6 hours
+- **Total cost**: ~$5.40
+## Next Steps
+After successful deployment:
+1. **Monitor training** in your Trackio Space
+2. **Check model repository** on Hugging Face Hub
+3. **Test the model** with different prompts
+4. **Share your model** with the community
+5. **Iterate and improve** based on results
+## Support
+- **Training issues**: Check logs and GPU utilization
+- **Upload issues**: Verify HF token and repository permissions
+- **Monitoring issues**: Check Trackio Space configuration
+- **Performance issues**: Adjust batch size and learning rate
+Your SmolLM3 DPO model will be ready for use after training completes!

CLOUD_TRAINING_GUIDE.md ADDED Viewed

	@@ -0,0 +1,440 @@

+# Cloud Training Guide for OpenHermes-FR Dataset
+This guide provides step-by-step instructions for training SmolLM3 models on cloud instances using the [legmlai/openhermes-fr](https://huggingface.co/datasets/legmlai/openhermes-fr) dataset.
+## Overview
+The OpenHermes-FR dataset contains 799,875 French instruction-response pairs, perfect for fine-tuning SmolLM3 models for French language tasks. This guide covers:
+- ✅ **Cloud Instance Setup** - Complete environment configuration
+- ✅ **Dataset Integration** - Automatic loading and filtering
+- ✅ **Training Configuration** - Optimized for French instruction tuning
+- ✅ **Monitoring Integration** - Trackio experiment tracking
+- ✅ **Model Deployment** - Push to Hugging Face Hub
+## Dataset Information
+### Schema
+```json
+{
+  "prompt": "Explique la différence entre la photosynthèse C3 et C4.",
+  "accepted_completion": "La photosynthèse C3 utilise… (réponse détaillée)",
+  "bad_prompt_detected": false,
+  "bad_response_detected": false,
+  "bad_entry": false
+}
+```
+### Key Features
+- **Size**: 799,875 examples (~1.4GB)
+- **Language**: 100% French
+- **Quality**: GPT-4o generated responses with automatic filtering
+- **License**: ODC-BY 1.0
+## Cloud Instance Setup
+### 1. Choose Your Cloud Provider
+#### **AWS EC2 (Recommended)**
+```bash
+# Launch instance with GPU
+# Recommended: g4dn.xlarge or g5.xlarge
+# AMI: Deep Learning AMI (Ubuntu 20.04)
+```
+#### **Google Cloud Platform**
+```bash
+# Launch instance with GPU
+# Recommended: n1-standard-4 with Tesla T4 or V100
+```
+#### **Azure**
+```bash
+# Launch instance with GPU
+# Recommended: Standard_NC6s_v3 or Standard_NC12s_v3
+```
+### 2. Instance Specifications
+#### **Minimum Requirements**
+- **GPU**: 16GB+ VRAM (Tesla T4, V100, or A100)
+- **RAM**: 32GB+ system memory
+- **Storage**: 100GB+ SSD
+- **CPU**: 8+ cores
+#### **Recommended Specifications**
+- **GPU**: A100 (40GB) or H100 (80GB)
+- **RAM**: 64GB+ system memory
+- **Storage**: 200GB+ NVMe SSD
+- **CPU**: 16+ cores
+### 3. Environment Setup
+```bash
+# Update system
+sudo apt update && sudo apt upgrade -y
+# Install CUDA (if not pre-installed)
+# Follow NVIDIA CUDA installation guide for your GPU
+# Install Python dependencies
+sudo apt install python3-pip python3-venv git -y
+# Create virtual environment
+python3 -m venv smollm3_env
+source smollm3_env/bin/activate
+# Clone repository
+git clone <your-repo-url>
+cd <your-repo-directory>
+# Install dependencies
+pip install -r requirements.txt
+# Install additional dependencies for cloud training
+pip install accelerate transformers datasets huggingface_hub
+```
+## Training Configuration
+### 1. Use the OpenHermes-FR Config
+The repository includes a specialized configuration for the OpenHermes-FR dataset:
+```bash
+python train.py config/train_smollm3_openhermes_fr.py \
+    --enable_tracking \
+    --trackio_url "https://your-space.hf.space" \
+    --experiment_name "smollm3_fr_openhermes_v1"
+```
+### 2. Configuration Details
+The `config/train_smollm3_openhermes_fr.py` includes:
+#### **Dataset Configuration**
+```python
+dataset_name: str = "legmlai/openhermes-fr"
+dataset_split: str = "train"
+input_field: str = "prompt"
+target_field: str = "accepted_completion"
+filter_bad_entries: bool = True
+bad_entry_field: str = "bad_entry"
+```
+#### **Training Optimization**
+```python
+batch_size: int = 2  # Reduced for French text (longer sequences)
+gradient_accumulation_steps: int = 8  # Maintains effective batch size
+learning_rate: float = 1e-5  # Lower for instruction tuning
+max_iters: int = 2000  # More iterations for large dataset
+```
+#### **Monitoring Integration**
+```python
+enable_tracking: bool = True
+experiment_name: str = "smollm3_openhermes_fr"
+```
+## Training Commands
+### Basic Training
+```bash
+python train.py config/train_smollm3_openhermes_fr.py
+```
+### Training with Monitoring
+```bash
+python train.py config/train_smollm3_openhermes_fr.py \
+    --enable_tracking \
+    --trackio_url "https://your-trackio-space.hf.space" \
+    --experiment_name "smollm3_fr_openhermes_v1"
+```
+### Training with Custom Parameters
+```bash
+python train.py config/train_smollm3_openhermes_fr.py \
+    --batch_size 4 \
+    --learning_rate 2e-5 \
+    --max_iters 3000 \
+    --enable_tracking \
+    --trackio_url "https://your-trackio-space.hf.space" \
+    --experiment_name "smollm3_fr_high_lr"
+```
+### Training with Checkpoint Resume
+```bash
+python train.py config/train_smollm3_openhermes_fr.py \
+    --init_from resume \
+    --enable_tracking \
+    --trackio_url "https://your-trackio-space.hf.space" \
+    --experiment_name "smollm3_fr_resume"
+```
+## Dataset Processing
+### Automatic Filtering
+The training script automatically:
+- ✅ **Loads** the OpenHermes-FR dataset from Hugging Face
+- ✅ **Filters** out bad entries (`bad_entry = true`)
+- ✅ **Splits** data into train/validation/test (98/1/1)
+- ✅ **Formats** prompts and completions for instruction tuning
+### Manual Dataset Inspection
+```python
+from datasets import load_dataset
+# Load dataset
+dataset = load_dataset("legmlai/openhermes-fr")
+# Check dataset info
+print(f"Dataset size: {len(dataset['train'])}")
+print(f"Sample columns: {dataset['train'].column_names}")
+# Check filtering
+bad_entries = dataset['train'].filter(lambda x: x['bad_entry'])
+print(f"Bad entries: {len(bad_entries)}")
+# Sample data
+sample = dataset['train'][0]
+print(f"Prompt: {sample['prompt']}")
+print(f"Completion: {sample['accepted_completion']}")
+```
+## Monitoring and Tracking
+### Trackio Integration
+The training automatically logs:
+- **Training metrics**: Loss, accuracy, learning rate
+- **System metrics**: GPU memory, CPU usage
+- **Dataset info**: Size, filtering statistics
+- **Model checkpoints**: Regular saves with metadata
+### View Training Progress
+1. **Trackio Space**: Visit your Trackio Space URL
+2. **Experiment Details**: Check the "View Experiments" tab
+3. **Metrics**: Monitor loss curves and system usage
+4. **Logs**: Download training logs for analysis
+## Model Deployment
+### Push to Hugging Face Hub
+After training, deploy your model:
+```bash
+python push_to_huggingface.py /output-checkpoint username/smollm3-fr-openhermes \
+    --trackio-url "https://your-trackio-space.hf.space" \
+    --experiment-name "smollm3_fr_openhermes_v1"
+```
+### Use Your Model
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+# Load your fine-tuned model
+model = AutoModelForCausalLM.from_pretrained("username/smollm3-fr-openhermes")
+tokenizer = AutoTokenizer.from_pretrained("username/smollm3-fr-openhermes")
+# Generate French text
+prompt = "Expliquez le concept de l'intelligence artificielle."
+inputs = tokenizer(prompt, return_tensors="pt")
+outputs = model.generate(**inputs, max_new_tokens=200)
+print(tokenizer.decode(outputs[0], skip_special_tokens=True))
+```
+## Performance Optimization
+### GPU Memory Management
+```bash
+# Monitor GPU usage
+nvidia-smi -l 1
+# Optimize for your GPU
+# For 16GB VRAM: batch_size=2, gradient_accumulation_steps=8
+# For 24GB VRAM: batch_size=4, gradient_accumulation_steps=4
+# For 40GB+ VRAM: batch_size=8, gradient_accumulation_steps=2
+```
+### Training Speed
+```bash
+# Use mixed precision (enabled by default)
+fp16: bool = True
+# Enable gradient checkpointing (enabled by default)
+use_gradient_checkpointing: bool = True
+# Use flash attention (enabled by default)
+use_flash_attention: bool = True
+```
+## Troubleshooting
+### Common Issues
+#### 1. **Out of Memory (OOM)**
+```bash
+# Reduce batch size
+python train.py config/train_smollm3_openhermes_fr.py --batch_size 1
+# Increase gradient accumulation
+# Edit config: gradient_accumulation_steps = 16
+```
+#### 2. **Slow Training**
+```bash
+# Check GPU utilization
+nvidia-smi
+# Verify data loading
+# Check if dataset is cached locally
+```
+#### 3. **Dataset Loading Issues**
+```bash
+# Clear cache
+rm -rf ~/.cache/huggingface/
+# Check internet connection
+# Verify dataset name: "legmlai/openhermes-fr"
+```
+#### 4. **Monitoring Connection Issues**
+```bash
+# Test Trackio connection
+curl -I https://your-trackio-space.hf.space
+# Check token permissions
+# Verify experiment name format
+```
+### Debug Mode
+```bash
+# Enable debug logging
+export LOG_LEVEL=DEBUG
+python train.py config/train_smollm3_openhermes_fr.py
+```
+## Cost Optimization
+### Cloud Provider Tips
+#### **AWS EC2**
+- Use Spot Instances for cost savings
+- Monitor usage with CloudWatch
+- Use appropriate instance types
+#### **Google Cloud Platform**
+- Use Preemptible VMs for non-critical training
+- Monitor with Cloud Monitoring
+- Use committed use discounts
+#### **Azure**
+- Use Spot VMs for cost optimization
+- Monitor with Azure Monitor
+- Use reserved instances for long training
+### Training Time Estimates
+| GPU Type | Batch Size | Estimated Time |
+|----------|------------|----------------|
+| Tesla T4 (16GB) | 2 | 8-12 hours |
+| V100 (32GB) | 4 | 4-6 hours |
+| A100 (40GB) | 8 | 2-3 hours |
+| H100 (80GB) | 16 | 1-2 hours |
+## Security Best Practices
+### Token Management
+```bash
+# Use environment variables
+export HF_TOKEN="your_token_here"
+export TRACKIO_TOKEN="your_trackio_token"
+# Don't hardcode in scripts
+# Use IAM roles when possible
+```
+### Data Privacy
+```bash
+# Use private repositories for sensitive models
+python push_to_huggingface.py model username/private-model --private
+# Secure your cloud instance
+# Use VPC and security groups
+```
+## Complete Workflow Example
+### 1. Setup Cloud Instance
+```bash
+# Launch GPU instance
+# Install dependencies
+git clone <your-repo>
+cd <your-repo>
+pip install -r requirements.txt
+```
+### 2. Train Model
+```bash
+python train.py config/train_smollm3_openhermes_fr.py \
+    --enable_tracking \
+    --trackio_url "https://your-space.hf.space" \
+    --experiment_name "smollm3_fr_v1"
+```
+### 3. Deploy Model
+```bash
+python push_to_huggingface.py /output-checkpoint username/smollm3-fr-v1 \
+    --trackio-url "https://your-space.hf.space" \
+    --experiment-name "smollm3_fr_v1"
+```
+### 4. Test Model
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+model = AutoModelForCausalLM.from_pretrained("username/smollm3-fr-v1")
+tokenizer = AutoTokenizer.from_pretrained("username/smollm3-fr-v1")
+# Test French generation
+prompt = "Qu'est-ce que l'apprentissage automatique?"
+inputs = tokenizer(prompt, return_tensors="pt")
+outputs = model.generate(**inputs, max_new_tokens=100)
+print(tokenizer.decode(outputs[0], skip_special_tokens=True))
+```
+## Support and Resources
+### Documentation
+- [OpenHermes-FR Dataset](https://huggingface.co/datasets/legmlai/openhermes-fr)
+- [SmolLM3 Model](https://huggingface.co/HuggingFaceTB/SmolLM3-3B)
+- [Trackio Monitoring](https://github.com/Josephrp/trackio)
+### Community
+- [Hugging Face Forums](https://discuss.huggingface.co/)
+- [Transformers Documentation](https://huggingface.co/docs/transformers/)
+### Examples
+- [French Language Models](https://huggingface.co/models?search=french)
+- [Instruction Tuned Models](https://huggingface.co/models?pipeline_tag=text-generation&sort=downloads)
+## Conclusion
+This guide provides everything needed to train SmolLM3 models on the OpenHermes-FR dataset in the cloud:
+- ✅ **Complete Setup** - From cloud instance to model deployment
+- ✅ **Optimized Configuration** - Tailored for French instruction tuning
+- ✅ **Monitoring Integration** - Trackio experiment tracking
+- ✅ **Cost Optimization** - Tips for efficient cloud usage
+- ✅ **Troubleshooting** - Solutions for common issues
+Start training your French language model today!

DEPLOYMENT_GUIDE.md ADDED Viewed

	@@ -0,0 +1,397 @@

+# Trackio Deployment Guide for Hugging Face Spaces
+This guide provides step-by-step instructions for deploying Trackio experiment tracking to Hugging Face Spaces and integrating it with your SmolLM3 fine-tuning pipeline.
+## Prerequisites
+- Hugging Face account
+- Hugging Face CLI installed (`pip install huggingface_hub`)
+- Git configured with your Hugging Face credentials
+## Method 1: Automated Deployment (Recommended)
+### Step 1: Run the Deployment Script
+```bash
+python deploy_trackio_space.py
+```
+The script will prompt you for:
+- Your Hugging Face username
+- Space name (e.g., `trackio-monitoring`)
+- Hugging Face token (needs a write token obviously)
+### Step 2: Wait for Build
+After deployment, wait 2-5 minutes for the Space to build and become available.
+### Step 3: Test the Interface
+Visit your Space URL to test the interface:
+```
+https://huggingface.co/spaces/YOUR_USERNAME/YOUR_SPACE_NAME
+```
+## Method 2: Manual Deployment
+### Step 1: Create a New Space
+1. Go to https://huggingface.co/spaces
+2. Click "Create new Space"
+3. Configure the Space:
+   - **Owner**: Your username
+   - **Space name**: `trackio-monitoring` (or your preferred name)
+   - **SDK**: Gradio
+   - **Hardware**: CPU (Basic)
+   - **License**: MIT
+### Step 2: Upload Files
+Upload these files to your Space:
+#### `app.py`
+The main Gradio interface (already created in this repository)
+#### `requirements_space.txt`
+```
+gradio>=4.0.0
+gradio-client>=0.10.0
+requests>=2.31.0
+numpy>=1.24.0
+pandas>=2.0.0
+jsonschema>=4.17.0
+plotly>=5.15.0
+matplotlib>=3.7.0
+python-dotenv>=1.0.0
+```
+#### `README.md`
+```markdown
+# Trackio Experiment Tracking
+A Gradio interface for experiment tracking and monitoring.
+## Features
+- Create and manage experiments
+- Log training metrics and parameters
+- View experiment details and results
+- Update experiment status
+## Usage
+1. Create a new experiment using the "Create Experiment" tab
+2. Log metrics during training using the "Log Metrics" tab
+3. View experiment details using the "View Experiments" tab
+4. Update experiment status using the "Update Status" tab
+## Integration
+To connect your training script to this Trackio Space:
+```python
+from monitoring import SmolLM3Monitor
+monitor = SmolLM3Monitor(
+    experiment_name="my_experiment",
+    trackio_url="https://your-space.hf.space",
+    enable_tracking=True
+)
+```
+### Step 3: Configure Space Settings
+In your Space settings, ensure:
+- **App file**: `app.py`
+- **Python version**: 3.9 or higher
+- **Hardware**: CPU (Basic) is sufficient
+## Integration with Your Training Script
+### Step 1: Update Your Configuration
+Add Trackio settings to your training configuration:
+```python
+# config/train_smollm3.py
+@dataclass
+class SmolLM3Config:
+    # ... existing settings ...
+    # Trackio monitoring configuration
+    enable_tracking: bool = True
+    trackio_url: Optional[str] = None  # Your Space URL
+    trackio_token: Optional[str] = None
+    log_artifacts: bool = True
+    log_metrics: bool = True
+    log_config: bool = True
+    experiment_name: Optional[str] = None
+```
+### Step 2: Run Training with Trackio
+```bash
+python train.py config/train_smollm3.py \
+    --dataset_dir my_dataset \
+    --enable_tracking \
+    --trackio_url "https://your-username-trackio-monitoring.hf.space" \
+    --experiment_name "smollm3_finetune_v1"
+```
+### Step 3: Monitor Your Experiments
+1. **Create Experiment**: Use the "Create Experiment" tab in your Space
+2. **Log Metrics**: Your training script will automatically log metrics
+3. **View Results**: Use the "View Experiments" tab to see progress
+4. **Update Status**: Mark experiments as completed when done
+## Advanced Configuration
+### Environment Variables
+You can set Trackio configuration via environment variables:
+```bash
+export TRACKIO_URL="https://your-space.hf.space"
+export TRACKIO_TOKEN="your_token_here"
+```
+### Custom Experiment Names
+```bash
+python train.py config/train_smollm3.py \
+    --experiment_name "smollm3_high_lr_experiment" \
+    --trackio_url "https://your-space.hf.space"
+```
+### Multiple Experiments
+You can run multiple experiments and track them separately:
+```bash
+# Experiment 1
+python train.py config/train_smollm3.py \
+    --experiment_name "smollm3_baseline" \
+    --learning_rate 2e-5
+# Experiment 2
+python train.py config/train_smollm3.py \
+    --experiment_name "smollm3_high_lr" \
+    --learning_rate 5e-5
+```
+## Using the Trackio Interface
+### Creating Experiments
+1. Go to the "Create Experiment" tab
+2. Enter experiment name (e.g., "smollm3_finetune_v1")
+3. Add description (optional)
+4. Click "Create Experiment"
+5. Note the experiment ID for logging metrics
+### Logging Metrics
+1. Go to the "Log Metrics" tab
+2. Enter your experiment ID
+3. Add metrics in JSON format:
+   ```json
+   {
+     "loss": 0.5,
+     "accuracy": 0.85,
+     "learning_rate": 2e-5
+   }
+   ```
+4. Add step number (optional)
+5. Click "Log Metrics"
+### Viewing Experiments
+1. Go to the "View Experiments" tab
+2. Enter experiment ID to view specific experiment
+3. Or click "List All Experiments" to see all experiments
+### Updating Status
+1. Go to the "Update Status" tab
+2. Enter experiment ID
+3. Select new status (running, completed, failed, paused)
+4. Click "Update Status"
+## Troubleshooting
+### Common Issues
+#### 1. Space Not Building
+- Check that all required files are uploaded
+- Verify `app.py` is the main file
+- Check the Space logs for errors
+#### 2. Connection Errors
+- Verify your Space URL is correct
+- Check that the Space is running (not paused)
+- Ensure your training script can reach the Space URL
+#### 3. Missing Metrics
+- Check that `enable_tracking=True` in your config
+- Verify the Trackio URL is correct
+- Check training logs for monitoring errors
+#### 4. Authentication Issues
+- If using tokens, verify they're correct
+- Check Hugging Face account permissions
+- Ensure Space is public or you have access
+### Debug Mode
+Enable debug logging in your training script:
+```python
+import logging
+logging.basicConfig(level=logging.DEBUG)
+```
+### Manual Testing
+Test the Trackio interface manually:
+1. Create an experiment
+2. Log some test metrics
+3. View the experiment details
+4. Update the status
+## Security Considerations
+### Public vs Private Spaces
+- **Public Spaces**: Anyone can view and use the interface
+- **Private Spaces**: Only you and collaborators can access
+### Token Management
+- Store tokens securely (environment variables)
+- Don't commit tokens to version control
+- Use Hugging Face's token management
+### Data Privacy
+- Trackio stores experiment data in the Space
+- Consider data retention policies
+- Be mindful of sensitive information in experiment names
+## Performance Optimization
+### Space Configuration
+- Use CPU (Basic) for the interface (sufficient for tracking)
+- Consider GPU only for actual training
+- Monitor Space usage and limits
+### Efficient Logging
+- Log metrics at reasonable intervals (every 10-100 steps)
+- Avoid logging too frequently to prevent rate limiting
+- Use batch logging when possible
+## Monitoring Best Practices
+### Experiment Naming
+Use descriptive names:
+- `smollm3_baseline_v1`
+- `smollm3_high_lr_experiment`
+- `smollm3_dpo_training`
+### Metric Logging
+Log relevant metrics:
+- Training loss
+- Validation loss
+- Learning rate
+- GPU memory usage
+- Training time
+### Status Management
+- Mark experiments as "running" when starting
+- Update to "completed" when finished
+- Mark as "failed" if errors occur
+- Use "paused" for temporary stops
+## Integration Examples
+### Basic Integration
+```python
+from monitoring import SmolLM3Monitor
+# Initialize monitor
+monitor = SmolLM3Monitor(
+    experiment_name="my_experiment",
+    trackio_url="https://your-space.hf.space",
+    enable_tracking=True
+)
+# Log configuration
+monitor.log_config(config_dict)
+# Log metrics during training
+monitor.log_metrics({"loss": 0.5}, step=100)
+# Log final results
+monitor.log_training_summary(final_results)
+```
+### Advanced Integration
+```python
+# Custom monitoring setup
+monitor = SmolLM3Monitor(
+    experiment_name="smollm3_advanced",
+    trackio_url="https://your-space.hf.space",
+    enable_tracking=True,
+    log_artifacts=True,
+    log_metrics=True,
+    log_config=True
+)
+# Log system metrics
+monitor.log_system_metrics(step=current_step)
+# Log model checkpoint
+monitor.log_model_checkpoint("checkpoint-1000", step=1000)
+# Log evaluation results
+monitor.log_evaluation_results(eval_results, step=1000)
+```
+## Support and Resources
+### Documentation
+- [Hugging Face Spaces Documentation](https://huggingface.co/docs/hub/spaces)
+- [Gradio Documentation](https://gradio.app/docs/)
+- [Trackio GitHub Repository](https://github.com/Josephrp/trackio)
+### Community
+- [Hugging Face Forums](https://discuss.huggingface.co/)
+- [Gradio Discord](https://discord.gg/feTf9z3Z)
+### Issues and Feedback
+- Report issues on the project repository
+- Provide feedback on the Trackio interface
+- Suggest improvements for the monitoring system
+## Conclusion
+You now have a complete Trackio monitoring system deployed on Hugging Face Spaces! This setup provides:
+- ✅ Easy experiment tracking and monitoring
+- ✅ Real-time metric logging
+- ✅ Web-based interface for experiment management
+- ✅ Integration with your SmolLM3 fine-tuning pipeline
+- ✅ Scalable and accessible monitoring solution
+Start tracking your experiments and gain insights into your model training process!

PUSH_GUIDE.md ADDED Viewed

	@@ -0,0 +1,406 @@

+# Push to Hugging Face Hub Guide
+This guide explains how to use the `push_to_huggingface.py` script to upload your trained SmolLM3 models and results to Hugging Face Hub.
+## Features
+- ✅ **Automatic Repository Creation** - Creates HF repositories automatically
+- ✅ **Model Validation** - Validates required model files before upload
+- ✅ **Comprehensive Model Cards** - Generates detailed model documentation
+- ✅ **Training Results Upload** - Uploads logs, configs, and results
+- ✅ **Trackio Integration** - Logs push actions to your monitoring system
+- ✅ **Private/Public Repositories** - Support for both private and public models
+## Prerequisites
+### 1. Install Dependencies
+```bash
+pip install huggingface_hub
+```
+### 2. Set Up Hugging Face Token
+```bash
+# Option 1: Environment variable
+export HF_TOKEN="your_huggingface_token_here"
+# Option 2: Use --token argument
+python push_to_huggingface.py model_path repo_name --token "your_token"
+```
+### 3. Get Your Hugging Face Token
+1. Go to https://huggingface.co/settings/tokens
+2. Click "New token"
+3. Give it a name (e.g., "model-upload")
+4. Select "Write" permissions
+5. Copy the token
+## Basic Usage
+### Simple Model Push
+```bash
+python push_to_huggingface.py /path/to/model username/model-name
+```
+### Push with Custom Token
+```bash
+python push_to_huggingface.py /path/to/model username/model-name \
+    --token "hf_your_token_here"
+```
+### Push Private Model
+```bash
+python push_to_huggingface.py /path/to/model username/model-name \
+    --private
+```
+### Push with Trackio Integration
+```bash
+python push_to_huggingface.py /path/to/model username/model-name \
+    --trackio-url "https://your-space.hf.space" \
+    --experiment-name "my_experiment"
+```
+## Complete Workflow Example
+### 1. Train Your Model
+```bash
+python train.py config/train_smollm3.py \
+    --dataset_dir my_dataset \
+    --enable_tracking \
+    --trackio_url "https://your-space.hf.space" \
+    --experiment_name "smollm3_finetune_v1"
+```
+### 2. Push to Hugging Face Hub
+```bash
+python push_to_huggingface.py /output-checkpoint username/smollm3-finetuned \
+    --trackio-url "https://your-space.hf.space" \
+    --experiment-name "smollm3_finetune_v1"
+```
+### 3. Use Your Model
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+# Load your uploaded model
+model = AutoModelForCausalLM.from_pretrained("username/smollm3-finetuned")
+tokenizer = AutoTokenizer.from_pretrained("username/smollm3-finetuned")
+# Generate text
+inputs = tokenizer("Hello, how are you?", return_tensors="pt")
+outputs = model.generate(**inputs, max_new_tokens=100)
+print(tokenizer.decode(outputs[0], skip_special_tokens=True))
+```
+## Repository Structure
+After pushing, your repository will contain:
+```
+username/model-name/
+├── README.md                    # Auto-generated model card
+├── config.json                  # Model configuration
+├── pytorch_model.bin           # Model weights
+├── tokenizer.json              # Tokenizer configuration
+├── tokenizer_config.json       # Tokenizer settings
+├── special_tokens_map.json     # Special tokens
+├── training_results/           # Training artifacts
+│   ├── train_results.json
+│   ├── eval_results.json
+│   ├── training_config.json
+│   └── training.log
+└── .gitattributes             # Git attributes
+```
+## Model Card Features
+The script automatically generates comprehensive model cards including:
+- **Model Details**: Base model, fine-tuning method, size
+- **Training Configuration**: All training parameters
+- **Training Results**: Loss, accuracy, steps, time
+- **Usage Examples**: Code snippets for loading and using
+- **Performance Metrics**: Training and validation metrics
+- **Hardware Information**: GPU/CPU used for training
+## Advanced Usage
+### Custom Repository Names
+```bash
+# Public repository
+python push_to_huggingface.py /model myusername/smollm3-chatbot
+# Private repository
+python push_to_huggingface.py /model myusername/smollm3-private --private
+```
+### Integration with Training Pipeline
+```bash
+#!/bin/bash
+# Complete training and push workflow
+# 1. Train the model
+python train.py config/train_smollm3.py \
+    --dataset_dir my_dataset \
+    --enable_tracking \
+    --trackio_url "https://your-space.hf.space" \
+    --experiment_name "smollm3_v1"
+# 2. Push to Hugging Face Hub
+python push_to_huggingface.py /output-checkpoint myusername/smollm3-v1 \
+    --trackio-url "https://your-space.hf.space" \
+    --experiment-name "smollm3_v1"
+# 3. Test the model
+python -c "
+from transformers import AutoModelForCausalLM, AutoTokenizer
+model = AutoModelForCausalLM.from_pretrained('myusername/smollm3-v1')
+tokenizer = AutoTokenizer.from_pretrained('myusername/smollm3-v1')
+print('Model loaded successfully!')
+"
+```
+### Batch Processing Multiple Models
+```bash
+#!/bin/bash
+# Push multiple models
+models=(
+    "smollm3-baseline"
+    "smollm3-high-lr"
+    "smollm3-dpo"
+)
+for model in "${models[@]}"; do
+    echo "Pushing $model..."
+    python push_to_huggingface.py "/models/$model" "username/$model"
+done
+```
+## Error Handling
+### Common Issues and Solutions
+#### 1. Missing Model Files
+**Error**: `❌ Missing required files: ['config.json', 'pytorch_model.bin']`
+**Solution**: Ensure your model directory contains all required files:
+- `config.json`
+- `pytorch_model.bin`
+- `tokenizer.json`
+- `tokenizer_config.json`
+#### 2. Authentication Issues
+**Error**: `❌ Failed to create repository: 401 Client Error`
+**Solution**:
+- Check your HF token is valid
+- Ensure token has write permissions
+- Verify username in repository name matches your account
+#### 3. Repository Already Exists
+**Error**: `Repository already exists`
+**Solution**: The script handles this automatically with `exist_ok=True`, but you can:
+- Use a different repository name
+- Delete the existing repository first
+- Use version numbers: `username/model-v2`
+#### 4. Large File Upload Issues
+**Error**: `Upload failed for large files`
+**Solution**:
+- Check your internet connection
+- Use Git LFS for large files
+- Consider splitting large models
+## Trackio Integration
+### Logging Push Actions
+When using Trackio integration, the script logs:
+- **Push Action**: Repository creation and file uploads
+- **Model Metadata**: Size, configuration, results
+- **Repository Info**: Name, privacy settings, URL
+- **Training Results**: Loss, accuracy, steps
+### Viewing Push Logs
+1. Go to your Trackio Space
+2. Navigate to the "View Experiments" tab
+3. Find your experiment
+4. Check the metrics for push-related actions
+## Security Best Practices
+### Token Management
+```bash
+# Use environment variables (recommended)
+export HF_TOKEN="your_token_here"
+python push_to_huggingface.py model repo
+# Don't hardcode tokens in scripts
+# ❌ Bad: python push_to_huggingface.py model repo --token "hf_xxx"
+```
+### Private Models
+```bash
+# For sensitive models, use private repositories
+python push_to_huggingface.py model username/private-model --private
+```
+### Repository Naming
+```bash
+# Use descriptive names
+python push_to_huggingface.py model username/smollm3-chatbot-v1
+# Include version numbers
+python push_to_huggingface.py model username/smollm3-v2.0
+```
+## Performance Optimization
+### Large Models
+For models > 5GB:
+```bash
+# Use Git LFS for large files
+git lfs install
+git lfs track "*.bin"
+# Consider splitting models
+python push_to_huggingface.py model username/model-large --private
+```
+### Upload Speed
+```bash
+# Use stable internet connection
+# Consider uploading during off-peak hours
+# Use private repositories for faster uploads
+```
+## Troubleshooting
+### Debug Mode
+```bash
+# Enable debug logging
+export LOG_LEVEL=DEBUG
+python push_to_huggingface.py model repo
+```
+### Validate Model Files
+```bash
+# Check model structure before pushing
+ls -la /path/to/model/
+# Should contain: config.json, pytorch_model.bin, tokenizer.json, etc.
+```
+### Test Repository Access
+```bash
+# Test your HF token
+python -c "
+from huggingface_hub import HfApi
+api = HfApi(token='your_token')
+print('Token is valid!')
+"
+```
+## Integration Examples
+### With CI/CD Pipeline
+```yaml
+# .github/workflows/train-and-push.yml
+name: Train and Push Model
+on:
+  push:
+    branches: [main]
+jobs:
+  train-and-push:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v2
+      - name: Train Model
+        run: |
+          python train.py config/train_smollm3.py
+      - name: Push to HF Hub
+        run: |
+          python push_to_huggingface.py /output username/model-${{ github.run_number }}
+        env:
+          HF_TOKEN: ${{ secrets.HF_TOKEN }}
+```
+### With Docker
+```dockerfile
+# Dockerfile
+FROM python:3.9
+WORKDIR /app
+COPY requirements.txt .
+RUN pip install -r requirements.txt
+COPY . .
+CMD ["python", "push_to_huggingface.py", "/model", "username/model"]
+```
+## Support and Resources
+### Documentation
+- [Hugging Face Hub Documentation](https://huggingface.co/docs/hub/index)
+- [Transformers Documentation](https://huggingface.co/docs/transformers/index)
+- [Model Cards Guide](https://huggingface.co/docs/hub/model-cards)
+### Community
+- [Hugging Face Forums](https://discuss.huggingface.co/)
+- [GitHub Issues](https://github.com/huggingface/huggingface_hub/issues)
+### Examples
+- [Model Repository Examples](https://huggingface.co/models?search=smollm3)
+- [Fine-tuned Models](https://huggingface.co/models?pipeline_tag=text-generation&sort=downloads)
+## Conclusion
+The `push_to_huggingface.py` script provides a complete solution for:
+- ✅ **Easy Model Deployment** - One command to push models
+- ✅ **Professional Documentation** - Auto-generated model cards
+- ✅ **Training Artifacts** - Complete experiment tracking
+- ✅ **Integration Ready** - Works with CI/CD and monitoring
+- ✅ **Security Focused** - Proper token and privacy management
+Start sharing your fine-tuned SmolLM3 models with the community!

README.md CHANGED Viewed

@@ -288,4 +288,17 @@ python -m llama_cpp.convert_model ./output-checkpoint --outfile model.gguf
 ## License
-This project follows the same license as the SmolLM3 model. Please refer to the Hugging Face model page for licensing information.

 ## License
+This project follows the same license as the SmolLM3 model. Please refer to the Hugging Face model page for licensing information.
+{
+  "id": "exp_20250718_195852",
+  "name": "petit-elle-l-aime-3",
+  "description": "SmolLM3 fine-tuning experiment",
+  "created_at": "2025-07-18T19:58:52.689087",
+  "status": "running",
+  "metrics": [],
+  "parameters": {},
+  "artifacts": [],
+  "logs": []
+}

TRACKIO_INTEGRATION.md ADDED Viewed

	@@ -0,0 +1,252 @@

+# Trackio Integration for SmolLM3 Fine-tuning
+This document provides comprehensive information about the Trackio experiment tracking and monitoring integration for your SmolLM3 fine-tuning pipeline.
+## Features
+- **SmolLM3 Fine-tuning**: Support for supervised fine-tuning and DPO training
+- **Trackio Integration**: Complete experiment tracking and monitoring
+- **Hugging Face Spaces Deployment**: Easy deployment of Trackio monitoring interface
+- **Comprehensive Logging**: Metrics, parameters, artifacts, and system monitoring
+- **Flexible Configuration**: Support for various training configurations
+## Quick Start
+### 1. Install Dependencies
+```bash
+pip install -r requirements.txt
+```
+### 2. Basic Training with Trackio
+```bash
+python train.py config/train_smollm3.py \
+    --dataset_dir my_dataset \
+    --enable_tracking \
+    --trackio_url "https://your-trackio-instance.com" \
+    --experiment_name "smollm3_finetune_v1"
+```
+### 3. Training with Custom Parameters
+```bash
+python train.py config/train_smollm3.py \
+    --dataset_dir my_dataset \
+    --batch_size 8 \
+    --learning_rate 1e-5 \
+    --max_iters 2000 \
+    --enable_tracking \
+    --trackio_url "https://your-trackio-instance.com" \
+    --experiment_name "smollm3_high_lr_experiment"
+```
+## Trackio Integration
+### Configuration
+Add Trackio settings to your configuration:
+```python
+# In your config file
+config = SmolLM3Config(
+    # ... other settings ...
+    # Trackio monitoring configuration
+    enable_tracking=True,
+    trackio_url="https://your-trackio-instance.com",
+    trackio_token="your_token_here",  # Optional
+    log_artifacts=True,
+    log_metrics=True,
+    log_config=True,
+    experiment_name="my_experiment"
+)
+```
+### Environment Variables
+You can also set Trackio configuration via environment variables:
+```bash
+export TRACKIO_URL="https://your-trackio-instance.com"
+export TRACKIO_TOKEN="your_token_here"
+```
+### What Gets Tracked
+- **Configuration**: All training parameters and model settings
+- **Metrics**: Loss, accuracy, learning rate, and custom metrics
+- **System Metrics**: GPU memory, CPU usage, training time
+- **Artifacts**: Model checkpoints, evaluation results
+- **Training Summary**: Final results and experiment duration
+## Hugging Face Spaces Deployment
+### Deploy Trackio Monitoring Interface
+1. **Create a new Space** on Hugging Face:
+   - Go to https://huggingface.co/spaces
+   - Click "Create new Space"
+   - Choose "Gradio" as the SDK
+   - Set visibility (Public or Private)
+2. **Upload the deployment files**:
+   - `app.py` - The Gradio interface
+   - `requirements_space.txt` - Dependencies
+   - `README.md` - Documentation
+3. **Configure the Space**:
+   - The Space will automatically install dependencies
+   - The Gradio interface will be available at your Space URL
+### Using the Trackio Space
+1. **Create Experiments**: Use the "Create Experiment" tab to start new experiments
+2. **Log Metrics**: Use the "Log Metrics" tab to track training progress
+3. **View Results**: Use the "View Experiments" tab to see experiment details
+4. **Update Status**: Use the "Update Status" tab to mark experiments as completed
+### Integration with Your Training
+To connect your training script to the Trackio Space:
+```python
+# In your training script
+from monitoring import SmolLM3Monitor
+# Initialize monitor
+monitor = SmolLM3Monitor(
+    experiment_name="my_experiment",
+    trackio_url="https://your-space.hf.space",  # Your Space URL
+    enable_tracking=True
+)
+# Log configuration
+monitor.log_config(config_dict)
+# Log metrics during training
+monitor.log_metrics({"loss": 0.5, "accuracy": 0.85}, step=100)
+# Log final results
+monitor.log_training_summary(final_results)
+```
+## Configuration Files
+### Main Configuration (`config/train_smollm3.py`)
+```python
+@dataclass
+class SmolLM3Config:
+    # Model configuration
+    model_name: str = "HuggingFaceTB/SmolLM3-3B"
+    max_seq_length: int = 4096
+    # Training configuration
+    batch_size: int = 4
+    learning_rate: float = 2e-5
+    max_iters: int = 1000
+    # Trackio monitoring
+    enable_tracking: bool = True
+    trackio_url: Optional[str] = None
+    trackio_token: Optional[str] = None
+    experiment_name: Optional[str] = None
+```
+### DPO Configuration (`config/train_smollm3_dpo.py`)
+```python
+@dataclass
+class SmolLM3DPOConfig(SmolLM3Config):
+    # DPO-specific settings
+    beta: float = 0.1
+    max_prompt_length: int = 2048
+    # Trackio monitoring (inherited)
+    enable_tracking: bool = True
+    trackio_url: Optional[str] = None
+```
+## Monitoring Features
+### Real-time Metrics
+- Training loss and evaluation metrics
+- Learning rate scheduling
+- GPU memory and utilization
+- Training time and progress
+### Artifact Tracking
+- Model checkpoints at regular intervals
+- Evaluation results and plots
+- Configuration snapshots
+- Training logs and summaries
+### Experiment Management
+- Experiment naming and organization
+- Status tracking (running, completed, failed)
+- Parameter comparison across experiments
+- Result visualization
+## Advanced Usage
+### Custom Metrics
+```python
+# Log custom metrics
+monitor.log_metrics({
+    "custom_metric": value,
+    "perplexity": perplexity_score,
+    "bleu_score": bleu_score
+}, step=current_step)
+```
+### System Monitoring
+```python
+# Log system metrics
+monitor.log_system_metrics(step=current_step)
+```
+### Artifact Logging
+```python
+# Log model checkpoint
+monitor.log_model_checkpoint("checkpoint-1000", step=1000)
+# Log evaluation results
+monitor.log_evaluation_results(eval_results, step=1000)
+```
+## Troubleshooting
+### Common Issues
+1. **Trackio not available**: Install with `pip install trackio`
+2. **Connection errors**: Check your Trackio URL and token
+3. **Missing metrics**: Ensure monitoring is enabled in configuration
+4. **Space deployment issues**: Check Gradio version compatibility
+### Debug Mode
+Enable debug logging:
+```python
+import logging
+logging.basicConfig(level=logging.DEBUG)
+```
+## Contributing
+1. Fork the repository
+2. Create a feature branch
+3. Make your changes
+4. Add tests if applicable
+5. Submit a pull request
+## License
+This project is licensed under the MIT License - see the LICENSE file for details.

app.py ADDED Viewed

	@@ -0,0 +1,318 @@

+"""
+Trackio Deployment on Hugging Face Spaces
+A Gradio interface for experiment tracking and monitoring
+"""
+import gradio as gr
+import os
+import json
+import logging
+from datetime import datetime
+from typing import Dict, Any, Optional
+import requests
+# Setup logging
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+class TrackioSpace:
+    """Trackio deployment for Hugging Face Spaces"""
+    def __init__(self):
+        self.experiments = {}
+        self.current_experiment = None
+    def create_experiment(self, name: str, description: str = "") -> Dict[str, Any]:
+        """Create a new experiment"""
+        experiment_id = f"exp_{datetime.now().strftime('%Y%m%d_%H%M%S')}"
+        experiment = {
+            'id': experiment_id,
+            'name': name,
+            'description': description,
+            'created_at': datetime.now().isoformat(),
+            'status': 'running',
+            'metrics': [],
+            'parameters': {},
+            'artifacts': [],
+            'logs': []
+        }
+        self.experiments[experiment_id] = experiment
+        self.current_experiment = experiment_id
+        logger.info(f"Created experiment: {experiment_id} - {name}")
+        return experiment
+    def log_metrics(self, experiment_id: str, metrics: Dict[str, Any], step: Optional[int] = None):
+        """Log metrics for an experiment"""
+        if experiment_id not in self.experiments:
+            raise ValueError(f"Experiment {experiment_id} not found")
+        metric_entry = {
+            'timestamp': datetime.now().isoformat(),
+            'step': step,
+            'metrics': metrics
+        }
+        self.experiments[experiment_id]['metrics'].append(metric_entry)
+        logger.info(f"Logged metrics for experiment {experiment_id}: {metrics}")
+    def log_parameters(self, experiment_id: str, parameters: Dict[str, Any]):
+        """Log parameters for an experiment"""
+        if experiment_id not in self.experiments:
+            raise ValueError(f"Experiment {experiment_id} not found")
+        self.experiments[experiment_id]['parameters'].update(parameters)
+        logger.info(f"Logged parameters for experiment {experiment_id}: {parameters}")
+    def log_artifact(self, experiment_id: str, artifact_name: str, artifact_data: str):
+        """Log an artifact for an experiment"""
+        if experiment_id not in self.experiments:
+            raise ValueError(f"Experiment {experiment_id} not found")
+        artifact_entry = {
+            'name': artifact_name,
+            'timestamp': datetime.now().isoformat(),
+            'data': artifact_data
+        }
+        self.experiments[experiment_id]['artifacts'].append(artifact_entry)
+        logger.info(f"Logged artifact for experiment {experiment_id}: {artifact_name}")
+    def get_experiment(self, experiment_id: str) -> Optional[Dict[str, Any]]:
+        """Get experiment details"""
+        return self.experiments.get(experiment_id)
+    def list_experiments(self) -> Dict[str, Any]:
+        """List all experiments"""
+        return {
+            'experiments': list(self.experiments.keys()),
+            'current_experiment': self.current_experiment,
+            'total_experiments': len(self.experiments)
+        }
+    def update_experiment_status(self, experiment_id: str, status: str):
+        """Update experiment status"""
+        if experiment_id in self.experiments:
+            self.experiments[experiment_id]['status'] = status
+            logger.info(f"Updated experiment {experiment_id} status to {status}")
+# Initialize Trackio space
+trackio_space = TrackioSpace()
+def create_experiment_interface(name: str, description: str) -> str:
+    """Create a new experiment"""
+    try:
+        experiment = trackio_space.create_experiment(name, description)
+        return f"✅ Experiment created successfully!\nID: {experiment['id']}\nName: {experiment['name']}"
+    except Exception as e:
+        return f"❌ Error creating experiment: {str(e)}"
+def log_metrics_interface(experiment_id: str, metrics_json: str, step: str) -> str:
+    """Log metrics for an experiment"""
+    try:
+        metrics = json.loads(metrics_json)
+        step_int = int(step) if step else None
+        trackio_space.log_metrics(experiment_id, metrics, step_int)
+        return f"✅ Metrics logged successfully for experiment {experiment_id}"
+    except Exception as e:
+        return f"❌ Error logging metrics: {str(e)}"
+def log_parameters_interface(experiment_id: str, parameters_json: str) -> str:
+    """Log parameters for an experiment"""
+    try:
+        parameters = json.loads(parameters_json)
+        trackio_space.log_parameters(experiment_id, parameters)
+        return f"✅ Parameters logged successfully for experiment {experiment_id}"
+    except Exception as e:
+        return f"❌ Error logging parameters: {str(e)}"
+def get_experiment_details(experiment_id: str) -> str:
+    """Get experiment details"""
+    try:
+        experiment = trackio_space.get_experiment(experiment_id)
+        if experiment:
+            return json.dumps(experiment, indent=2)
+        else:
+            return f"❌ Experiment {experiment_id} not found"
+    except Exception as e:
+        return f"❌ Error getting experiment details: {str(e)}"
+def list_experiments_interface() -> str:
+    """List all experiments"""
+    try:
+        experiments_info = trackio_space.list_experiments()
+        return json.dumps(experiments_info, indent=2)
+    except Exception as e:
+        return f"❌ Error listing experiments: {str(e)}"
+def update_experiment_status_interface(experiment_id: str, status: str) -> str:
+    """Update experiment status"""
+    try:
+        trackio_space.update_experiment_status(experiment_id, status)
+        return f"✅ Experiment {experiment_id} status updated to {status}"
+    except Exception as e:
+        return f"❌ Error updating experiment status: {str(e)}"
+# Create Gradio interface
+with gr.Blocks(title="Trackio - Experiment Tracking", theme=gr.themes.Soft()) as demo:
+    gr.Markdown("# 🚀 Trackio Experiment Tracking")
+    gr.Markdown("Monitor and track your ML experiments with ease!")
+    with gr.Tabs():
+        # Create Experiment Tab
+        with gr.Tab("Create Experiment"):
+            gr.Markdown("### Create a New Experiment")
+            with gr.Row():
+                with gr.Column():
+                    experiment_name = gr.Textbox(
+                        label="Experiment Name",
+                        placeholder="my_smollm3_finetune",
+                        value="smollm3_finetune"
+                    )
+                    experiment_description = gr.Textbox(
+                        label="Description",
+                        placeholder="Fine-tuning SmolLM3 model on custom dataset",
+                        value="SmolLM3 fine-tuning experiment"
+                    )
+                    create_btn = gr.Button("Create Experiment", variant="primary")
+                with gr.Column():
+                    create_output = gr.Textbox(
+                        label="Result",
+                        lines=5,
+                        interactive=False
+                    )
+            create_btn.click(
+                create_experiment_interface,
+                inputs=[experiment_name, experiment_description],
+                outputs=create_output
+            )
+        # Log Metrics Tab
+        with gr.Tab("Log Metrics"):
+            gr.Markdown("### Log Training Metrics")
+            with gr.Row():
+                with gr.Column():
+                    metrics_exp_id = gr.Textbox(
+                        label="Experiment ID",
+                        placeholder="exp_20231201_143022"
+                    )
+                    metrics_json = gr.Textbox(
+                        label="Metrics (JSON)",
+                        placeholder='{"loss": 0.5, "accuracy": 0.85}',
+                        value='{"loss": 0.5, "accuracy": 0.85}'
+                    )
+                    metrics_step = gr.Textbox(
+                        label="Step (optional)",
+                        placeholder="100"
+                    )
+                    log_metrics_btn = gr.Button("Log Metrics", variant="primary")
+                with gr.Column():
+                    metrics_output = gr.Textbox(
+                        label="Result",
+                        lines=3,
+                        interactive=False
+                    )
+            log_metrics_btn.click(
+                log_metrics_interface,
+                inputs=[metrics_exp_id, metrics_json, metrics_step],
+                outputs=metrics_output
+            )
+        # Log Parameters Tab
+        with gr.Tab("Log Parameters"):
+            gr.Markdown("### Log Experiment Parameters")
+            with gr.Row():
+                with gr.Column():
+                    params_exp_id = gr.Textbox(
+                        label="Experiment ID",
+                        placeholder="exp_20231201_143022"
+                    )
+                    parameters_json = gr.Textbox(
+                        label="Parameters (JSON)",
+                        placeholder='{"learning_rate": 2e-5, "batch_size": 4}',
+                        value='{"learning_rate": 2e-5, "batch_size": 4, "model_name": "HuggingFaceTB/SmolLM3-3B"}'
+                    )
+                    log_params_btn = gr.Button("Log Parameters", variant="primary")
+                with gr.Column():
+                    params_output = gr.Textbox(
+                        label="Result",
+                        lines=3,
+                        interactive=False
+                    )
+            log_params_btn.click(
+                log_parameters_interface,
+                inputs=[params_exp_id, parameters_json],
+                outputs=params_output
+            )
+        # View Experiments Tab
+        with gr.Tab("View Experiments"):
+            gr.Markdown("### View Experiment Details")
+            with gr.Row():
+                with gr.Column():
+                    view_exp_id = gr.Textbox(
+                        label="Experiment ID",
+                        placeholder="exp_20231201_143022"
+                    )
+                    view_btn = gr.Button("View Experiment", variant="primary")
+                    list_btn = gr.Button("List All Experiments", variant="secondary")
+                with gr.Column():
+                    view_output = gr.Textbox(
+                        label="Experiment Details",
+                        lines=15,
+                        interactive=False
+                    )
+            view_btn.click(
+                get_experiment_details,
+                inputs=[view_exp_id],
+                outputs=view_output
+            )
+            list_btn.click(
+                list_experiments_interface,
+                inputs=[],
+                outputs=view_output
+            )
+        # Update Status Tab
+        with gr.Tab("Update Status"):
+            gr.Markdown("### Update Experiment Status")
+            with gr.Row():
+                with gr.Column():
+                    status_exp_id = gr.Textbox(
+                        label="Experiment ID",
+                        placeholder="exp_20231201_143022"
+                    )
+                    status_dropdown = gr.Dropdown(
+                        label="Status",
+                        choices=["running", "completed", "failed", "paused"],
+                        value="running"
+                    )
+                    update_status_btn = gr.Button("Update Status", variant="primary")
+                with gr.Column():
+                    status_output = gr.Textbox(
+                        label="Result",
+                        lines=3,
+                        interactive=False
+                    )
+            update_status_btn.click(
+                update_experiment_status_interface,
+                inputs=[status_exp_id, status_dropdown],
+                outputs=status_output
+            )
+# Launch the app
+if __name__ == "__main__":
+    demo.launch()

cloud_deployment.sh ADDED Viewed

	@@ -0,0 +1,279 @@

+#!/bin/bash
+# Cloud Deployment Script for SmolLM3 DPO Training
+# This script sets up a cloud instance for training and uploading to Hugging Face
+set -e  # Exit on any error
+echo "🚀 Starting SmolLM3 DPO Cloud Deployment"
+echo "=========================================="
+# Configuration
+MODEL_NAME="HuggingFaceTB/SmolLM3-3B"
+DATASET_NAME="HuggingFaceTB/smoltalk"
+EXPERIMENT_NAME="smollm3_dpo_6epochs"
+REPO_NAME="your-username/smollm3-dpo-6epochs"  # Change this to your username
+TRACKIO_URL="https://your-trackio-space.hf.space"  # Change this to your Trackio Space URL
+HF_TOKEN="your_hf_token_here"  # Change this to your HF token
+# Training Configuration
+BATCH_SIZE=2
+GRADIENT_ACCUMULATION_STEPS=8
+LEARNING_RATE=5e-6
+MAX_EPOCHS=6
+MAX_SEQ_LENGTH=4096
+SAVE_STEPS=500
+EVAL_STEPS=100
+LOGGING_STEPS=10
+echo "📋 Configuration:"
+echo "  Model: $MODEL_NAME"
+echo "  Dataset: $DATASET_NAME"
+echo "  Experiment: $EXPERIMENT_NAME"
+echo "  Repository: $REPO_NAME"
+echo "  Epochs: $MAX_EPOCHS"
+echo "  Batch Size: $BATCH_SIZE"
+echo "  Learning Rate: $LEARNING_RATE"
+# Step 1: Update system and install dependencies
+echo ""
+echo "🔧 Step 1: Installing system dependencies..."
+sudo apt-get update
+sudo apt-get install -y git curl wget unzip
+# Step 2: Install Python and pip
+echo ""
+echo "🐍 Step 2: Installing Python dependencies..."
+sudo apt-get install -y python3 python3-pip python3-venv
+# Step 3: Create virtual environment
+echo ""
+echo "📦 Step 3: Setting up Python virtual environment..."
+python3 -m venv smollm3_env
+source smollm3_env/bin/activate
+# Step 4: Install PyTorch and CUDA
+echo ""
+echo "🔥 Step 4: Installing PyTorch with CUDA support..."
+pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
+# Step 5: Install project dependencies
+echo ""
+echo "📚 Step 5: Installing project dependencies..."
+pip install -r requirements.txt
+# Step 6: Install additional dependencies for DPO
+echo ""
+echo "🎯 Step 6: Installing DPO-specific dependencies..."
+pip install trl>=0.7.0
+pip install peft>=0.4.0
+pip install accelerate>=0.20.0
+# Step 7: Set up Hugging Face token
+echo ""
+echo "🔑 Step 7: Setting up Hugging Face authentication..."
+export HF_TOKEN="$HF_TOKEN"
+huggingface-cli login --token $HF_TOKEN
+# Step 8: Create DPO configuration
+echo ""
+echo "⚙️ Step 8: Creating DPO configuration..."
+cat > config/train_smollm3_dpo_6epochs.py << EOF
+"""
+SmolLM3 DPO Training Configuration - 6 Epochs
+Optimized for cloud deployment
+"""
+from config.train_smollm3_dpo import SmolLM3DPOConfig
+config = SmolLM3DPOConfig(
+    # Model configuration
+    model_name="$MODEL_NAME",
+    max_seq_length=$MAX_SEQ_LENGTH,
+    use_flash_attention=True,
+    use_gradient_checkpointing=True,
+    # Training configuration
+    batch_size=$BATCH_SIZE,
+    gradient_accumulation_steps=$GRADIENT_ACCUMULATION_STEPS,
+    learning_rate=$LEARNING_RATE,
+    weight_decay=0.01,
+    warmup_steps=100,
+    max_iters=None,  # Will be calculated based on epochs
+    eval_interval=100,
+    log_interval=10,
+    save_interval=500,
+    # DPO configuration
+    beta=0.1,
+    max_prompt_length=$((MAX_SEQ_LENGTH // 2)),
+    # Optimizer configuration
+    optimizer="adamw",
+    beta1=0.9,
+    beta2=0.95,
+    eps=1e-8,
+    # Scheduler configuration
+    scheduler="cosine",
+    min_lr=1e-6,
+    # Mixed precision
+    fp16=True,
+    bf16=False,
+    # Logging and saving
+    save_steps=$SAVE_STEPS,
+    eval_steps=$EVAL_STEPS,
+    logging_steps=$LOGGING_STEPS,
+    save_total_limit=3,
+    # Evaluation
+    eval_strategy="steps",
+    metric_for_best_model="eval_loss",
+    greater_is_better=False,
+    load_best_model_at_end=True,
+    # Data configuration
+    data_dir="smoltalk_dataset",
+    train_file="train.json",
+    validation_file="validation.json",
+    # Chat template configuration
+    use_chat_template=True,
+    chat_template_kwargs={
+        "enable_thinking": False,
+        "add_generation_prompt": True
+    },
+    # Trackio monitoring configuration
+    enable_tracking=True,
+    trackio_url="$TRACKIO_URL",
+    trackio_token=None,
+    log_artifacts=True,
+    log_metrics=True,
+    log_config=True,
+    experiment_name="$EXPERIMENT_NAME"
+)
+EOF
+# Step 9: Download and prepare dataset
+echo ""
+echo "📊 Step 9: Downloading and preparing dataset..."
+python -c "
+from datasets import load_dataset
+import json
+import os
+# Load SmolTalk dataset
+print('Loading SmolTalk dataset...')
+dataset = load_dataset('$DATASET_NAME')
+# Create dataset directory
+os.makedirs('smoltalk_dataset', exist_ok=True)
+# Convert to DPO format (preference pairs)
+def convert_to_dpo_format(example):
+    # For SmolTalk, we'll create preference pairs based on response quality
+    # This is a simplified example - you may need to adjust based on your needs
+    return {
+        'prompt': example.get('prompt', ''),
+        'chosen': example.get('chosen', ''),
+        'rejected': example.get('rejected', '')
+    }
+# Process train split
+train_data = []
+for example in dataset['train']:
+    dpo_example = convert_to_dpo_format(example)
+    if dpo_example['prompt'] and dpo_example['chosen'] and dpo_example['rejected']:
+        train_data.append(dpo_example)
+# Process validation split
+val_data = []
+for example in dataset['validation']:
+    dpo_example = convert_to_dpo_format(example)
+    if dpo_example['prompt'] and dpo_example['chosen'] and dpo_example['rejected']:
+        val_data.append(dpo_example)
+# Save to files
+with open('smoltalk_dataset/train.json', 'w') as f:
+    json.dump(train_data, f, indent=2)
+with open('smoltalk_dataset/validation.json', 'w') as f:
+    json.dump(val_data, f, indent=2)
+print(f'Dataset prepared: {len(train_data)} train samples, {len(val_data)} validation samples')
+"
+# Step 10: Calculate training steps based on epochs
+echo ""
+echo "📈 Step 10: Calculating training parameters..."
+TOTAL_SAMPLES=$(python -c "import json; data=json.load(open('smoltalk_dataset/train.json')); print(len(data))")
+EFFECTIVE_BATCH_SIZE=$((BATCH_SIZE * GRADIENT_ACCUMULATION_STEPS))
+STEPS_PER_EPOCH=$((TOTAL_SAMPLES / EFFECTIVE_BATCH_SIZE))
+MAX_STEPS=$((STEPS_PER_EPOCH * MAX_EPOCHS))
+echo "  Total samples: $TOTAL_SAMPLES"
+echo "  Effective batch size: $EFFECTIVE_BATCH_SIZE"
+echo "  Steps per epoch: $STEPS_PER_EPOCH"
+echo "  Total training steps: $MAX_STEPS"
+# Step 11: Start DPO training
+echo ""
+echo "🎯 Step 11: Starting DPO training..."
+python train.py config/train_smollm3_dpo_6epochs.py \
+    --dataset_dir smoltalk_dataset \
+    --out_dir /output-checkpoint \
+    --init_from scratch \
+    --max_iters $MAX_STEPS \
+    --batch_size $BATCH_SIZE \
+    --learning_rate $LEARNING_RATE \
+    --gradient_accumulation_steps $GRADIENT_ACCUMULATION_STEPS \
+    --max_seq_length $MAX_SEQ_LENGTH \
+    --save_steps $SAVE_STEPS \
+    --eval_steps $EVAL_STEPS \
+    --logging_steps $LOGGING_STEPS \
+    --enable_tracking \
+    --trackio_url "$TRACKIO_URL" \
+    --experiment_name "$EXPERIMENT_NAME"
+# Step 12: Push model to Hugging Face Hub
+echo ""
+echo "📤 Step 12: Pushing model to Hugging Face Hub..."
+python push_to_huggingface.py /output-checkpoint "$REPO_NAME" \
+    --token "$HF_TOKEN" \
+    --trackio-url "$TRACKIO_URL" \
+    --experiment-name "$EXPERIMENT_NAME"
+# Step 13: Test the uploaded model
+echo ""
+echo "🧪 Step 13: Testing uploaded model..."
+python -c "
+from transformers import AutoModelForCausalLM, AutoTokenizer
+import torch
+print('Loading uploaded model...')
+model = AutoModelForCausalLM.from_pretrained('$REPO_NAME', torch_dtype=torch.float16, device_map='auto')
+tokenizer = AutoTokenizer.from_pretrained('$REPO_NAME')
+print('Testing model generation...')
+prompt = 'Hello, how are you?'
+inputs = tokenizer(prompt, return_tensors='pt').to(model.device)
+outputs = model.generate(**inputs, max_new_tokens=50, do_sample=True, temperature=0.7)
+response = tokenizer.decode(outputs[0], skip_special_tokens=True)
+print(f'Prompt: {prompt}')
+print(f'Response: {response}')
+print('✅ Model test completed successfully!')
+"
+echo ""
+echo "🎉 Deployment completed successfully!"
+echo "====================================="
+echo "📊 Model: https://huggingface.co/$REPO_NAME"
+echo "📈 Trackio: $TRACKIO_URL"
+echo "📋 Experiment: $EXPERIMENT_NAME"
+echo ""
+echo "Next steps:"
+echo "1. Monitor training progress in your Trackio Space"
+echo "2. Check the model repository on Hugging Face Hub"
+echo "3. Use the model in your applications"

config/__init__.py ADDED Viewed

	@@ -0,0 +1,19 @@

+"""
+Configuration package for SmolLM3 training
+"""
+from .train_smollm3 import SmolLM3Config, get_config as get_base_config
+from .train_smollm3_openhermes_fr import SmolLM3ConfigOpenHermesFR, get_config as get_openhermes_fr_config
+from .train_smollm3_openhermes_fr_a100_large import SmolLM3ConfigOpenHermesFRA100Large, get_config as get_a100_large_config
+from .train_smollm3_openhermes_fr_a100_multiple_passes import SmolLM3ConfigOpenHermesFRMultiplePasses, get_config as get_multiple_passes_config
+__all__ = [
+    'SmolLM3Config',
+    'SmolLM3ConfigOpenHermesFR',
+    'SmolLM3ConfigOpenHermesFRA100Large',
+    'SmolLM3ConfigOpenHermesFRMultiplePasses',
+    'get_base_config',
+    'get_openhermes_fr_config',
+    'get_a100_large_config',
+    'get_multiple_passes_config',
+]

config/runpod_config.py ADDED Viewed

	@@ -0,0 +1,47 @@

+"""
+RunPod Optimized Configuration for SmolLM3 Fine-tuning
+Optimized for cloud GPU training on RunPod
+"""
+from config.train_smollm3 import SmolLM3Config
+config = SmolLM3Config(
+    # Model configuration
+    model_name="HuggingFaceTB/SmolLM3-3B",
+    max_seq_length=4096,
+    use_flash_attention=True,
+    use_gradient_checkpointing=True,
+    # Training configuration - optimized for cloud GPUs
+    batch_size=2,  # Conservative for cloud stability
+    gradient_accumulation_steps=8,  # Effective batch size = 16
+    learning_rate=2e-5,
+    weight_decay=0.01,
+    warmup_steps=100,
+    max_iters=1500,
+    # Mixed precision for efficiency
+    fp16=True,
+    bf16=False,
+    # Logging and saving - more frequent for cloud
+    save_steps=200,
+    eval_steps=100,
+    logging_steps=10,
+    save_total_limit=5,  # Keep more checkpoints
+    # Cloud-specific optimizations
+    ddp_backend="nccl",
+    ddp_find_unused_parameters=False,
+    # Data loading optimizations
+    dataloader_num_workers=4,
+    dataloader_pin_memory=True,
+    # Chat template configuration
+    use_chat_template=True,
+    chat_template_kwargs={
+        "enable_thinking": False,
+        "add_generation_prompt": True
+    }
+)

config/train_smollm3.py CHANGED Viewed

@@ -68,6 +68,15 @@ class SmolLM3Config:
     use_chat_template: bool = True
     chat_template_kwargs: dict = None
     def __post_init__(self):
         if self.chat_template_kwargs is None:
             self.chat_template_kwargs = {

     use_chat_template: bool = True
     chat_template_kwargs: dict = None
+    # Trackio monitoring configuration
+    enable_tracking: bool = True
+    trackio_url: Optional[str] = None
+    trackio_token: Optional[str] = None
+    log_artifacts: bool = True
+    log_metrics: bool = True
+    log_config: bool = True
+    experiment_name: Optional[str] = None
     def __post_init__(self):
         if self.chat_template_kwargs is None:
             self.chat_template_kwargs = {

config/train_smollm3_dpo.py CHANGED Viewed

@@ -1,38 +1,95 @@
 """
 SmolLM3 DPO Training Configuration
-Optimized for Direct Preference Optimization
 """
 from config.train_smollm3 import SmolLM3Config
-config = SmolLM3Config(
-    # Model configuration
-    model_name="HuggingFaceTB/SmolLM3-3B-Instruct",  # Start from instruction-tuned model
-    max_seq_length=4096,
-    use_flash_attention=True,
-    use_gradient_checkpointing=True,
-    # Training configuration
-    batch_size=2,  # Smaller batch size for DPO
-    gradient_accumulation_steps=4,
-    learning_rate=5e-6,  # Very low learning rate for DPO
-    weight_decay=0.01,
-    warmup_steps=100,
-    max_iters=1000,
-    # Mixed precision
-    fp16=True,
-    bf16=False,
-    # Logging and saving
-    save_steps=200,
-    eval_steps=100,
-    logging_steps=20,
-    # Chat template configuration
-    use_chat_template=True,
-    chat_template_kwargs={
-        "enable_thinking": False,  # Disable reasoning for preference learning
-        "add_generation_prompt": True
-    }
-)

 """
 SmolLM3 DPO Training Configuration
+Based on nanoGPT structure but adapted for SmolLM3 DPO training
 """
+import os
+from dataclasses import dataclass
+from typing import Optional
 from config.train_smollm3 import SmolLM3Config
+@dataclass
+class SmolLM3DPOConfig(SmolLM3Config):
+    """Configuration for SmolLM3 DPO fine-tuning"""
+    # DPO-specific configuration
+    beta: float = 0.1
+    max_prompt_length: int = 2048
+    max_length: int = 4096
+    # DPO training configuration
+    dpo_beta: float = 0.1
+    dpo_loss_type: str = "sigmoid"  # "sigmoid" or "hinge"
+    dpo_alpha: float = 0.5
+    # Reference model configuration
+    ref_model_name: Optional[str] = None  # If None, will use the same as model_name
+    ref_model_peft_config: Optional[dict] = None
+    # Preference dataset configuration
+    preference_dataset_format: str = "dpo"  # "dpo", "rlhf", "custom"
+    preference_dataset_text_field: str = "text"
+    preference_dataset_prompt_field: str = "prompt"
+    preference_dataset_chosen_field: str = "chosen"
+    preference_dataset_rejected_field: str = "rejected"
+    # DPO training arguments
+    dpo_gradient_checkpointing: bool = True
+    dpo_gradient_checkpointing_kwargs: dict = None
+    dpo_precompute_ref_log_probs: bool = False
+    dpo_peft_config: Optional[dict] = None
+    def __post_init__(self):
+        super().__post_init__()
+        # Set default values for DPO-specific settings
+        if self.ref_model_name is None:
+            self.ref_model_name = self.model_name
+        if self.dpo_gradient_checkpointing_kwargs is None:
+            self.dpo_gradient_checkpointing_kwargs = {
+                "use_reentrant": False
+            }
+        if self.dpo_peft_config is None:
+            self.dpo_peft_config = {
+                "r": 16,
+                "lora_alpha": 32,
+                "lora_dropout": 0.1,
+                "bias": "none",
+                "task_type": "CAUSAL_LM"
+            }
+        # Validate DPO configuration
+        if self.beta <= 0:
+            raise ValueError("beta must be positive")
+        if self.max_prompt_length > self.max_seq_length:
+            raise ValueError("max_prompt_length cannot exceed max_seq_length")
+        if self.max_length > self.max_seq_length:
+            raise ValueError("max_length cannot exceed max_seq_length")
+def get_dpo_config(config_path: str) -> SmolLM3DPOConfig:
+    """Load DPO configuration from file or return default"""
+    if os.path.exists(config_path):
+        # Load from file if it exists
+        import importlib.util
+        spec = importlib.util.spec_from_file_location("config_module", config_path)
+        config_module = importlib.util.module_from_spec(spec)
+        spec.loader.exec_module(config_module)
+        if hasattr(config_module, 'config'):
+            return config_module.config
+        else:
+            # Try to find a config class
+            for attr_name in dir(config_module):
+                attr = getattr(config_module, attr_name)
+                if isinstance(attr, SmolLM3DPOConfig):
+                    return attr
+    # Return default configuration
+    return SmolLM3DPOConfig()
+# Default DPO configuration instance
+config = SmolLM3DPOConfig()

config/train_smollm3_openhermes_fr.py ADDED Viewed

	@@ -0,0 +1,129 @@

+"""
+SmolLM3 Training Configuration for OpenHermes-FR Dataset
+Optimized for French instruction tuning using legmlai/openhermes-fr
+"""
+import os
+from dataclasses import dataclass
+from typing import Optional
+from config.train_smollm3 import SmolLM3Config
+@dataclass
+class SmolLM3ConfigOpenHermesFR(SmolLM3Config):
+    """Configuration for SmolLM3 fine-tuning on OpenHermes-FR dataset"""
+    # Model configuration
+    model_name: str = "HuggingFaceTB/SmolLM3-3B"
+    max_seq_length: int = 4096
+    use_flash_attention: bool = True
+    use_gradient_checkpointing: bool = True
+    # Training configuration - optimized for French instruction tuning
+    batch_size: int = 2  # Reduced for French text (longer sequences)
+    gradient_accumulation_steps: int = 8  # Increased to maintain effective batch size
+    learning_rate: float = 1e-5  # Slightly lower for instruction tuning
+    weight_decay: float = 0.01
+    warmup_steps: int = 500  # More warmup for instruction tuning
+    max_iters: int = 2000  # More iterations for large dataset
+    eval_interval: int = 200
+    log_interval: int = 10
+    save_interval: int = 500
+    # Optimizer configuration
+    optimizer: str = "adamw"
+    beta1: float = 0.9
+    beta2: float = 0.95
+    eps: float = 1e-8
+    # Scheduler configuration
+    scheduler: str = "cosine"
+    min_lr: float = 1e-6
+    # Mixed precision
+    fp16: bool = True
+    bf16: bool = False
+    # DDP configuration
+    ddp_backend: str = "nccl"
+    ddp_find_unused_parameters: bool = False
+    # Logging and saving
+    save_steps: int = 500
+    eval_steps: int = 200
+    logging_steps: int = 10
+    save_total_limit: Optional[int] = 3
+    # Evaluation
+    eval_strategy: str = "steps"
+    metric_for_best_model: str = "eval_loss"
+    greater_is_better: bool = False
+    load_best_model_at_end: bool = True
+    # OpenHermes-FR Dataset configuration
+    dataset_name: str = "legmlai/openhermes-fr"
+    dataset_split: str = "train"
+    input_field: str = "prompt"
+    target_field: str = "accepted_completion"
+    filter_bad_entries: bool = True
+    bad_entry_field: str = "bad_entry"
+    # Data configuration (not used for HF datasets but kept for compatibility)
+    data_dir: str = None
+    train_file: str = None
+    validation_file: Optional[str] = None
+    test_file: Optional[str] = None
+    # Chat template configuration
+    use_chat_template: bool = True
+    chat_template_kwargs: dict = None
+    # Trackio monitoring configuration
+    enable_tracking: bool = True
+    trackio_url: Optional[str] = None
+    trackio_token: Optional[str] = None
+    log_artifacts: bool = True
+    log_metrics: bool = True
+    log_config: bool = True
+    experiment_name: Optional[str] = None
+    def __post_init__(self):
+        if self.chat_template_kwargs is None:
+            self.chat_template_kwargs = {
+                "enable_thinking": False,
+                "add_generation_prompt": True
+            }
+        # Validate configuration
+        if self.fp16 and self.bf16:
+            raise ValueError("Cannot use both fp16 and bf16")
+        if self.max_seq_length > 131072:  # 128k limit
+            raise ValueError("max_seq_length cannot exceed 131072")
+        # Set default experiment name if not provided
+        if self.experiment_name is None:
+            self.experiment_name = "smollm3_openhermes_fr"
+def get_config(config_path: str) -> SmolLM3ConfigOpenHermesFR:
+    """Load configuration from file or return default"""
+    if os.path.exists(config_path):
+        # Load from file if it exists
+        import importlib.util
+        spec = importlib.util.spec_from_file_location("config_module", config_path)
+        config_module = importlib.util.module_from_spec(spec)
+        spec.loader.exec_module(config_module)
+        if hasattr(config_module, 'config'):
+            return config_module.config
+        else:
+            # Try to find a config class
+            for attr_name in dir(config_module):
+                attr = getattr(config_module, attr_name)
+                if isinstance(attr, SmolLM3ConfigOpenHermesFR):
+                    return attr
+    # Return default configuration
+    return SmolLM3ConfigOpenHermesFR()
+# Default configuration instance
+config = SmolLM3ConfigOpenHermesFR()

config/train_smollm3_openhermes_fr_a100_large.py ADDED Viewed

	@@ -0,0 +1,161 @@

+"""
+SmolLM3 Training Configuration for OpenHermes-FR Dataset - A100 Large Scale
+Optimized for A100 GPUs with large batch sizes and multiple passes on 800k+ datapoints
+"""
+import os
+from dataclasses import dataclass
+from typing import Optional
+from config.train_smollm3 import SmolLM3Config
+@dataclass
+class SmolLM3ConfigOpenHermesFRA100Large(SmolLM3Config):
+    """Configuration for SmolLM3 fine-tuning on OpenHermes-FR dataset - A100 Large Scale"""
+    # Model configuration - optimized for A100
+    model_name: str = "HuggingFaceTB/SmolLM3-3B"
+    max_seq_length: int = 8192  # Increased for better context understanding
+    use_flash_attention: bool = True
+    use_gradient_checkpointing: bool = False  # Disabled for A100 efficiency
+    # Training configuration - A100 optimized with large batch sizes
+    batch_size: int = 8  # Large batch size for A100 (80GB VRAM)
+    gradient_accumulation_steps: int = 16  # Effective batch size = 8 * 16 = 128
+    learning_rate: float = 5e-6  # Lower LR for large effective batch size
+    weight_decay: float = 0.01
+    warmup_steps: int = 1000  # More warmup for large dataset
+    max_iters: int = 8000  # Multiple passes on 800k dataset
+    eval_interval: int = 500  # Less frequent evaluation
+    log_interval: int = 25  # Less frequent logging
+    save_interval: int = 1000  # Less frequent saving
+    # Optimizer configuration - optimized for large batches
+    optimizer: str = "adamw"
+    beta1: float = 0.9
+    beta2: float = 0.999  # Higher beta2 for stability with large batches
+    eps: float = 1e-8
+    # Scheduler configuration - longer training
+    scheduler: str = "cosine"
+    min_lr: float = 5e-7  # Lower min LR
+    # Mixed precision - A100 optimized
+    fp16: bool = False  # Use bf16 for A100
+    bf16: bool = True  # Better for A100
+    # DDP configuration
+    ddp_backend: str = "nccl"
+    ddp_find_unused_parameters: bool = False
+    # Logging and saving - optimized for long training
+    save_steps: int = 1000
+    eval_steps: int = 500
+    logging_steps: int = 25
+    save_total_limit: Optional[int] = 5  # Keep more checkpoints
+    # Evaluation
+    eval_strategy: str = "steps"
+    metric_for_best_model: str = "eval_loss"
+    greater_is_better: bool = False
+    load_best_model_at_end: bool = True
+    # OpenHermes-FR Dataset configuration
+    dataset_name: str = "legmlai/openhermes-fr"
+    dataset_split: str = "train"
+    input_field: str = "prompt"
+    target_field: str = "accepted_completion"
+    filter_bad_entries: bool = True
+    bad_entry_field: str = "bad_entry"
+    # Data configuration (not used for HF datasets but kept for compatibility)
+    data_dir: str = None
+    train_file: str = None
+    validation_file: Optional[str] = None
+    test_file: Optional[str] = None
+    # Chat template configuration
+    use_chat_template: bool = True
+    chat_template_kwargs: dict = None
+    # Trackio monitoring configuration
+    enable_tracking: bool = True
+    trackio_url: Optional[str] = None
+    trackio_token: Optional[str] = None
+    log_artifacts: bool = True
+    log_metrics: bool = True
+    log_config: bool = True
+    experiment_name: Optional[str] = None
+    # Additional A100 optimizations
+    dataloader_num_workers: int = 8  # More workers for faster data loading
+    dataloader_pin_memory: bool = True
+    dataloader_prefetch_factor: int = 2
+    # Memory optimizations
+    max_grad_norm: float = 1.0  # Gradient clipping
+    group_by_length: bool = True  # Group similar length sequences
+    # Training duration calculations
+    # With 800k datapoints and effective batch size of 128:
+    # Steps per epoch = 800,000 / 128 = 6,250 steps
+    # For 3 passes: 6,250 * 3 = 18,750 steps
+    # For 5 passes: 6,250 * 5 = 31,250 steps
+    # Current max_iters = 8,000 (about 1.3 passes)
+    def __post_init__(self):
+        if self.chat_template_kwargs is None:
+            self.chat_template_kwargs = {
+                "enable_thinking": False,
+                "add_generation_prompt": True
+            }
+        # Validate configuration
+        if self.fp16 and self.bf16:
+            raise ValueError("Cannot use both fp16 and bf16")
+        if self.max_seq_length > 131072:  # 128k limit
+            raise ValueError("max_seq_length cannot exceed 131072")
+        # Calculate training statistics
+        effective_batch_size = self.batch_size * self.gradient_accumulation_steps
+        steps_per_epoch = 800000 // effective_batch_size  # Approximate for 800k dataset
+        epochs_for_max_iters = self.max_iters / steps_per_epoch
+        print(f"=== A100 Large Scale Training Configuration ===")
+        print(f"Effective batch size: {effective_batch_size}")
+        print(f"Steps per epoch: ~{steps_per_epoch}")
+        print(f"Training for ~{epochs_for_max_iters:.1f} epochs")
+        print(f"Total training steps: {self.max_iters}")
+        print(f"Learning rate: {self.learning_rate}")
+        print(f"Mixed precision: {'bf16' if self.bf16 else 'fp16'}")
+        print(f"Max sequence length: {self.max_seq_length}")
+        print(f"Gradient checkpointing: {self.use_gradient_checkpointing}")
+        print("=" * 50)
+        # Set default experiment name if not provided
+        if self.experiment_name is None:
+            self.experiment_name = "smollm3_openhermes_fr_a100_large"
+def get_config(config_path: str) -> SmolLM3ConfigOpenHermesFRA100Large:
+    """Load configuration from file or return default"""
+    if os.path.exists(config_path):
+        # Load from file if it exists
+        import importlib.util
+        spec = importlib.util.spec_from_file_location("config_module", config_path)
+        config_module = importlib.util.module_from_spec(spec)
+        spec.loader.exec_module(config_module)
+        if hasattr(config_module, 'config'):
+            return config_module.config
+        else:
+            # Try to find a config class
+            for attr_name in dir(config_module):
+                attr = getattr(config_module, attr_name)
+                if isinstance(attr, SmolLM3ConfigOpenHermesFRA100Large):
+                    return attr
+    # Return default configuration
+    return SmolLM3ConfigOpenHermesFRA100Large()
+# Default configuration instance
+config = SmolLM3ConfigOpenHermesFRA100Large()

config/train_smollm3_openhermes_fr_a100_multiple_passes.py ADDED Viewed

	@@ -0,0 +1,164 @@

+"""
+SmolLM3 Training Configuration for OpenHermes-FR Dataset - Multiple Passes
+Optimized for A100 GPUs with multiple passes (3-5 epochs) on 800k+ datapoints
+"""
+import os
+from dataclasses import dataclass
+from typing import Optional
+from config.train_smollm3 import SmolLM3Config
+@dataclass
+class SmolLM3ConfigOpenHermesFRMultiplePasses(SmolLM3Config):
+    """Configuration for SmolLM3 fine-tuning with multiple passes on OpenHermes-FR dataset"""
+    # Model configuration - optimized for A100
+    model_name: str = "HuggingFaceTB/SmolLM3-3B"
+    max_seq_length: int = 8192  # Increased for better context understanding
+    use_flash_attention: bool = True
+    use_gradient_checkpointing: bool = False  # Disabled for A100 efficiency
+    # Training configuration - Multiple passes optimized
+    batch_size: int = 6  # Slightly smaller for stability during long training
+    gradient_accumulation_steps: int = 20  # Effective batch size = 6 * 20 = 120
+    learning_rate: float = 3e-6  # Conservative LR for multiple passes
+    weight_decay: float = 0.01
+    warmup_steps: int = 2000  # Longer warmup for multiple passes
+    max_iters: int = 25000  # 4 passes on 800k dataset (25k steps)
+    eval_interval: int = 1000  # Less frequent evaluation
+    log_interval: int = 50  # Less frequent logging
+    save_interval: int = 2000  # Less frequent saving
+    # Optimizer configuration - stability focused
+    optimizer: str = "adamw"
+    beta1: float = 0.9
+    beta2: float = 0.999  # Higher beta2 for stability
+    eps: float = 1e-8
+    # Scheduler configuration - longer training with multiple passes
+    scheduler: str = "cosine"
+    min_lr: float = 3e-7  # Lower min LR
+    # Mixed precision - A100 optimized
+    fp16: bool = False  # Use bf16 for A100
+    bf16: bool = True  # Better for A100
+    # DDP configuration
+    ddp_backend: str = "nccl"
+    ddp_find_unused_parameters: bool = False
+    # Logging and saving - optimized for long training
+    save_steps: int = 2000
+    eval_steps: int = 1000
+    logging_steps: int = 50
+    save_total_limit: Optional[int] = 8  # Keep more checkpoints for long training
+    # Evaluation
+    eval_strategy: str = "steps"
+    metric_for_best_model: str = "eval_loss"
+    greater_is_better: bool = False
+    load_best_model_at_end: bool = True
+    # OpenHermes-FR Dataset configuration
+    dataset_name: str = "legmlai/openhermes-fr"
+    dataset_split: str = "train"
+    input_field: str = "prompt"
+    target_field: str = "accepted_completion"
+    filter_bad_entries: bool = True
+    bad_entry_field: str = "bad_entry"
+    # Data configuration (not used for HF datasets but kept for compatibility)
+    data_dir: str = None
+    train_file: str = None
+    validation_file: Optional[str] = None
+    test_file: Optional[str] = None
+    # Chat template configuration
+    use_chat_template: bool = True
+    chat_template_kwargs: dict = None
+    # Trackio monitoring configuration
+    enable_tracking: bool = True
+    trackio_url: Optional[str] = None
+    trackio_token: Optional[str] = None
+    log_artifacts: bool = True
+    log_metrics: bool = True
+    log_config: bool = True
+    experiment_name: Optional[str] = None
+    # Additional A100 optimizations
+    dataloader_num_workers: int = 8  # More workers for faster data loading
+    dataloader_pin_memory: bool = True
+    dataloader_prefetch_factor: int = 2
+    # Memory optimizations
+    max_grad_norm: float = 1.0  # Gradient clipping
+    group_by_length: bool = True  # Group similar length sequences
+    # Training duration calculations
+    # With 800k datapoints and effective batch size of 120:
+    # Steps per epoch = 800,000 / 120 = 6,667 steps
+    # For 3 passes: 6,667 * 3 = 20,000 steps
+    # For 4 passes: 6,667 * 4 = 26,667 steps
+    # For 5 passes: 6,667 * 5 = 33,333 steps
+    # Current max_iters = 25,000 (about 3.75 passes)
+    def __post_init__(self):
+        if self.chat_template_kwargs is None:
+            self.chat_template_kwargs = {
+                "enable_thinking": False,
+                "add_generation_prompt": True
+            }
+        # Validate configuration
+        if self.fp16 and self.bf16:
+            raise ValueError("Cannot use both fp16 and bf16")
+        if self.max_seq_length > 131072:  # 128k limit
+            raise ValueError("max_seq_length cannot exceed 131072")
+        # Calculate training statistics
+        effective_batch_size = self.batch_size * self.gradient_accumulation_steps
+        steps_per_epoch = 800000 // effective_batch_size  # Approximate for 800k dataset
+        epochs_for_max_iters = self.max_iters / steps_per_epoch
+        print(f"=== Multiple Passes Training Configuration ===")
+        print(f"Effective batch size: {effective_batch_size}")
+        print(f"Steps per epoch: ~{steps_per_epoch}")
+        print(f"Training for ~{epochs_for_max_iters:.1f} epochs")
+        print(f"Total training steps: {self.max_iters}")
+        print(f"Learning rate: {self.learning_rate}")
+        print(f"Mixed precision: {'bf16' if self.bf16 else 'fp16'}")
+        print(f"Max sequence length: {self.max_seq_length}")
+        print(f"Gradient checkpointing: {self.use_gradient_checkpointing}")
+        print(f"Warmup steps: {self.warmup_steps}")
+        print(f"Save interval: {self.save_interval}")
+        print("=" * 50)
+        # Set default experiment name if not provided
+        if self.experiment_name is None:
+            self.experiment_name = "smollm3_openhermes_fr_multiple_passes"
+def get_config(config_path: str) -> SmolLM3ConfigOpenHermesFRMultiplePasses:
+    """Load configuration from file or return default"""
+    if os.path.exists(config_path):
+        # Load from file if it exists
+        import importlib.util
+        spec = importlib.util.spec_from_file_location("config_module", config_path)
+        config_module = importlib.util.module_from_spec(spec)
+        spec.loader.exec_module(config_module)
+        if hasattr(config_module, 'config'):
+            return config_module.config
+        else:
+            # Try to find a config class
+            for attr_name in dir(config_module):
+                attr = getattr(config_module, attr_name)
+                if isinstance(attr, SmolLM3ConfigOpenHermesFRMultiplePasses):
+                    return attr
+    # Return default configuration
+    return SmolLM3ConfigOpenHermesFRMultiplePasses()
+# Default configuration instance
+config = SmolLM3ConfigOpenHermesFRMultiplePasses()

data.py CHANGED Viewed

@@ -22,13 +22,17 @@ class SmolLM3Dataset:
         tokenizer: PreTrainedTokenizer,
         max_seq_length: int = 4096,
         use_chat_template: bool = True,
-        chat_template_kwargs: Optional[Dict] = None
     ):
         self.data_path = data_path
         self.tokenizer = tokenizer
         self.max_seq_length = max_seq_length
         self.use_chat_template = use_chat_template
         self.chat_template_kwargs = chat_template_kwargs or {}
         # Load and process dataset
         self.dataset = self._load_dataset()
@@ -74,6 +78,17 @@ class SmolLM3Dataset:
         try:
             dataset = load_dataset(self.data_path)
             logger.info(f"Loaded Hugging Face dataset: {self.data_path}")
             # If only 'train' split exists, create validation and test splits
             if ("train" in dataset) and ("validation" not in dataset or "test" not in dataset):
                 logger.info("Automatically splitting train into train/validation/test (98/1/1)")
@@ -123,6 +138,11 @@ class SmolLM3Dataset:
                             {"role": "user", "content": example["prompt"]},
                             {"role": "assistant", "content": example["accepted_completion"]}
                         ]
                     else:
                         # Fallback: treat as plain text
                         return {"text": str(example)}

         tokenizer: PreTrainedTokenizer,
         max_seq_length: int = 4096,
         use_chat_template: bool = True,
+        chat_template_kwargs: Optional[Dict] = None,
+        filter_bad_entries: bool = False,
+        bad_entry_field: str = "bad_entry"
     ):
         self.data_path = data_path
         self.tokenizer = tokenizer
         self.max_seq_length = max_seq_length
         self.use_chat_template = use_chat_template
         self.chat_template_kwargs = chat_template_kwargs or {}
+        self.filter_bad_entries = filter_bad_entries
+        self.bad_entry_field = bad_entry_field
         # Load and process dataset
         self.dataset = self._load_dataset()
         try:
             dataset = load_dataset(self.data_path)
             logger.info(f"Loaded Hugging Face dataset: {self.data_path}")
+            # Filter bad entries if requested
+            if self.filter_bad_entries and self.bad_entry_field in dataset["train"].column_names:
+                logger.info(f"Filtering out bad entries using field: {self.bad_entry_field}")
+                for split in dataset:
+                    if self.bad_entry_field in dataset[split].column_names:
+                        original_size = len(dataset[split])
+                        dataset[split] = dataset[split].filter(lambda x: not x[self.bad_entry_field])
+                        filtered_size = len(dataset[split])
+                        logger.info(f"Filtered {split}: {original_size} -> {filtered_size} samples")
             # If only 'train' split exists, create validation and test splits
             if ("train" in dataset) and ("validation" not in dataset or "test" not in dataset):
                 logger.info("Automatically splitting train into train/validation/test (98/1/1)")
                             {"role": "user", "content": example["prompt"]},
                             {"role": "assistant", "content": example["accepted_completion"]}
                         ]
+                    elif "prompt" in example and "completion" in example:
+                        messages = [
+                            {"role": "user", "content": example["prompt"]},
+                            {"role": "assistant", "content": example["completion"]}
+                        ]
                     else:
                         # Fallback: treat as plain text
                         return {"text": str(example)}

deploy_trackio_space.py ADDED Viewed

	@@ -0,0 +1,235 @@

+#!/usr/bin/env python3
+"""
+Deployment script for Trackio on Hugging Face Spaces
+Automates the process of creating and configuring a Trackio Space
+"""
+import os
+import json
+import requests
+import subprocess
+import sys
+from pathlib import Path
+from typing import Dict, Any, Optional
+class TrackioSpaceDeployer:
+    """Deployer for Trackio on Hugging Face Spaces"""
+    def __init__(self, space_name: str, username: str, token: str):
+        self.space_name = space_name
+        self.username = username
+        self.token = token
+        self.space_url = f"https://huggingface.co/spaces/{username}/{space_name}"
+    def create_space(self) -> bool:
+        """Create a new Hugging Face Space"""
+        try:
+            print(f"Creating Space: {self.space_name}")
+            # Create space using Hugging Face CLI
+            cmd = [
+                "huggingface-cli", "repo", "create",
+                f"{self.username}/{self.space_name}",
+                "--type", "space",
+                "--space-sdk", "gradio",
+                "--space-hardware", "cpu-basic"
+            ]
+            result = subprocess.run(cmd, capture_output=True, text=True)
+            if result.returncode == 0:
+                print(f"✅ Space created successfully: {self.space_url}")
+                return True
+            else:
+                print(f"❌ Failed to create space: {result.stderr}")
+                return False
+        except Exception as e:
+            print(f"❌ Error creating space: {e}")
+            return False
+    def upload_files(self) -> bool:
+        """Upload necessary files to the Space"""
+        try:
+            print("Uploading files to Space...")
+            # Files to upload
+            files_to_upload = [
+                "app.py",
+                "requirements_space.txt",
+                "README.md"
+            ]
+            for file_path in files_to_upload:
+                if os.path.exists(file_path):
+                    # Use git to add and push files
+                    subprocess.run(["git", "add", file_path], check=True)
+                    subprocess.run(["git", "commit", "-m", f"Add {file_path}"], check=True)
+                    subprocess.run(["git", "push"], check=True)
+                    print(f"✅ Uploaded {file_path}")
+                else:
+                    print(f"⚠️  File not found: {file_path}")
+            return True
+        except Exception as e:
+            print(f"❌ Error uploading files: {e}")
+            return False
+    def configure_space(self) -> bool:
+        """Configure the Space settings"""
+        try:
+            print("Configuring Space settings...")
+            # Create space configuration
+            space_config = {
+                "title": "Trackio - Experiment Tracking",
+                "emoji": "🚀",
+                "colorFrom": "blue",
+                "colorTo": "purple",
+                "sdk": "gradio",
+                "sdk_version": "4.0.0",
+                "app_file": "app.py",
+                "pinned": False
+            }
+            # Write README.md for the space
+            space_readme = f"""---
+title: Trackio for Petite Elle L'Aime
+emoji: 🐠
+colorFrom: indigo
+colorTo: yellow
+sdk: gradio
+sdk_version: 5.38.0
+app_file: app.py
+pinned: true
+license: mit
+short_description: trackio for training monitoring
+---
+# Trackio Experiment Tracking
+A Gradio interface for experiment tracking and monitoring.
+## Features
+- Create and manage experiments
+- Log training metrics and parameters
+- View experiment details and results
+- Update experiment status
+## Usage
+1. Create a new experiment using the "Create Experiment" tab
+2. Log metrics during training using the "Log Metrics" tab
+3. View experiment details using the "View Experiments" tab
+4. Update experiment status using the "Update Status" tab
+## Integration
+To connect your training script to this Trackio Space:
+```python
+from monitoring import SmolLM3Monitor
+monitor = SmolLM3Monitor(
+    experiment_name="my_experiment",
+    trackio_url="{self.space_url}",
+    enable_tracking=True
+)
+```
+Visit: {self.space_url}
+"""
+            with open("README.md", "w") as f:
+                f.write(space_readme)
+            return True
+        except Exception as e:
+            print(f"❌ Error configuring space: {e}")
+            return False
+    def test_space(self) -> bool:
+        """Test if the Space is working correctly"""
+        try:
+            print("Testing Space...")
+            # Wait a bit for the space to build
+            import time
+            time.sleep(30)
+            # Try to access the space
+            response = requests.get(self.space_url, timeout=10)
+            if response.status_code == 200:
+                print(f"✅ Space is accessible: {self.space_url}")
+                return True
+            else:
+                print(f"⚠️  Space returned status code: {response.status_code}")
+                return False
+        except Exception as e:
+            print(f"❌ Error testing space: {e}")
+            return False
+    def deploy(self) -> bool:
+        """Complete deployment process"""
+        print("🚀 Starting Trackio Space deployment...")
+        # Step 1: Create space
+        if not self.create_space():
+            return False
+        # Step 2: Configure space
+        if not self.configure_space():
+            return False
+        # Step 3: Upload files
+        if not self.upload_files():
+            return False
+        # Step 4: Test space
+        if not self.test_space():
+            print("⚠️  Space created but may need time to build")
+        print(f"🎉 Deployment completed!")
+        print(f"📊 Trackio Space URL: {self.space_url}")
+        print(f"🔧 Space configuration: {self.space_url}/settings")
+        return True
+def main():
+    """Main deployment function"""
+    print("Trackio Space Deployment Script")
+    print("=" * 40)
+    # Get user input
+    username = input("Enter your Hugging Face username: ").strip()
+    space_name = input("Enter Space name (e.g., trackio-monitoring): ").strip()
+    token = input("Enter your Hugging Face token (optional): ").strip()
+    if not username or not space_name:
+        print("❌ Username and Space name are required")
+        sys.exit(1)
+    # Create deployer
+    deployer = TrackioSpaceDeployer(space_name, username, token)
+    # Run deployment
+    success = deployer.deploy()
+    if success:
+        print("\n✅ Deployment successful!")
+        print(f"🌐 Your Trackio Space: {deployer.space_url}")
+        print("\nNext steps:")
+        print("1. Wait for the Space to build (usually 2-5 minutes)")
+        print("2. Test the interface by visiting the Space URL")
+        print("3. Use the Space URL in your training scripts")
+    else:
+        print("\n❌ Deployment failed!")
+        print("Check the error messages above and try again.")
+if __name__ == "__main__":
+    main()

monitoring.py ADDED Viewed

	@@ -0,0 +1,298 @@

+"""
+Trackio Monitoring Integration for SmolLM3 Fine-tuning
+Provides comprehensive experiment tracking and monitoring capabilities
+"""
+import os
+import json
+import logging
+from typing import Dict, Any, Optional, List
+from datetime import datetime
+import torch
+from pathlib import Path
+try:
+    import trackio
+    from trackio import TrackioClient
+    TRACKIO_AVAILABLE = True
+except ImportError:
+    TRACKIO_AVAILABLE = False
+    print("Warning: Trackio not available. Install with: pip install trackio")
+logger = logging.getLogger(__name__)
+class SmolLM3Monitor:
+    """Monitoring and tracking for SmolLM3 fine-tuning experiments"""
+    def __init__(
+        self,
+        experiment_name: str,
+        trackio_url: Optional[str] = None,
+        trackio_token: Optional[str] = None,
+        enable_tracking: bool = True,
+        log_artifacts: bool = True,
+        log_metrics: bool = True,
+        log_config: bool = True
+    ):
+        self.experiment_name = experiment_name
+        self.enable_tracking = enable_tracking and TRACKIO_AVAILABLE
+        self.log_artifacts = log_artifacts
+        self.log_metrics = log_metrics
+        self.log_config = log_config
+        # Initialize Trackio client
+        self.trackio_client = None
+        if self.enable_tracking:
+            self._setup_trackio(trackio_url, trackio_token)
+        # Experiment metadata
+        self.experiment_id = None
+        self.start_time = datetime.now()
+        self.metrics_history = []
+        self.artifacts = []
+        logger.info(f"Initialized monitoring for experiment: {experiment_name}")
+    def _setup_trackio(self, trackio_url: Optional[str], trackio_token: Optional[str]):
+        """Setup Trackio client"""
+        try:
+            # Get Trackio configuration from environment or parameters
+            url = trackio_url or os.getenv('TRACKIO_URL')
+            token = trackio_token or os.getenv('TRACKIO_TOKEN')
+            if not url:
+                logger.warning("Trackio URL not provided. Set TRACKIO_URL environment variable.")
+                self.enable_tracking = False
+                return
+            self.trackio_client = TrackioClient(
+                url=url,
+                token=token
+            )
+            # Create or get experiment
+            self.experiment_id = self.trackio_client.create_experiment(
+                name=self.experiment_name,
+                description=f"SmolLM3 fine-tuning experiment started at {self.start_time}"
+            )
+            logger.info(f"Trackio client initialized. Experiment ID: {self.experiment_id}")
+        except Exception as e:
+            logger.error(f"Failed to initialize Trackio: {e}")
+            self.enable_tracking = False
+    def log_config(self, config: Dict[str, Any]):
+        """Log experiment configuration"""
+        if not self.enable_tracking or not self.log_config:
+            return
+        try:
+            # Log configuration as parameters
+            self.trackio_client.log_parameters(
+                experiment_id=self.experiment_id,
+                parameters=config
+            )
+            # Also save config locally
+            config_path = f"config_{self.experiment_name}_{self.start_time.strftime('%Y%m%d_%H%M%S')}.json"
+            with open(config_path, 'w') as f:
+                json.dump(config, f, indent=2, default=str)
+            self.artifacts.append(config_path)
+            logger.info(f"Configuration logged to Trackio and saved to {config_path}")
+        except Exception as e:
+            logger.error(f"Failed to log configuration: {e}")
+    def log_metrics(self, metrics: Dict[str, Any], step: Optional[int] = None):
+        """Log training metrics"""
+        if not self.enable_tracking or not self.log_metrics:
+            return
+        try:
+            # Add timestamp
+            metrics['timestamp'] = datetime.now().isoformat()
+            if step is not None:
+                metrics['step'] = step
+            # Log to Trackio
+            self.trackio_client.log_metrics(
+                experiment_id=self.experiment_id,
+                metrics=metrics,
+                step=step
+            )
+            # Store locally
+            self.metrics_history.append(metrics)
+            logger.debug(f"Metrics logged: {metrics}")
+        except Exception as e:
+            logger.error(f"Failed to log metrics: {e}")
+    def log_model_checkpoint(self, checkpoint_path: str, step: Optional[int] = None):
+        """Log model checkpoint"""
+        if not self.enable_tracking or not self.log_artifacts:
+            return
+        try:
+            # Log checkpoint as artifact
+            self.trackio_client.log_artifact(
+                experiment_id=self.experiment_id,
+                file_path=checkpoint_path,
+                artifact_name=f"checkpoint_step_{step}" if step else "checkpoint"
+            )
+            self.artifacts.append(checkpoint_path)
+            logger.info(f"Checkpoint logged: {checkpoint_path}")
+        except Exception as e:
+            logger.error(f"Failed to log checkpoint: {e}")
+    def log_evaluation_results(self, results: Dict[str, Any], step: Optional[int] = None):
+        """Log evaluation results"""
+        if not self.enable_tracking:
+            return
+        try:
+            # Add evaluation prefix to metrics
+            eval_metrics = {f"eval_{k}": v for k, v in results.items()}
+            self.log_metrics(eval_metrics, step)
+            # Save evaluation results locally
+            eval_path = f"eval_results_step_{step}_{self.start_time.strftime('%Y%m%d_%H%M%S')}.json"
+            with open(eval_path, 'w') as f:
+                json.dump(results, f, indent=2, default=str)
+            self.artifacts.append(eval_path)
+            logger.info(f"Evaluation results logged and saved to {eval_path}")
+        except Exception as e:
+            logger.error(f"Failed to log evaluation results: {e}")
+    def log_system_metrics(self, step: Optional[int] = None):
+        """Log system metrics (GPU, memory, etc.)"""
+        if not self.enable_tracking:
+            return
+        try:
+            system_metrics = {}
+            # GPU metrics
+            if torch.cuda.is_available():
+                for i in range(torch.cuda.device_count()):
+                    system_metrics[f'gpu_{i}_memory_allocated'] = torch.cuda.memory_allocated(i) / 1024**3  # GB
+                    system_metrics[f'gpu_{i}_memory_reserved'] = torch.cuda.memory_reserved(i) / 1024**3  # GB
+                    system_metrics[f'gpu_{i}_utilization'] = torch.cuda.utilization(i) if hasattr(torch.cuda, 'utilization') else 0
+            # CPU and memory metrics (basic)
+            import psutil
+            system_metrics['cpu_percent'] = psutil.cpu_percent()
+            system_metrics['memory_percent'] = psutil.virtual_memory().percent
+            self.log_metrics(system_metrics, step)
+        except Exception as e:
+            logger.error(f"Failed to log system metrics: {e}")
+    def log_training_summary(self, summary: Dict[str, Any]):
+        """Log training summary at the end"""
+        if not self.enable_tracking:
+            return
+        try:
+            # Add experiment duration
+            end_time = datetime.now()
+            duration = (end_time - self.start_time).total_seconds()
+            summary['experiment_duration_seconds'] = duration
+            summary['experiment_duration_hours'] = duration / 3600
+            # Log final summary
+            self.trackio_client.log_parameters(
+                experiment_id=self.experiment_id,
+                parameters=summary
+            )
+            # Save summary locally
+            summary_path = f"training_summary_{self.experiment_name}_{self.start_time.strftime('%Y%m%d_%H%M%S')}.json"
+            with open(summary_path, 'w') as f:
+                json.dump(summary, f, indent=2, default=str)
+            self.artifacts.append(summary_path)
+            logger.info(f"Training summary logged and saved to {summary_path}")
+        except Exception as e:
+            logger.error(f"Failed to log training summary: {e}")
+    def create_monitoring_callback(self):
+        """Create a callback for integration with Hugging Face Trainer"""
+        if not self.enable_tracking:
+            return None
+        class TrackioCallback:
+            def __init__(self, monitor):
+                self.monitor = monitor
+            def on_log(self, args, state, control, logs=None, **kwargs):
+                """Called when logs are created"""
+                if logs:
+                    self.monitor.log_metrics(logs, state.global_step)
+                    self.monitor.log_system_metrics(state.global_step)
+            def on_save(self, args, state, control, **kwargs):
+                """Called when a checkpoint is saved"""
+                checkpoint_path = os.path.join(args.output_dir, f"checkpoint-{state.global_step}")
+                if os.path.exists(checkpoint_path):
+                    self.monitor.log_model_checkpoint(checkpoint_path, state.global_step)
+            def on_evaluate(self, args, state, control, metrics=None, **kwargs):
+                """Called when evaluation is performed"""
+                if metrics:
+                    self.monitor.log_evaluation_results(metrics, state.global_step)
+        return TrackioCallback(self)
+    def get_experiment_url(self) -> Optional[str]:
+        """Get the URL to view the experiment in Trackio"""
+        if self.trackio_client and self.experiment_id:
+            return f"{self.trackio_client.url}/experiments/{self.experiment_id}"
+        return None
+    def close(self):
+        """Close the monitoring session"""
+        if self.enable_tracking and self.trackio_client:
+            try:
+                # Mark experiment as completed
+                self.trackio_client.update_experiment_status(
+                    experiment_id=self.experiment_id,
+                    status="completed"
+                )
+                logger.info("Monitoring session closed")
+            except Exception as e:
+                logger.error(f"Failed to close monitoring session: {e}")
+# Utility function to create monitor from config
+def create_monitor_from_config(config, experiment_name: Optional[str] = None) -> SmolLM3Monitor:
+    """Create a monitor instance from configuration"""
+    if experiment_name is None:
+        experiment_name = f"smollm3_finetune_{datetime.now().strftime('%Y%m%d_%H%M%S')}"
+    # Extract monitoring configuration
+    trackio_url = getattr(config, 'trackio_url', None)
+    trackio_token = getattr(config, 'trackio_token', None)
+    enable_tracking = getattr(config, 'enable_tracking', True)
+    log_artifacts = getattr(config, 'log_artifacts', True)
+    log_metrics = getattr(config, 'log_metrics', True)
+    log_config = getattr(config, 'log_config', True)
+    return SmolLM3Monitor(
+        experiment_name=experiment_name,
+        trackio_url=trackio_url,
+        trackio_token=trackio_token,
+        enable_tracking=enable_tracking,
+        log_artifacts=log_artifacts,
+        log_metrics=log_metrics,
+        log_config=log_config
+    )

push_to_huggingface.py ADDED Viewed

	@@ -0,0 +1,486 @@

+#!/usr/bin/env python3
+"""
+Push Trained Model and Results to Hugging Face Hub
+Integrates with Trackio monitoring and provides complete model deployment
+"""
+import os
+import json
+import argparse
+import logging
+from pathlib import Path
+from typing import Dict, Any, Optional, List
+from datetime import datetime
+import subprocess
+import shutil
+try:
+    from huggingface_hub import HfApi, create_repo, upload_file
+    from huggingface_hub import snapshot_download, hf_hub_download
+    HF_AVAILABLE = True
+except ImportError:
+    HF_AVAILABLE = False
+    print("Warning: huggingface_hub not available. Install with: pip install huggingface_hub")
+try:
+    from monitoring import SmolLM3Monitor
+    MONITORING_AVAILABLE = True
+except ImportError:
+    MONITORING_AVAILABLE = False
+    print("Warning: monitoring module not available")
+logger = logging.getLogger(__name__)
+class HuggingFacePusher:
+    """Push trained models and results to Hugging Face Hub"""
+    def __init__(
+        self,
+        model_path: str,
+        repo_name: str,
+        token: Optional[str] = None,
+        private: bool = False,
+        trackio_url: Optional[str] = None,
+        experiment_name: Optional[str] = None
+    ):
+        self.model_path = Path(model_path)
+        self.repo_name = repo_name
+        self.token = token or os.getenv('HF_TOKEN')
+        self.private = private
+        self.trackio_url = trackio_url
+        self.experiment_name = experiment_name
+        # Initialize HF API
+        if HF_AVAILABLE:
+            self.api = HfApi(token=self.token)
+        else:
+            raise ImportError("huggingface_hub is required. Install with: pip install huggingface_hub")
+        # Initialize monitoring if available
+        self.monitor = None
+        if MONITORING_AVAILABLE and trackio_url:
+            self.monitor = SmolLM3Monitor(
+                experiment_name=experiment_name or "model_push",
+                trackio_url=trackio_url,
+                enable_tracking=True
+            )
+        logger.info(f"Initialized HuggingFacePusher for {repo_name}")
+    def create_repository(self) -> bool:
+        """Create the Hugging Face repository"""
+        try:
+            logger.info(f"Creating repository: {self.repo_name}")
+            # Create repository
+            create_repo(
+                repo_id=self.repo_name,
+                token=self.token,
+                private=self.private,
+                exist_ok=True
+            )
+            logger.info(f"✅ Repository created: https://huggingface.co/{self.repo_name}")
+            return True
+        except Exception as e:
+            logger.error(f"❌ Failed to create repository: {e}")
+            return False
+    def validate_model_path(self) -> bool:
+        """Validate that the model path contains required files"""
+        required_files = [
+            "config.json",
+            "pytorch_model.bin",
+            "tokenizer.json",
+            "tokenizer_config.json"
+        ]
+        missing_files = []
+        for file in required_files:
+            if not (self.model_path / file).exists():
+                missing_files.append(file)
+        if missing_files:
+            logger.error(f"❌ Missing required files: {missing_files}")
+            return False
+        logger.info("✅ Model files validated")
+        return True
+    def create_model_card(self, training_config: Dict[str, Any], results: Dict[str, Any]) -> str:
+        """Create a comprehensive model card"""
+        model_card = f"""---
+language:
+- en
+license: mit
+tags:
+- smollm3
+- fine-tuned
+- text-generation
+- transformers
+---
+# {self.repo_name.split('/')[-1]}
+This is a fine-tuned SmolLM3 model based on the HuggingFaceTB/SmolLM3-3B architecture.
+## Model Details
+- **Base Model**: HuggingFaceTB/SmolLM3-3B
+- **Fine-tuning Method**: Supervised Fine-tuning
+- **Training Date**: {datetime.now().strftime('%Y-%m-%d')}
+- **Model Size**: {self._get_model_size():.1f} GB
+## Training Configuration
+```json
+{json.dumps(training_config, indent=2)}
+```
+## Training Results
+```json
+{json.dumps(results, indent=2)}
+```
+## Usage
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+# Load model and tokenizer
+model = AutoModelForCausalLM.from_pretrained("{self.repo_name}")
+tokenizer = AutoTokenizer.from_pretrained("{self.repo_name}")
+# Generate text
+inputs = tokenizer("Hello, how are you?", return_tensors="pt")
+outputs = model.generate(**inputs, max_new_tokens=100)
+print(tokenizer.decode(outputs[0], skip_special_tokens=True))
+```
+## Training Information
+- **Framework**: Transformers
+- **Hardware**: {self._get_hardware_info()}
+- **Training Time**: {results.get('training_time_hours', 'Unknown')} hours
+- **Final Loss**: {results.get('final_loss', 'Unknown')}
+- **Final Accuracy**: {results.get('final_accuracy', 'Unknown')}
+## Model Performance
+- **Training Loss**: {results.get('train_loss', 'Unknown')}
+- **Validation Loss**: {results.get('eval_loss', 'Unknown')}
+- **Training Steps**: {results.get('total_steps', 'Unknown')}
+## Limitations and Biases
+This model is fine-tuned for specific tasks and may not generalize well to all use cases. Please evaluate the model's performance on your specific task before deployment.
+## License
+This model is licensed under the MIT License.
+"""
+        return model_card
+    def _get_model_size(self) -> float:
+        """Get model size in GB"""
+        try:
+            total_size = 0
+            for file in self.model_path.rglob("*"):
+                if file.is_file():
+                    total_size += file.stat().st_size
+            return total_size / (1024**3)  # Convert to GB
+        except:
+            return 0.0
+    def _get_hardware_info(self) -> str:
+        """Get hardware information"""
+        try:
+            import torch
+            if torch.cuda.is_available():
+                gpu_name = torch.cuda.get_device_name(0)
+                return f"GPU: {gpu_name}"
+            else:
+                return "CPU"
+        except:
+            return "Unknown"
+    def upload_model_files(self) -> bool:
+        """Upload model files to Hugging Face Hub"""
+        try:
+            logger.info("Uploading model files...")
+            # Upload all files in the model directory
+            for file_path in self.model_path.rglob("*"):
+                if file_path.is_file():
+                    relative_path = file_path.relative_to(self.model_path)
+                    remote_path = str(relative_path)
+                    logger.info(f"Uploading {relative_path}")
+                    upload_file(
+                        path_or_fileobj=str(file_path),
+                        path_in_repo=remote_path,
+                        repo_id=self.repo_name,
+                        token=self.token
+                    )
+            logger.info("✅ Model files uploaded successfully")
+            return True
+        except Exception as e:
+            logger.error(f"❌ Failed to upload model files: {e}")
+            return False
+    def upload_training_results(self, results_path: str) -> bool:
+        """Upload training results and logs"""
+        try:
+            logger.info("Uploading training results...")
+            results_files = [
+                "train_results.json",
+                "eval_results.json",
+                "training_config.json",
+                "training.log"
+            ]
+            for file_name in results_files:
+                file_path = Path(results_path) / file_name
+                if file_path.exists():
+                    logger.info(f"Uploading {file_name}")
+                    upload_file(
+                        path_or_fileobj=str(file_path),
+                        path_in_repo=f"training_results/{file_name}",
+                        repo_id=self.repo_name,
+                        token=self.token
+                    )
+            logger.info("✅ Training results uploaded successfully")
+            return True
+        except Exception as e:
+            logger.error(f"❌ Failed to upload training results: {e}")
+            return False
+    def create_readme(self, training_config: Dict[str, Any], results: Dict[str, Any]) -> bool:
+        """Create and upload README.md"""
+        try:
+            logger.info("Creating README.md...")
+            readme_content = f"""# {self.repo_name.split('/')[-1]}
+A fine-tuned SmolLM3 model for text generation tasks.
+## Quick Start
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+model = AutoModelForCausalLM.from_pretrained("{self.repo_name}")
+tokenizer = AutoTokenizer.from_pretrained("{self.repo_name}")
+# Generate text
+text = "Hello, how are you?"
+inputs = tokenizer(text, return_tensors="pt")
+outputs = model.generate(**inputs, max_new_tokens=100)
+print(tokenizer.decode(outputs[0], skip_special_tokens=True))
+```
+## Model Information
+- **Base Model**: HuggingFaceTB/SmolLM3-3B
+- **Fine-tuning Date**: {datetime.now().strftime('%Y-%m-%d')}
+- **Model Size**: {self._get_model_size():.1f} GB
+- **Training Steps**: {results.get('total_steps', 'Unknown')}
+- **Final Loss**: {results.get('final_loss', 'Unknown')}
+## Training Configuration
+```json
+{json.dumps(training_config, indent=2)}
+```
+## Performance Metrics
+```json
+{json.dumps(results, indent=2)}
+```
+## Files
+- `pytorch_model.bin`: Model weights
+- `config.json`: Model configuration
+- `tokenizer.json`: Tokenizer configuration
+- `training_results/`: Training logs and results
+## License
+MIT License
+"""
+            # Write README to temporary file
+            readme_path = Path("temp_readme.md")
+            with open(readme_path, "w") as f:
+                f.write(readme_content)
+            # Upload README
+            upload_file(
+                path_or_fileobj=str(readme_path),
+                path_in_repo="README.md",
+                repo_id=self.repo_name,
+                token=self.token
+            )
+            # Clean up
+            readme_path.unlink()
+            logger.info("✅ README.md uploaded successfully")
+            return True
+        except Exception as e:
+            logger.error(f"❌ Failed to create README: {e}")
+            return False
+    def log_to_trackio(self, action: str, details: Dict[str, Any]):
+        """Log push action to Trackio"""
+        if self.monitor:
+            try:
+                self.monitor.log_metrics({
+                    "push_action": action,
+                    "repo_name": self.repo_name,
+                    "model_size_gb": self._get_model_size(),
+                    **details
+                })
+                logger.info(f"✅ Logged {action} to Trackio")
+            except Exception as e:
+                logger.error(f"❌ Failed to log to Trackio: {e}")
+    def push_model(self, training_config: Optional[Dict[str, Any]] = None,
+                   results: Optional[Dict[str, Any]] = None) -> bool:
+        """Complete model push process"""
+        logger.info(f"🚀 Starting model push to {self.repo_name}")
+        # Validate model path
+        if not self.validate_model_path():
+            return False
+        # Create repository
+        if not self.create_repository():
+            return False
+        # Load training config and results if not provided
+        if training_config is None:
+            training_config = self._load_training_config()
+        if results is None:
+            results = self._load_training_results()
+        # Create and upload model card
+        model_card = self.create_model_card(training_config, results)
+        model_card_path = Path("temp_model_card.md")
+        with open(model_card_path, "w") as f:
+            f.write(model_card)
+        try:
+            upload_file(
+                path_or_fileobj=str(model_card_path),
+                path_in_repo="README.md",
+                repo_id=self.repo_name,
+                token=self.token
+            )
+        finally:
+            model_card_path.unlink()
+        # Upload model files
+        if not self.upload_model_files():
+            return False
+        # Upload training results
+        if results:
+            self.upload_training_results(str(self.model_path))
+        # Log to Trackio
+        self.log_to_trackio("model_push", {
+            "model_path": str(self.model_path),
+            "repo_name": self.repo_name,
+            "private": self.private,
+            "training_config": training_config,
+            "results": results
+        })
+        logger.info(f"🎉 Model successfully pushed to: https://huggingface.co/{self.repo_name}")
+        return True
+    def _load_training_config(self) -> Dict[str, Any]:
+        """Load training configuration"""
+        config_path = self.model_path / "training_config.json"
+        if config_path.exists():
+            with open(config_path, "r") as f:
+                return json.load(f)
+        return {"model_name": "HuggingFaceTB/SmolLM3-3B"}
+    def _load_training_results(self) -> Dict[str, Any]:
+        """Load training results"""
+        results_path = self.model_path / "train_results.json"
+        if results_path.exists():
+            with open(results_path, "r") as f:
+                return json.load(f)
+        return {"final_loss": "Unknown", "total_steps": "Unknown"}
+def parse_args():
+    """Parse command line arguments"""
+    parser = argparse.ArgumentParser(description='Push trained model to Hugging Face Hub')
+    # Required arguments
+    parser.add_argument('model_path', type=str, help='Path to trained model directory')
+    parser.add_argument('repo_name', type=str, help='Hugging Face repository name (username/repo-name)')
+    # Optional arguments
+    parser.add_argument('--token', type=str, default=None, help='Hugging Face token')
+    parser.add_argument('--private', action='store_true', help='Make repository private')
+    parser.add_argument('--trackio-url', type=str, default=None, help='Trackio Space URL for logging')
+    parser.add_argument('--experiment-name', type=str, default=None, help='Experiment name for Trackio')
+    return parser.parse_args()
+def main():
+    """Main function"""
+    args = parse_args()
+    # Setup logging
+    logging.basicConfig(
+        level=logging.INFO,
+        format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
+    )
+    logger.info("Starting model push to Hugging Face Hub")
+    # Initialize pusher
+    try:
+        pusher = HuggingFacePusher(
+            model_path=args.model_path,
+            repo_name=args.repo_name,
+            token=args.token,
+            private=args.private,
+            trackio_url=args.trackio_url,
+            experiment_name=args.experiment_name
+        )
+        # Push model
+        success = pusher.push_model()
+        if success:
+            logger.info("✅ Model push completed successfully!")
+            logger.info(f"🌐 View your model at: https://huggingface.co/{args.repo_name}")
+        else:
+            logger.error("❌ Model push failed!")
+            return 1
+    except Exception as e:
+        logger.error(f"❌ Error during model push: {e}")
+        return 1
+    return 0
+if __name__ == "__main__":
+    exit(main())

requirements.txt CHANGED Viewed

@@ -32,4 +32,11 @@ sentencepiece>=0.1.99
 # Development
 pytest>=7.0.0
 black>=23.0.0
-isort>=5.12.0

 # Development
 pytest>=7.0.0
 black>=23.0.0
+isort>=5.12.0
+# Experiment tracking and monitoring
+trackio>=0.1.0
+psutil>=5.9.0
+# Hugging Face Hub integration
+huggingface_hub>=0.16.0

requirements_space.txt ADDED Viewed

	@@ -0,0 +1,18 @@

+# Gradio and web interface
+gradio>=4.0.0
+gradio-client>=0.10.0
+# Core dependencies for Trackio Space
+requests>=2.31.0
+numpy>=1.24.0
+pandas>=2.0.0
+# JSON and data handling
+jsonschema>=4.17.0
+# Optional: for better UI
+plotly>=5.15.0
+matplotlib>=3.7.0
+# Development and debugging
+python-dotenv>=1.0.0

run_a100_large_experiment.py ADDED Viewed

	@@ -0,0 +1,134 @@

+#!/usr/bin/env python3
+"""
+Script to run A100 large-scale experiments on OpenHermes-FR dataset
+Supports multiple configurations for different training scenarios
+"""
+import argparse
+import os
+import sys
+from pathlib import Path
+def main():
+    parser = argparse.ArgumentParser(description="Run A100 large-scale experiments")
+    parser.add_argument(
+        "--config",
+        type=str,
+        default="config/train_smollm3_openhermes_fr_a100_large.py",
+        help="Configuration file to use"
+    )
+    parser.add_argument(
+        "--experiment-name",
+        type=str,
+        help="Custom experiment name for tracking"
+    )
+    parser.add_argument(
+        "--output-dir",
+        type=str,
+        default="./outputs",
+        help="Output directory for checkpoints and logs"
+    )
+    parser.add_argument(
+        "--resume",
+        type=str,
+        help="Resume training from checkpoint"
+    )
+    parser.add_argument(
+        "--dry-run",
+        action="store_true",
+        help="Print configuration without starting training"
+    )
+    args = parser.parse_args()
+    # Add the current directory to Python path
+    sys.path.insert(0, str(Path(__file__).parent))
+    # Import the configuration
+    try:
+        from config.train_smollm3_openhermes_fr_a100_large import get_config as get_large_config
+        from config.train_smollm3_openhermes_fr_a100_multiple_passes import get_config as get_multiple_passes_config
+        # Map config files to their respective functions
+        config_map = {
+            "config/train_smollm3_openhermes_fr_a100_large.py": get_large_config,
+            "config/train_smollm3_openhermes_fr_a100_multiple_passes.py": get_multiple_passes_config,
+        }
+        if args.config in config_map:
+            config = config_map[args.config](args.config)
+        else:
+            # Try to load from the specified config file
+            config = get_large_config(args.config)
+    except ImportError as e:
+        print(f"Error importing configuration: {e}")
+        print("Available configurations:")
+        print("  - config/train_smollm3_openhermes_fr_a100_large.py (Large batch, 1.3 passes)")
+        print("  - config/train_smollm3_openhermes_fr_a100_multiple_passes.py (Multiple passes, 4 epochs)")
+        return 1
+    # Override experiment name if provided
+    if args.experiment_name:
+        config.experiment_name = args.experiment_name
+    # Create output directory
+    os.makedirs(args.output_dir, exist_ok=True)
+    # Print configuration summary
+    print(f"\n{'='*60}")
+    print(f"EXPERIMENT CONFIGURATION")
+    print(f"{'='*60}")
+    print(f"Config file: {args.config}")
+    print(f"Experiment name: {config.experiment_name}")
+    print(f"Output directory: {args.output_dir}")
+    print(f"Model: {config.model_name}")
+    print(f"Batch size: {config.batch_size}")
+    print(f"Gradient accumulation: {config.gradient_accumulation_steps}")
+    print(f"Effective batch size: {config.batch_size * config.gradient_accumulation_steps}")
+    print(f"Learning rate: {config.learning_rate}")
+    print(f"Max iterations: {config.max_iters}")
+    print(f"Max sequence length: {config.max_seq_length}")
+    print(f"Mixed precision: {'bf16' if config.bf16 else 'fp16'}")
+    print(f"Dataset: {config.dataset_name}")
+    print(f"{'='*60}\n")
+    if args.dry_run:
+        print("DRY RUN - Configuration printed above. Use without --dry-run to start training.")
+        return 0
+    # Import and run training
+    try:
+        from train import main as train_main
+        # Set up training arguments
+        train_args = [
+            "--config", args.config,
+            "--output-dir", args.output_dir,
+        ]
+        if args.resume:
+            train_args.extend(["--resume", args.resume])
+        # Override sys.argv for the training script
+        original_argv = sys.argv
+        sys.argv = ["train.py"] + train_args
+        # Run training
+        train_main()
+        # Restore original argv
+        sys.argv = original_argv
+    except ImportError as e:
+        print(f"Error importing training module: {e}")
+        print("Make sure train.py is available in the current directory.")
+        return 1
+    except Exception as e:
+        print(f"Error during training: {e}")
+        return 1
+    return 0
+if __name__ == "__main__":
+    exit(main())

test_monitoring.py ADDED Viewed

	@@ -0,0 +1,181 @@

+#!/usr/bin/env python3
+"""
+Quick Start Script for Trackio Integration
+Tests the monitoring functionality without full training
+"""
+import os
+import json
+import logging
+from datetime import datetime
+from monitoring import SmolLM3Monitor
+def setup_logging():
+    """Setup logging configuration"""
+    logging.basicConfig(
+        level=logging.INFO,
+        format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
+    )
+    return logging.getLogger(__name__)
+def test_trackio_integration():
+    """Test Trackio integration with sample data"""
+    logger = setup_logging()
+    print("🚀 Testing Trackio Integration")
+    print("=" * 40)
+    # Get Trackio URL from user or environment
+    trackio_url = os.getenv('TRACKIO_URL')
+    if not trackio_url:
+        trackio_url = input("Enter your Trackio Space URL (or press Enter to skip): ").strip()
+        if not trackio_url:
+            print("⚠️  No Trackio URL provided. Running in local mode only.")
+            trackio_url = None
+    # Initialize monitor
+    experiment_name = f"test_experiment_{datetime.now().strftime('%Y%m%d_%H%M%S')}"
+    monitor = SmolLM3Monitor(
+        experiment_name=experiment_name,
+        trackio_url=trackio_url,
+        enable_tracking=trackio_url is not None,
+        log_artifacts=True,
+        log_metrics=True,
+        log_config=True
+    )
+    print(f"✅ Monitor initialized for experiment: {experiment_name}")
+    # Test configuration logging
+    sample_config = {
+        "model_name": "HuggingFaceTB/SmolLM3-3B",
+        "batch_size": 4,
+        "learning_rate": 2e-5,
+        "max_iters": 1000,
+        "max_seq_length": 4096,
+        "test_mode": True
+    }
+    print("📝 Logging configuration...")
+    monitor.log_config(sample_config)
+    # Test metrics logging
+    print("📊 Logging sample metrics...")
+    for step in range(0, 100, 10):
+        metrics = {
+            "loss": 2.0 - (step * 0.015),  # Simulate decreasing loss
+            "accuracy": 0.5 + (step * 0.004),  # Simulate increasing accuracy
+            "learning_rate": 2e-5,
+            "step": step
+        }
+        monitor.log_metrics(metrics, step=step)
+        print(f"   Step {step}: loss={metrics['loss']:.3f}, accuracy={metrics['accuracy']:.3f}")
+    # Test system metrics
+    print("💻 Logging system metrics...")
+    monitor.log_system_metrics(step=50)
+    # Test evaluation results
+    print("📈 Logging evaluation results...")
+    eval_results = {
+        "eval_loss": 1.2,
+        "eval_accuracy": 0.85,
+        "perplexity": 3.3,
+        "bleu_score": 0.72
+    }
+    monitor.log_evaluation_results(eval_results, step=100)
+    # Test training summary
+    print("📋 Logging training summary...")
+    summary = {
+        "final_loss": 0.5,
+        "final_accuracy": 0.89,
+        "total_steps": 100,
+        "training_time_hours": 2.5,
+        "model_size_gb": 6.2,
+        "test_mode": True
+    }
+    monitor.log_training_summary(summary)
+    # Close monitoring
+    monitor.close()
+    print("✅ Trackio integration test completed!")
+    if trackio_url:
+        experiment_url = monitor.get_experiment_url()
+        if experiment_url:
+            print(f"🌐 View your experiment at: {experiment_url}")
+    return True
+def test_local_monitoring():
+    """Test local monitoring without Trackio"""
+    logger = setup_logging()
+    print("🔧 Testing Local Monitoring")
+    print("=" * 30)
+    # Initialize monitor without Trackio
+    experiment_name = f"local_test_{datetime.now().strftime('%Y%m%d_%H%M%S')}"
+    monitor = SmolLM3Monitor(
+        experiment_name=experiment_name,
+        enable_tracking=False,  # Disable Trackio
+        log_artifacts=True,
+        log_metrics=True,
+        log_config=True
+    )
+    print(f"✅ Local monitor initialized for experiment: {experiment_name}")
+    # Test local logging
+    sample_config = {
+        "model_name": "HuggingFaceTB/SmolLM3-3B",
+        "batch_size": 4,
+        "learning_rate": 2e-5,
+        "local_test": True
+    }
+    print("📝 Logging configuration locally...")
+    monitor.log_config(sample_config)
+    # Test local metrics
+    print("📊 Logging sample metrics locally...")
+    for step in range(0, 50, 10):
+        metrics = {
+            "loss": 1.8 - (step * 0.02),
+            "accuracy": 0.6 + (step * 0.005),
+            "step": step
+        }
+        monitor.log_metrics(metrics, step=step)
+        print(f"   Step {step}: loss={metrics['loss']:.3f}, accuracy={metrics['accuracy']:.3f}")
+    print("✅ Local monitoring test completed!")
+    return True
+def main():
+    """Main function"""
+    print("Trackio Integration Quick Start")
+    print("=" * 40)
+    # Test local monitoring first
+    test_local_monitoring()
+    print()
+    # Test Trackio integration if available
+    try:
+        test_trackio_integration()
+    except Exception as e:
+        print(f"❌ Trackio integration test failed: {e}")
+        print("💡 Make sure you have a valid Trackio Space URL")
+    print("\n🎉 Quick start completed!")
+    print("\nNext steps:")
+    print("1. Deploy Trackio to Hugging Face Spaces (see DEPLOYMENT_GUIDE.md)")
+    print("2. Update your training script with Trackio integration")
+    print("3. Run your first monitored training session")
+if __name__ == "__main__":
+    main()

train.py CHANGED Viewed

@@ -76,6 +76,16 @@ def parse_args():
     parser.add_argument('--logging_steps', type=int, default=10,
                        help='Log every N steps')
     return parser.parse_args()
 def main():
@@ -99,14 +109,22 @@ def main():
     if args.gradient_accumulation_steps is not None:
         config.gradient_accumulation_steps = args.gradient_accumulation_steps
     # Setup paths
-    dataset_path = os.path.join('/input', args.dataset_dir)
     output_path = args.out_dir
     # Ensure output directory exists
     os.makedirs(output_path, exist_ok=True)
-    logger.info(f"Dataset path: {dataset_path}")
     logger.info(f"Output path: {output_path}")
     # Initialize model
@@ -116,11 +134,23 @@ def main():
         config=config
     )
-    # Load dataset
     dataset = SmolLM3Dataset(
         data_path=dataset_path,
         tokenizer=model.tokenizer,
-        max_seq_length=args.max_seq_length
     )
     # Initialize trainer

     parser.add_argument('--logging_steps', type=int, default=10,
                        help='Log every N steps')
+    # Trackio monitoring arguments
+    parser.add_argument('--enable_tracking', action='store_true', default=True,
+                       help='Enable Trackio experiment tracking')
+    parser.add_argument('--trackio_url', type=str, default=None,
+                       help='Trackio server URL')
+    parser.add_argument('--trackio_token', type=str, default=None,
+                       help='Trackio authentication token')
+    parser.add_argument('--experiment_name', type=str, default=None,
+                       help='Custom experiment name for tracking')
     return parser.parse_args()
 def main():
     if args.gradient_accumulation_steps is not None:
         config.gradient_accumulation_steps = args.gradient_accumulation_steps
+    # Override Trackio configuration
+    if args.enable_tracking is not None:
+        config.enable_tracking = args.enable_tracking
+    if args.trackio_url is not None:
+        config.trackio_url = args.trackio_url
+    if args.trackio_token is not None:
+        config.trackio_token = args.trackio_token
+    if args.experiment_name is not None:
+        config.experiment_name = args.experiment_name
     # Setup paths
     output_path = args.out_dir
     # Ensure output directory exists
     os.makedirs(output_path, exist_ok=True)
     logger.info(f"Output path: {output_path}")
     # Initialize model
         config=config
     )
+    # Determine dataset path
+    if hasattr(config, 'dataset_name') and config.dataset_name:
+        # Use Hugging Face dataset
+        dataset_path = config.dataset_name
+        logger.info(f"Using Hugging Face dataset: {dataset_path}")
+    else:
+        # Use local dataset
+        dataset_path = os.path.join('/input', args.dataset_dir)
+        logger.info(f"Using local dataset: {dataset_path}")
+    # Load dataset with filtering options
     dataset = SmolLM3Dataset(
         data_path=dataset_path,
         tokenizer=model.tokenizer,
+        max_seq_length=args.max_seq_length,
+        filter_bad_entries=getattr(config, 'filter_bad_entries', False),
+        bad_entry_field=getattr(config, 'bad_entry_field', 'bad_entry')
     )
     # Initialize trainer

trainer.py CHANGED Viewed

@@ -11,6 +11,9 @@ from transformers import Trainer, TrainingArguments
 from trl import SFTTrainer
 import json
 logger = logging.getLogger(__name__)
 class SmolLM3Trainer:
@@ -32,6 +35,9 @@ class SmolLM3Trainer:
         self.init_from = init_from
         self.use_sft_trainer = use_sft_trainer
         # Setup trainer
         self.trainer = self._setup_trainer()
@@ -55,6 +61,13 @@ class SmolLM3Trainer:
         # Get data collator
         data_collator = self.dataset.get_data_collator()
         if self.use_sft_trainer:
             # Use SFTTrainer for supervised fine-tuning
             trainer = SFTTrainer(
@@ -67,6 +80,7 @@ class SmolLM3Trainer:
                 dataset_text_field="text",
                 max_seq_length=self.config.max_seq_length,
                 packing=False,  # Disable packing for better control
             )
         else:
             # Use standard Trainer
@@ -77,6 +91,7 @@ class SmolLM3Trainer:
                 train_dataset=train_dataset,
                 eval_dataset=eval_dataset,
                 data_collator=data_collator,
             )
         return trainer
@@ -103,6 +118,17 @@ class SmolLM3Trainer:
         """Start training"""
         logger.info("Starting training")
         # Load checkpoint if resuming
         if self.init_from == "resume":
             checkpoint_path = "/input-checkpoint"
@@ -122,11 +148,26 @@ class SmolLM3Trainer:
             with open(os.path.join(self.output_dir, "train_results.json"), "w") as f:
                 json.dump(train_result.metrics, f, indent=2)
             logger.info("Training completed successfully!")
             logger.info(f"Training metrics: {train_result.metrics}")
         except Exception as e:
             logger.error(f"Training failed: {e}")
             raise
     def evaluate(self):

 from trl import SFTTrainer
 import json
+# Import monitoring
+from monitoring import create_monitor_from_config
 logger = logging.getLogger(__name__)
 class SmolLM3Trainer:
         self.init_from = init_from
         self.use_sft_trainer = use_sft_trainer
+        # Initialize monitoring
+        self.monitor = create_monitor_from_config(config)
         # Setup trainer
         self.trainer = self._setup_trainer()
         # Get data collator
         data_collator = self.dataset.get_data_collator()
+        # Add monitoring callback
+        callbacks = []
+        if self.monitor and self.monitor.enable_tracking:
+            trackio_callback = self.monitor.create_monitoring_callback()
+            if trackio_callback:
+                callbacks.append(trackio_callback)
         if self.use_sft_trainer:
             # Use SFTTrainer for supervised fine-tuning
             trainer = SFTTrainer(
                 dataset_text_field="text",
                 max_seq_length=self.config.max_seq_length,
                 packing=False,  # Disable packing for better control
+                callbacks=callbacks,
             )
         else:
             # Use standard Trainer
                 train_dataset=train_dataset,
                 eval_dataset=eval_dataset,
                 data_collator=data_collator,
+                callbacks=callbacks,
             )
         return trainer
         """Start training"""
         logger.info("Starting training")
+        # Log configuration to Trackio
+        if self.monitor and self.monitor.enable_tracking:
+            config_dict = {k: v for k, v in self.config.__dict__.items()
+                          if not k.startswith('_')}
+            self.monitor.log_config(config_dict)
+            # Log experiment URL
+            experiment_url = self.monitor.get_experiment_url()
+            if experiment_url:
+                logger.info(f"Trackio experiment URL: {experiment_url}")
         # Load checkpoint if resuming
         if self.init_from == "resume":
             checkpoint_path = "/input-checkpoint"
             with open(os.path.join(self.output_dir, "train_results.json"), "w") as f:
                 json.dump(train_result.metrics, f, indent=2)
+            # Log training summary to Trackio
+            if self.monitor and self.monitor.enable_tracking:
+                summary = {
+                    'final_loss': train_result.metrics.get('train_loss', 0),
+                    'total_steps': train_result.metrics.get('train_runtime', 0),
+                    'training_time': train_result.metrics.get('train_runtime', 0),
+                    'output_dir': self.output_dir,
+                    'model_name': getattr(self.config, 'model_name', 'unknown'),
+                }
+                self.monitor.log_training_summary(summary)
+                self.monitor.close()
             logger.info("Training completed successfully!")
             logger.info(f"Training metrics: {train_result.metrics}")
         except Exception as e:
             logger.error(f"Training failed: {e}")
+            # Close monitoring on error
+            if self.monitor and self.monitor.enable_tracking:
+                self.monitor.close()
             raise
     def evaluate(self):