diff --git a/README.md b/README.md
index 2622ec0d2fe3242a9501418a37b5afc59a2139f7..016b82856dd7f16ce827c5588d660de7bbe83869 100644
--- a/README.md
+++ b/README.md
@@ -1,399 +1,381 @@
-# SmolLM3 Fine-tuning
+# 🤏🏻🏭SmolFactory
-This repository provides a complete setup for fine-tuning SmolLM3 models using the FlexAI console, following the nanoGPT structure but adapted for modern transformer models.
+A comprehensive end-to-end fine-tuning pipeline for SmolLM3 models with custom monitoring, Hugging Face integration, and interactive configuration management.
-## Overview
+## 🤖 Automatically Push Model, Spaces, Datasets & Monitoring
-SmolLM3 is a 3B-parameter transformer decoder model optimized for efficiency, long-context reasoning, and multilingual support. This setup allows you to fine-tune SmolLM3 for various tasks including:
+- **Trackio Monitoring Space**: Real-time training metrics, loss curves, and resource utilization
+- **Demo Spaces**: Instant web interfaces for model testing and demonstration
+- **Automatic Deployment**: Spaces created and configured automatically during the pipeline
-- **Supervised Fine-tuning (SFT)**: Adapt the model for instruction following
-- **Direct Preference Optimization (DPO)**: Improve model alignment
-- **Long-context fine-tuning**: Support for up to 128k tokens
-- **Tool calling**: Fine-tune for function calling capabilities
-- **Model Quantization**: Create int8 (GPU) and int4 (CPU) quantized versions
+### 📈 **Custom Trackio Monitoring**
-## Quick Start
+- **Real-time Metrics**: Live training loss, learning rate, gradient norms, and GPU utilization
+- **Custom Dashboards**: Tailored visualizations for SmolLM3 fine-tuning
+- **Artifact Logging**: Model checkpoints, configuration files, and training logs
+- **Experiment Comparison**: Side-by-side analysis of different training runs
+- **Alert System**: Notifications for training issues or completion
+- **Integration**: Seamless connection with HF Spaces for public monitoring
+- **Experiment Tracking**: All training data, metrics, and artifacts stored in HF Datasets
+- **Reproducibility**: Complete experiment history with configuration snapshots
+- **Collaboration**: Easy sharing of training results and model comparisons
+- **Version Control**: Track dataset changes and model performance over time
-### 1. Repository Setup
+## 🚀 Quick Start
-The repository follows the FlexAI console structure with the following key files:
+### Interactive Pipeline (Recommended)
-- `train.py`: Main entry point script
-- `config/train_smollm3.py`: Default configuration
-- `model.py`: Model wrapper and loading
-- `data.py`: Dataset handling and preprocessing
-- `trainer.py`: Training loop and trainer setup
-- `requirements.txt`: Dependencies
+The easiest way to get started is using the interactive pipeline:
-### 2. FlexAI Console Configuration
-
-When setting up a Fine Tuning Job in the FlexAI console, use these settings:
-
-#### Basic Configuration
-- **Name**: `smollm3-finetune`
-- **Cluster**: Your organization's designated cluster
-- **Checkpoint**: (Optional) Previous training job checkpoint
-- **Node Count**: 1
-- **Accelerator Count**: 1-8 (depending on your needs)
-
-#### Repository Settings
-- **Repository URL**: `https://github.com/your-username/flexai-finetune`
-- **Repository Revision**: `main`
-
-#### Dataset Configuration
-- **Datasets**: Your dataset (mounted under `/input`)
-- **Mount Directory**: `my_dataset`
-
-#### Entry Point
-```
-train.py config/train_smollm3.py --dataset_dir=my_dataset --init_from=resume --out_dir=/input-checkpoint --max_iters=1500
+```bash
+./launch.sh
```
-### 3. Dataset Format
-
-The script supports multiple dataset formats:
+This script will:
+1. **Authenticate** with Hugging Face (write + read tokens)
+2. **Configure** training parameters interactively
+3. **Deploy** Trackio Space for monitoring
+4. **Setup** HF Dataset for experiment tracking
+5. **Execute** training with your chosen configuration
+6. **Push** model to HF Hub with comprehensive documentation
+7. **Deploy** demo space for testing (optional)
-#### Chat Format (Recommended)
-```json
-[
- {
- "messages": [
- {"role": "user", "content": "What is machine learning?"},
- {"role": "assistant", "content": "Machine learning is a subset of AI..."}
- ]
- }
-]
-```
+### Manual Setup
-#### Instruction Format
-```json
-[
- {
- "instruction": "What is machine learning?",
- "output": "Machine learning is a subset of AI..."
- }
-]
-```
+For advanced users who want to customize the pipeline:
-#### User-Assistant Format
-```json
-[
- {
- "user": "What is machine learning?",
- "assistant": "Machine learning is a subset of AI..."
- }
-]
+```bash
+# 1. Install dependencies
+pip install -r requirements/requirements_core.txt
+
+# 2. Configure your training
+python scripts/training/train.py \
+ --config config/train_smollm3_h100_lightweight.py \
+ --experiment-name "my-experiment" \
+ --output-dir ./outputs \
+ --trackio-url "https://huggingface.co/spaces/username/trackio-monitoring"
+
+# 3. Push model to HF Hub
+python scripts/model_tonic/push_to_huggingface.py \
+ ./outputs username/model-name \
+ --token YOUR_HF_TOKEN
```
-### 4. Configuration Options
-The default configuration in `config/train_smollm3.py` includes:
-
-```python
-@dataclass
-class SmolLM3Config:
- # Model configuration
- model_name: str = "HuggingFaceTB/SmolLM3-3B"
- max_seq_length: int = 4096
- use_flash_attention: bool = True
-
- # Training configuration
- batch_size: int = 4
- gradient_accumulation_steps: int = 4
- learning_rate: float = 2e-5
- max_iters: int = 1000
-
- # Mixed precision
- fp16: bool = True
- bf16: bool = False
+## 🏗️ Repository Architecture
+
+```mermaid
+graph LR
+ Entry_Point["Entry Point"]
+ Configuration_Management["Configuration Management"]
+ Data_Pipeline["Data Pipeline"]
+ Model_Abstraction["Model Abstraction"]
+ Training_Orchestrator["Training Orchestrator"]
+ Entry_Point -- "Initializes and Uses" --> Configuration_Management
+ Entry_Point -- "Initializes" --> Data_Pipeline
+ Entry_Point -- "Initializes" --> Model_Abstraction
+ Entry_Point -- "Initializes and Invokes" --> Training_Orchestrator
+ Configuration_Management -- "Provides Configuration To" --> Model_Abstraction
+ Configuration_Management -- "Provides Configuration To" --> Data_Pipeline
+ Configuration_Management -- "Provides Configuration To" --> Training_Orchestrator
+ Data_Pipeline -- "Provides Data To" --> Training_Orchestrator
+ Model_Abstraction -- "Provides Model To" --> Training_Orchestrator
+ click Entry_Point href "https://github.com/Josephrp/SmolFactory/blob/main/docs/Entry_Point.md" "Details"
+ click Configuration_Management href "https://github.com/Josephrp/SmolFactory/blob/main/docs/Configuration_Management.md" "Details"
+ click Data_Pipeline href "https://github.com/CodeBoarding/GeneratedOnBoardings/blob/main/docs/Data_Pipeline.md" "Details"
+ click Model_Abstraction href "https://github.com/CodeBoarding/GeneratedOnBoardings/blob/main/docs/Model_Abstraction.md" "Details"
+ click Training_Orchestrator href "https://github.com/CodeBoarding/GeneratedOnBoardings/blob/main/docs/Training_Orchestrator.md" "Details"
```
-### 5. Command Line Arguments
-The `train.py` script accepts various arguments:
+## 🔧 Core Components
-```bash
-# Basic usage
-python train.py config/train_smollm3.py
-
-# With custom parameters
-python train.py config/train_smollm3.py \
- --dataset_dir=my_dataset \
- --out_dir=/output-checkpoint \
- --init_from=resume \
- --max_iters=1500 \
- --batch_size=8 \
- --learning_rate=1e-5 \
- --max_seq_length=8192
-```
+### Configuration System (`config/`)
-## Advanced Usage
-
-### 1. Custom Configuration
-
-Create a custom configuration file:
+All training configurations inherit from `SmolLM3Config`:
```python
# config/my_config.py
from config.train_smollm3 import SmolLM3Config
config = SmolLM3Config(
- model_name="HuggingFaceTB/SmolLM3-3B-Instruct",
+ model_name="HuggingFaceTB/SmolLM3-3B",
max_seq_length=8192,
- batch_size=2,
- learning_rate=1e-5,
- max_iters=2000,
- use_flash_attention=True,
- fp16=True
+ batch_size=8,
+ learning_rate=5e-6,
+ trainer_type="sft", # or "dpo"
+ enable_tracking=True,
+ trackio_url="https://huggingface.co/spaces/username/trackio-monitoring"
)
```
-### 2. Long-Context Fine-tuning
+### Dataset Processing (`src/data.py`)
-For long-context tasks (up to 128k tokens):
+The `SmolLM3Dataset` class handles multiple dataset formats:
```python
-config = SmolLM3Config(
- max_seq_length=131072, # 128k tokens
- model_name="HuggingFaceTB/SmolLM3-3B",
- use_flash_attention=True,
- gradient_checkpointing=True
+from src.data import SmolLM3Dataset
+
+# Supports multiple formats:
+# 1. Chat format (recommended)
+# 2. Instruction format
+# 3. User-Assistant format
+# 4. Hugging Face datasets
+
+dataset = SmolLM3Dataset(
+ data_path="my_dataset",
+ tokenizer=tokenizer,
+ max_seq_length=4096,
+ use_chat_template=True,
+ sample_size=80000 # For lightweight training
)
```
-### 3. DPO Training
+### Training Orchestration (`src/train.py`)
-For preference optimization, use the DPO trainer:
+The main training script coordinates all components:
```python
-from trainer import SmolLM3DPOTrainer
+from src.train import main
+from src.model import SmolLM3Model
+from src.trainer import SmolLM3Trainer, SmolLM3DPOTrainer
-dpo_trainer = SmolLM3DPOTrainer(
+# SFT Training
+trainer = SmolLM3Trainer(
model=model,
dataset=dataset,
config=config,
- output_dir="./dpo-output"
+ output_dir="./outputs"
)
-dpo_trainer.train()
-```
-
-### 4. Tool Calling Fine-tuning
-
-Include tool calling examples in your dataset:
-
-```json
-[
- {
- "messages": [
- {"role": "user", "content": "What's the weather in New York?"},
- {"role": "assistant", "content": "\n\nNew York\n\n"},
- {"role": "tool", "content": "The weather in New York is 72°F and sunny."},
- {"role": "assistant", "content": "The weather in New York is currently 72°F and sunny."}
- ]
- }
-]
+# DPO Training
+dpo_trainer = SmolLM3DPOTrainer(
+ model=model,
+ dataset=dataset,
+ config=config,
+ output_dir="./dpo-outputs"
+)
```
-## Model Variants
+## 🎯 Training Types
-SmolLM3 comes in several variants:
+### Supervised Fine-tuning (SFT)
-- **SmolLM3-3B-Base**: Base model for general fine-tuning
-- **SmolLM3-3B**: Instruction-tuned model
-- **SmolLM3-3B-Instruct**: Enhanced instruction model
-- **Quantized versions**: Available for deployment
+Standard instruction tuning for improving model capabilities:
-## Hardware Requirements
-
-### Minimum Requirements
-- **GPU**: 16GB+ VRAM (for 3B model)
-- **RAM**: 32GB+ system memory
-- **Storage**: 50GB+ free space
-
-### Recommended
-- **GPU**: A100/H100 or similar
-- **RAM**: 64GB+ system memory
-- **Storage**: 100GB+ SSD
+```bash
+python scripts/training/train.py \
+ --config config/train_smollm3.py \
+ --trainer-type sft \
+ --experiment-name "sft-experiment"
+```
-## Troubleshooting
+### Direct Preference Optimization (DPO)
-### Common Issues
+Preference-based training for alignment:
-1. **Out of Memory (OOM)**
- - Reduce `batch_size`
- - Increase `gradient_accumulation_steps`
- - Enable `gradient_checkpointing`
- - Use `fp16` or `bf16`
-
-2. **Slow Training**
- - Enable `flash_attention`
- - Use mixed precision (`fp16`/`bf16`)
- - Increase `dataloader_num_workers`
+```bash
+python scripts/training/train.py \
+ --config config/train_smollm3_dpo.py \
+ --trainer-type dpo \
+ --experiment-name "dpo-experiment"
+```
-3. **Dataset Loading Issues**
- - Check dataset format
- - Ensure proper JSON structure
- - Verify file permissions
+## 📊 Monitoring & Tracking
-### Debug Mode
+### Trackio Integration
-Enable debug logging:
+The pipeline includes comprehensive monitoring:
```python
-import logging
-logging.basicConfig(level=logging.DEBUG)
+from src.monitoring import create_monitor_from_config
+
+monitor = create_monitor_from_config(config)
+monitor.log_metrics({
+ "train_loss": loss,
+ "learning_rate": lr,
+ "gradient_norm": grad_norm
+})
```
-## Evaluation
+### HF Dataset Integration
-After training, evaluate your model:
+Experiment data is automatically saved to HF Datasets:
```python
-from transformers import pipeline
-
-pipe = pipeline(
- task="text-generation",
- model="./output-checkpoint",
- device=0,
- max_new_tokens=256,
- do_sample=True,
- temperature=0.7
-)
-
-# Test the model
-messages = [{"role": "user", "content": "Explain gravity in simple terms."}]
-outputs = pipe(messages)
-print(outputs[0]["generated_text"][-1]["content"])
-```
-
-## Model Quantization
-
-The pipeline includes built-in quantization support using torchao for creating optimized model versions with a unified repository structure:
-
-### Repository Structure
-
-All models (main and quantized) are stored in a single repository:
-
-```
-your-username/model-name/
-├── README.md (unified model card)
-├── config.json
-├── pytorch_model.bin
-├── tokenizer.json
-├── int8/ (quantized model for GPU)
-└── int4/ (quantized model for CPU)
+# Automatically configured in launch.sh
+dataset_repo = "username/trackio-experiments"
```
-### Quantization Types
+## 🔄 Model Management
-- **int8_weight_only**: GPU optimized, ~50% memory reduction
-- **int4_weight_only**: CPU optimized, ~75% memory reduction
-
-### Automatic Quantization
-
-When using the interactive pipeline (`launch.sh`), you'll be prompted to create quantized versions after training:
+### Pushing to HF Hub
```bash
-./launch.sh
-# ... training completes ...
-# Choose quantization options when prompted
+python scripts/model_tonic/push_to_huggingface.py \
+ ./outputs username/model-name \
+ --token YOUR_HF_TOKEN \
+ --trackio-url "https://huggingface.co/spaces/username/trackio-monitoring" \
+ --experiment-name "my-experiment"
```
-### Standalone Quantization
+### Model Quantization
-Quantize existing models independently:
+Create optimized versions for deployment:
```bash
-# Quantize and push to HF Hub (same repository)
-python scripts/model_tonic/quantize_standalone.py /path/to/model your-username/model-name \
+# Quantize and push to HF Hub
+python scripts/model_tonic/quantize_standalone.py \
+ ./outputs username/model-name \
--quant-type int8_weight_only \
--token YOUR_HF_TOKEN
-# Quantize and save locally
-python scripts/model_tonic/quantize_standalone.py /path/to/model your-username/model-name \
+# Quantize for CPU deployment
+python scripts/model_tonic/quantize_standalone.py \
+ ./outputs username/model-name \
--quant-type int4_weight_only \
--device cpu \
--save-only
```
-### Loading Quantized Models
+## 🛠️ Customization Guide
+
+### Adding New Training Configurations
+
+1. Create a new config file in `config/`:
```python
-import torch
-from transformers import AutoModelForCausalLM, AutoTokenizer
-
-# Load main model
-model = AutoModelForCausalLM.from_pretrained(
- "your-username/model-name",
- device_map="auto",
- torch_dtype=torch.bfloat16
-)
-tokenizer = AutoTokenizer.from_pretrained("your-username/model-name")
+# config/train_smollm3_custom.py
+from config.train_smollm3 import SmolLM3Config
-# Load int8 quantized model (GPU)
-model = AutoModelForCausalLM.from_pretrained(
- "your-username/model-name/int8",
- device_map="auto",
- torch_dtype=torch.bfloat16
+config = SmolLM3Config(
+ model_name="HuggingFaceTB/SmolLM3-3B-Instruct",
+ max_seq_length=16384,
+ batch_size=4,
+ learning_rate=1e-5,
+ max_iters=2000,
+ trainer_type="sft"
)
-tokenizer = AutoTokenizer.from_pretrained("your-username/model-name/int8")
+```
-# Load int4 quantized model (CPU)
-model = AutoModelForCausalLM.from_pretrained(
- "your-username/model-name/int4",
- device_map="cpu",
- torch_dtype=torch.bfloat16
-)
-tokenizer = AutoTokenizer.from_pretrained("your-username/model-name/int4")
+2. Add to the training script mapping in `scripts/training/train.py`:
+
+```python
+config_map = {
+ # ... existing configs ...
+ "config/train_smollm3_custom.py": get_custom_config,
+}
```
-For detailed quantization documentation, see [QUANTIZATION_GUIDE.md](docs/QUANTIZATION_GUIDE.md).
+### Custom Dataset Formats
-### Unified Model Cards
+Extend `src/data.py` to support new formats:
-The system generates comprehensive model cards that include information about all model variants:
+```python
+def _load_custom_format(self, data_path: str) -> Dataset:
+ """Load custom dataset format"""
+ # Your custom loading logic here
+ pass
+```
+
+### Custom Training Loops
+
+Extend `src/trainer.py` for specialized training:
+
+```python
+class SmolLM3CustomTrainer(SmolLM3Trainer):
+ def training_step(self, batch):
+ # Custom training logic
+ pass
+```
-- **Single README**: One comprehensive model card for the entire repository
-- **Conditional Sections**: Quantized model information appears when available
-- **Usage Examples**: Complete examples for all model variants
-- **Performance Information**: Memory and speed benefits for each quantization type
+## 🔧 Development & Contributing
-For detailed information about the unified model card system, see [UNIFIED_MODEL_CARD_GUIDE.md](docs/UNIFIED_MODEL_CARD_GUIDE.md).
+### Project Structure
-## Deployment
+- **`src/`**: Core training modules
+- **`config/`**: Training configurations
+- **`scripts/`**: Utility scripts and automation
+- **`docs/`**: Comprehensive documentation
+- **`tests/`**: Test files and debugging tools
+
+### Adding New Features
+
+1. **Configuration**: Add to `config/` directory
+2. **Core Logic**: Extend modules in `src/`
+3. **Scripts**: Add utility scripts to `scripts/`
+4. **Documentation**: Update relevant docs in `docs/`
+5. **Tests**: Add test files to `tests/`
+
+### Testing Your Changes
-### Using vLLM
```bash
-vllm serve ./output-checkpoint --enable-auto-tool-choice
+# Run basic tests
+python tests/test_config.py
+python tests/test_dataset.py
+python tests/test_training.py
+
+# Test specific components
+python tests/test_monitoring.py
+python tests/test_model_push.py
```
-### Using llama.cpp
-```bash
-# Convert to GGUF format
-python -m llama_cpp.convert_model ./output-checkpoint --outfile model.gguf
+### Code Style
+
+- Follow PEP 8 for Python code
+- Use type hints for all functions
+- Add comprehensive docstrings
+- Include error handling for external APIs
+- Use structured logging with consistent field names
+
+## 🚨 Troubleshooting
+
+### Common Issues
+
+1. **Out of Memory (OOM)**
+ ```bash
+ # Reduce batch size in config
+ batch_size=2 # instead of 8
+ gradient_accumulation_steps=16 # increase to compensate
+ ```
+
+2. **Token Validation Errors**
+ ```bash
+ # Validate your HF token
+ python scripts/validate_hf_token.py YOUR_TOKEN
+ ```
+
+3. **Dataset Loading Issues**
+ ```bash
+ # Check dataset format
+ python tests/test_dataset_loading.py
+ ```
+
+### Debug Mode
+
+Enable detailed logging:
+
+```python
+import logging
+logging.basicConfig(level=logging.DEBUG)
```
-## Resources
+## 🤝 Contributing
-- [SmolLM3 Blog Post](https://huggingface.co/blog/smollm3)
-- [Model Repository](https://huggingface.co/HuggingFaceTB/SmolLM3-3B)
-- [GitHub Repository](https://github.com/huggingface/smollm)
-- [SmolTalk Dataset](https://huggingface.co/datasets/HuggingFaceTB/smoltalk)
+1. Fork the repository
+2. Create a feature branch
+3. Make your changes following the code style
+4. Add tests for new functionality
+5. Update documentation
+6. Submit a pull request
-## License
+## 📄 License
-This project follows the same license as the SmolLM3 model. Please refer to the Hugging Face model page for licensing information.
+This project follows the same license as the SmolLM3 model. Please refer to the Hugging Face model page for licensing information.
+## 🔗 Resources
-{
- "id": "exp_20250718_195852",
- "name": "petit-elle-l-aime-3",
- "description": "SmolLM3 fine-tuning experiment",
- "created_at": "2025-07-18T19:58:52.689087",
- "status": "running",
- "metrics": [],
- "parameters": {},
- "artifacts": [],
- "logs": []
-}
\ No newline at end of file
+- [SmolLM3 Blog Post](https://huggingface.co/blog/smollm3)
+- [Model Repository](https://huggingface.co/HuggingFaceTB/SmolLM3-3B)
+- [GitHub Repository](https://github.com/huggingface/smollm)
+- [SmolTalk Dataset](https://huggingface.co/datasets/HuggingFaceTB/smoltalk)
\ No newline at end of file
diff --git a/docs/A100_LARGE_SCALE_GUIDE.md b/docs/A100_LARGE_SCALE_GUIDE.md
deleted file mode 100644
index 508484a2182d1b43ff50851cbdb954817e51e015..0000000000000000000000000000000000000000
--- a/docs/A100_LARGE_SCALE_GUIDE.md
+++ /dev/null
@@ -1,195 +0,0 @@
-# A100 Large Scale Training Guide
-
-This guide provides configurations and instructions for running fully-fledged experiments with multiple passes on the full OpenHermes-FR dataset (800k+ datapoints) using A100 GPUs.
-
-## Available Configurations
-
-### 1. A100 Large Batch Configuration
-**File**: `config/train_smollm3_openhermes_fr_a100_large.py`
-
-**Key Features**:
-- **Effective Batch Size**: 128 (8 × 16 gradient accumulation)
-- **Training Duration**: ~1.3 passes (8,000 steps)
-- **Learning Rate**: 5e-6 (optimized for large batches)
-- **Mixed Precision**: bf16 (A100 optimized)
-- **Sequence Length**: 8192 tokens
-- **Memory Optimizations**: No gradient checkpointing for A100 efficiency
-
-**Estimated Training Time**: ~6-8 hours on A100
-
-### 2. Multiple Passes Configuration
-**File**: `config/train_smollm3_openhermes_fr_a100_multiple_passes.py`
-
-**Key Features**:
-- **Effective Batch Size**: 120 (6 × 20 gradient accumulation)
-- **Training Duration**: ~4 passes (25,000 steps)
-- **Learning Rate**: 3e-6 (conservative for long training)
-- **Warmup Steps**: 2000 (longer warmup for stability)
-- **Checkpoint Strategy**: More frequent saves (every 2000 steps)
-
-**Estimated Training Time**: ~20-24 hours on A100
-
-## Training Commands
-
-### Quick Start - Large Batch Experiment
-```bash
-python run_a100_large_experiment.py \
- --config config/train_smollm3_openhermes_fr_a100_large.py \
- --experiment-name "smollm3_openhermes_fr_large_batch" \
- --output-dir ./outputs/large_batch
-```
-
-### Multiple Passes Experiment
-```bash
-python run_a100_large_experiment.py \
- --config config/train_smollm3_openhermes_fr_a100_multiple_passes.py \
- --experiment-name "smollm3_openhermes_fr_multiple_passes" \
- --output-dir ./outputs/multiple_passes
-```
-
-### Dry Run (Check Configuration)
-```bash
-python run_a100_large_experiment.py \
- --config config/train_smollm3_openhermes_fr_a100_large.py \
- --dry-run
-```
-
-### Resume Training
-```bash
-python run_a100_large_experiment.py \
- --config config/train_smollm3_openhermes_fr_a100_multiple_passes.py \
- --resume ./outputs/multiple_passes/checkpoint-10000 \
- --output-dir ./outputs/multiple_passes
-```
-
-## Configuration Details
-
-### Memory Usage Optimization
-- **Gradient Checkpointing**: Disabled for A100 efficiency
-- **Flash Attention**: Enabled for memory efficiency
-- **bf16 Mixed Precision**: Better for A100 than fp16
-- **Gradient Clipping**: 1.0 for stability
-- **Group by Length**: Enabled for better batching
-
-### Data Loading Optimization
-- **Num Workers**: 8 for faster data loading
-- **Pin Memory**: Enabled for GPU transfer efficiency
-- **Prefetch Factor**: 2 for pipeline optimization
-
-### Training Stability
-- **Conservative Learning Rate**: Lower LR for large effective batch sizes
-- **Longer Warmup**: More warmup steps for stability
-- **Higher Beta2**: 0.999 for AdamW stability
-- **Gradient Clipping**: Prevents gradient explosion
-
-## Expected Results
-
-### Large Batch Configuration (1.3 passes)
-- **Training Steps**: 8,000
-- **Effective Batch Size**: 128
-- **Steps per Epoch**: ~6,250
-- **Epochs**: ~1.3
-- **Expected Loss**: Should converge to ~1.5-2.0
-
-### Multiple Passes Configuration (4 passes)
-- **Training Steps**: 25,000
-- **Effective Batch Size**: 120
-- **Steps per Epoch**: ~6,667
-- **Epochs**: ~3.75
-- **Expected Loss**: Should converge to ~1.2-1.5
-
-## Monitoring and Logging
-
-### Trackio Integration
-Both configurations include Trackio monitoring:
-- **Metrics Logging**: Every 25-50 steps
-- **Artifact Logging**: Model checkpoints
-- **Config Logging**: Training configuration
-
-### Checkpoint Strategy
-- **Large Batch**: Save every 1000 steps (8 checkpoints)
-- **Multiple Passes**: Save every 2000 steps (12 checkpoints)
-- **Best Model**: Automatically load best model at end
-
-## Hardware Requirements
-
-### Minimum Requirements
-- **GPU**: A100 80GB (or multiple A100s)
-- **RAM**: 64GB+ system RAM
-- **Storage**: 100GB+ for checkpoints and logs
-- **Network**: Fast internet for dataset download
-
-### Recommended Setup
-- **GPU**: 2-4x A100 80GB
-- **RAM**: 128GB+ system RAM
-- **Storage**: 500GB+ NVMe SSD
-- **Network**: 10Gbps+ connection
-
-## Troubleshooting
-
-### Out of Memory (OOM)
-If you encounter OOM errors:
-1. Reduce `batch_size` from 8 to 6 or 4
-2. Increase `gradient_accumulation_steps` to maintain effective batch size
-3. Reduce `max_seq_length` from 8192 to 4096
-
-### Slow Training
-If training is too slow:
-1. Increase `dataloader_num_workers` to 12-16
-2. Ensure you're using bf16 mixed precision
-3. Check that gradient checkpointing is disabled
-4. Verify flash attention is enabled
-
-### Convergence Issues
-If loss doesn't converge:
-1. Reduce learning rate by 2x
-2. Increase warmup steps
-3. Check gradient norms in logs
-4. Verify dataset quality
-
-## Customization
-
-### For Different Dataset Sizes
-Adjust `max_iters` based on your dataset size:
-```python
-# For 1M datapoints with effective batch size 120
-steps_per_epoch = 1000000 // 120 # ~8,333 steps
-max_iters = steps_per_epoch * desired_epochs
-```
-
-### For Different GPU Memory
-Adjust batch size and gradient accumulation:
-```python
-# For 40GB A100
-batch_size = 4
-gradient_accumulation_steps = 32 # Effective batch size = 128
-
-# For 24GB GPU
-batch_size = 2
-gradient_accumulation_steps = 64 # Effective batch size = 128
-```
-
-## Performance Tips
-
-1. **Use bf16**: Better than fp16 for A100
-2. **Disable Gradient Checkpointing**: A100 has enough memory
-3. **Use Flash Attention**: Memory efficient attention
-4. **Group by Length**: Better batching efficiency
-5. **Pin Memory**: Faster GPU transfers
-6. **Multiple Workers**: Faster data loading
-
-## Expected Timeline
-
-- **Large Batch**: 6-8 hours for 1.3 passes
-- **Multiple Passes**: 20-24 hours for 4 passes
-- **Full Dataset (5+ passes)**: 30+ hours
-
-## Next Steps
-
-After training completes:
-1. Evaluate on validation set
-2. Test generation quality
-3. Push to Hugging Face Hub
-4. Deploy for inference
-
-For deployment instructions, see `DEPLOYMENT_GUIDE.md`.
\ No newline at end of file
diff --git a/docs/APP_CONFIGURATION_GUIDE.md b/docs/APP_CONFIGURATION_GUIDE.md
deleted file mode 100644
index afa15566590f70bc6fa9f061e034c4d3b406975a..0000000000000000000000000000000000000000
--- a/docs/APP_CONFIGURATION_GUIDE.md
+++ /dev/null
@@ -1,234 +0,0 @@
-# ⚙️ App Configuration Guide
-
-## Overview
-
-The Trackio app now includes a **Configuration tab** that allows you to set your Hugging Face token and dataset repository directly through the interface, providing an alternative to environment variables.
-
-## 🚀 New Features
-
-### **Configuration Tab**
-- ✅ **HF Token Input**: Secure password field for your Hugging Face token
-- ✅ **Dataset Repository Input**: Text field for your dataset repository
-- ✅ **Update Configuration**: Apply new settings and reload experiments
-- ✅ **Test Connection**: Verify access to the dataset repository
-- ✅ **Create Dataset**: Create a new dataset repository if it doesn't exist
-
-### **Flexible Configuration**
-- ✅ **Environment Variables**: Still supported as fallback
-- ✅ **Interface Input**: New direct input method
-- ✅ **Dynamic Updates**: Change configuration without restarting
-- ✅ **Validation**: Input validation and error handling
-
-## 📋 Configuration Tab Usage
-
-### **1. Access the Configuration Tab**
-- Open the Trackio app
-- Click on the "⚙️ Configuration" tab
-- You'll see input fields for HF Token and Dataset Repository
-
-### **2. Set Your HF Token**
-```
-Hugging Face Token: hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
-```
-- **Type**: Password field (hidden for security)
-- **Required**: Yes (for dataset access)
-- **Format**: Your HF token starting with `hf_`
-- **Help**: Click the help text for instructions on getting your token
-
-### **3. Set Your Dataset Repository**
-```
-Dataset Repository: your-username/your-dataset-name
-```
-- **Type**: Text field
-- **Required**: No (defaults to `tonic/trackio-experiments`)
-- **Format**: `username/dataset-name`
-- **Examples**:
- - `tonic/trackio-experiments`
- - `your-username/my-experiments`
- - `your-org/team-experiments`
-
-### **4. Use the Action Buttons**
-
-#### **Update Configuration**
-- Applies new settings immediately
-- Reloads experiments with new configuration
-- Shows current status and experiment count
-
-#### **Test Connection**
-- Verifies access to the dataset repository
-- Tests HF token permissions
-- Shows dataset information and experiment count
-
-#### **Create Dataset**
-- Creates a new dataset repository if it doesn't exist
-- Sets up the correct schema for experiments
-- Makes the dataset private by default
-
-## 🔧 Configuration Methods
-
-### **Method 1: Interface Input (New)**
-1. Go to "⚙️ Configuration" tab
-2. Enter your HF token and dataset repository
-3. Click "Update Configuration"
-4. Verify with "Test Connection"
-
-### **Method 2: Environment Variables (Existing)**
-```bash
-# Set environment variables
-export HF_TOKEN=your_hf_token_here
-export TRACKIO_DATASET_REPO=your-username/your-dataset-name
-
-# Or for HF Spaces, add to Space settings
-HF_TOKEN=your_hf_token_here
-TRACKIO_DATASET_REPO=your-username/your-dataset-name
-```
-
-### **Method 3: Hybrid Approach**
-- Set environment variables as defaults
-- Override specific values through the interface
-- Interface values take precedence over environment variables
-
-## 📊 Configuration Priority
-
-The app uses this priority order for configuration:
-
-1. **Interface Input** (highest priority)
-2. **Environment Variables** (fallback)
-3. **Default Values** (lowest priority)
-
-## 🛠️ Getting Your HF Token
-
-### **Step-by-Step Instructions**
-1. Go to [Hugging Face Settings](https://huggingface.co/settings/tokens)
-2. Click "New token"
-3. Give it a name (e.g., "Trackio Access")
-4. Select "Write" permissions
-5. Click "Generate token"
-6. Copy the token (starts with `hf_`)
-7. Paste it in the app's HF Token field
-
-### **Token Permissions**
-- **Read**: Required for loading experiments
-- **Write**: Required for saving experiments
-- **Scope**: Should have access to your dataset repositories
-
-## 📁 Dataset Repository Format
-
-### **Correct Format**
-```
-username/dataset-name
-```
-
-### **Examples**
-- `tonic/trackio-experiments` (default)
-- `your-username/my-experiments`
-- `your-org/team-experiments`
-- `your-username/smollm3-experiments`
-
-### **Validation**
-- Must contain exactly one `/`
-- Username must be valid HF username
-- Dataset name must be valid (alphanumeric + hyphens)
-
-## 🔍 Testing Your Configuration
-
-### **1. Test Connection**
-- Enter your HF token and dataset repository
-- Click "Test Connection"
-- Should show: "✅ Connection successful!"
-
-### **2. Create Dataset (if needed)**
-- If dataset doesn't exist, click "Create Dataset"
-- Should show: "✅ Dataset created successfully!"
-
-### **3. Update Configuration**
-- Click "Update Configuration"
-- Should show: "✅ Configuration updated successfully!"
-
-## 🚨 Troubleshooting
-
-### **Issue: "Please provide a Hugging Face token"**
-**Solution**:
-- Enter your HF token in the interface
-- Or set the `HF_TOKEN` environment variable
-
-### **Issue: "Connection failed: 401 Unauthorized"**
-**Solutions**:
-1. Check your HF token is correct
-2. Verify the token has read access to the dataset
-3. Ensure the dataset repository exists
-
-### **Issue: "Failed to create dataset"**
-**Solutions**:
-1. Check your HF token has write permissions
-2. Verify the username in the repository name
-3. Ensure the dataset name is valid
-
-### **Issue: "Dataset repository must be in format: username/dataset-name"**
-**Solution**:
-- Use the correct format: `username/dataset-name`
-- Example: `your-username/my-experiments`
-
-## 📈 Benefits
-
-### **For Users**
-- ✅ **Easy Setup**: No need to set environment variables
-- ✅ **Visual Interface**: Clear input fields and validation
-- ✅ **Immediate Feedback**: Test connection and see results
-- ✅ **Flexible**: Can change configuration anytime
-
-### **For Development**
-- ✅ **Backward Compatible**: Environment variables still work
-- ✅ **Fallback Support**: Graceful degradation
-- ✅ **Error Handling**: Clear error messages
-- ✅ **Validation**: Input validation and testing
-
-### **For Deployment**
-- ✅ **HF Spaces Ready**: Works on Hugging Face Spaces
-- ✅ **No Restart Required**: Dynamic configuration updates
-- ✅ **Secure**: Password field for token input
-- ✅ **User-Friendly**: Clear instructions and help text
-
-## 🎯 Usage Examples
-
-### **Basic Setup**
-1. Open the app
-2. Go to "⚙️ Configuration" tab
-3. Enter your HF token
-4. Enter your dataset repository
-5. Click "Update Configuration"
-6. Click "Test Connection" to verify
-
-### **Advanced Setup**
-1. Set environment variables as defaults
-2. Use interface to override specific values
-3. Test connection to verify access
-4. Create dataset if it doesn't exist
-5. Start using the app with persistent storage
-
-### **Team Setup**
-1. Create a shared dataset repository
-2. Share the repository name with team
-3. Each team member sets their own HF token
-4. All experiments are stored in the shared dataset
-
-## 📋 Configuration Status
-
-The app shows current configuration status:
-```
-📊 Dataset: your-username/your-dataset
-🔑 HF Token: Set
-📈 Experiments: 5
-```
-
-## 🔄 Updating Configuration
-
-You can update configuration at any time:
-1. Go to "⚙️ Configuration" tab
-2. Change HF token or dataset repository
-3. Click "Update Configuration"
-4. Experiments will reload with new settings
-
----
-
-**🎉 Your Trackio app is now more flexible and user-friendly with direct configuration input!**
\ No newline at end of file
diff --git a/docs/CLOUD_DEPLOYMENT_GUIDE.md b/docs/CLOUD_DEPLOYMENT_GUIDE.md
deleted file mode 100644
index a30d3194b197bc39895f198d48c621b3de2e944c..0000000000000000000000000000000000000000
--- a/docs/CLOUD_DEPLOYMENT_GUIDE.md
+++ /dev/null
@@ -1,462 +0,0 @@
-# Cloud Deployment Guide for SmolLM3 DPO Training
-
-This guide provides the exact sequence of commands to deploy and run SmolLM3 DPO training on a cloud computing instance with 6 epochs.
-
-## Prerequisites
-
-### Cloud Instance Requirements
-
-- **GPU**: NVIDIA A100, H100, or similar (16GB+ VRAM)
-- **RAM**: 64GB+ system memory
-- **Storage**: 100GB+ SSD storage
-- **OS**: Ubuntu 20.04 or 22.04
-
-### Required Information
-
-Before starting, gather these details:
-- Your Hugging Face username
-- Your Hugging Face token (with write permissions)
-- Your Trackio Space URL (if using monitoring)
-
-## Step-by-Step Deployment
-
-### Step 1: Launch Cloud Instance
-
-Choose your cloud provider and launch an instance:
-
-#### AWS (g5.2xlarge or g5.4xlarge)
-```bash
-# Launch instance with Ubuntu 22.04 and appropriate GPU
-aws ec2 run-instances \
- --image-id ami-0c7217cdde317cfec \
- --instance-type g5.2xlarge \
- --key-name your-key-pair \
- --security-group-ids sg-xxxxxxxxx
-```
-
-#### Google Cloud (n1-standard-8 with T4/V100)
-```bash
-gcloud compute instances create smollm3-dpo \
- --zone=us-central1-a \
- --machine-type=n1-standard-8 \
- --accelerator="type=nvidia-tesla-t4,count=1" \
- --image-family=ubuntu-2204-lts \
- --image-project=ubuntu-os-cloud
-```
-
-#### Azure (Standard_NC6s_v3)
-```bash
-az vm create \
- --resource-group your-rg \
- --name smollm3-dpo \
- --image Canonical:0001-com-ubuntu-server-jammy:22_04-lts:latest \
- --size Standard_NC6s_v3 \
- --admin-username azureuser
-```
-
-### Step 2: Connect to Instance
-
-```bash
-# SSH to your instance
-ssh -i your-key.pem ubuntu@your-instance-ip
-
-# Or for Azure
-ssh azureuser@your-instance-ip
-```
-
-### Step 3: Update System and Install Dependencies
-
-```bash
-# Update system
-sudo apt-get update
-sudo apt-get upgrade -y
-
-# Install system dependencies
-sudo apt-get install -y git curl wget unzip python3 python3-pip python3-venv
-
-# Install NVIDIA drivers (if not pre-installed)
-curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
-curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
- sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
- sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
-
-sudo apt-get update
-sudo apt-get install -y nvidia-container-toolkit
-```
-
-### Step 4: Clone Repository and Setup Environment
-
-```bash
-# Clone your repository
-git clone https://github.com/your-username/flexai-finetune.git
-cd flexai-finetune
-
-# Create virtual environment
-python3 -m venv smollm3_env
-source smollm3_env/bin/activate
-
-# Install PyTorch with CUDA
-pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
-
-# Install project dependencies
-pip install -r requirements.txt
-
-# Install additional DPO dependencies
-pip install trl>=0.7.0
-pip install peft>=0.4.0
-pip install accelerate>=0.20.0
-```
-
-### Step 5: Configure Authentication
-
-```bash
-# Set your Hugging Face token
-export HF_TOKEN="your_huggingface_token_here"
-
-# Login to Hugging Face
-hf login --token $HF_TOKEN
-```
-
-### Step 6: Create Configuration Files
-
-Create the DPO configuration file:
-
-```bash
-cat > config/train_smollm3_dpo_6epochs.py << 'EOF'
-"""
-SmolLM3 DPO Training Configuration - 6 Epochs
-Optimized for cloud deployment
-"""
-
-from config.train_smollm3_dpo import SmolLM3DPOConfig
-
-config = SmolLM3DPOConfig(
- # Model configuration
- model_name="HuggingFaceTB/SmolLM3-3B",
- max_seq_length=4096,
- use_flash_attention=True,
- use_gradient_checkpointing=True,
-
- # Training configuration
- batch_size=2,
- gradient_accumulation_steps=8,
- learning_rate=5e-6,
- weight_decay=0.01,
- warmup_steps=100,
- max_iters=None, # Will be calculated based on epochs
- eval_interval=100,
- log_interval=10,
- save_interval=500,
-
- # DPO configuration
- beta=0.1,
- max_prompt_length=2048,
-
- # Optimizer configuration
- optimizer="adamw",
- beta1=0.9,
- beta2=0.95,
- eps=1e-8,
-
- # Scheduler configuration
- scheduler="cosine",
- min_lr=1e-6,
-
- # Mixed precision
- fp16=True,
- bf16=False,
-
- # Logging and saving
- save_steps=500,
- eval_steps=100,
- logging_steps=10,
- save_total_limit=3,
-
- # Evaluation
- eval_strategy="steps",
- metric_for_best_model="eval_loss",
- greater_is_better=False,
- load_best_model_at_end=True,
-
- # Data configuration
- data_dir="smoltalk_dataset",
- train_file="train.json",
- validation_file="validation.json",
-
- # Chat template configuration
- use_chat_template=True,
- chat_template_kwargs={
- "enable_thinking": False,
- "add_generation_prompt": True
- },
-
- # Trackio monitoring configuration
- enable_tracking=True,
- trackio_url="https://your-trackio-space.hf.space", # Change this
- trackio_token=None,
- log_artifacts=True,
- log_metrics=True,
- log_config=True,
- experiment_name="smollm3_dpo_6epochs"
-)
-EOF
-```
-
-### Step 7: Download and Prepare Dataset
-
-```bash
-# Create dataset preparation script
-cat > prepare_dataset.py << 'EOF'
-from datasets import load_dataset
-import json
-import os
-
-# Load SmolTalk dataset
-print('Loading SmolTalk dataset...')
-dataset = load_dataset('HuggingFaceTB/smoltalk')
-
-# Create dataset directory
-os.makedirs('smoltalk_dataset', exist_ok=True)
-
-# Convert to DPO format (preference pairs)
-def convert_to_dpo_format(example):
- # For SmolTalk, we'll create preference pairs based on response quality
- # This is a simplified example - you may need to adjust based on your needs
- return {
- 'prompt': example.get('prompt', ''),
- 'chosen': example.get('chosen', ''),
- 'rejected': example.get('rejected', '')
- }
-
-# Process train split
-train_data = []
-for example in dataset['train']:
- dpo_example = convert_to_dpo_format(example)
- if dpo_example['prompt'] and dpo_example['chosen'] and dpo_example['rejected']:
- train_data.append(dpo_example)
-
-# Process validation split
-val_data = []
-for example in dataset['validation']:
- dpo_example = convert_to_dpo_format(example)
- if dpo_example['prompt'] and dpo_example['chosen'] and dpo_example['rejected']:
- val_data.append(dpo_example)
-
-# Save to files
-with open('smoltalk_dataset/train.json', 'w') as f:
- json.dump(train_data, f, indent=2)
-
-with open('smoltalk_dataset/validation.json', 'w') as f:
- json.dump(val_data, f, indent=2)
-
-print(f'Dataset prepared: {len(train_data)} train samples, {len(val_data)} validation samples')
-EOF
-
-# Run dataset preparation
-python prepare_dataset.py
-```
-
-### Step 8: Calculate Training Parameters
-
-```bash
-# Calculate training steps based on epochs
-TOTAL_SAMPLES=$(python -c "import json; data=json.load(open('smoltalk_dataset/train.json')); print(len(data))")
-BATCH_SIZE=2
-GRADIENT_ACCUMULATION_STEPS=8
-MAX_EPOCHS=6
-EFFECTIVE_BATCH_SIZE=$((BATCH_SIZE * GRADIENT_ACCUMULATION_STEPS))
-STEPS_PER_EPOCH=$((TOTAL_SAMPLES / EFFECTIVE_BATCH_SIZE))
-MAX_STEPS=$((STEPS_PER_EPOCH * MAX_EPOCHS))
-
-echo "Training Configuration:"
-echo " Total samples: $TOTAL_SAMPLES"
-echo " Effective batch size: $EFFECTIVE_BATCH_SIZE"
-echo " Steps per epoch: $STEPS_PER_EPOCH"
-echo " Total training steps: $MAX_STEPS"
-echo " Training epochs: $MAX_EPOCHS"
-```
-
-### Step 9: Start DPO Training
-
-```bash
-# Start training with all parameters
-python train.py config/train_smollm3_dpo_6epochs.py \
- --dataset_dir smoltalk_dataset \
- --out_dir /output-checkpoint \
- --init_from scratch \
- --max_iters $MAX_STEPS \
- --batch_size $BATCH_SIZE \
- --learning_rate 5e-6 \
- --gradient_accumulation_steps $GRADIENT_ACCUMULATION_STEPS \
- --max_seq_length 4096 \
- --save_steps 500 \
- --eval_steps 100 \
- --logging_steps 10 \
- --enable_tracking \
- --trackio_url "https://your-trackio-space.hf.space" \
- --experiment_name "smollm3_dpo_6epochs"
-```
-
-### Step 10: Push Model to Hugging Face Hub
-
-```bash
-# Push the trained model
-python push_to_huggingface.py /output-checkpoint "your-username/smollm3-dpo-6epochs" \
- --token "$HF_TOKEN" \
- --trackio-url "https://your-trackio-space.hf.space" \
- --experiment-name "smollm3_dpo_6epochs"
-```
-
-### Step 11: Test the Uploaded Model
-
-```bash
-# Test the model
-python -c "
-from transformers import AutoModelForCausalLM, AutoTokenizer
-import torch
-
-print('Loading uploaded model...')
-model = AutoModelForCausalLM.from_pretrained('your-username/smollm3-dpo-6epochs', torch_dtype=torch.float16, device_map='auto')
-tokenizer = AutoTokenizer.from_pretrained('your-username/smollm3-dpo-6epochs')
-
-print('Testing model generation...')
-prompt = 'Hello, how are you?'
-inputs = tokenizer(prompt, return_tensors='pt').to(model.device)
-outputs = model.generate(**inputs, max_new_tokens=50, do_sample=True, temperature=0.7)
-response = tokenizer.decode(outputs[0], skip_special_tokens=True)
-print(f'Prompt: {prompt}')
-print(f'Response: {response}')
-print('✅ Model test completed successfully!')
-"
-```
-
-## Complete One-Line Deployment
-
-If you want to run everything automatically, use the deployment script:
-
-```bash
-# Make script executable
-chmod +x cloud_deployment.sh
-
-# Edit configuration in the script first
-nano cloud_deployment.sh
-# Change these variables:
-# - REPO_NAME="your-username/smollm3-dpo-6epochs"
-# - TRACKIO_URL="https://your-trackio-space.hf.space"
-# - HF_TOKEN="your_hf_token_here"
-
-# Run the complete deployment
-./cloud_deployment.sh
-```
-
-## Monitoring and Debugging
-
-### Check GPU Usage
-
-```bash
-# Monitor GPU usage during training
-watch -n 1 nvidia-smi
-```
-
-### Check Training Logs
-
-```bash
-# Monitor training progress
-tail -f training.log
-
-# Check system resources
-htop
-```
-
-### Monitor Trackio
-
-```bash
-# Check if Trackio is logging properly
-curl -s "https://your-trackio-space.hf.space" | grep -i "experiment"
-```
-
-## Expected Timeline
-
-- **Setup**: 15-30 minutes
-- **Dataset preparation**: 5-10 minutes
-- **Training (6 epochs)**: 4-8 hours (depending on GPU)
-- **Model upload**: 10-30 minutes
-- **Testing**: 5-10 minutes
-
-## Troubleshooting
-
-### Common Issues
-
-#### 1. Out of Memory (OOM)
-```bash
-# Reduce batch size
-BATCH_SIZE=1
-GRADIENT_ACCUMULATION_STEPS=16
-
-# Or use gradient checkpointing
-# Already enabled in config
-```
-
-#### 2. Slow Training
-```bash
-# Check GPU utilization
-nvidia-smi
-
-# Check if mixed precision is working
-# Look for "fp16" in training logs
-```
-
-#### 3. Dataset Issues
-```bash
-# Check dataset format
-head -n 5 smoltalk_dataset/train.json
-
-# Verify dataset size
-wc -l smoltalk_dataset/train.json
-```
-
-#### 4. Authentication Issues
-```bash
-# Test HF token
-python -c "
-from huggingface_hub import HfApi
-api = HfApi(token='$HF_TOKEN')
-print('Token is valid!')
-"
-```
-
-## Cost Estimation
-
-### AWS (g5.2xlarge)
-- **Instance**: $0.526/hour
-- **Training time**: 6 hours
-- **Total cost**: ~$3.16
-
-### Google Cloud (n1-standard-8 + T4)
-- **Instance**: $0.38/hour
-- **Training time**: 6 hours
-- **Total cost**: ~$2.28
-
-### Azure (Standard_NC6s_v3)
-- **Instance**: $0.90/hour
-- **Training time**: 6 hours
-- **Total cost**: ~$5.40
-
-## Next Steps
-
-After successful deployment:
-
-1. **Monitor training** in your Trackio Space
-2. **Check model repository** on Hugging Face Hub
-3. **Test the model** with different prompts
-4. **Share your model** with the community
-5. **Iterate and improve** based on results
-
-## Support
-
-- **Training issues**: Check logs and GPU utilization
-- **Upload issues**: Verify HF token and repository permissions
-- **Monitoring issues**: Check Trackio Space configuration
-- **Performance issues**: Adjust batch size and learning rate
-
-Your SmolLM3 DPO model will be ready for use after training completes!
\ No newline at end of file
diff --git a/docs/CLOUD_TRAINING_GUIDE.md b/docs/CLOUD_TRAINING_GUIDE.md
deleted file mode 100644
index 376370cabe95bcb1d3bbb76a0410bd0c54ebdd96..0000000000000000000000000000000000000000
--- a/docs/CLOUD_TRAINING_GUIDE.md
+++ /dev/null
@@ -1,440 +0,0 @@
-# Cloud Training Guide for OpenHermes-FR Dataset
-
-This guide provides step-by-step instructions for training SmolLM3 models on cloud instances using the [legmlai/openhermes-fr](https://huggingface.co/datasets/legmlai/openhermes-fr) dataset.
-
-## Overview
-
-The OpenHermes-FR dataset contains 799,875 French instruction-response pairs, perfect for fine-tuning SmolLM3 models for French language tasks. This guide covers:
-
-- ✅ **Cloud Instance Setup** - Complete environment configuration
-- ✅ **Dataset Integration** - Automatic loading and filtering
-- ✅ **Training Configuration** - Optimized for French instruction tuning
-- ✅ **Monitoring Integration** - Trackio experiment tracking
-- ✅ **Model Deployment** - Push to Hugging Face Hub
-
-## Dataset Information
-
-### Schema
-```json
-{
- "prompt": "Explique la différence entre la photosynthèse C3 et C4.",
- "accepted_completion": "La photosynthèse C3 utilise… (réponse détaillée)",
- "bad_prompt_detected": false,
- "bad_response_detected": false,
- "bad_entry": false
-}
-```
-
-### Key Features
-- **Size**: 799,875 examples (~1.4GB)
-- **Language**: 100% French
-- **Quality**: GPT-4o generated responses with automatic filtering
-- **License**: ODC-BY 1.0
-
-## Cloud Instance Setup
-
-### 1. Choose Your Cloud Provider
-
-#### **AWS EC2 (Recommended)**
-```bash
-# Launch instance with GPU
-# Recommended: g4dn.xlarge or g5.xlarge
-# AMI: Deep Learning AMI (Ubuntu 20.04)
-```
-
-#### **Google Cloud Platform**
-```bash
-# Launch instance with GPU
-# Recommended: n1-standard-4 with Tesla T4 or V100
-```
-
-#### **Azure**
-```bash
-# Launch instance with GPU
-# Recommended: Standard_NC6s_v3 or Standard_NC12s_v3
-```
-
-### 2. Instance Specifications
-
-#### **Minimum Requirements**
-- **GPU**: 16GB+ VRAM (Tesla T4, V100, or A100)
-- **RAM**: 32GB+ system memory
-- **Storage**: 100GB+ SSD
-- **CPU**: 8+ cores
-
-#### **Recommended Specifications**
-- **GPU**: A100 (40GB) or H100 (80GB)
-- **RAM**: 64GB+ system memory
-- **Storage**: 200GB+ NVMe SSD
-- **CPU**: 16+ cores
-
-### 3. Environment Setup
-
-```bash
-# Update system
-sudo apt update && sudo apt upgrade -y
-
-# Install CUDA (if not pre-installed)
-# Follow NVIDIA CUDA installation guide for your GPU
-
-# Install Python dependencies
-sudo apt install python3-pip python3-venv git -y
-
-# Create virtual environment
-python3 -m venv smollm3_env
-source smollm3_env/bin/activate
-
-# Clone repository
-git clone
-cd
-
-# Install dependencies
-pip install -r requirements.txt
-
-# Install additional dependencies for cloud training
-pip install accelerate transformers datasets huggingface_hub
-```
-
-## Training Configuration
-
-### 1. Use the OpenHermes-FR Config
-
-The repository includes a specialized configuration for the OpenHermes-FR dataset:
-
-```bash
-python train.py config/train_smollm3_openhermes_fr.py \
- --enable_tracking \
- --trackio_url "https://your-space.hf.space" \
- --experiment_name "smollm3_fr_openhermes_v1"
-```
-
-### 2. Configuration Details
-
-The `config/train_smollm3_openhermes_fr.py` includes:
-
-#### **Dataset Configuration**
-```python
-dataset_name: str = "legmlai/openhermes-fr"
-dataset_split: str = "train"
-input_field: str = "prompt"
-target_field: str = "accepted_completion"
-filter_bad_entries: bool = True
-bad_entry_field: str = "bad_entry"
-```
-
-#### **Training Optimization**
-```python
-batch_size: int = 2 # Reduced for French text (longer sequences)
-gradient_accumulation_steps: int = 8 # Maintains effective batch size
-learning_rate: float = 1e-5 # Lower for instruction tuning
-max_iters: int = 2000 # More iterations for large dataset
-```
-
-#### **Monitoring Integration**
-```python
-enable_tracking: bool = True
-experiment_name: str = "smollm3_openhermes_fr"
-```
-
-## Training Commands
-
-### Basic Training
-```bash
-python train.py config/train_smollm3_openhermes_fr.py
-```
-
-### Training with Monitoring
-```bash
-python train.py config/train_smollm3_openhermes_fr.py \
- --enable_tracking \
- --trackio_url "https://your-trackio-space.hf.space" \
- --experiment_name "smollm3_fr_openhermes_v1"
-```
-
-### Training with Custom Parameters
-```bash
-python train.py config/train_smollm3_openhermes_fr.py \
- --batch_size 4 \
- --learning_rate 2e-5 \
- --max_iters 3000 \
- --enable_tracking \
- --trackio_url "https://your-trackio-space.hf.space" \
- --experiment_name "smollm3_fr_high_lr"
-```
-
-### Training with Checkpoint Resume
-```bash
-python train.py config/train_smollm3_openhermes_fr.py \
- --init_from resume \
- --enable_tracking \
- --trackio_url "https://your-trackio-space.hf.space" \
- --experiment_name "smollm3_fr_resume"
-```
-
-## Dataset Processing
-
-### Automatic Filtering
-
-The training script automatically:
-- ✅ **Loads** the OpenHermes-FR dataset from Hugging Face
-- ✅ **Filters** out bad entries (`bad_entry = true`)
-- ✅ **Splits** data into train/validation/test (98/1/1)
-- ✅ **Formats** prompts and completions for instruction tuning
-
-### Manual Dataset Inspection
-
-```python
-from datasets import load_dataset
-
-# Load dataset
-dataset = load_dataset("legmlai/openhermes-fr")
-
-# Check dataset info
-print(f"Dataset size: {len(dataset['train'])}")
-print(f"Sample columns: {dataset['train'].column_names}")
-
-# Check filtering
-bad_entries = dataset['train'].filter(lambda x: x['bad_entry'])
-print(f"Bad entries: {len(bad_entries)}")
-
-# Sample data
-sample = dataset['train'][0]
-print(f"Prompt: {sample['prompt']}")
-print(f"Completion: {sample['accepted_completion']}")
-```
-
-## Monitoring and Tracking
-
-### Trackio Integration
-
-The training automatically logs:
-- **Training metrics**: Loss, accuracy, learning rate
-- **System metrics**: GPU memory, CPU usage
-- **Dataset info**: Size, filtering statistics
-- **Model checkpoints**: Regular saves with metadata
-
-### View Training Progress
-
-1. **Trackio Space**: Visit your Trackio Space URL
-2. **Experiment Details**: Check the "View Experiments" tab
-3. **Metrics**: Monitor loss curves and system usage
-4. **Logs**: Download training logs for analysis
-
-## Model Deployment
-
-### Push to Hugging Face Hub
-
-After training, deploy your model:
-
-```bash
-python push_to_huggingface.py /output-checkpoint username/smollm3-fr-openhermes \
- --trackio-url "https://your-trackio-space.hf.space" \
- --experiment-name "smollm3_fr_openhermes_v1"
-```
-
-### Use Your Model
-
-```python
-from transformers import AutoModelForCausalLM, AutoTokenizer
-
-# Load your fine-tuned model
-model = AutoModelForCausalLM.from_pretrained("username/smollm3-fr-openhermes")
-tokenizer = AutoTokenizer.from_pretrained("username/smollm3-fr-openhermes")
-
-# Generate French text
-prompt = "Expliquez le concept de l'intelligence artificielle."
-inputs = tokenizer(prompt, return_tensors="pt")
-outputs = model.generate(**inputs, max_new_tokens=200)
-print(tokenizer.decode(outputs[0], skip_special_tokens=True))
-```
-
-## Performance Optimization
-
-### GPU Memory Management
-
-```bash
-# Monitor GPU usage
-nvidia-smi -l 1
-
-# Optimize for your GPU
-# For 16GB VRAM: batch_size=2, gradient_accumulation_steps=8
-# For 24GB VRAM: batch_size=4, gradient_accumulation_steps=4
-# For 40GB+ VRAM: batch_size=8, gradient_accumulation_steps=2
-```
-
-### Training Speed
-
-```bash
-# Use mixed precision (enabled by default)
-fp16: bool = True
-
-# Enable gradient checkpointing (enabled by default)
-use_gradient_checkpointing: bool = True
-
-# Use flash attention (enabled by default)
-use_flash_attention: bool = True
-```
-
-## Troubleshooting
-
-### Common Issues
-
-#### 1. **Out of Memory (OOM)**
-```bash
-# Reduce batch size
-python train.py config/train_smollm3_openhermes_fr.py --batch_size 1
-
-# Increase gradient accumulation
-# Edit config: gradient_accumulation_steps = 16
-```
-
-#### 2. **Slow Training**
-```bash
-# Check GPU utilization
-nvidia-smi
-
-# Verify data loading
-# Check if dataset is cached locally
-```
-
-#### 3. **Dataset Loading Issues**
-```bash
-# Clear cache
-rm -rf ~/.cache/huggingface/
-
-# Check internet connection
-# Verify dataset name: "legmlai/openhermes-fr"
-```
-
-#### 4. **Monitoring Connection Issues**
-```bash
-# Test Trackio connection
-curl -I https://your-trackio-space.hf.space
-
-# Check token permissions
-# Verify experiment name format
-```
-
-### Debug Mode
-
-```bash
-# Enable debug logging
-export LOG_LEVEL=DEBUG
-python train.py config/train_smollm3_openhermes_fr.py
-```
-
-## Cost Optimization
-
-### Cloud Provider Tips
-
-#### **AWS EC2**
-- Use Spot Instances for cost savings
-- Monitor usage with CloudWatch
-- Use appropriate instance types
-
-#### **Google Cloud Platform**
-- Use Preemptible VMs for non-critical training
-- Monitor with Cloud Monitoring
-- Use committed use discounts
-
-#### **Azure**
-- Use Spot VMs for cost optimization
-- Monitor with Azure Monitor
-- Use reserved instances for long training
-
-### Training Time Estimates
-
-| GPU Type | Batch Size | Estimated Time |
-|----------|------------|----------------|
-| Tesla T4 (16GB) | 2 | 8-12 hours |
-| V100 (32GB) | 4 | 4-6 hours |
-| A100 (40GB) | 8 | 2-3 hours |
-| H100 (80GB) | 16 | 1-2 hours |
-
-## Security Best Practices
-
-### Token Management
-```bash
-# Use environment variables
-export HF_TOKEN="your_token_here"
-export TRACKIO_TOKEN="your_trackio_token"
-
-# Don't hardcode in scripts
-# Use IAM roles when possible
-```
-
-### Data Privacy
-```bash
-# Use private repositories for sensitive models
-python push_to_huggingface.py model username/private-model --private
-
-# Secure your cloud instance
-# Use VPC and security groups
-```
-
-## Complete Workflow Example
-
-### 1. Setup Cloud Instance
-```bash
-# Launch GPU instance
-# Install dependencies
-git clone
-cd
-pip install -r requirements.txt
-```
-
-### 2. Train Model
-```bash
-python train.py config/train_smollm3_openhermes_fr.py \
- --enable_tracking \
- --trackio_url "https://your-space.hf.space" \
- --experiment_name "smollm3_fr_v1"
-```
-
-### 3. Deploy Model
-```bash
-python push_to_huggingface.py /output-checkpoint username/smollm3-fr-v1 \
- --trackio-url "https://your-space.hf.space" \
- --experiment-name "smollm3_fr_v1"
-```
-
-### 4. Test Model
-```python
-from transformers import AutoModelForCausalLM, AutoTokenizer
-
-model = AutoModelForCausalLM.from_pretrained("username/smollm3-fr-v1")
-tokenizer = AutoTokenizer.from_pretrained("username/smollm3-fr-v1")
-
-# Test French generation
-prompt = "Qu'est-ce que l'apprentissage automatique?"
-inputs = tokenizer(prompt, return_tensors="pt")
-outputs = model.generate(**inputs, max_new_tokens=100)
-print(tokenizer.decode(outputs[0], skip_special_tokens=True))
-```
-
-## Support and Resources
-
-### Documentation
-- [OpenHermes-FR Dataset](https://huggingface.co/datasets/legmlai/openhermes-fr)
-- [SmolLM3 Model](https://huggingface.co/HuggingFaceTB/SmolLM3-3B)
-- [Trackio Monitoring](https://github.com/Josephrp/trackio)
-
-### Community
-- [Hugging Face Forums](https://discuss.huggingface.co/)
-- [Transformers Documentation](https://huggingface.co/docs/transformers/)
-
-### Examples
-- [French Language Models](https://huggingface.co/models?search=french)
-- [Instruction Tuned Models](https://huggingface.co/models?pipeline_tag=text-generation&sort=downloads)
-
-## Conclusion
-
-This guide provides everything needed to train SmolLM3 models on the OpenHermes-FR dataset in the cloud:
-
-- ✅ **Complete Setup** - From cloud instance to model deployment
-- ✅ **Optimized Configuration** - Tailored for French instruction tuning
-- ✅ **Monitoring Integration** - Trackio experiment tracking
-- ✅ **Cost Optimization** - Tips for efficient cloud usage
-- ✅ **Troubleshooting** - Solutions for common issues
-
-Start training your French language model today!
\ No newline at end of file
diff --git a/docs/Configuration_Management.md b/docs/Configuration_Management.md
new file mode 100644
index 0000000000000000000000000000000000000000..e35e6ba932852167117e638b1e71b036e8dc8bee
--- /dev/null
+++ b/docs/Configuration_Management.md
@@ -0,0 +1,29 @@
+```mermaid
+graph LR
+ Configuration_Management["Configuration Management"]
+ Training_Orchestration["Training Orchestration"]
+ Training_Orchestration -- "retrieves configuration from" --> Configuration_Management
+ click Configuration_Management href "https://github.com//Josephrp/SmolFactory/blob/main/SmolFactory/docs/blob/Configuration_Management.md" "Details"
+```
+
+[](https://github.com/CodeBoarding/GeneratedOnBoardings)[](https://www.codeboarding.org/demo)[](mailto:contact@codeboarding.org)
+
+## Details
+
+One paragraph explaining the functionality which is represented by this graph. What the main flow is and what is its purpose.
+
+### Configuration Management [[Expand]](./Configuration_Management.md)
+This component, primarily embodied by the `SmolLM3Config` dataclass and the `get_config` function in `config/train_smollm3.py`, is responsible for the centralized definition, loading, validation, and provision of access to all training parameters, model specifications, data paths, and hyperparameters. It supports loading both base and custom configurations, ensuring that all necessary settings are available and correctly formatted for the training and fine-tuning processes.
+
+
+**Related Classes/Methods**: _None_
+
+### Training Orchestration
+This component represents the main scripts or modules responsible for initiating and coordinating the training and fine-tuning processes. It acts as the primary entry point for different training runs, retrieving necessary configurations and orchestrating the overall training pipeline.
+
+
+**Related Classes/Methods**: _None_
+
+
+
+### [FAQ](https://github.com/CodeBoarding/GeneratedOnBoardings/tree/main?tab=readme-ov-file#faq)
\ No newline at end of file
diff --git a/docs/DATASET_AUTOMATION_FIX.md b/docs/DATASET_AUTOMATION_FIX.md
deleted file mode 100644
index 22a8276e527c2de1799dcbbf4fe0f9a3f9668beb..0000000000000000000000000000000000000000
--- a/docs/DATASET_AUTOMATION_FIX.md
+++ /dev/null
@@ -1,218 +0,0 @@
-# Dataset Configuration Automation Fix
-
-## Problem Description
-
-The original launch script required users to manually specify their username in the dataset repository name, which was:
-1. **Error-prone**: Users had to remember their username
-2. **Inconsistent**: Different users might use different naming conventions
-3. **Manual**: Required extra steps in the setup process
-
-## Solution Implementation
-
-### Automatic Dataset Repository Creation
-
-We've implemented a Python-based solution that automatically:
-
-1. **Extracts username from token**: Uses the HF API to get the username from the validated token
-2. **Creates dataset repository**: Automatically creates `username/trackio-experiments` or custom name
-3. **Sets environment variables**: Automatically configures `TRACKIO_DATASET_REPO`
-4. **Provides customization**: Allows users to customize the dataset name if desired
-
-### Key Components
-
-#### 1. **`scripts/dataset_tonic/setup_hf_dataset.py`** - Main Dataset Setup Script
-- Automatically detects username from HF token
-- Creates dataset repository with proper permissions
-- Supports custom dataset names
-- Sets environment variables for other scripts
-
-#### 2. **Updated `launch.sh`** - Enhanced User Experience
-- Automatically creates dataset repository
-- Provides options for default or custom dataset names
-- Fallback to manual input if automatic creation fails
-- Clear user feedback and progress indicators
-
-#### 3. **Python API Integration** - Consistent Authentication
-- Uses `HfApi(token=token)` for direct token authentication
-- Avoids environment variable conflicts
-- Consistent error handling across all scripts
-
-## Usage Examples
-
-### Automatic Dataset Creation (Default)
-
-```bash
-# The launch script now automatically:
-python scripts/dataset_tonic/setup_hf_dataset.py hf_your_token_here
-
-# Creates: username/trackio-experiments
-# Sets: TRACKIO_DATASET_REPO=username/trackio-experiments
-```
-
-### Custom Dataset Name
-
-```bash
-# Create with custom name
-python scripts/dataset_tonic/setup_hf_dataset.py hf_your_token_here my-custom-experiments
-
-# Creates: username/my-custom-experiments
-# Sets: TRACKIO_DATASET_REPO=username/my-custom-experiments
-```
-
-### Launch Script Integration
-
-The launch script now provides a seamless experience:
-
-```bash
-./launch.sh
-
-# Step 3: Experiment Details
-# - Automatically creates dataset repository
-# - Option to use default or custom name
-# - No manual username input required
-```
-
-## Features
-
-### ✅ **Automatic Username Detection**
-- Extracts username from HF token using Python API
-- No manual username input required
-- Consistent across all scripts
-
-### ✅ **Flexible Dataset Naming**
-- Default: `username/trackio-experiments`
-- Custom: `username/custom-name`
-- User choice during setup
-
-### ✅ **Robust Error Handling**
-- Graceful fallback to manual input
-- Clear error messages
-- Token validation before creation
-
-### ✅ **Environment Integration**
-- Automatically sets `TRACKIO_DATASET_REPO`
-- Compatible with existing scripts
-- No manual configuration required
-
-### ✅ **Cross-Platform Compatibility**
-- Works on Windows, Linux, macOS
-- Uses Python API instead of CLI
-- Consistent behavior across platforms
-
-## Technical Implementation
-
-### Token Authentication Flow
-
-```python
-# 1. Direct token authentication
-api = HfApi(token=token)
-
-# 2. Extract username
-user_info = api.whoami()
-username = user_info.get("name", user_info.get("username"))
-
-# 3. Create repository
-create_repo(
- repo_id=f"{username}/{dataset_name}",
- repo_type="dataset",
- token=token,
- exist_ok=True,
- private=False
-)
-```
-
-### Launch Script Integration
-
-```bash
-# Automatic dataset creation
-if python3 scripts/dataset_tonic/setup_hf_dataset.py 2>/dev/null; then
- TRACKIO_DATASET_REPO="$TRACKIO_DATASET_REPO"
- print_status "Dataset repository created successfully"
-else
- # Fallback to manual input
- get_input "Trackio dataset repository" "$HF_USERNAME/trackio-experiments" TRACKIO_DATASET_REPO
-fi
-```
-
-## User Experience Improvements
-
-### Before (Manual Process)
-1. User enters HF token
-2. User manually types username
-3. User manually types dataset repository name
-4. User manually configures environment variables
-5. Risk of typos and inconsistencies
-
-### After (Automated Process)
-1. User enters HF token
-2. System automatically detects username
-3. System automatically creates dataset repository
-4. System automatically sets environment variables
-5. Option to customize dataset name if desired
-
-## Error Handling
-
-### Common Scenarios
-
-| Scenario | Action | User Experience |
-|----------|--------|-----------------|
-| Valid token | ✅ Automatic creation | Seamless setup |
-| Invalid token | ❌ Clear error message | Helpful feedback |
-| Network issues | ⚠️ Retry with fallback | Graceful degradation |
-| Repository exists | ℹ️ Use existing | No conflicts |
-
-### Fallback Mechanisms
-
-1. **Token validation fails**: Clear error message with troubleshooting steps
-2. **Dataset creation fails**: Fallback to manual input
-3. **Network issues**: Retry with exponential backoff
-4. **Permission issues**: Clear guidance on token permissions
-
-## Benefits
-
-### For Users
-- **Simplified Setup**: No manual username input required
-- **Reduced Errors**: Automatic username detection eliminates typos
-- **Consistent Naming**: Standardized repository naming conventions
-- **Better UX**: Clear progress indicators and feedback
-
-### For Developers
-- **Maintainable Code**: Python API instead of CLI dependencies
-- **Cross-Platform**: Works consistently across operating systems
-- **Extensible**: Easy to add new features and customizations
-- **Testable**: Comprehensive test coverage
-
-### For System
-- **Reliable**: Robust error handling and fallback mechanisms
-- **Secure**: Direct token authentication without environment conflicts
-- **Scalable**: Easy to extend for additional repository types
-- **Integrated**: Seamless integration with existing pipeline
-
-## Migration Guide
-
-### For Existing Users
-
-No migration required! The system automatically:
-- Detects existing repositories
-- Uses existing repositories if they exist
-- Creates new repositories only when needed
-
-### For New Users
-
-The setup is now completely automated:
-1. Run `./launch.sh`
-2. Enter your HF token
-3. Choose dataset naming preference
-4. System handles everything else automatically
-
-## Future Enhancements
-
-- [ ] Support for organization repositories
-- [ ] Multiple dataset repositories per user
-- [ ] Dataset repository templates
-- [ ] Advanced repository configuration options
-- [ ] Repository sharing and collaboration features
-
----
-
-**Note**: This automation ensures that users can focus on their fine-tuning experiments rather than repository setup details, while maintaining full flexibility for customization when needed.
\ No newline at end of file
diff --git a/docs/DATASET_COMPONENTS_VERIFICATION.md b/docs/DATASET_COMPONENTS_VERIFICATION.md
deleted file mode 100644
index 2e13d4d4485d4814f1caf395f7c45c61135b36bc..0000000000000000000000000000000000000000
--- a/docs/DATASET_COMPONENTS_VERIFICATION.md
+++ /dev/null
@@ -1,235 +0,0 @@
-# Dataset Components Verification
-
-## Overview
-
-This document verifies that all important dataset components have been properly implemented and are working correctly.
-
-## ✅ **Verified Components**
-
-### 1. **Initial Experiment Data** ✅ IMPLEMENTED
-
-**Location**: `scripts/dataset_tonic/setup_hf_dataset.py` - `add_initial_experiment_data()` function
-
-**What it does**:
-- Creates comprehensive sample experiment data
-- Includes realistic training metrics (loss, accuracy, GPU usage, etc.)
-- Contains proper experiment parameters (model name, batch size, learning rate, etc.)
-- Includes experiment logs and artifacts structure
-- Uploads data to HF Dataset using `datasets` library
-
-**Sample Data Structure**:
-```json
-{
- "experiment_id": "exp_20250120_143022",
- "name": "smollm3-finetune-demo",
- "description": "SmolLM3 fine-tuning experiment demo with comprehensive metrics tracking",
- "created_at": "2025-01-20T14:30:22.123456",
- "status": "completed",
- "metrics": "[{\"timestamp\": \"2025-01-20T14:30:22.123456\", \"step\": 100, \"metrics\": {\"loss\": 1.15, \"grad_norm\": 10.5, \"learning_rate\": 5e-6, \"num_tokens\": 1000000.0, \"mean_token_accuracy\": 0.76, \"epoch\": 0.1, \"total_tokens\": 1000000.0, \"throughput\": 2000000.0, \"step_time\": 0.5, \"batch_size\": 2, \"seq_len\": 4096, \"token_acc\": 0.76, \"gpu_memory_allocated\": 15.2, \"gpu_memory_reserved\": 70.1, \"gpu_utilization\": 85.2, \"cpu_percent\": 2.7, \"memory_percent\": 10.1}}]",
- "parameters": "{\"model_name\": \"HuggingFaceTB/SmolLM3-3B\", \"max_seq_length\": 4096, \"batch_size\": 2, \"learning_rate\": 5e-6, \"epochs\": 3, \"dataset\": \"OpenHermes-FR\", \"trainer_type\": \"SFTTrainer\", \"hardware\": \"GPU (H100/A100)\", \"mixed_precision\": true, \"gradient_checkpointing\": true, \"flash_attention\": true}",
- "artifacts": "[]",
- "logs": "[{\"timestamp\": \"2025-01-20T14:30:22.123456\", \"level\": \"INFO\", \"message\": \"Training started successfully\"}, {\"timestamp\": \"2025-01-20T14:30:22.123456\", \"level\": \"INFO\", \"message\": \"Model loaded and configured\"}, {\"timestamp\": \"2025-01-20T14:30:22.123456\", \"level\": \"INFO\", \"message\": \"Dataset loaded and preprocessed\"}]",
- "last_updated": "2025-01-20T14:30:22.123456"
-}
-```
-
-**Test Result**: ✅ Successfully uploaded to `Tonic/test-dataset-complete`
-
-### 2. **README Templates** ✅ IMPLEMENTED
-
-**Location**:
-- Template: `templates/datasets/readme.md`
-- Implementation: `scripts/dataset_tonic/setup_hf_dataset.py` - `add_dataset_readme()` function
-
-**What it does**:
-- Uses comprehensive README template from `templates/datasets/readme.md`
-- Falls back to basic README if template doesn't exist
-- Includes dataset schema documentation
-- Provides usage examples and integration information
-- Uploads README to dataset repository using `huggingface_hub`
-
-**Template Features**:
-- Dataset schema documentation
-- Metrics structure examples
-- Integration instructions
-- Privacy and license information
-- Sample experiment entries
-
-**Test Result**: ✅ Successfully added README to `Tonic/test-dataset-complete`
-
-### 3. **Dataset Repository Creation** ✅ IMPLEMENTED
-
-**Location**: `scripts/dataset_tonic/setup_hf_dataset.py` - `create_dataset_repository()` function
-
-**What it does**:
-- Creates HF Dataset repository with proper permissions
-- Handles existing repositories gracefully
-- Sets up public dataset for easier sharing
-- Uses Python API (`huggingface_hub.create_repo`)
-
-**Test Result**: ✅ Successfully created dataset repositories
-
-### 4. **Automatic Username Detection** ✅ IMPLEMENTED
-
-**Location**: `scripts/dataset_tonic/setup_hf_dataset.py` - `get_username_from_token()` function
-
-**What it does**:
-- Extracts username from HF token using Python API
-- Uses `HfApi(token=token).whoami()`
-- Handles both `name` and `username` fields
-- Provides clear error messages
-
-**Test Result**: ✅ Successfully detected username "Tonic"
-
-### 5. **Environment Variable Integration** ✅ IMPLEMENTED
-
-**Location**: `scripts/dataset_tonic/setup_hf_dataset.py` - `setup_trackio_dataset()` function
-
-**What it does**:
-- Sets `TRACKIO_DATASET_REPO` environment variable
-- Supports both environment and command-line token sources
-- Provides clear feedback on environment setup
-
-**Test Result**: ✅ Successfully set `TRACKIO_DATASET_REPO=Tonic/test-dataset-complete`
-
-### 6. **Launch Script Integration** ✅ IMPLEMENTED
-
-**Location**: `launch.sh` - Dataset creation section
-
-**What it does**:
-- Automatically calls dataset setup script
-- Provides user options for default or custom dataset names
-- Falls back to manual input if automatic creation fails
-- Integrates seamlessly with the training pipeline
-
-**Features**:
-- Automatic dataset creation
-- Custom dataset name support
-- Graceful error handling
-- Clear user feedback
-
-## 🔧 **Technical Implementation Details**
-
-### Token Authentication Flow
-
-```python
-# 1. Direct token authentication
-api = HfApi(token=token)
-
-# 2. Extract username
-user_info = api.whoami()
-username = user_info.get("name", user_info.get("username"))
-
-# 3. Create repository
-create_repo(
- repo_id=f"{username}/{dataset_name}",
- repo_type="dataset",
- token=token,
- exist_ok=True,
- private=False
-)
-
-# 4. Upload data
-dataset = Dataset.from_list(initial_experiments)
-dataset.push_to_hub(repo_id, token=token, private=False)
-
-# 5. Upload README
-upload_file(
- path_or_fileobj=readme_content,
- path_in_repo="README.md",
- repo_id=repo_id,
- repo_type="dataset",
- token=token
-)
-```
-
-### Error Handling
-
-- **Token validation**: Clear error messages for invalid tokens
-- **Repository creation**: Handles existing repositories gracefully
-- **Data upload**: Fallback mechanisms for upload failures
-- **README upload**: Graceful handling of template issues
-
-### Cross-Platform Compatibility
-
-- **Windows**: Tested and working on Windows PowerShell
-- **Linux**: Compatible with bash scripts
-- **macOS**: Compatible with zsh/bash
-
-## 📊 **Test Results**
-
-### Successful Test Run
-
-```bash
-$ python scripts/dataset_tonic/setup_hf_dataset.py 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 test-dataset-complete
-
-🚀 Setting up Trackio Dataset Repository
-==================================================
-🔍 Getting username from token...
-✅ Authenticated as: Tonic
-🔧 Creating dataset repository: Tonic/test-dataset-complete
-✅ Successfully created dataset repository: Tonic/test-dataset-complete
-✅ Set TRACKIO_DATASET_REPO=Tonic/test-dataset-complete
-📊 Adding initial experiment data...
-Creating parquet from Arrow format: 100%|████████████████████████████████████| 1/1 [00:00<00:00, 93.77ba/s]
-Uploading the dataset shards: 100%|█████████████████████████████████████| 1/1 [00:01<00:00, 1.39s/ shards]
-✅ Successfully uploaded initial experiment data to Tonic/test-dataset-complete
-✅ Successfully added README to Tonic/test-dataset-complete
-✅ Successfully added initial experiment data
-
-🎉 Dataset setup complete!
-📊 Dataset URL: https://huggingface.co/datasets/Tonic/test-dataset-complete
-🔧 Repository ID: Tonic/test-dataset-complete
-```
-
-### Verified Dataset Repository
-
-**URL**: https://huggingface.co/datasets/Tonic/test-dataset-complete
-
-**Contents**:
-- ✅ README.md with comprehensive documentation
-- ✅ Initial experiment data with realistic metrics
-- ✅ Proper dataset schema
-- ✅ Public repository for easy access
-
-## 🎯 **Integration Points**
-
-### 1. **Trackio Space Integration**
-- Dataset repository automatically configured
-- Environment variables set for Space deployment
-- Compatible with Trackio monitoring interface
-
-### 2. **Training Pipeline Integration**
-- `TRACKIO_DATASET_REPO` environment variable set
-- Compatible with monitoring scripts
-- Ready for experiment logging
-
-### 3. **Launch Script Integration**
-- Seamless integration with `launch.sh`
-- Automatic dataset creation during setup
-- User-friendly configuration options
-
-## ✅ **Verification Summary**
-
-| Component | Status | Location | Test Result |
-|-----------|--------|----------|-------------|
-| Initial Experiment Data | ✅ Implemented | `setup_hf_dataset.py` | ✅ Uploaded successfully |
-| README Templates | ✅ Implemented | `templates/datasets/readme.md` | ✅ Added to repository |
-| Dataset Repository Creation | ✅ Implemented | `setup_hf_dataset.py` | ✅ Created successfully |
-| Username Detection | ✅ Implemented | `setup_hf_dataset.py` | ✅ Detected "Tonic" |
-| Environment Variables | ✅ Implemented | `setup_hf_dataset.py` | ✅ Set correctly |
-| Launch Script Integration | ✅ Implemented | `launch.sh` | ✅ Integrated |
-| Error Handling | ✅ Implemented | All functions | ✅ Graceful fallbacks |
-| Cross-Platform Support | ✅ Implemented | Python API | ✅ Windows/Linux/macOS |
-
-## 🚀 **Next Steps**
-
-The dataset components are now **fully implemented and verified**. Users can:
-
-1. **Run the launch script**: `./launch.sh`
-2. **Get automatic dataset creation**: No manual username input required
-3. **Receive comprehensive documentation**: README templates included
-4. **Start with sample data**: Initial experiment data provided
-5. **Monitor experiments**: Trackio integration ready
-
-**All important components are properly implemented and working correctly!** 🎉
\ No newline at end of file
diff --git a/docs/DEPLOYMENT_COMPONENTS_VERIFICATION.md b/docs/DEPLOYMENT_COMPONENTS_VERIFICATION.md
deleted file mode 100644
index d28414d4a51806164b7bacdd2939f544cd629d94..0000000000000000000000000000000000000000
--- a/docs/DEPLOYMENT_COMPONENTS_VERIFICATION.md
+++ /dev/null
@@ -1,393 +0,0 @@
-# Deployment Components Verification
-
-## Overview
-
-This document verifies that all important components for Trackio Spaces deployment and model repository deployment have been properly implemented and are working correctly.
-
-## ✅ **Trackio Spaces Deployment - Verified Components**
-
-### 1. **Space Creation** ✅ IMPLEMENTED
-
-**Location**: `scripts/trackio_tonic/deploy_trackio_space.py` - `create_space()` function
-
-**What it does**:
-- Creates HF Space using latest Python API (`create_repo`)
-- Falls back to CLI method if API fails
-- Handles authentication and username extraction
-- Sets proper Space configuration (Gradio SDK, CPU hardware)
-
-**Key Features**:
-- ✅ **API-based creation**: Uses `huggingface_hub.create_repo`
-- ✅ **Fallback mechanism**: CLI method if API fails
-- ✅ **Username extraction**: Automatic from token using `whoami()`
-- ✅ **Proper configuration**: Gradio SDK, CPU hardware, public access
-
-**Test Result**: ✅ Successfully creates Spaces
-
-### 2. **File Upload System** ✅ IMPLEMENTED
-
-**Location**: `scripts/trackio_tonic/deploy_trackio_space.py` - `upload_files_to_space()` function
-
-**What it does**:
-- Prepares all required files in temporary directory
-- Uploads files using HF Hub API (`upload_file`)
-- Handles proper file structure for HF Spaces
-- Sets up git repository and pushes to main branch
-
-**Key Features**:
-- ✅ **API-based upload**: Uses `huggingface_hub.upload_file`
-- ✅ **Proper file structure**: Follows HF Spaces requirements
-- ✅ **Git integration**: Proper git workflow in temp directory
-- ✅ **Error handling**: Graceful fallback mechanisms
-
-**Files Uploaded**:
-- ✅ `app.py` - Main Gradio interface
-- ✅ `requirements.txt` - Dependencies
-- ✅ `README.md` - Space documentation
-- ✅ `.gitignore` - Git ignore file
-
-### 3. **Space Configuration** ✅ IMPLEMENTED
-
-**Location**: `scripts/trackio_tonic/deploy_trackio_space.py` - `set_space_secrets()` function
-
-**What it does**:
-- Sets environment variables via HF Hub API
-- Configures `HF_TOKEN` for dataset access
-- Sets `TRACKIO_DATASET_REPO` for experiment storage
-- Provides manual setup instructions if API fails
-
-**Key Features**:
-- ✅ **API-based secrets**: Uses `add_space_secret()` method
-- ✅ **Automatic configuration**: Sets required environment variables
-- ✅ **Manual fallback**: Clear instructions if API fails
-- ✅ **Error handling**: Graceful degradation
-
-### 4. **Space Testing** ✅ IMPLEMENTED
-
-**Location**: `scripts/trackio_tonic/deploy_trackio_space.py` - `test_space()` function
-
-**What it does**:
-- Tests Space availability after deployment
-- Checks if Space is building correctly
-- Provides status feedback to user
-- Handles build time delays
-
-**Key Features**:
-- ✅ **Availability testing**: Checks Space URL accessibility
-- ✅ **Build status**: Monitors Space build progress
-- ✅ **User feedback**: Clear status messages
-- ✅ **Timeout handling**: Proper wait times for builds
-
-### 5. **Gradio Interface** ✅ IMPLEMENTED
-
-**Location**: `templates/spaces/app.py` - Complete Gradio application
-
-**What it does**:
-- Provides comprehensive experiment tracking interface
-- Integrates with HF Datasets for persistent storage
-- Offers real-time metrics visualization
-- Supports API access for training scripts
-
-**Key Features**:
-- ✅ **Experiment management**: Create, view, update experiments
-- ✅ **Metrics logging**: Real-time training metrics
-- ✅ **Visualization**: Interactive plots and charts
-- ✅ **HF Datasets integration**: Persistent storage
-- ✅ **API endpoints**: Programmatic access
-- ✅ **Fallback data**: Backup when dataset unavailable
-
-**Interface Components**:
-- ✅ **Create Experiment**: Start new experiments
-- ✅ **Log Metrics**: Track training progress
-- ✅ **View Experiments**: See experiment details
-- ✅ **Update Status**: Mark experiments complete
-- ✅ **Visualizations**: Interactive plots
-- ✅ **Configuration**: Environment setup
-
-### 6. **Requirements and Dependencies** ✅ IMPLEMENTED
-
-**Location**: `templates/spaces/requirements.txt`
-
-**What it includes**:
-- ✅ **Core Gradio**: `gradio>=4.0.0`
-- ✅ **Data processing**: `pandas>=2.0.0`, `numpy>=1.24.0`
-- ✅ **Visualization**: `plotly>=5.15.0`
-- ✅ **HF integration**: `datasets>=2.14.0`, `huggingface-hub>=0.16.0`
-- ✅ **HTTP requests**: `requests>=2.31.0`
-- ✅ **Environment**: `python-dotenv>=1.0.0`
-
-### 7. **README Template** ✅ IMPLEMENTED
-
-**Location**: `templates/spaces/README.md`
-
-**What it includes**:
-- ✅ **HF Spaces metadata**: Proper YAML frontmatter
-- ✅ **Feature documentation**: Complete interface description
-- ✅ **API documentation**: Usage examples
-- ✅ **Configuration guide**: Environment variables
-- ✅ **Troubleshooting**: Common issues and solutions
-
-## ✅ **Model Repository Deployment - Verified Components**
-
-### 1. **Repository Creation** ✅ IMPLEMENTED
-
-**Location**: `scripts/model_tonic/push_to_huggingface.py` - `create_repository()` function
-
-**What it does**:
-- Creates HF model repository using Python API
-- Handles private/public repository settings
-- Supports existing repository updates
-- Provides proper error handling
-
-**Key Features**:
-- ✅ **API-based creation**: Uses `huggingface_hub.create_repo`
-- ✅ **Privacy settings**: Configurable private/public
-- ✅ **Existing handling**: `exist_ok=True` for updates
-- ✅ **Error handling**: Clear error messages
-
-### 2. **Model File Upload** ✅ IMPLEMENTED
-
-**Location**: `scripts/model_tonic/push_to_huggingface.py` - `upload_model_files()` function
-
-**What it does**:
-- Validates model files exist and are complete
-- Uploads all model files to repository
-- Handles large file uploads efficiently
-- Provides progress feedback
-
-**Key Features**:
-- ✅ **File validation**: Checks for required model files
-- ✅ **Complete upload**: All model components uploaded
-- ✅ **Progress tracking**: Upload progress feedback
-- ✅ **Error handling**: Graceful failure handling
-
-**Files Uploaded**:
-- ✅ `config.json` - Model configuration
-- ✅ `pytorch_model.bin` - Model weights
-- ✅ `tokenizer.json` - Tokenizer configuration
-- ✅ `tokenizer_config.json` - Tokenizer settings
-- ✅ `special_tokens_map.json` - Special tokens
-- ✅ `generation_config.json` - Generation settings
-
-### 3. **Model Card Generation** ✅ IMPLEMENTED
-
-**Location**: `scripts/model_tonic/push_to_huggingface.py` - `create_model_card()` function
-
-**What it does**:
-- Generates comprehensive model cards
-- Includes training configuration and results
-- Provides usage examples and documentation
-- Supports quantized model variants
-
-**Key Features**:
-- ✅ **Template-based**: Uses `templates/model_card.md`
-- ✅ **Dynamic content**: Training config and results
-- ✅ **Usage examples**: Code snippets and instructions
-- ✅ **Quantized support**: Multiple model variants
-- ✅ **Metadata**: Proper HF Hub metadata
-
-### 4. **Training Results Documentation** ✅ IMPLEMENTED
-
-**Location**: `scripts/model_tonic/push_to_huggingface.py` - `upload_training_results()` function
-
-**What it does**:
-- Uploads training configuration and results
-- Documents experiment parameters
-- Includes performance metrics
-- Provides experiment tracking links
-
-**Key Features**:
-- ✅ **Configuration upload**: Training parameters
-- ✅ **Results documentation**: Performance metrics
-- ✅ **Experiment links**: Trackio integration
-- ✅ **Metadata**: Proper documentation structure
-
-### 5. **Quantized Model Support** ✅ IMPLEMENTED
-
-**Location**: `scripts/model_tonic/quantize_model.py`
-
-**What it does**:
-- Creates int8 and int4 quantized models
-- Uploads to subdirectories in same repository
-- Generates quantized model cards
-- Provides usage instructions for each variant
-
-**Key Features**:
-- ✅ **Multiple quantization**: int8 and int4 support
-- ✅ **Unified repository**: All variants in one repo
-- ✅ **Separate documentation**: Individual model cards
-- ✅ **Usage instructions**: Clear guidance for each variant
-
-### 6. **Trackio Integration** ✅ IMPLEMENTED
-
-**Location**: `scripts/model_tonic/push_to_huggingface.py` - `log_to_trackio()` function
-
-**What it does**:
-- Logs model push events to Trackio
-- Records training results and metrics
-- Provides experiment tracking links
-- Integrates with HF Datasets
-
-**Key Features**:
-- ✅ **Event logging**: Model push events
-- ✅ **Results tracking**: Training metrics
-- ✅ **Experiment links**: Trackio Space integration
-- ✅ **Dataset integration**: HF Datasets support
-
-### 7. **Model Validation** ✅ IMPLEMENTED
-
-**Location**: `scripts/model_tonic/push_to_huggingface.py` - `validate_model_path()` function
-
-**What it does**:
-- Validates model files are complete
-- Checks for required model components
-- Verifies file integrity
-- Provides detailed error messages
-
-**Key Features**:
-- ✅ **File validation**: Checks all required files
-- ✅ **Size verification**: Model file sizes
-- ✅ **Configuration check**: Valid config files
-- ✅ **Error reporting**: Detailed error messages
-
-## 🔧 **Technical Implementation Details**
-
-### Trackio Space Deployment Flow
-
-```python
-# 1. Create Space
-create_repo(
- repo_id=f"{username}/{space_name}",
- token=token,
- repo_type="space",
- exist_ok=True,
- private=False,
- space_sdk="gradio",
- space_hardware="cpu-basic"
-)
-
-# 2. Upload Files
-upload_file(
- path_or_fileobj=file_content,
- path_in_repo=file_path,
- repo_id=repo_id,
- repo_type="space",
- token=token
-)
-
-# 3. Set Secrets
-add_space_secret(
- repo_id=repo_id,
- repo_type="space",
- key="HF_TOKEN",
- value=token
-)
-```
-
-### Model Repository Deployment Flow
-
-```python
-# 1. Create Repository
-create_repo(
- repo_id=repo_name,
- token=token,
- private=private,
- exist_ok=True
-)
-
-# 2. Upload Model Files
-upload_file(
- path_or_fileobj=model_file,
- path_in_repo=file_path,
- repo_id=repo_name,
- token=token
-)
-
-# 3. Generate Model Card
-model_card = create_model_card(training_config, results)
-upload_file(
- path_or_fileobj=model_card,
- path_in_repo="README.md",
- repo_id=repo_name,
- token=token
-)
-```
-
-## 📊 **Test Results**
-
-### Trackio Space Deployment Test
-
-```bash
-$ python scripts/trackio_tonic/deploy_trackio_space.py
-
-🚀 Starting Trackio Space deployment...
-✅ Authenticated as: Tonic
-✅ Space created successfully: https://huggingface.co/spaces/Tonic/trackio-monitoring
-✅ Files uploaded successfully
-✅ Secrets configured via API
-✅ Space is building and will be available shortly
-🎉 Deployment completed!
-📊 Trackio Space URL: https://huggingface.co/spaces/Tonic/trackio-monitoring
-```
-
-### Model Repository Deployment Test
-
-```bash
-$ python scripts/model_tonic/push_to_huggingface.py --model_path outputs/model --repo_name Tonic/smollm3-finetuned
-
-✅ Repository created: https://huggingface.co/Tonic/smollm3-finetuned
-✅ Model files uploaded successfully
-✅ Model card generated and uploaded
-✅ Training results documented
-✅ Quantized models created and uploaded
-🎉 Model deployment completed!
-```
-
-## 🎯 **Integration Points**
-
-### 1. **End-to-End Pipeline Integration**
-- ✅ **Launch script**: Automatic deployment calls
-- ✅ **Environment setup**: Proper token configuration
-- ✅ **Error handling**: Graceful fallbacks
-- ✅ **User feedback**: Clear progress indicators
-
-### 2. **Monitoring Integration**
-- ✅ **Trackio Space**: Real-time experiment tracking
-- ✅ **HF Datasets**: Persistent experiment storage
-- ✅ **Model cards**: Complete documentation
-- ✅ **Training results**: Comprehensive logging
-
-### 3. **Cross-Component Integration**
-- ✅ **Dataset deployment**: Automatic dataset creation
-- ✅ **Space deployment**: Automatic Space creation
-- ✅ **Model deployment**: Automatic model upload
-- ✅ **Documentation**: Complete system documentation
-
-## ✅ **Verification Summary**
-
-| Component | Status | Location | Test Result |
-|-----------|--------|----------|-------------|
-| **Trackio Space Creation** | ✅ Implemented | `deploy_trackio_space.py` | ✅ Created successfully |
-| **File Upload System** | ✅ Implemented | `deploy_trackio_space.py` | ✅ Uploaded successfully |
-| **Space Configuration** | ✅ Implemented | `deploy_trackio_space.py` | ✅ Configured via API |
-| **Gradio Interface** | ✅ Implemented | `templates/spaces/app.py` | ✅ Full functionality |
-| **Requirements** | ✅ Implemented | `templates/spaces/requirements.txt` | ✅ All dependencies |
-| **README Template** | ✅ Implemented | `templates/spaces/README.md` | ✅ Complete documentation |
-| **Model Repository Creation** | ✅ Implemented | `push_to_huggingface.py` | ✅ Created successfully |
-| **Model File Upload** | ✅ Implemented | `push_to_huggingface.py` | ✅ Uploaded successfully |
-| **Model Card Generation** | ✅ Implemented | `push_to_huggingface.py` | ✅ Generated and uploaded |
-| **Quantized Models** | ✅ Implemented | `quantize_model.py` | ✅ Created and uploaded |
-| **Trackio Integration** | ✅ Implemented | `push_to_huggingface.py` | ✅ Integrated successfully |
-| **Model Validation** | ✅ Implemented | `push_to_huggingface.py` | ✅ Validated successfully |
-
-## 🚀 **Next Steps**
-
-The deployment components are now **fully implemented and verified**. Users can:
-
-1. **Deploy Trackio Space**: Automatic Space creation and configuration
-2. **Upload Models**: Complete model deployment with documentation
-3. **Monitor Experiments**: Real-time tracking and visualization
-4. **Share Results**: Comprehensive documentation and examples
-5. **Scale Operations**: Support for multiple experiments and models
-
-**All important deployment components are properly implemented and working correctly!** 🎉
\ No newline at end of file
diff --git a/docs/DEPLOYMENT_GUIDE.md b/docs/DEPLOYMENT_GUIDE.md
deleted file mode 100644
index 4371c52f2039509cced8d9d1f74a4a0b3f21bc12..0000000000000000000000000000000000000000
--- a/docs/DEPLOYMENT_GUIDE.md
+++ /dev/null
@@ -1,397 +0,0 @@
-# Trackio Deployment Guide for Hugging Face Spaces
-
-This guide provides step-by-step instructions for deploying Trackio experiment tracking to Hugging Face Spaces and integrating it with your SmolLM3 fine-tuning pipeline.
-
-## Prerequisites
-
-- Hugging Face account
-- Hugging Face CLI installed (`pip install huggingface_hub`)
-- Git configured with your Hugging Face credentials
-
-## Method 1: Automated Deployment (Recommended)
-
-### Step 1: Run the Deployment Script
-
-```bash
-python deploy_trackio_space.py
-```
-
-The script will prompt you for:
-- Your Hugging Face username
-- Space name (e.g., `trackio-monitoring`)
-- Hugging Face token (needs a write token obviously)
-
-### Step 2: Wait for Build
-
-After deployment, wait 2-5 minutes for the Space to build and become available.
-
-### Step 3: Test the Interface
-
-Visit your Space URL to test the interface:
-```
-https://huggingface.co/spaces/YOUR_USERNAME/YOUR_SPACE_NAME
-```
-
-## Method 2: Manual Deployment
-
-### Step 1: Create a New Space
-
-1. Go to https://huggingface.co/spaces
-2. Click "Create new Space"
-3. Configure the Space:
- - **Owner**: Your username
- - **Space name**: `trackio-monitoring` (or your preferred name)
- - **SDK**: Gradio
- - **Hardware**: CPU (Basic)
- - **License**: MIT
-
-### Step 2: Upload Files
-
-Upload these files to your Space:
-
-#### `app.py`
-The main Gradio interface (already created in this repository)
-
-#### `requirements_space.txt`
-```
-gradio>=4.0.0
-gradio-client>=0.10.0
-requests>=2.31.0
-numpy>=1.24.0
-pandas>=2.0.0
-jsonschema>=4.17.0
-plotly>=5.15.0
-matplotlib>=3.7.0
-python-dotenv>=1.0.0
-```
-
-#### `README.md`
-```markdown
-# Trackio Experiment Tracking
-
-A Gradio interface for experiment tracking and monitoring.
-
-## Features
-
-- Create and manage experiments
-- Log training metrics and parameters
-- View experiment details and results
-- Update experiment status
-
-## Usage
-
-1. Create a new experiment using the "Create Experiment" tab
-2. Log metrics during training using the "Log Metrics" tab
-3. View experiment details using the "View Experiments" tab
-4. Update experiment status using the "Update Status" tab
-
-## Integration
-
-To connect your training script to this Trackio Space:
-
-```python
-from monitoring import SmolLM3Monitor
-
-monitor = SmolLM3Monitor(
- experiment_name="my_experiment",
- trackio_url="https://your-space.hf.space",
- enable_tracking=True
-)
-```
-
-### Step 3: Configure Space Settings
-
-In your Space settings, ensure:
-- **App file**: `app.py`
-- **Python version**: 3.9 or higher
-- **Hardware**: CPU (Basic) is sufficient
-
-## Integration with Your Training Script
-
-### Step 1: Update Your Configuration
-
-Add Trackio settings to your training configuration:
-
-```python
-# config/train_smollm3.py
-@dataclass
-class SmolLM3Config:
- # ... existing settings ...
-
- # Trackio monitoring configuration
- enable_tracking: bool = True
- trackio_url: Optional[str] = None # Your Space URL
- trackio_token: Optional[str] = None
- log_artifacts: bool = True
- log_metrics: bool = True
- log_config: bool = True
- experiment_name: Optional[str] = None
-```
-
-### Step 2: Run Training with Trackio
-
-```bash
-python train.py config/train_smollm3.py \
- --dataset_dir my_dataset \
- --enable_tracking \
- --trackio_url "https://your-username-trackio-monitoring.hf.space" \
- --experiment_name "smollm3_finetune_v1"
-```
-
-### Step 3: Monitor Your Experiments
-
-1. **Create Experiment**: Use the "Create Experiment" tab in your Space
-2. **Log Metrics**: Your training script will automatically log metrics
-3. **View Results**: Use the "View Experiments" tab to see progress
-4. **Update Status**: Mark experiments as completed when done
-
-## Advanced Configuration
-
-### Environment Variables
-
-You can set Trackio configuration via environment variables:
-
-```bash
-export TRACKIO_URL="https://your-space.hf.space"
-export TRACKIO_TOKEN="your_token_here"
-```
-
-### Custom Experiment Names
-
-```bash
-python train.py config/train_smollm3.py \
- --experiment_name "smollm3_high_lr_experiment" \
- --trackio_url "https://your-space.hf.space"
-```
-
-### Multiple Experiments
-
-You can run multiple experiments and track them separately:
-
-```bash
-# Experiment 1
-python train.py config/train_smollm3.py \
- --experiment_name "smollm3_baseline" \
- --learning_rate 2e-5
-
-# Experiment 2
-python train.py config/train_smollm3.py \
- --experiment_name "smollm3_high_lr" \
- --learning_rate 5e-5
-```
-
-## Using the Trackio Interface
-
-### Creating Experiments
-
-1. Go to the "Create Experiment" tab
-2. Enter experiment name (e.g., "smollm3_finetune_v1")
-3. Add description (optional)
-4. Click "Create Experiment"
-5. Note the experiment ID for logging metrics
-
-### Logging Metrics
-
-1. Go to the "Log Metrics" tab
-2. Enter your experiment ID
-3. Add metrics in JSON format:
- ```json
- {
- "loss": 0.5,
- "accuracy": 0.85,
- "learning_rate": 2e-5
- }
- ```
-4. Add step number (optional)
-5. Click "Log Metrics"
-
-### Viewing Experiments
-
-1. Go to the "View Experiments" tab
-2. Enter experiment ID to view specific experiment
-3. Or click "List All Experiments" to see all experiments
-
-### Updating Status
-
-1. Go to the "Update Status" tab
-2. Enter experiment ID
-3. Select new status (running, completed, failed, paused)
-4. Click "Update Status"
-
-## Troubleshooting
-
-### Common Issues
-
-#### 1. Space Not Building
-- Check that all required files are uploaded
-- Verify `app.py` is the main file
-- Check the Space logs for errors
-
-#### 2. Connection Errors
-- Verify your Space URL is correct
-- Check that the Space is running (not paused)
-- Ensure your training script can reach the Space URL
-
-#### 3. Missing Metrics
-- Check that `enable_tracking=True` in your config
-- Verify the Trackio URL is correct
-- Check training logs for monitoring errors
-
-#### 4. Authentication Issues
-- If using tokens, verify they're correct
-- Check Hugging Face account permissions
-- Ensure Space is public or you have access
-
-### Debug Mode
-
-Enable debug logging in your training script:
-
-```python
-import logging
-logging.basicConfig(level=logging.DEBUG)
-```
-
-### Manual Testing
-
-Test the Trackio interface manually:
-
-1. Create an experiment
-2. Log some test metrics
-3. View the experiment details
-4. Update the status
-
-## Security Considerations
-
-### Public vs Private Spaces
-
-- **Public Spaces**: Anyone can view and use the interface
-- **Private Spaces**: Only you and collaborators can access
-
-### Token Management
-
-- Store tokens securely (environment variables)
-- Don't commit tokens to version control
-- Use Hugging Face's token management
-
-### Data Privacy
-
-- Trackio stores experiment data in the Space
-- Consider data retention policies
-- Be mindful of sensitive information in experiment names
-
-## Performance Optimization
-
-### Space Configuration
-
-- Use CPU (Basic) for the interface (sufficient for tracking)
-- Consider GPU only for actual training
-- Monitor Space usage and limits
-
-### Efficient Logging
-
-- Log metrics at reasonable intervals (every 10-100 steps)
-- Avoid logging too frequently to prevent rate limiting
-- Use batch logging when possible
-
-## Monitoring Best Practices
-
-### Experiment Naming
-
-Use descriptive names:
-- `smollm3_baseline_v1`
-- `smollm3_high_lr_experiment`
-- `smollm3_dpo_training`
-
-### Metric Logging
-
-Log relevant metrics:
-- Training loss
-- Validation loss
-- Learning rate
-- GPU memory usage
-- Training time
-
-### Status Management
-
-- Mark experiments as "running" when starting
-- Update to "completed" when finished
-- Mark as "failed" if errors occur
-- Use "paused" for temporary stops
-
-## Integration Examples
-
-### Basic Integration
-
-```python
-from monitoring import SmolLM3Monitor
-
-# Initialize monitor
-monitor = SmolLM3Monitor(
- experiment_name="my_experiment",
- trackio_url="https://your-space.hf.space",
- enable_tracking=True
-)
-
-# Log configuration
-monitor.log_config(config_dict)
-
-# Log metrics during training
-monitor.log_metrics({"loss": 0.5}, step=100)
-
-# Log final results
-monitor.log_training_summary(final_results)
-```
-
-### Advanced Integration
-
-```python
-# Custom monitoring setup
-monitor = SmolLM3Monitor(
- experiment_name="smollm3_advanced",
- trackio_url="https://your-space.hf.space",
- enable_tracking=True,
- log_artifacts=True,
- log_metrics=True,
- log_config=True
-)
-
-# Log system metrics
-monitor.log_system_metrics(step=current_step)
-
-# Log model checkpoint
-monitor.log_model_checkpoint("checkpoint-1000", step=1000)
-
-# Log evaluation results
-monitor.log_evaluation_results(eval_results, step=1000)
-```
-
-## Support and Resources
-
-### Documentation
-
-- [Hugging Face Spaces Documentation](https://huggingface.co/docs/hub/spaces)
-- [Gradio Documentation](https://gradio.app/docs/)
-- [Trackio GitHub Repository](https://github.com/Josephrp/trackio)
-
-### Community
-
-- [Hugging Face Forums](https://discuss.huggingface.co/)
-- [Gradio Discord](https://discord.gg/feTf9z3Z)
-
-### Issues and Feedback
-
-- Report issues on the project repository
-- Provide feedback on the Trackio interface
-- Suggest improvements for the monitoring system
-
-## Conclusion
-
-You now have a complete Trackio monitoring system deployed on Hugging Face Spaces! This setup provides:
-
-- ✅ Easy experiment tracking and monitoring
-- ✅ Real-time metric logging
-- ✅ Web-based interface for experiment management
-- ✅ Integration with your SmolLM3 fine-tuning pipeline
-- ✅ Scalable and accessible monitoring solution
-
-Start tracking your experiments and gain insights into your model training process!
\ No newline at end of file
diff --git a/docs/Data_Pipeline.md b/docs/Data_Pipeline.md
new file mode 100644
index 0000000000000000000000000000000000000000..46a9bdb2ed9a7fef7338f41613a731da7907189a
--- /dev/null
+++ b/docs/Data_Pipeline.md
@@ -0,0 +1,95 @@
+```mermaid
+graph LR
+ EntryPoint["EntryPoint"]
+ Configuration["Configuration"]
+ Model_Abstraction["Model Abstraction"]
+ Data_Pipeline["Data Pipeline"]
+ Training_Logic["Training Logic"]
+ Utilities["Utilities"]
+ EntryPoint -- "instructs" --> Data_Pipeline
+ EntryPoint -- "loads settings from" --> Configuration
+ EntryPoint -- "initializes models via" --> Model_Abstraction
+ EntryPoint -- "invokes" --> Training_Logic
+ Configuration -- "provides settings to" --> EntryPoint
+ Configuration -- "informs" --> Model_Abstraction
+ Configuration -- "guides" --> Data_Pipeline
+ Model_Abstraction -- "provides models to" --> EntryPoint
+ Model_Abstraction -- "receives settings from" --> Configuration
+ Model_Abstraction -- "interacts with" --> Training_Logic
+ Data_Pipeline -- "provides processed data to" --> EntryPoint
+ Data_Pipeline -- "receives parameters from" --> Configuration
+ Data_Pipeline -- "supplies batches to" --> Training_Logic
+ Training_Logic -- "receives control from" --> EntryPoint
+ Training_Logic -- "consumes data from" --> Data_Pipeline
+ Training_Logic -- "operates on models from" --> Model_Abstraction
+ Training_Logic -- "uses" --> Utilities
+ Utilities -- "used by" --> EntryPoint
+ Utilities -- "provides functionalities to" --> Training_Logic
+ Utilities -- "assists" --> Data_Pipeline
+ click Model_Abstraction href "https://github.com/Josephrp/SmolFactory/blob/main/docs/Model_Abstraction.md" "Details"
+ click Data_Pipeline href "https://github.com/Josephrp/SmolFactory/blob/main/docs/Data_Pipeline.md" "Details"
+```
+
+[](https://github.com/CodeBoarding/GeneratedOnBoardings)[](https://www.codeboarding.org/demo)[](mailto:contact@codeboarding.org)
+
+## Details
+
+Final component overview for the `smollm3_finetune` project, based on the provided analysis and adhering to Machine Learning Training and Fine-tuning Framework patterns.
+
+### EntryPoint
+The main entry point of the application, responsible for orchestrating the entire training and fine-tuning workflow. It initializes other core components, loads configurations, and manages the overall execution flow.
+
+
+**Related Classes/Methods**:
+
+- `smollm3_finetune.train` (1:1)
+
+
+### Configuration
+Centralizes and defines all parameters and settings required for the training and fine-tuning process, including model hyperparameters, dataset paths, and training arguments. It promotes a configuration-driven architecture, allowing easy modification and versioning of experimental setups.
+
+
+**Related Classes/Methods**:
+
+- `config` (1:1)
+
+
+### Model Abstraction [[Expand]](./Model_Abstraction.md)
+Encapsulates the logic for loading, initializing, and managing different machine learning models and their variants (e.g., different architectures, quantization settings). It provides a consistent interface for interacting with various model architectures.
+
+
+**Related Classes/Methods**:
+
+- `model` (1:1)
+
+
+### Data Pipeline [[Expand]](./Data_Pipeline.md)
+Handles the entire data lifecycle, including dataset loading, preprocessing (e.g., tokenization, formatting), and creating efficient data loaders for both training and evaluation phases. It ensures data is prepared correctly and efficiently for the model.
+
+
+**Related Classes/Methods**:
+
+- `smollm3_finetune.data.load_and_preprocess_data` (1:1)
+
+
+### Training Logic
+Contains the core algorithms and routines for training and fine-tuning machine learning models. This includes the training loop, optimization steps, loss calculation, gradient accumulation, and potentially specialized fine-tuning methods (e.g., LoRA, QLoRA).
+
+
+**Related Classes/Methods**:
+
+- `trainer` (1:1)
+
+
+### Utilities
+A collection of common helper functions, reusable modules, and general-purpose tools that support various parts of the training framework but do not belong to a specific core component. This includes functions for logging, metrics calculation, device management, etc.
+
+
+**Related Classes/Methods**:
+
+- `utils` (1:1)
+
+
+
+
+### [FAQ](https://github.com/CodeBoarding/GeneratedOnBoardings/tree/main?tab=readme-ov-file#faq)
\ No newline at end of file
diff --git a/docs/ENHANCED_MODEL_CARD_METADATA.md b/docs/ENHANCED_MODEL_CARD_METADATA.md
deleted file mode 100644
index 3c00416e93db374b9d51fd03ddbc6c706e0eb606..0000000000000000000000000000000000000000
--- a/docs/ENHANCED_MODEL_CARD_METADATA.md
+++ /dev/null
@@ -1,300 +0,0 @@
-# Enhanced Model Card Metadata System
-
-## Overview
-
-The enhanced model card system now includes comprehensive YAML metadata that follows the [Hugging Face Model Cards specification](https://huggingface.co/docs/hub/en/model-cards). This ensures maximum compatibility with the Hugging Face Hub and provides rich metadata for model discovery and usage.
-
-## Metadata Structure
-
-### Core Metadata Fields
-
-The model card template now includes the following metadata fields:
-
-```yaml
----
-language:
-- en
-- fr
-license: apache-2.0
-library_name: transformers
-tags:
-- smollm3
-- fine-tuned
-- causal-lm
-- text-generation
-- quantized
-- dataset:OpenHermes-FR
-- config:H100 Lightweight
-pipeline_tag: text-generation
-base_model: HuggingFaceTB/SmolLM3-3B
-datasets:
-- OpenHermes-FR
----
-```
-
-### Conditional Metadata
-
-The system supports conditional metadata based on model configuration:
-
-#### Quantized Models
-When quantized models are available, additional metadata is included:
-
-```yaml
-quantization_types:
-- int8_weight_only
-- int4_weight_only
-```
-
-#### Model Index (Evaluation Results)
-The system automatically generates structured evaluation results:
-
-```yaml
-model-index:
-- name: Model Name
- results:
- - task:
- type: text-generation
- dataset:
- name: OpenHermes-FR
- type: OpenHermes-FR
- metrics:
- - name: Training Loss
- type: loss
- value: "2.1"
- - name: Validation Loss
- type: loss
- value: "2.3"
- - name: Perplexity
- type: perplexity
- value: "9.8"
-```
-
-For quantized models, additional entries are included:
-
-```yaml
-- name: Model Name (int8 quantized)
- results:
- - task:
- type: text-generation
- dataset:
- name: OpenHermes-FR
- type: OpenHermes-FR
- metrics:
- - name: Memory Reduction
- type: memory_efficiency
- value: "~50%"
- - name: Inference Speed
- type: speed
- value: "Faster"
-```
-
-## Metadata Fields Explained
-
-### Required Fields
-
-| Field | Description | Example |
-|-------|-------------|---------|
-| `language` | Supported languages | `["en", "fr"]` |
-| `license` | Model license | `"apache-2.0"` |
-| `library_name` | Primary library | `"transformers"` |
-| `tags` | Model tags for discovery | `["smollm3", "fine-tuned"]` |
-| `pipeline_tag` | Task type | `"text-generation"` |
-| `base_model` | Original model | `"HuggingFaceTB/SmolLM3-3B"` |
-
-### Optional Fields
-
-| Field | Description | Example |
-|-------|-------------|---------|
-| `datasets` | Training datasets | `["OpenHermes-FR"]` |
-| `author` | Model author | `"Your Name"` |
-| `experiment_name` | Experiment tracking | `"smollm3-experiment"` |
-| `trackio_url` | Monitoring URL | `"https://trackio.space/exp"` |
-| `hardware` | Training hardware | `"GPU (H100/A100)"` |
-| `training_config` | Configuration type | `"H100 Lightweight"` |
-| `trainer_type` | Trainer used | `"SFTTrainer"` |
-| `batch_size` | Training batch size | `"8"` |
-| `learning_rate` | Learning rate | `"5e-6"` |
-| `max_epochs` | Number of epochs | `"3"` |
-| `max_seq_length` | Sequence length | `"2048"` |
-| `gradient_accumulation_steps` | Gradient accumulation | `"16"` |
-
-### Training Results
-
-| Field | Description | Example |
-|-------|-------------|---------|
-| `training_loss` | Final training loss | `"2.1"` |
-| `validation_loss` | Final validation loss | `"2.3"` |
-| `perplexity` | Model perplexity | `"9.8"` |
-
-## Benefits of Enhanced Metadata
-
-### 1. Improved Discovery
-- **Filtering**: Users can filter models by dataset, configuration, or hardware
-- **Search**: Enhanced search capabilities on the Hugging Face Hub
-- **Tags**: Automatic tag generation for better categorization
-
-### 2. Better Model Cards
-- **Structured Data**: Evaluation results are displayed in widgets
-- **Consistent Format**: Follows Hugging Face standards
-- **Rich Information**: Comprehensive model information
-
-### 3. Integration Benefits
-- **Papers with Code**: Model index data can be indexed in leaderboards
-- **API Compatibility**: Better integration with Hugging Face APIs
-- **Automated Tools**: Support for automated model analysis
-
-## Usage Examples
-
-### Basic Model Card Generation
-
-```bash
-python scripts/model_tonic/generate_model_card.py \
- --repo-name "username/model-name" \
- --model-name "My Fine-tuned Model" \
- --dataset-name "OpenHermes-FR" \
- --training-config "H100 Lightweight" \
- --batch-size "8" \
- --learning-rate "5e-6" \
- --max-epochs "3" \
- --training-loss "2.1" \
- --validation-loss "2.3" \
- --perplexity "9.8" \
- --output "README.md"
-```
-
-### With Quantized Models
-
-```bash
-python scripts/model_tonic/generate_model_card.py \
- --repo-name "username/model-name" \
- --model-name "My Fine-tuned Model" \
- --dataset-name "OpenHermes-FR" \
- --training-config "H100 Lightweight" \
- --batch-size "8" \
- --learning-rate "5e-6" \
- --max-epochs "3" \
- --training-loss "2.1" \
- --validation-loss "2.3" \
- --perplexity "9.8" \
- --quantized-models \
- --output "README.md"
-```
-
-## Template Variables
-
-The enhanced template supports all the original variables plus new metadata fields:
-
-### New Variables
-
-| Variable | Description | Default |
-|----------|-------------|---------|
-| `training_loss` | Training loss value | `"N/A"` |
-| `validation_loss` | Validation loss value | `"N/A"` |
-| `perplexity` | Model perplexity | `"N/A"` |
-
-### Conditional Metadata
-
-The template automatically includes:
-
-- **Dataset Information**: When `dataset_name` is provided
-- **Quantization Types**: When `quantized_models` is `true`
-- **Evaluation Results**: When training metrics are available
-- **Hardware Information**: When `hardware_info` is provided
-
-## Integration with Training Pipeline
-
-### Automatic Metadata Generation
-
-The push script automatically extracts metadata from:
-
-1. **Training Configuration**: Batch size, learning rate, epochs, etc.
-2. **Training Results**: Loss values, perplexity, etc.
-3. **Model Information**: Base model, hardware, etc.
-4. **Experiment Tracking**: Trackio URLs, experiment names
-
-### Example Integration
-
-```python
-# In push_to_huggingface.py
-variables = {
- "model_name": f"{self.repo_name.split('/')[-1]} - Fine-tuned SmolLM3",
- "repo_name": self.repo_name,
- "base_model": "HuggingFaceTB/SmolLM3-3B",
- "dataset_name": training_config.get('dataset_name', 'OpenHermes-FR'),
- "training_config_type": training_config.get('training_config_type', 'Custom Configuration'),
- "trainer_type": training_config.get('trainer_type', 'SFTTrainer'),
- "batch_size": str(training_config.get('per_device_train_batch_size', 8)),
- "learning_rate": str(training_config.get('learning_rate', '5e-6')),
- "max_epochs": str(training_config.get('num_train_epochs', 3)),
- "hardware_info": self._get_hardware_info(),
- "training_loss": results.get('train_loss', 'N/A'),
- "validation_loss": results.get('eval_loss', 'N/A'),
- "perplexity": results.get('perplexity', 'N/A'),
- "quantized_models": False # Updated if quantized models are added
-}
-```
-
-## Validation and Testing
-
-### Metadata Validation
-
-The system includes validation for:
-
-- **Required Fields**: Ensures all required metadata is present
-- **Format Validation**: Validates YAML syntax and structure
-- **Value Ranges**: Checks for reasonable values in numeric fields
-- **Conditional Logic**: Verifies conditional metadata is properly included
-
-### Test Coverage
-
-The test suite verifies:
-
-- **Basic Metadata**: All required fields are present
-- **Conditional Metadata**: Quantized model metadata is included when appropriate
-- **Evaluation Results**: Model index data is properly structured
-- **Template Processing**: Variable substitution works correctly
-
-## Best Practices
-
-### 1. Metadata Completeness
-- Include all available training information
-- Provide accurate evaluation metrics
-- Use consistent naming conventions
-
-### 2. Conditional Logic
-- Only include relevant metadata
-- Use conditional sections appropriately
-- Provide fallback values for missing data
-
-### 3. Validation
-- Test metadata generation with various configurations
-- Verify YAML syntax is correct
-- Check that all variables are properly substituted
-
-### 4. Documentation
-- Document all available metadata fields
-- Provide examples for each field type
-- Include troubleshooting information
-
-## Future Enhancements
-
-### Planned Features
-
-1. **Additional Metrics**: Support for more evaluation metrics
-2. **Custom Metadata**: User-defined metadata fields
-3. **Validation Rules**: Configurable validation rules
-4. **Auto-Detection**: Automatic detection of model features
-5. **Integration APIs**: Better integration with external tools
-
-### Extensibility
-
-The system is designed to be easily extensible:
-
-- **New Fields**: Easy to add new metadata fields
-- **Custom Validators**: Support for custom validation logic
-- **Template Extensions**: Support for template inheritance
-- **API Integration**: Easy integration with external APIs
-
-## Conclusion
-
-The enhanced model card metadata system provides comprehensive, standards-compliant metadata that maximizes compatibility with the Hugging Face Hub while providing rich information for model discovery and usage. The system automatically generates appropriate metadata based on model configuration and training results, ensuring consistency and completeness across all model repositories.
\ No newline at end of file
diff --git a/docs/ENVIRONMENT_SETUP_FIX.md b/docs/ENVIRONMENT_SETUP_FIX.md
deleted file mode 100644
index 85d39534508914db5a4c9285916a4ecea5e20d3f..0000000000000000000000000000000000000000
--- a/docs/ENVIRONMENT_SETUP_FIX.md
+++ /dev/null
@@ -1,239 +0,0 @@
-# Environment Setup Fix
-
-## Issue Identified
-
-The user requested to ensure that the provided token is properly available in the new virtual environment created during the launch script execution to avoid errors.
-
-## Root Cause
-
-The `launch.sh` script was setting environment variables after creating the virtual environment, which could cause the token to not be available within the virtual environment context.
-
-## Fixes Applied
-
-### 1. **Environment Variables Set Before Virtual Environment** ✅ **FIXED**
-
-**File**: `launch.sh`
-
-**Changes**:
-- Set environment variables before creating the virtual environment
-- Re-export environment variables after activating the virtual environment
-- Added verification step to ensure token is available
-
-**Before**:
-```bash
-print_info "Creating Python virtual environment..."
-python3 -m venv smollm3_env
-source smollm3_env/bin/activate
-
-# ... install dependencies ...
-
-# Step 8: Authentication setup
-export HF_TOKEN="$HF_TOKEN"
-export HUGGING_FACE_HUB_TOKEN="$HF_TOKEN"
-```
-
-**After**:
-```bash
-# Set environment variables before creating virtual environment
-print_info "Setting up environment variables..."
-export HF_TOKEN="$HF_TOKEN"
-export TRACKIO_DATASET_REPO="$TRACKIO_DATASET_REPO"
-export HUGGING_FACE_HUB_TOKEN="$HF_TOKEN"
-export HF_USERNAME="$HF_USERNAME"
-
-print_info "Creating Python virtual environment..."
-python3 -m venv smollm3_env
-source smollm3_env/bin/activate
-
-# Re-export environment variables in the virtual environment
-print_info "Configuring environment variables in virtual environment..."
-export HF_TOKEN="$HF_TOKEN"
-export TRACKIO_DATASET_REPO="$TRACKIO_DATASET_REPO"
-export HUGGING_FACE_HUB_TOKEN="$HF_TOKEN"
-export HF_USERNAME="$HF_USERNAME"
-```
-
-### 2. **Token Verification Step** ✅ **ADDED**
-
-**File**: `launch.sh`
-
-**Added verification to ensure token is properly configured**:
-```bash
-# Verify token is available in the virtual environment
-print_info "Verifying token availability in virtual environment..."
-if [ -n "$HF_TOKEN" ] && [ -n "$HUGGING_FACE_HUB_TOKEN" ]; then
- print_status "✅ Token properly configured in virtual environment"
- print_info " HF_TOKEN: ${HF_TOKEN:0:10}...${HF_TOKEN: -4}"
- print_info " HUGGING_FACE_HUB_TOKEN: ${HUGGING_FACE_HUB_TOKEN:0:10}...${HUGGING_FACE_HUB_TOKEN: -4}"
-else
- print_error "❌ Token not properly configured in virtual environment"
- print_error "Please check your token and try again"
- exit 1
-fi
-```
-
-### 3. **Environment Variables Before Each Script Call** ✅ **ADDED**
-
-**File**: `launch.sh`
-
-**Added environment variable exports before each Python script call**:
-
-**Trackio Space Deployment**:
-```bash
-# Ensure environment variables are available for the script
-export HF_TOKEN="$HF_TOKEN"
-export HUGGING_FACE_HUB_TOKEN="$HF_TOKEN"
-export HF_USERNAME="$HF_USERNAME"
-
-python deploy_trackio_space.py "$TRACKIO_SPACE_NAME" "$HF_TOKEN" "$GIT_EMAIL"
-```
-
-**Dataset Setup**:
-```bash
-# Ensure environment variables are available for the script
-export HF_TOKEN="$HF_TOKEN"
-export HUGGING_FACE_HUB_TOKEN="$HF_TOKEN"
-export HF_USERNAME="$HF_USERNAME"
-
-python setup_hf_dataset.py "$HF_TOKEN"
-```
-
-**Trackio Configuration**:
-```bash
-# Ensure environment variables are available for the script
-export HF_TOKEN="$HF_TOKEN"
-export HUGGING_FACE_HUB_TOKEN="$HF_TOKEN"
-export HF_USERNAME="$HF_USERNAME"
-
-python configure_trackio.py
-```
-
-**Training Script**:
-```bash
-# Ensure environment variables are available for training
-export HF_TOKEN="$HF_TOKEN"
-export HUGGING_FACE_HUB_TOKEN="$HF_TOKEN"
-export HF_USERNAME="$HF_USERNAME"
-export TRACKIO_DATASET_REPO="$TRACKIO_DATASET_REPO"
-
-python scripts/training/train.py \
- --config "$CONFIG_FILE" \
- --experiment-name "$EXPERIMENT_NAME" \
- --output-dir /output-checkpoint \
- --trackio-url "$TRACKIO_URL" \
- --trainer-type "$TRAINER_TYPE"
-```
-
-**Model Push**:
-```bash
-# Ensure environment variables are available for model push
-export HF_TOKEN="$HF_TOKEN"
-export HUGGING_FACE_HUB_TOKEN="$HF_TOKEN"
-export HF_USERNAME="$HF_USERNAME"
-export TRACKIO_DATASET_REPO="$TRACKIO_DATASET_REPO"
-
-python scripts/model_tonic/push_to_huggingface.py /output-checkpoint "$REPO_NAME" \
- --token "$HF_TOKEN" \
- --trackio-url "$TRACKIO_URL" \
- --experiment-name "$EXPERIMENT_NAME" \
- --dataset-repo "$TRACKIO_DATASET_REPO"
-```
-
-**Quantization Scripts**:
-```bash
-# Ensure environment variables are available for quantization
-export HF_TOKEN="$HF_TOKEN"
-export HUGGING_FACE_HUB_TOKEN="$HF_TOKEN"
-export HF_USERNAME="$HF_USERNAME"
-export TRACKIO_DATASET_REPO="$TRACKIO_DATASET_REPO"
-
-python scripts/model_tonic/quantize_model.py /output-checkpoint "$REPO_NAME" \
- --quant-type "$QUANT_TYPE" \
- --device "$DEVICE" \
- --token "$HF_TOKEN" \
- --trackio-url "$TRACKIO_URL" \
- --experiment-name "${EXPERIMENT_NAME}-${QUANT_TYPE}" \
- --dataset-repo "$TRACKIO_DATASET_REPO"
-```
-
-## Key Improvements
-
-### 1. **Proper Environment Variable Timing**
-- ✅ **Set before virtual environment**: Variables set before creating venv
-- ✅ **Re-export after activation**: Variables re-exported after activating venv
-- ✅ **Before each script**: Variables exported before each Python script call
-- ✅ **Verification step**: Token availability verified before proceeding
-
-### 2. **Comprehensive Coverage**
-- ✅ **All scripts covered**: Every Python script has environment variables
-- ✅ **Multiple variables**: HF_TOKEN, HUGGING_FACE_HUB_TOKEN, HF_USERNAME, TRACKIO_DATASET_REPO
-- ✅ **Consistent naming**: All scripts use the same environment variable names
-- ✅ **Error handling**: Verification step catches missing tokens
-
-### 3. **Cross-Platform Compatibility**
-- ✅ **Bash compatible**: Uses standard bash export syntax
-- ✅ **Virtual environment aware**: Properly handles venv activation
-- ✅ **Token validation**: Verifies token availability before use
-- ✅ **Clear error messages**: Descriptive error messages for debugging
-
-## Environment Variables Set
-
-The following environment variables are now properly set and available in the virtual environment:
-
-1. **`HF_TOKEN`** - The Hugging Face token for authentication
-2. **`HUGGING_FACE_HUB_TOKEN`** - Alternative token variable for Python API
-3. **`HF_USERNAME`** - Username extracted from token
-4. **`TRACKIO_DATASET_REPO`** - Dataset repository for Trackio
-
-## Test Results
-
-### **Environment Setup Test**
-```bash
-$ python tests/test_environment_setup.py
-
-🚀 Environment Setup Verification
-==================================================
-🔍 Testing Environment Variables
-[OK] HF_TOKEN: hf_FWrfleE...zuoF
-[OK] HUGGING_FACE_HUB_TOKEN: hf_FWrfleE...zuoF
-[OK] HF_USERNAME: Tonic...onic
-[OK] TRACKIO_DATASET_REPO: Tonic/trac...ents
-
-🔍 Testing Launch Script Environment Setup
-[OK] Found: export HF_TOKEN=
-[OK] Found: export HUGGING_FACE_HUB_TOKEN=
-[OK] Found: export HF_USERNAME=
-[OK] Found: export TRACKIO_DATASET_REPO=
-[OK] Found virtual environment activation
-[OK] Found environment variable re-export after activation
-
-[SUCCESS] ALL ENVIRONMENT TESTS PASSED!
-[OK] Environment variables: Properly set
-[OK] Virtual environment: Can access variables
-[OK] Launch script: Properly configured
-
-The environment setup is working correctly!
-```
-
-## User Token Status
-
-**Token**: `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`
-**Status**: ✅ **Working correctly in virtual environment**
-**Username**: `Tonic` (auto-detected)
-
-## Next Steps
-
-The user can now run the launch script with confidence that the token will be properly available in the virtual environment:
-
-```bash
-./launch.sh
-```
-
-The script will:
-1. ✅ **Set environment variables** before creating virtual environment
-2. ✅ **Re-export variables** after activating virtual environment
-3. ✅ **Verify token availability** before proceeding
-4. ✅ **Export variables** before each Python script call
-5. ✅ **Ensure all scripts** have access to the token
-
-**No more token-related errors in the virtual environment!** 🎉
\ No newline at end of file
diff --git a/docs/ENVIRONMENT_VARIABLES.md b/docs/ENVIRONMENT_VARIABLES.md
deleted file mode 100644
index c4b1ea7335bfbbdfec745402d8758a2fe4011bf9..0000000000000000000000000000000000000000
--- a/docs/ENVIRONMENT_VARIABLES.md
+++ /dev/null
@@ -1,113 +0,0 @@
-# 🔧 Trackio Environment Variables Reference
-
-## Quick Setup
-
-Set these environment variables in your Hugging Face Space:
-
-```bash
-# Required: Your HF token for dataset access
-HF_TOKEN=your_hf_token_here
-
-# Optional: Dataset repository to use (defaults to tonic/trackio-experiments)
-TRACKIO_DATASET_REPO=your-username/your-dataset-name
-```
-
-## Environment Variables
-
-| Variable | Required | Default | Description |
-|----------|----------|---------|-------------|
-| `HF_TOKEN` | ✅ Yes | None | Your Hugging Face token for dataset access |
-| `TRACKIO_DATASET_REPO` | ❌ No | `tonic/trackio-experiments` | Dataset repository to load experiments from |
-| `SPACE_ID` | 🔄 Auto | None | HF Space ID (automatically detected) |
-
-## Configuration Examples
-
-### 1. Default Setup
-```bash
-HF_TOKEN=your_token_here
-# Uses: tonic/trackio-experiments
-```
-
-### 2. Personal Dataset
-```bash
-HF_TOKEN=your_token_here
-TRACKIO_DATASET_REPO=your-username/trackio-experiments
-```
-
-### 3. Team Dataset
-```bash
-HF_TOKEN=your_token_here
-TRACKIO_DATASET_REPO=your-org/team-experiments
-```
-
-### 4. Project-Specific Dataset
-```bash
-HF_TOKEN=your_token_here
-TRACKIO_DATASET_REPO=your-username/smollm3-experiments
-```
-
-## How to Set in HF Spaces
-
-1. Go to your Hugging Face Space settings
-2. Navigate to "Settings" → "Environment variables"
-3. Add the variables:
- - `HF_TOKEN`: Your HF token
- - `TRACKIO_DATASET_REPO`: Your dataset repository (optional)
-
-## Testing Configuration
-
-Run the configuration script to check your setup:
-
-```bash
-python configure_trackio.py
-```
-
-This will:
-- ✅ Show current environment variables
-- 🧪 Test dataset access
-- 📊 Display experiment count
-- 💾 Generate configuration file
-
-## Getting Your HF Token
-
-1. Go to [Hugging Face Settings](https://huggingface.co/settings/tokens)
-2. Click "New token"
-3. Give it a name (e.g., "Trackio Access")
-4. Select "Write" permissions
-5. Copy the token and set it as `HF_TOKEN`
-
-## Dataset Repository Format
-
-The `TRACKIO_DATASET_REPO` should follow this format:
-```
-username/dataset-name
-```
-
-Examples:
-- `tonic/trackio-experiments`
-- `your-username/my-experiments`
-- `your-org/team-experiments`
-
-## Troubleshooting
-
-### Issue: "HF_TOKEN not found"
-**Solution**: Set your HF token in the Space environment variables
-
-### Issue: "Failed to load dataset"
-**Solutions**:
-1. Check your token has read access to the dataset
-2. Verify the dataset repository exists
-3. Try the backup fallback (automatic)
-
-### Issue: "Failed to save experiments"
-**Solutions**:
-1. Check your token has write permissions
-2. Verify the dataset repository exists
-3. Check network connectivity
-
-## Security Notes
-
-- 🔒 Dataset is private by default
-- 🔑 Only accessible with your HF_TOKEN
-- 🛡️ No sensitive data exposed publicly
-- 🔐 Secure storage on HF infrastructure
\ No newline at end of file
diff --git a/docs/Entry_Point.md b/docs/Entry_Point.md
new file mode 100644
index 0000000000000000000000000000000000000000..9085cb7e2fb15186e0de46eb732e3124d2de4847
--- /dev/null
+++ b/docs/Entry_Point.md
@@ -0,0 +1,120 @@
+```mermaid
+graph LR
+ Entry_Point["Entry Point"]
+ Configuration["Configuration"]
+ Model_Abstraction["Model Abstraction"]
+ Data_Pipeline["Data Pipeline"]
+ Training_Logic["Training Logic"]
+ Utilities["Utilities"]
+ Scripts["Scripts"]
+ Requirements_Management["Requirements Management"]
+ Entry_Point -- "initializes" --> Configuration
+ Entry_Point -- "initializes" --> Model_Abstraction
+ Entry_Point -- "initializes" --> Data_Pipeline
+ Entry_Point -- "invokes" --> Training_Logic
+ Configuration -- "provides settings to" --> Model_Abstraction
+ Configuration -- "provides settings to" --> Data_Pipeline
+ Configuration -- "provides settings to" --> Training_Logic
+ Model_Abstraction -- "provides model to" --> Training_Logic
+ Data_Pipeline -- "provides data to" --> Training_Logic
+ Training_Logic -- "utilizes" --> Model_Abstraction
+ Training_Logic -- "utilizes" --> Data_Pipeline
+ Training_Logic -- "utilizes" --> Configuration
+ Training_Logic -- "utilizes" --> Utilities
+ Data_Pipeline -- "uses" --> Utilities
+ Model_Abstraction -- "uses" --> Utilities
+ Scripts -- "supports" --> Data_Pipeline
+ Scripts -- "supports" --> Model_Abstraction
+ Requirements_Management -- "defines environment for" --> Entry_Point
+ Requirements_Management -- "defines environment for" --> Configuration
+ Requirements_Management -- "defines environment for" --> Model_Abstraction
+ Requirements_Management -- "defines environment for" --> Data_Pipeline
+ Requirements_Management -- "defines environment for" --> Training_Logic
+ Requirements_Management -- "defines environment for" --> Utilities
+ Requirements_Management -- "defines environment for" --> Scripts
+ click Entry_Point href "https://github.com/Josephrp/SmolFactory/blob/main/docs/Entry_Point.md" "Details"
+ click Model_Abstraction href "https://github.com/Josephrp/SmolFactory/blob/main/docs/Model_Abstraction.md" "Details"
+ click Data_Pipeline href "https://github.com/Josephrp/SmolFactory/blob/main/docs/Data_Pipeline.md" "Details"
+```
+
+[](https://github.com/CodeBoarding/GeneratedOnBoardings)[](https://www.codeboarding.org/demo)[](mailto:contact@codeboarding.org)
+
+## Details
+
+Component overview for the Machine Learning Training and Fine-tuning Framework.
+
+### Entry Point [[Expand]](./Entry_Point.md)
+The primary execution script that orchestrates the entire training process. It initializes all other major components, loads configurations, sets up the training environment, and invokes the core training logic.
+
+
+**Related Classes/Methods**:
+
+- `train.py`
+
+
+### Configuration
+Centralized management of all training parameters, model hyperparameters, dataset paths, and other environment settings. It defines the schema for configurations, often using dataclasses, and supports both base and custom configurations.
+
+
+**Related Classes/Methods**:
+
+- `config/` (1:1)
+
+
+### Model Abstraction [[Expand]](./Model_Abstraction.md)
+Responsible for abstracting the underlying machine learning model. This includes loading pre-trained models, handling different model architectures or variants, and preparing the model for training (e.g., quantization, device placement).
+
+
+**Related Classes/Methods**:
+
+- `model.py` (1:1)
+
+
+### Data Pipeline [[Expand]](./Data_Pipeline.md)
+Manages the entire data flow, from loading raw datasets to preprocessing, tokenization, and creating efficient data loaders (e.g., PyTorch `DataLoader`) for batching and shuffling data during training and evaluation.
+
+
+**Related Classes/Methods**:
+
+- `data.py` (1:1)
+
+
+### Training Logic
+Encapsulates the core training loop, including forward and backward passes, loss calculation, optimization steps, and integration of callbacks for monitoring and control. It may include specialized trainers for different fine-tuning methods.
+
+
+**Related Classes/Methods**:
+
+- `trainer.py` (1:1)
+
+
+### Utilities
+Provides a collection of common helper functions, classes, and modules used across various components. This includes functionalities like logging, metric calculation, checkpointing, and general data manipulation.
+
+
+**Related Classes/Methods**:
+
+- `utils/` (1:1)
+
+
+### Scripts
+Contains auxiliary scripts that support the overall project but are separate from the main training pipeline. Examples include data preparation scripts, model conversion tools, or deployment-related utilities.
+
+
+**Related Classes/Methods**:
+
+- `scripts/` (1:1)
+
+
+### Requirements Management
+Defines and manages all project dependencies, ensuring a consistent and reproducible development and deployment environment. This typically involves `requirements.txt` files or similar dependency management tools.
+
+
+**Related Classes/Methods**:
+
+- `requirements/` (1:1)
+
+
+
+
+### [FAQ](https://github.com/CodeBoarding/GeneratedOnBoardings/tree/main?tab=readme-ov-file#faq)
\ No newline at end of file
diff --git a/docs/FINAL_DEPLOYMENT_VERIFICATION.md b/docs/FINAL_DEPLOYMENT_VERIFICATION.md
deleted file mode 100644
index 847c58bef3f3538c49a8867fd3bd36c9308b63d8..0000000000000000000000000000000000000000
--- a/docs/FINAL_DEPLOYMENT_VERIFICATION.md
+++ /dev/null
@@ -1,378 +0,0 @@
-# Final Deployment Verification Summary
-
-## Overview
-
-This document provides the final verification that all important components for Trackio Spaces deployment and model repository deployment have been properly implemented and are working correctly.
-
-## ✅ **VERIFICATION COMPLETE: All Components Properly Implemented**
-
-### **What We Verified**
-
-You were absolutely right to ask about the Trackio Spaces deployment and model repository deployment components. I've now **completely verified** that all important components are properly implemented:
-
-## **Trackio Spaces Deployment** ✅ **FULLY IMPLEMENTED**
-
-### **1. Space Creation System** ✅ **COMPLETE**
-- **Location**: `scripts/trackio_tonic/deploy_trackio_space.py`
-- **Functionality**: Creates HF Spaces using latest Python API
-- **Features**:
- - ✅ API-based creation with `huggingface_hub.create_repo`
- - ✅ Fallback to CLI method if API fails
- - ✅ Automatic username extraction from token
- - ✅ Proper Space configuration (Gradio SDK, CPU hardware)
-
-### **2. File Upload System** ✅ **COMPLETE**
-- **Location**: `scripts/trackio_tonic/deploy_trackio_space.py`
-- **Functionality**: Uploads all required files to Space
-- **Features**:
- - ✅ API-based upload using `huggingface_hub.upload_file`
- - ✅ Proper HF Spaces file structure
- - ✅ Git integration in temporary directory
- - ✅ Error handling and fallback mechanisms
-
-**Files Uploaded**:
-- ✅ `app.py` - Complete Gradio interface (1,241 lines)
-- ✅ `requirements.txt` - All dependencies included
-- ✅ `README.md` - Comprehensive documentation
-- ✅ `.gitignore` - Proper git configuration
-
-### **3. Space Configuration** ✅ **COMPLETE**
-- **Location**: `scripts/trackio_tonic/deploy_trackio_space.py`
-- **Functionality**: Sets environment variables via HF Hub API
-- **Features**:
- - ✅ API-based secrets using `add_space_secret()`
- - ✅ Automatic `HF_TOKEN` configuration
- - ✅ Automatic `TRACKIO_DATASET_REPO` setup
- - ✅ Manual fallback instructions if API fails
-
-### **4. Gradio Interface** ✅ **COMPLETE**
-- **Location**: `templates/spaces/app.py` (1,241 lines)
-- **Functionality**: Comprehensive experiment tracking interface
-- **Features**:
- - ✅ **Experiment Management**: Create, view, update experiments
- - ✅ **Metrics Logging**: Real-time training metrics
- - ✅ **Visualization**: Interactive plots and charts
- - ✅ **HF Datasets Integration**: Persistent storage
- - ✅ **API Endpoints**: Programmatic access
- - ✅ **Fallback Data**: Backup when dataset unavailable
-
-**Interface Components**:
-- ✅ **Create Experiment**: Start new experiments
-- ✅ **Log Metrics**: Track training progress
-- ✅ **View Experiments**: See experiment details
-- ✅ **Update Status**: Mark experiments complete
-- ✅ **Visualizations**: Interactive plots
-- ✅ **Configuration**: Environment setup
-
-### **5. Requirements and Dependencies** ✅ **COMPLETE**
-- **Location**: `templates/spaces/requirements.txt`
-- **Dependencies**: All required packages included
-- ✅ **Core Gradio**: `gradio>=4.0.0`
-- ✅ **Data Processing**: `pandas>=2.0.0`, `numpy>=1.24.0`
-- ✅ **Visualization**: `plotly>=5.15.0`
-- ✅ **HF Integration**: `datasets>=2.14.0`, `huggingface-hub>=0.16.0`
-- ✅ **HTTP Requests**: `requests>=2.31.0`
-- ✅ **Environment**: `python-dotenv>=1.0.0`
-
-### **6. README Template** ✅ **COMPLETE**
-- **Location**: `templates/spaces/README.md`
-- **Features**:
- - ✅ **HF Spaces Metadata**: Proper YAML frontmatter
- - ✅ **Feature Documentation**: Complete interface description
- - ✅ **API Documentation**: Usage examples
- - ✅ **Configuration Guide**: Environment variables
- - ✅ **Troubleshooting**: Common issues and solutions
-
-## **Model Repository Deployment** ✅ **FULLY IMPLEMENTED**
-
-### **1. Repository Creation** ✅ **COMPLETE**
-- **Location**: `scripts/model_tonic/push_to_huggingface.py`
-- **Functionality**: Creates HF model repositories using Python API
-- **Features**:
- - ✅ API-based creation with `huggingface_hub.create_repo`
- - ✅ Configurable private/public settings
- - ✅ Existing repository handling (`exist_ok=True`)
- - ✅ Proper error handling and messages
-
-### **2. Model File Upload** ✅ **COMPLETE**
-- **Location**: `scripts/model_tonic/push_to_huggingface.py`
-- **Functionality**: Uploads all model files to repository
-- **Features**:
- - ✅ File validation and integrity checks
- - ✅ Complete model component upload
- - ✅ Progress tracking and feedback
- - ✅ Graceful error handling
-
-**Files Uploaded**:
-- ✅ `config.json` - Model configuration
-- ✅ `pytorch_model.bin` - Model weights
-- ✅ `tokenizer.json` - Tokenizer configuration
-- ✅ `tokenizer_config.json` - Tokenizer settings
-- ✅ `special_tokens_map.json` - Special tokens
-- ✅ `generation_config.json` - Generation settings
-
-### **3. Model Card Generation** ✅ **COMPLETE**
-- **Location**: `scripts/model_tonic/push_to_huggingface.py`
-- **Functionality**: Generates comprehensive model cards
-- **Features**:
- - ✅ Template-based generation using `templates/model_card.md`
- - ✅ Dynamic content from training configuration
- - ✅ Usage examples and documentation
- - ✅ Support for quantized model variants
- - ✅ Proper HF Hub metadata
-
-### **4. Training Results Documentation** ✅ **COMPLETE**
-- **Location**: `scripts/model_tonic/push_to_huggingface.py`
-- **Functionality**: Uploads training configuration and results
-- **Features**:
- - ✅ Training parameters documentation
- - ✅ Performance metrics inclusion
- - ✅ Experiment tracking links
- - ✅ Proper documentation structure
-
-### **5. Quantized Model Support** ✅ **COMPLETE**
-- **Location**: `scripts/model_tonic/quantize_model.py`
-- **Functionality**: Creates and uploads quantized models
-- **Features**:
- - ✅ Multiple quantization levels (int8, int4)
- - ✅ Unified repository structure
- - ✅ Separate documentation for each variant
- - ✅ Clear usage instructions
-
-### **6. Trackio Integration** ✅ **COMPLETE**
-- **Location**: `scripts/model_tonic/push_to_huggingface.py`
-- **Functionality**: Logs model push events to Trackio
-- **Features**:
- - ✅ Event logging for model pushes
- - ✅ Training results tracking
- - ✅ Experiment tracking links
- - ✅ HF Datasets integration
-
-### **7. Model Validation** ✅ **COMPLETE**
-- **Location**: `scripts/model_tonic/push_to_huggingface.py`
-- **Functionality**: Validates model files before upload
-- **Features**:
- - ✅ Complete file validation
- - ✅ Size and integrity checks
- - ✅ Configuration validation
- - ✅ Detailed error reporting
-
-## **Integration Components** ✅ **FULLY IMPLEMENTED**
-
-### **1. Launch Script Integration** ✅ **COMPLETE**
-- **Location**: `launch.sh`
-- **Features**:
- - ✅ Automatic Trackio Space deployment calls
- - ✅ Automatic model push integration
- - ✅ Environment setup and configuration
- - ✅ Error handling and user feedback
-
-### **2. Monitoring Integration** ✅ **COMPLETE**
-- **Location**: `src/monitoring.py`
-- **Features**:
- - ✅ `SmolLM3Monitor` class implementation
- - ✅ Real-time experiment tracking
- - ✅ Trackio Space integration
- - ✅ HF Datasets integration
-
-### **3. Dataset Integration** ✅ **COMPLETE**
-- **Location**: `scripts/dataset_tonic/setup_hf_dataset.py`
-- **Features**:
- - ✅ Automatic dataset repository creation
- - ✅ Initial experiment data upload
- - ✅ README template integration
- - ✅ Environment variable setup
-
-## **Token Validation** ✅ **FULLY IMPLEMENTED**
-
-### **1. Token Validation System** ✅ **COMPLETE**
-- **Location**: `scripts/validate_hf_token.py`
-- **Features**:
- - ✅ API-based token validation
- - ✅ Username extraction from token
- - ✅ JSON output for shell parsing
- - ✅ Comprehensive error handling
-
-## **Test Results** ✅ **ALL PASSED**
-
-### **Comprehensive Component Test**
-```bash
-$ python tests/test_deployment_components.py
-
-🚀 Deployment Components Verification
-==================================================
-🔍 Testing Trackio Space Deployment Components
-✅ Trackio Space deployment script exists
-✅ Gradio app template exists
-✅ TrackioSpace class implemented
-✅ Experiment creation functionality
-✅ Metrics logging functionality
-✅ Experiment retrieval functionality
-✅ Space requirements file exists
-✅ Required dependency: gradio
-✅ Required dependency: pandas
-✅ Required dependency: plotly
-✅ Required dependency: datasets
-✅ Required dependency: huggingface-hub
-✅ Space README template exists
-✅ HF Spaces metadata present
-✅ All Trackio Space components verified!
-
-🔍 Testing Model Repository Deployment Components
-✅ Model push script exists
-✅ Model quantization script exists
-✅ Model card template exists
-✅ Required section: base_model:
-✅ Required section: pipeline_tag:
-✅ Required section: tags:
-✅ Model card generator exists
-✅ Required function: def create_repository
-✅ Required function: def upload_model_files
-✅ Required function: def create_model_card
-✅ Required function: def validate_model_path
-✅ All Model Repository components verified!
-
-🔍 Testing Integration Components
-✅ Launch script exists
-✅ Trackio Space deployment integrated
-✅ Model push integrated
-✅ Monitoring script exists
-✅ SmolLM3Monitor class implemented
-✅ Dataset setup script exists
-✅ Dataset setup function implemented
-✅ All integration components verified!
-
-🔍 Testing Token Validation
-✅ Token validation script exists
-✅ Token validation function implemented
-✅ Token validation components verified!
-
-==================================================
-🎉 ALL COMPONENTS VERIFIED SUCCESSFULLY!
-✅ Trackio Space deployment components: Complete
-✅ Model repository deployment components: Complete
-✅ Integration components: Complete
-✅ Token validation components: Complete
-
-All important deployment components are properly implemented!
-```
-
-## **Technical Implementation Details**
-
-### **Trackio Space Deployment Flow**
-```python
-# 1. Create Space
-create_repo(
- repo_id=f"{username}/{space_name}",
- token=token,
- repo_type="space",
- exist_ok=True,
- private=False,
- space_sdk="gradio",
- space_hardware="cpu-basic"
-)
-
-# 2. Upload Files
-upload_file(
- path_or_fileobj=file_content,
- path_in_repo=file_path,
- repo_id=repo_id,
- repo_type="space",
- token=token
-)
-
-# 3. Set Secrets
-add_space_secret(
- repo_id=repo_id,
- repo_type="space",
- key="HF_TOKEN",
- value=token
-)
-```
-
-### **Model Repository Deployment Flow**
-```python
-# 1. Create Repository
-create_repo(
- repo_id=repo_name,
- token=token,
- private=private,
- exist_ok=True
-)
-
-# 2. Upload Model Files
-upload_file(
- path_or_fileobj=model_file,
- path_in_repo=file_path,
- repo_id=repo_name,
- token=token
-)
-
-# 3. Generate Model Card
-model_card = create_model_card(training_config, results)
-upload_file(
- path_or_fileobj=model_card,
- path_in_repo="README.md",
- repo_id=repo_name,
- token=token
-)
-```
-
-## **Verification Summary**
-
-| Component Category | Status | Components Verified | Test Result |
-|-------------------|--------|-------------------|-------------|
-| **Trackio Space Deployment** | ✅ Complete | 6 components | ✅ All passed |
-| **Model Repository Deployment** | ✅ Complete | 7 components | ✅ All passed |
-| **Integration Components** | ✅ Complete | 3 components | ✅ All passed |
-| **Token Validation** | ✅ Complete | 1 component | ✅ All passed |
-
-## **Key Achievements**
-
-### **1. Complete Automation**
-- ✅ **No manual username input**: Automatic extraction from token
-- ✅ **No manual Space creation**: Automatic via Python API
-- ✅ **No manual model upload**: Complete automation
-- ✅ **No manual configuration**: Automatic environment setup
-
-### **2. Robust Error Handling**
-- ✅ **API fallbacks**: CLI methods when API fails
-- ✅ **Graceful degradation**: Clear error messages
-- ✅ **User feedback**: Progress indicators and status
-- ✅ **Recovery mechanisms**: Multiple retry strategies
-
-### **3. Comprehensive Documentation**
-- ✅ **Model cards**: Complete with usage examples
-- ✅ **Space documentation**: Full interface description
-- ✅ **API documentation**: Usage examples and integration
-- ✅ **Troubleshooting guides**: Common issues and solutions
-
-### **4. Cross-Platform Support**
-- ✅ **Windows**: Tested and working on PowerShell
-- ✅ **Linux**: Compatible with bash scripts
-- ✅ **macOS**: Compatible with zsh/bash
-- ✅ **Python API**: Platform-independent
-
-## **Next Steps**
-
-The deployment components are now **fully implemented and verified**. Users can:
-
-1. **Deploy Trackio Space**: Automatic Space creation and configuration
-2. **Upload Models**: Complete model deployment with documentation
-3. **Monitor Experiments**: Real-time tracking and visualization
-4. **Share Results**: Comprehensive documentation and examples
-5. **Scale Operations**: Support for multiple experiments and models
-
-## **Conclusion**
-
-**All important deployment components are properly implemented and working correctly!** 🎉
-
-The verification confirms that:
-- ✅ **Trackio Spaces deployment**: Complete with all required components
-- ✅ **Model repository deployment**: Complete with all required components
-- ✅ **Integration systems**: Complete with all required components
-- ✅ **Token validation**: Complete with all required components
-- ✅ **Documentation**: Complete with all required components
-- ✅ **Error handling**: Complete with all required components
-
-The system is now ready for production use with full automation and comprehensive functionality.
\ No newline at end of file
diff --git a/docs/FORMATTING_FIX_SUMMARY.md b/docs/FORMATTING_FIX_SUMMARY.md
deleted file mode 100644
index 0e14a126c864c91580e6feedb2dde1007ad91828..0000000000000000000000000000000000000000
--- a/docs/FORMATTING_FIX_SUMMARY.md
+++ /dev/null
@@ -1,153 +0,0 @@
-# String Formatting Fix Summary
-
-## 🐛 Problem
-
-The training script was failing with the error:
-```
-ERROR:trainer:Training failed: Unknown format code 'f' for object of type 'str'
-```
-
-This error occurs when Python's string formatting encounters an f-string format specifier (`%f`) but receives a string object instead of a numeric value.
-
-## 🔍 Root Cause
-
-The issue was caused by inconsistent use of f-string formatting (`f"..."`) and traditional string formatting (`"..." % ...`) in the logging statements throughout the codebase. When logging statements used f-string syntax but were processed by the logging system, it could cause formatting conflicts.
-
-## ✅ Solution
-
-I fixed the issue by standardizing all logging statements to use traditional string formatting with `%` placeholders instead of f-strings. This ensures compatibility with Python's logging system and prevents formatting conflicts.
-
-### Files Fixed
-
-1. **`src/monitoring.py`** - Fixed all logging statements
-2. **`src/trainer.py`** - Fixed all logging statements
-3. **`src/model.py`** - Fixed all logging statements
-4. **`src/data.py`** - Fixed all logging statements
-
-### Changes Made
-
-#### Before (Problematic):
-```python
-logger.info(f"Loading model from {self.model_name}")
-logger.error(f"Failed to load model: {e}")
-print(f"Step {step}: loss={loss:.4f}, lr={lr}")
-```
-
-#### After (Fixed):
-```python
-logger.info("Loading model from %s", self.model_name)
-logger.error("Failed to load model: %s", e)
-print("Step {}: loss={:.4f}, lr={}".format(step, loss, lr))
-```
-
-## 🧪 Testing
-
-Created `test_formatting_fix.py` to verify the fix:
-
-```bash
-python test_formatting_fix.py
-```
-
-This script tests:
-- ✅ Logging functionality
-- ✅ Module imports
-- ✅ Configuration loading
-- ✅ Monitoring creation
-- ✅ Error handling
-
-## 🚀 Usage
-
-The fix is now ready to use. You can run your training command again:
-
-```bash
-python run_a100_large_experiment.py \
- --config config/train_smollm3_openhermes_fr_a100_balanced.py \
- --trackio_url "https://tonic-test-trackio-test.hf.space" \
- --experiment-name "petit-elle-l-aime-3-balanced" \
- --output-dir ./outputs/balanced | tee trainfr.log
-```
-
-## 📋 Key Changes
-
-### 1. Monitoring Module (`src/monitoring.py`)
-- Fixed all `logger.info()`, `logger.error()`, `logger.warning()` calls
-- Replaced f-strings with `%` formatting
-- Fixed string concatenation in file paths
-- Fixed HF Datasets integration logging
-
-### 2. Trainer Module (`src/trainer.py`)
-- Fixed logging in `SmolLM3Trainer` class
-- Fixed console output formatting
-- Fixed error message formatting
-- Fixed callback logging
-
-### 3. Model Module (`src/model.py`)
-- Fixed model loading logging
-- Fixed configuration logging
-- Fixed error reporting
-- Fixed parameter logging
-
-### 4. Data Module (`src/data.py`)
-- Fixed dataset loading logging
-- Fixed processing progress logging
-- Fixed error handling
-- Fixed split processing logging
-
-## 🔧 Technical Details
-
-### Why This Happened
-1. **Mixed Formatting**: Some code used f-strings while others used `%` formatting
-2. **Logging System**: Python's logging system processes format strings differently
-3. **String Processing**: When strings containing `%f` were processed as format strings, it caused conflicts
-
-### The Fix
-1. **Standardized Formatting**: All logging now uses `%` placeholders
-2. **Consistent Style**: No more mixing of f-strings and `%` formatting
-3. **Safe Logging**: All logging statements are now safe for the logging system
-
-### Benefits
-- ✅ **Eliminates Formatting Errors**: No more "Unknown format code 'f'" errors
-- ✅ **Consistent Code Style**: All logging uses the same format
-- ✅ **Better Performance**: Traditional formatting is slightly faster
-- ✅ **Compatibility**: Works with all Python versions and logging configurations
-
-## 🎯 Verification
-
-To verify the fix works:
-
-1. **Run the test script**:
- ```bash
- python test_formatting_fix.py
- ```
-
-2. **Check that all tests pass**:
- - ✅ Logging tests
- - ✅ Import tests
- - ✅ Configuration tests
- - ✅ Monitoring creation tests
-
-3. **Run your training command**:
- ```bash
- python run_a100_large_experiment.py --config config/train_smollm3_openhermes_fr_a100_balanced.py --trackio_url "https://tonic-test-trackio-test.hf.space" --experiment-name "petit-elle-l-aime-3-balanced" --output-dir ./outputs/balanced
- ```
-
-## 📝 Notes
-
-- The fix maintains all existing functionality
-- No changes to the training logic or configuration
-- All error messages and logging remain informative
-- The fix is backward compatible
-- HF Datasets integration is preserved
-
-## 🚨 Prevention
-
-To prevent similar issues in the future:
-
-1. **Use Consistent Formatting**: Stick to `%` formatting for logging
-2. **Avoid f-strings in Logging**: Don't use f-strings in `logger.info()` calls
-3. **Test Logging**: Always test logging statements during development
-4. **Use Type Hints**: Consider using type hints to catch formatting issues early
-
----
-
-**The formatting fix is now complete and ready for use! 🎉**
\ No newline at end of file
diff --git a/docs/GIT_CONFIGURATION_FIX.md b/docs/GIT_CONFIGURATION_FIX.md
deleted file mode 100644
index d7e41523cbb742a8f8551d85ea25c08c0af73e3b..0000000000000000000000000000000000000000
--- a/docs/GIT_CONFIGURATION_FIX.md
+++ /dev/null
@@ -1,257 +0,0 @@
-# Git Configuration Fix for Trackio Space Deployment
-
-## Issue Identified
-
-The Trackio Space deployment was failing with the error:
-```
-❌ Error uploading files: Command '['git', 'commit', '-m', 'Initial Trackio Space setup']' returned non-zero exit status 128.
-```
-
-This error occurs because git requires a user identity (email and name) to be configured before making commits. The deployment script was creating a temporary directory and initializing a git repository, but wasn't configuring the git user identity in that temporary directory.
-
-## Root Cause
-
-### **Problem**: Git Identity Not Configured in Temporary Directory
-
-When the deployment script:
-1. Creates a temporary directory
-2. Changes to that directory (`os.chdir(temp_dir)`)
-3. Initializes a git repository (`git init`)
-4. Tries to commit (`git commit`)
-
-The git repository in the temporary directory doesn't inherit the git configuration from the main directory, so it has no user identity configured.
-
-### **Solution**: Configure Git Identity in Temporary Directory
-
-The fix involves explicitly configuring git user identity in the temporary directory before attempting to commit.
-
-## Fixes Applied
-
-### 1. **Enhanced TrackioSpaceDeployer Constructor**
-
-**Before**:
-```python
-def __init__(self, space_name: str, username: str, token: str):
- self.space_name = space_name
- self.username = username
- self.token = token
-```
-
-**After**:
-```python
-def __init__(self, space_name: str, username: str, token: str, git_email: str = None, git_name: str = None):
- self.space_name = space_name
- self.username = username
- self.token = token
-
- # Git configuration
- self.git_email = git_email or f"{username}@huggingface.co"
- self.git_name = git_name or username
-```
-
-### 2. **Git Configuration in upload_files_to_space Method**
-
-**Added to the method**:
-```python
-# Configure git user identity for this repository
-try:
- # Try to get existing git config
- result = subprocess.run(["git", "config", "--global", "user.email"], capture_output=True, text=True)
- if result.returncode == 0 and result.stdout.strip():
- git_email = result.stdout.strip()
- else:
- git_email = self.git_email
-
- result = subprocess.run(["git", "config", "--global", "user.name"], capture_output=True, text=True)
- if result.returncode == 0 and result.stdout.strip():
- git_name = result.stdout.strip()
- else:
- git_name = self.git_name
-
-except Exception:
- # Fallback to default values
- git_email = self.git_email
- git_name = self.git_name
-
-# Set git config for this repository
-subprocess.run(["git", "config", "user.email", git_email], check=True, capture_output=True)
-subprocess.run(["git", "config", "user.name", git_name], check=True, capture_output=True)
-
-print(f"✅ Configured git with email: {git_email}, name: {git_name}")
-```
-
-### 3. **Updated Main Function**
-
-**Enhanced to accept git configuration**:
-```python
-def main():
- # Get user input
- username = input("Enter your Hugging Face username: ").strip()
- space_name = input("Enter Space name (e.g., trackio-monitoring): ").strip()
- token = input("Enter your Hugging Face token: ").strip()
-
- # Get git configuration (optional)
- git_email = input("Enter your git email (optional, press Enter for default): ").strip()
- git_name = input("Enter your git name (optional, press Enter for default): ").strip()
-
- # Create deployer with git config
- deployer = TrackioSpaceDeployer(space_name, username, token, git_email, git_name)
-```
-
-### 4. **Updated Launch Script**
-
-**Enhanced to pass git configuration**:
-```bash
-# Create deployment script input
-cat > deploy_input.txt << EOF
-$HF_USERNAME
-$TRACKIO_SPACE_NAME
-$HF_TOKEN
-$GIT_EMAIL
-$HF_USERNAME
-EOF
-```
-
-## Testing the Fix
-
-### **Run Git Configuration Tests**
-```bash
-python tests/test_git_config_fix.py
-```
-
-Expected output:
-```
-🚀 Testing Git Configuration Fix
-========================================
-🔍 Testing git configuration in temporary directory...
-✅ Created temp directory: /tmp/tmp_xxxxx
-✅ Initialized git repository
-✅ Git email configured correctly
-✅ Git name configured correctly
-✅ Git commit successful
-✅ Cleanup successful
-
-🔍 Testing deployment script git configuration...
-✅ Git email set correctly
-✅ Git name set correctly
-
-🔍 Testing git configuration fallback...
-✅ Default git email set correctly
-✅ Default git name set correctly
-
-🔍 Testing git commit with configuration...
-✅ Created temp directory: /tmp/tmp_xxxxx
-✅ Git commit successful with configuration
-✅ Cleanup successful
-
-📊 Test Results: 4/4 tests passed
-✅ All git configuration tests passed! The deployment should work correctly.
-```
-
-## Files Modified
-
-### **Core Deployment Files**
-1. **`scripts/trackio_tonic/deploy_trackio_space.py`**
- - Enhanced constructor to accept git configuration
- - Added git configuration in upload_files_to_space method
- - Updated main function to accept git parameters
- - Added fallback mechanisms for git configuration
-
-### **Launch Script**
-2. **`launch.sh`**
- - Updated to pass git configuration to deployment script
- - Enhanced input file creation with git parameters
-
-### **Testing**
-3. **`tests/test_git_config_fix.py`**
- - Comprehensive testing of git configuration
- - Tests for temporary directory git setup
- - Tests for deployment script git handling
- - Tests for fallback behavior
-
-## Benefits of the Fix
-
-### **1. Reliable Git Commits**
-- Git user identity properly configured in temporary directory
-- No more "exit status 128" errors
-- Successful commits and pushes to Hugging Face Spaces
-
-### **2. Flexible Configuration**
-- Accepts custom git email and name
-- Falls back to sensible defaults
-- Works with existing git configuration
-
-### **3. Better Error Handling**
-- Graceful fallback to default values
-- Clear error messages and logging
-- Robust configuration validation
-
-### **4. Professional Setup**
-- Uses user's actual email address when provided
-- Maintains proper git attribution
-- Follows git best practices
-
-## Usage Instructions
-
-### **1. Test the Fix**
-```bash
-python tests/test_git_config_fix.py
-```
-
-### **2. Deploy with Git Configuration**
-```bash
-python scripts/trackio_tonic/deploy_trackio_space.py
-```
-
-When prompted:
-- Enter your HF username
-- Enter space name
-- Enter your HF token
-- Enter your git email (or press Enter for default)
-- Enter your git name (or press Enter for default)
-
-### **3. Use with Launch Script**
-```bash
-./launch.sh
-```
-
-The launch script will automatically pass the git configuration to the deployment script.
-
-## Troubleshooting
-
-### **Common Issues**
-
-#### **1. Git Configuration Still Fails**
-```bash
-# Check if git is properly configured
-git config --list
-
-# Set git config manually if needed
-git config --global user.email "your-email@example.com"
-git config --global user.name "Your Name"
-```
-
-#### **2. Permission Issues**
-```bash
-# Check HF token permissions
-hf whoami
-
-# Verify token has write access
-hf repo create test-repo --type space
-```
-
-#### **3. Space Creation Fails**
-```bash
-# Check if space name is available
-# Try a different space name
-# Verify HF token is valid
-```
-
-## Next Steps
-
-1. **Test the fix**: Run the git configuration tests
-2. **Deploy a test space**: Use the updated deployment script
-3. **Verify deployment**: Check that the space is created successfully
-4. **Use in production**: Deploy your actual Trackio Space
-
-The git configuration fix should resolve the deployment issues and allow successful Trackio Space creation! 🚀
\ No newline at end of file
diff --git a/docs/GIT_CONFIGURATION_GUIDE.md b/docs/GIT_CONFIGURATION_GUIDE.md
deleted file mode 100644
index 09e7e027166b2ce9e313fa43ebb2444ccc1e4d53..0000000000000000000000000000000000000000
--- a/docs/GIT_CONFIGURATION_GUIDE.md
+++ /dev/null
@@ -1,258 +0,0 @@
-# Git Configuration Guide for Hugging Face Operations
-
-This guide explains the correct way to configure git for Hugging Face Spaces deployment and model pushing operations.
-
-## 🎯 **Overview**
-
-When working with Hugging Face Spaces and model repositories, proper git configuration is essential for:
-- Creating and deploying Spaces
-- Pushing models to the Hub
-- Managing experiment tracking datasets
-- Ensuring proper authentication
-- **Using the user's actual email address for proper git identity and commit attribution**
-
-## ✅ **Correct Git Configuration**
-
-### **1. Local vs Global Configuration**
-
-**❌ Wrong (Current):**
-```bash
-git config --global user.email "$HF_USERNAME@example.com"
-git config --global user.name "$HF_USERNAME"
-```
-
-**✅ Correct (Updated):**
-```bash
-# Get user's actual email address
-read -p "Enter your email address for git configuration: " GIT_EMAIL
-
-# Configure git locally for this project only
-git config user.email "$GIT_EMAIL"
-git config user.name "$HF_USERNAME"
-
-# Verify configuration
-git config user.email
-git config user.name
-```
-
-### **2. Proper Authentication Setup**
-
-**✅ Correct Authentication:**
-```bash
-# Login with token and add to git credentials
-hf login --token "$HF_TOKEN" --add-to-git-credential
-
-# Verify login
-hf whoami
-```
-
-### **3. Error Handling**
-
-**✅ Robust Configuration:**
-```bash
-# Get user's email and configure git with error handling
-read -p "Enter your email address for git configuration: " GIT_EMAIL
-
-if git config user.email "$GIT_EMAIL" && \
- git config user.name "$HF_USERNAME"; then
- echo "✅ Git configured successfully"
- echo " Email: $(git config user.email)"
- echo " Name: $(git config user.name)"
-else
- echo "❌ Failed to configure git"
- exit 1
-fi
-```
-
-## 🔧 **Why These Changes Matter**
-
-### **1. Local Configuration Benefits**
-- **Isolation**: Doesn't affect other projects on the system
-- **Project-specific**: Each project can have different git settings
-- **Cleaner**: No global state pollution
-- **Safer**: Won't interfere with existing git configurations
-
-### **2. User's Actual Email Address**
-- **Professional**: Uses the user's real email address
-- **Authentic**: Represents the actual user's identity
-- **Consistent**: Matches the user's Hugging Face account
-- **Best Practice**: Follows git configuration standards
-
-### **3. Token-based Authentication**
-- **Secure**: Uses HF token instead of username/password
-- **Automated**: No manual password entry required
-- **Persistent**: Credentials stored securely
-- **Verified**: Includes verification steps
-
-## 📋 **Implementation in Launch Script**
-
-### **Updated Authentication Step:**
-```bash
-# Step 8: Authentication setup
-print_step "Step 8: Authentication Setup"
-echo "================================"
-
-export HF_TOKEN="$HF_TOKEN"
-export TRACKIO_DATASET_REPO="$TRACKIO_DATASET_REPO"
-
-# Login to Hugging Face with token
-print_info "Logging in to Hugging Face..."
-if hf login --token "$HF_TOKEN" --add-to-git-credential; then
- print_status "Successfully logged in to Hugging Face"
- print_info "Username: $(hf whoami)"
-else
- print_error "Failed to login to Hugging Face"
- print_error "Please check your token and try again"
- exit 1
-fi
-
-# Configure git for HF operations
-print_step "Step 8.1: Git Configuration"
-echo "================================"
-
-print_info "Configuring git for Hugging Face operations..."
-
-# Get user's email for git configuration
-get_input "Enter your email address for git configuration" "" GIT_EMAIL
-
-# Configure git locally (not globally) for this project
-git config user.email "$GIT_EMAIL"
-git config user.name "$HF_USERNAME"
-
-# Verify git configuration
-print_info "Verifying git configuration..."
-if git config user.email && git config user.name; then
- print_status "Git configured successfully"
- print_info " Email: $(git config user.email)"
- print_info " Name: $(git config user.name)"
-else
- print_error "Failed to configure git"
- exit 1
-fi
-```
-
-## 🚀 **Deployment Script Improvements**
-
-### **Robust File Upload:**
-```python
-def upload_files(self) -> bool:
- """Upload necessary files to the Space"""
- try:
- print("Uploading files to Space...")
-
- # Files to upload
- files_to_upload = [
- "app.py",
- "requirements_space.txt",
- "README.md"
- ]
-
- # Check if we're in a git repository
- try:
- subprocess.run(["git", "status"], capture_output=True, check=True)
- except subprocess.CalledProcessError:
- print("⚠️ Not in a git repository, initializing...")
- subprocess.run(["git", "init"], check=True)
- subprocess.run(["git", "remote", "add", "origin", f"https://huggingface.co/spaces/{self.username}/{self.space_name}"], check=True)
-
- # Add all files at once
- existing_files = [f for f in files_to_upload if os.path.exists(f)]
- if existing_files:
- subprocess.run(["git", "add"] + existing_files, check=True)
- subprocess.run(["git", "commit", "-m", "Initial Space setup"], check=True)
-
- # Push to the space
- try:
- subprocess.run(["git", "push", "origin", "main"], check=True)
- print(f"✅ Uploaded {len(existing_files)} files")
- except subprocess.CalledProcessError:
- # Try pushing to master branch if main doesn't exist
- subprocess.run(["git", "push", "origin", "master"], check=True)
- print(f"✅ Uploaded {len(existing_files)} files")
- else:
- print("⚠️ No files found to upload")
-
- return True
-
- except Exception as e:
- print(f"❌ Error uploading files: {e}")
- return False
-```
-
-## 🔍 **Troubleshooting**
-
-### **Common Issues and Solutions:**
-
-#### **1. Git Configuration Fails**
-```bash
-# Check current git config
-git config --list
-
-# Reset if needed
-git config --unset user.email
-git config --unset user.name
-
-# Reconfigure
-git config user.email "your-username@huggingface.co"
-git config user.name "your-username"
-```
-
-#### **2. Authentication Issues**
-```bash
-# Check HF login status
-hf whoami
-
-# Re-login if needed
-hf logout
-hf login --token "your-token"
-```
-
-#### **3. Space Deployment Fails**
-```bash
-# Check git remote
-git remote -v
-
-# Re-add remote if needed
-git remote remove origin
-git remote add origin https://huggingface.co/spaces/username/space-name
-```
-
-## 📚 **Best Practices**
-
-### **1. Always Use Local Configuration**
-- Use `git config` without `--global` flag
-- Keeps project configurations isolated
-- Prevents conflicts with other projects
-
-### **2. Verify Configuration**
-- Always check that git config was successful
-- Display configured values for verification
-- Exit on failure to prevent downstream issues
-
-### **3. Use Token-based Authentication**
-- More secure than username/password
-- Automatically handles credential storage
-- Works well with CI/CD systems
-
-### **4. Handle Errors Gracefully**
-- Check return codes from git commands
-- Provide clear error messages
-- Exit early on critical failures
-
-### **5. Test Configuration**
-- Verify git config after setting it
-- Test HF login before proceeding
-- Validate remote repository access
-
-## 🎯 **Summary**
-
-The updated git configuration approach provides:
-
-1. **✅ Better Isolation**: Local configuration doesn't affect system-wide settings
-2. **✅ User's Actual Email**: Uses the user's real email address for proper git identity
-3. **✅ Proper Authentication**: Token-based login with credential storage
-4. **✅ Error Handling**: Robust verification and error reporting
-5. **✅ Professional Setup**: Uses user's actual email and verification
-6. **✅ Deployment Reliability**: Improved Space deployment with git repository handling
-
-This ensures a more reliable and professional setup for Hugging Face operations in the SmolLM3 fine-tuning pipeline.
\ No newline at end of file
diff --git a/docs/H100_LIGHTWEIGHT_GUIDE.md b/docs/H100_LIGHTWEIGHT_GUIDE.md
deleted file mode 100644
index a712ca8b0bd9f1948f75df67a5da572d80c28c20..0000000000000000000000000000000000000000
--- a/docs/H100_LIGHTWEIGHT_GUIDE.md
+++ /dev/null
@@ -1,276 +0,0 @@
-# H100 Lightweight Training Configuration Guide
-
-This guide explains the new **H100 Lightweight (Rapid)** training configuration, optimized for rapid fine-tuning on H100 GPUs with a small, carefully selected dataset.
-
-## 🎯 Overview
-
-The H100 Lightweight configuration is designed for:
-- **Rapid experimentation** on H100 GPUs
-- **Efficient training** with 80K carefully selected samples
-- **Quick iteration** for research and development
-- **Cost-effective** training sessions
-
-## 🚀 Key Features
-
-### **Optimized for H100**
-- **Batch Size**: 16 (larger than A100 configs)
-- **Gradient Accumulation**: 4 (reduced for faster updates)
-- **Learning Rate**: 8e-6 (slightly higher for rapid convergence)
-- **Sequence Length**: 8192 (full context window)
-
-### **Dataset Sampling**
-- **Source**: OpenHermes-FR dataset
-- **Sample Size**: 80,000 random samples
-- **Validation**: 1,000 samples (if available)
-- **Reproducibility**: Fixed random seed (42)
-
-### **Training Optimizations**
-- **Warmup Steps**: 50 (reduced for rapid training)
-- **Evaluation**: Every 50 steps
-- **Logging**: Every 5 steps
-- **Saving**: Every 200 steps
-- **Checkpoints**: Keep only 2 (save storage)
-
-## 📊 Configuration Details
-
-### **Model Configuration**
-```python
-model_name="HuggingFaceTB/SmolLM3-3B"
-max_seq_length=8192
-use_flash_attention=True
-use_gradient_checkpointing=True
-```
-
-### **Training Parameters**
-```python
-batch_size=16
-gradient_accumulation_steps=4
-learning_rate=8e-6
-warmup_steps=50
-max_epochs=1
-```
-
-### **H100-Specific Optimizations**
-```python
-dataloader_num_workers=4
-dataloader_pin_memory=True
-gradient_clipping=1.0
-group_by_length=True
-pad_to_multiple_of=8
-```
-
-### **Memory Optimizations**
-```python
-save_total_limit=2
-early_stopping_patience=3
-max_grad_norm=1.0
-warmup_ratio=0.1
-```
-
-## 🔧 Usage
-
-### **Interactive Selection**
-```bash
-./launch.sh
-# Select "H100 Lightweight (Rapid)" when prompted
-```
-
-### **Expected Training Time**
-- **H100**: ~2-4 hours (depending on hardware)
-- **A100**: ~4-6 hours
-- **V100**: ~6-8 hours
-
-### **Memory Requirements**
-- **GPU Memory**: 40GB+ (H100 recommended)
-- **System RAM**: 32GB+
-- **Storage**: 50GB+ for dataset and checkpoints
-
-## 📈 Performance Characteristics
-
-### **Training Speed**
-- **Steps per Second**: ~2-3 (on H100)
-- **Samples per Second**: ~32-48
-- **Effective Batch Size**: 64 (16 × 4)
-
-### **Convergence**
-- **Expected Loss**: 1.2-1.8 (after 1 epoch)
-- **Evaluation Frequency**: Every 50 steps
-- **Early Stopping**: After 3 evaluations without improvement
-
-### **Dataset Efficiency**
-- **80K samples**: ~1.3% of full OpenHermes-FR
-- **Random sampling**: Ensures diversity
-- **Fixed seed**: Reproducible results
-
-## 🎯 Use Cases
-
-### **Perfect For**
-- **Rapid prototyping** of new ideas
-- **Hyperparameter tuning** experiments
-- **Model comparison** studies
-- **Research validation** before full training
-- **Educational purposes** and learning
-
-### **Not Recommended For**
-- **Production models** (use Multiple Passes instead)
-- **Competition submissions** (use full dataset)
-- **Research papers** (use complete training)
-
-## 🔄 Comparison with Other Configurations
-
-| Configuration | Dataset Size | Batch Size | Epochs | Training Time | Use Case |
-|---------------|--------------|------------|--------|---------------|----------|
-| **Basic Training** | Full SmolTalk | 2 | 3 | 6-8 hours | Learning |
-| **H100 Lightweight** | 80K Hermes-FR | 16 | 1 | 2-4 hours | Rapid experiments |
-| **A100 Large Scale** | Full Hermes-FR | 8 | 1.3 | 8-12 hours | Serious research |
-| **Multiple Passes** | Full Hermes-FR | 6 | 4 | 24-36 hours | Production |
-
-## 🛠️ Customization
-
-### **Modifying Sample Size**
-```bash
-# In the launch script, you can modify:
-DATASET_SAMPLE_SIZE=50000 # For 50K samples
-DATASET_SAMPLE_SIZE=100000 # For 100K samples
-```
-
-### **Adjusting Training Parameters**
-```bash
-# Modify in config/train_smollm3_h100_lightweight.py:
-batch_size=12 # Smaller batch size
-learning_rate=6e-6 # Lower learning rate
-warmup_steps=100 # More warmup steps
-```
-
-### **Changing Dataset**
-```bash
-# Modify the dataset name in the configuration:
-dataset_name="your-custom-dataset"
-```
-
-## 📊 Monitoring and Results
-
-### **Trackio Integration**
-- **Real-time metrics**: Loss, learning rate, gradient norm
-- **Training curves**: Visual progress tracking
-- **Resource usage**: GPU utilization, memory consumption
-- **Artifacts**: Model checkpoints, logs
-
-### **Expected Metrics**
-- **Training Loss**: Starts ~3.0, ends ~1.5
-- **Validation Loss**: Should be close to training loss
-- **Learning Rate**: Cosine decay from 8e-6 to 2e-6
-- **Gradient Norm**: Should stay below 1.0
-
-### **Success Indicators**
-- **Converging loss**: Steady decrease over time
-- **Stable gradients**: Consistent gradient norms
-- **Good validation**: Validation loss follows training loss
-- **No overfitting**: Validation loss doesn't increase
-
-## 🚨 Troubleshooting
-
-### **Common Issues**
-
-#### **Out of Memory (OOM)**
-```bash
-# Reduce batch size in config:
-batch_size=12 # Instead of 16
-gradient_accumulation_steps=6 # Instead of 4
-```
-
-#### **Slow Training**
-```bash
-# Check GPU utilization:
-nvidia-smi
-# Ensure CUDA is properly installed
-python -c "import torch; print(torch.cuda.is_available())"
-```
-
-#### **Poor Convergence**
-```bash
-# Try different learning rate:
-learning_rate=6e-6 # Instead of 8e-6
-# Or increase warmup:
-warmup_steps=100 # Instead of 50
-```
-
-#### **Dataset Issues**
-```bash
-# Check dataset loading:
-python -c "from datasets import load_dataset; print(len(load_dataset('legmlai/openhermes-fr')['train']))"
-```
-
-### **Performance Tips**
-
-1. **Use H100 if available**: Significantly faster than A100
-2. **Monitor GPU memory**: Keep utilization below 90%
-3. **Check logs regularly**: Look for convergence issues
-4. **Save checkpoints**: Don't lose progress
-5. **Use early stopping**: Prevent overfitting
-
-## 📋 Example Workflow
-
-### **Complete H100 Lightweight Training**
-```bash
-# 1. Setup
-python setup_launch.py
-
-# 2. Check requirements
-python check_requirements.py
-
-# 3. Run interactive pipeline
-./launch.sh
-
-# 4. Select configuration
-# Choose: "H100 Lightweight (Rapid)"
-
-# 5. Monitor training
-# Watch Trackio Space for real-time progress
-
-# 6. Check results
-# Model will be pushed to HF Hub
-# Summary in training_summary.md
-```
-
-### **Expected Output**
-```
-✅ Dataset prepared: 80000 train samples, 1000 validation samples
-📈 Training started with 5000 total steps
-⏱️ Estimated time: 2-4 hours
-📊 Monitor progress at: https://huggingface.co/spaces/...
-```
-
-## 🎉 Benefits
-
-### **Speed**
-- **3-4x faster** than full dataset training
-- **Rapid iteration** for research
-- **Quick validation** of ideas
-
-### **Efficiency**
-- **Reduced costs** (less GPU time)
-- **Lower storage** requirements
-- **Faster experimentation** cycle
-
-### **Quality**
-- **Still high quality** results
-- **Good for prototyping**
-- **Suitable for many use cases**
-
-## 🔮 Future Enhancements
-
-### **Planned Improvements**
-- **Adaptive sampling**: Smart dataset selection
-- **Multi-GPU support**: Distributed training
-- **Advanced monitoring**: More detailed metrics
-- **Auto-tuning**: Automatic hyperparameter optimization
-
-### **Extensibility**
-- **Custom datasets**: Easy integration
-- **Different models**: Support for other architectures
-- **Advanced sampling**: Stratified, balanced sampling
-
----
-
-**Happy Rapid Training on H100! 🚀**
\ No newline at end of file
diff --git a/docs/HF_DATASETS_GUIDE.md b/docs/HF_DATASETS_GUIDE.md
deleted file mode 100644
index 8d7f9732dda360373557935bcc89297cbae88a9e..0000000000000000000000000000000000000000
--- a/docs/HF_DATASETS_GUIDE.md
+++ /dev/null
@@ -1,269 +0,0 @@
-# 🚀 Trackio with Hugging Face Datasets - Complete Guide
-
-## Overview
-
-This guide explains how to use Hugging Face Datasets for persistent storage of Trackio experiments, providing reliable data persistence across Hugging Face Spaces deployments.
-
-## 🏗️ Architecture
-
-### Why HF Datasets?
-
-1. **Persistent Storage**: Data survives Space restarts and redeployments
-2. **Version Control**: Automatic versioning of experiment data
-3. **Access Control**: Private datasets for security
-4. **Reliability**: HF's infrastructure ensures data availability
-5. **Scalability**: Handles large amounts of experiment data
-
-### Data Flow
-
-```
-Training Script → Trackio App → HF Dataset → Trackio App → Plots
-```
-
-## 🚀 Setup Instructions
-
-### 1. Create HF Token
-
-1. Go to [Hugging Face Settings](https://huggingface.co/settings/tokens)
-2. Create a new token with `write` permissions
-3. Copy the token for use in your Space
-
-### 2. Set Up Dataset Repository
-
-```bash
-# Run the setup script
-python setup_hf_dataset.py
-```
-
-This will:
-- Create a private dataset: `tonic/trackio-experiments`
-- Add your existing experiments
-- Configure the dataset for Trackio
-
-### 3. Configure Hugging Face Space
-
-#### Environment Variables
-Set these in your HF Space settings:
-```bash
-HF_TOKEN=your_hf_token_here
-TRACKIO_DATASET_REPO=your-username/your-dataset-name
-```
-
-**Environment Variables Explained:**
-- `HF_TOKEN`: Your Hugging Face token (required for dataset access)
-- `TRACKIO_DATASET_REPO`: Dataset repository to use (optional, defaults to `tonic/trackio-experiments`)
-
-**Example Configurations:**
-```bash
-# Use default dataset
-HF_TOKEN=your_token_here
-
-# Use personal dataset
-HF_TOKEN=your_token_here
-TRACKIO_DATASET_REPO=your-username/trackio-experiments
-
-# Use team dataset
-HF_TOKEN=your_token_here
-TRACKIO_DATASET_REPO=your-org/team-experiments
-
-# Use project-specific dataset
-HF_TOKEN=your_token_here
-TRACKIO_DATASET_REPO=your-username/smollm3-experiments
-```
-
-#### Requirements
-Update your `requirements.txt`:
-```txt
-gradio>=4.0.0
-plotly>=5.0.0
-pandas>=1.5.0
-numpy>=1.24.0
-datasets>=2.14.0
-huggingface-hub>=0.16.0
-requests>=2.31.0
-```
-
-### 4. Deploy Updated App
-
-The updated `app.py` now:
-- Loads experiments from HF Dataset
-- Saves new experiments to the dataset
-- Falls back to backup data if dataset unavailable
-- Provides better error handling
-
-### 5. Configure Environment Variables
-
-Use the configuration script to check your setup:
-
-```bash
-python configure_trackio.py
-```
-
-This script will:
-- Show current environment variables
-- Test dataset access
-- Generate configuration file
-- Provide usage examples
-
-**Available Environment Variables:**
-
-| Variable | Required | Default | Description |
-|----------|----------|---------|-------------|
-| `HF_TOKEN` | Yes | None | Your Hugging Face token |
-| `TRACKIO_DATASET_REPO` | No | `tonic/trackio-experiments` | Dataset repository to use |
-| `SPACE_ID` | Auto | None | HF Space ID (auto-detected) |
-
-## 📊 Dataset Schema
-
-The HF Dataset contains these columns:
-
-| Column | Type | Description |
-|--------|------|-------------|
-| `experiment_id` | string | Unique experiment identifier |
-| `name` | string | Experiment name |
-| `description` | string | Experiment description |
-| `created_at` | string | ISO timestamp |
-| `status` | string | running/completed/failed |
-| `metrics` | string | JSON array of metric entries |
-| `parameters` | string | JSON object of experiment parameters |
-| `artifacts` | string | JSON array of artifacts |
-| `logs` | string | JSON array of log entries |
-| `last_updated` | string | ISO timestamp of last update |
-
-## 🔧 Technical Details
-
-### Loading Experiments
-
-```python
-from datasets import load_dataset
-
-# Load from HF Dataset
-dataset = load_dataset("tonic/trackio-experiments", token=HF_TOKEN)
-
-# Convert to experiments dict
-for row in dataset['train']:
- experiment = {
- 'id': row['experiment_id'],
- 'metrics': json.loads(row['metrics']),
- 'parameters': json.loads(row['parameters']),
- # ... other fields
- }
-```
-
-### Saving Experiments
-
-```python
-from datasets import Dataset
-from huggingface_hub import HfApi
-
-# Convert experiments to dataset format
-dataset_data = []
-for exp_id, exp_data in experiments.items():
- dataset_data.append({
- 'experiment_id': exp_id,
- 'metrics': json.dumps(exp_data['metrics']),
- 'parameters': json.dumps(exp_data['parameters']),
- # ... other fields
- })
-
-# Push to HF Hub
-dataset = Dataset.from_list(dataset_data)
-dataset.push_to_hub("tonic/trackio-experiments", token=HF_TOKEN, private=True)
-```
-
-## 📈 Your Current Experiments
-
-### Available Experiments
-
-1. **`exp_20250720_130853`** (petite-elle-l-aime-3)
- - 4 metric entries (steps 25, 50, 75, 100)
- - Loss decreasing: 1.1659 → 1.1528
- - Good convergence pattern
-
-2. **`exp_20250720_134319`** (petite-elle-l-aime-3-1)
- - 2 metric entries (step 25)
- - Loss: 1.166
- - GPU memory tracking
-
-### Metrics Available for Plotting
-
-- `loss` - Training loss curve
-- `learning_rate` - Learning rate schedule
-- `mean_token_accuracy` - Token-level accuracy
-- `grad_norm` - Gradient norm
-- `num_tokens` - Tokens processed
-- `epoch` - Training epoch
-- `gpu_0_memory_allocated` - GPU memory usage
-- `cpu_percent` - CPU usage
-- `memory_percent` - System memory
-
-## 🎯 Usage Instructions
-
-### 1. View Experiments
-- Go to "View Experiments" tab
-- Enter experiment ID: `exp_20250720_130853` or `exp_20250720_134319`
-- Click "View Experiment"
-
-### 2. Create Plots
-- Go to "Visualizations" tab
-- Enter experiment ID
-- Select metric to plot
-- Click "Create Plot"
-
-### 3. Compare Experiments
-- Use "Experiment Comparison" feature
-- Enter: `exp_20250720_130853,exp_20250720_134319`
-- Compare loss curves
-
-## 🔍 Troubleshooting
-
-### Issue: "No metrics data available"
-**Solutions**:
-1. Check HF_TOKEN is set correctly
-2. Verify dataset repository exists
-3. Check network connectivity to HF Hub
-
-### Issue: "Failed to load from dataset"
-**Solutions**:
-1. App falls back to backup data automatically
-2. Check dataset permissions
-3. Verify token has read access
-
-### Issue: "Failed to save experiments"
-**Solutions**:
-1. Check token has write permissions
-2. Verify dataset repository exists
-3. Check network connectivity
-
-## 🚀 Benefits of This Approach
-
-### ✅ Advantages
-- **Persistent**: Data survives Space restarts
-- **Reliable**: HF's infrastructure ensures availability
-- **Secure**: Private datasets protect your data
-- **Scalable**: Handles large amounts of experiment data
-- **Versioned**: Automatic versioning of experiment data
-
-### 🔄 Fallback Strategy
-1. **Primary**: Load from HF Dataset
-2. **Secondary**: Use backup data (your existing experiments)
-3. **Tertiary**: Create new experiments locally
-
-## 📋 Next Steps
-
-1. **Set HF_TOKEN**: Add your token to Space environment
-2. **Run Setup**: Execute `setup_hf_dataset.py`
-3. **Deploy App**: Push updated `app.py` to your Space
-4. **Test Plots**: Verify experiments load and plots work
-5. **Monitor Training**: New experiments will be saved to dataset
-
-## 🔐 Security Notes
-
-- Dataset is **private** by default
-- Only accessible with your HF_TOKEN
-- Experiment data is stored securely on HF infrastructure
-- No sensitive data is exposed publicly
-
----
-
-**Your experiments are now configured for reliable persistence using Hugging Face Datasets!** 🎉
\ No newline at end of file
diff --git a/docs/HF_HUB_V0_34_UPDATE.md b/docs/HF_HUB_V0_34_UPDATE.md
deleted file mode 100644
index 28893743103322b0e920137e31bf33a30946d8e1..0000000000000000000000000000000000000000
--- a/docs/HF_HUB_V0_34_UPDATE.md
+++ /dev/null
@@ -1,170 +0,0 @@
-# Hugging Face Hub v0.34.0 Compatibility Update
-
-## Overview
-
-This document outlines the updates made to ensure compatibility with the new Hugging Face Hub v0.34.0 release, which introduced significant changes to the CLI interface.
-
-## Key Changes in HF Hub v0.34.0
-
-### 1. CLI Rename
-- **Old**: `huggingface-cli`
-- **New**: `hf`
-- **Status**: Legacy `huggingface-cli` still works but is deprecated
-
-### 2. New Features
-- **Jobs CLI**: New `hf jobs` command for running compute jobs
-- **Enhanced Inference**: Image-to-image support and PIL Image support
-- **Xet Integration**: Improved file transfer protocol
-- **Modern Command Format**: `hf [options]`
-
-## Files Updated
-
-### Core Scripts
-1. **`launch.sh`**
- - Updated `huggingface-cli whoami` → `hf whoami`
- - Updated `huggingface-cli login` → `hf login`
-
-2. **`scripts/trackio_tonic/deploy_trackio_space.py`**
- - Updated CLI commands for space creation
- - Updated username extraction method
-
-3. **`scripts/dataset_tonic/setup_hf_dataset.py`**
- - Updated username extraction method
-
-4. **`scripts/trackio_tonic/configure_trackio.py`**
- - Updated username extraction method
-
-### Documentation Files
-1. **`setup_launch.py`**
- - Updated troubleshooting guide
-
-2. **`README_END_TO_END.md`**
- - Updated CLI command examples
-
-3. **`docs/GIT_CONFIGURATION_GUIDE.md`**
- - Updated authentication examples
-
-4. **`docs/LAUNCH_SCRIPT_USERNAME_FIX.md`**
- - Updated username extraction method
-
-5. **`docs/LAUNCH_SCRIPT_UPDATES.md`**
- - Updated CLI command references
-
-6. **`docs/TRACKIO_DEPLOYMENT_FIXES.md`**
- - Updated troubleshooting commands
-
-7. **`docs/GIT_CONFIGURATION_FIX.md`**
- - Updated authentication examples
-
-## Compatibility Notes
-
-### Backward Compatibility
-- The legacy `huggingface-cli` commands still work
-- Our scripts will continue to function with both old and new CLI
-- No breaking changes to the Python API
-
-### Recommended Actions
-1. **Update CLI Installation**: Ensure users have the latest `huggingface_hub` package
-2. **Update Documentation**: All references now use the new `hf` command
-3. **Test Deployment**: Verify that all deployment scripts work with the new CLI
-
-## Verification Steps
-
-### 1. Test CLI Installation
-```bash
-# Check if hf command is available
-hf --version
-
-# Test authentication
-hf whoami
-```
-
-### 2. Test Deployment Scripts
-```bash
-# Test space deployment
-python scripts/trackio_tonic/deploy_trackio_space.py
-
-# Test dataset setup
-python scripts/dataset_tonic/setup_hf_dataset.py
-
-# Test model push
-python scripts/model_tonic/push_to_huggingface.py
-```
-
-### 3. Test Launch Script
-```bash
-# Run the interactive pipeline
-./launch.sh
-```
-
-## Benefits of the Update
-
-### 1. Future-Proof
-- Uses the new official CLI name
-- Follows HF's recommended practices
-- Ready for future HF Hub updates
-
-### 2. Consistency
-- All scripts now use the same CLI command
-- Unified command format across the project
-- Consistent with HF's new conventions
-
-### 3. Modern Interface
-- Aligns with HF's new command structure
-- Better integration with HF's ecosystem
-- Improved user experience
-
-## Migration Guide
-
-### For Users
-1. **Update huggingface_hub**: `pip install --upgrade huggingface_hub`
-2. **Test CLI**: Run `hf whoami` to verify installation
-3. **Update Scripts**: Use the updated scripts from this repository
-
-### For Developers
-1. **Update Dependencies**: Ensure `huggingface_hub>=0.34.0`
-2. **Test Scripts**: Verify all deployment scripts work
-3. **Update Documentation**: Use `hf` instead of `huggingface-cli`
-
-## Troubleshooting
-
-### Common Issues
-
-#### 1. CLI Not Found
-```bash
-# Install/upgrade huggingface_hub
-pip install --upgrade huggingface_hub
-
-# Verify installation
-hf --version
-```
-
-#### 2. Authentication Issues
-```bash
-# Login with new CLI
-hf login --token "your-token"
-
-# Verify login
-hf whoami
-```
-
-#### 3. Script Compatibility
-- All scripts have been updated to use the new CLI
-- Legacy commands are still supported as fallback
-- No breaking changes to functionality
-
-## Summary
-
-The update to HF Hub v0.34.0 compatibility ensures:
-
-1. **✅ Future-Proof**: Uses the new official CLI name
-2. **✅ Consistent**: All scripts use the same command format
-3. **✅ Compatible**: Maintains backward compatibility
-4. **✅ Modern**: Aligns with HF's latest conventions
-5. **✅ Tested**: All deployment scripts verified to work
-
-The project is now fully compatible with Hugging Face Hub v0.34.0 and ready for future updates.
-
----
-
-**Note**: The legacy `huggingface-cli` commands will continue to work, but using `hf` is now the recommended approach for all new development and deployments.
\ No newline at end of file
diff --git a/docs/HF_SPACES_GUIDE.md b/docs/HF_SPACES_GUIDE.md
deleted file mode 100644
index 80346806097ac4e07845dc152d6368e1911f0d57..0000000000000000000000000000000000000000
--- a/docs/HF_SPACES_GUIDE.md
+++ /dev/null
@@ -1,163 +0,0 @@
-# 🚀 Trackio on Hugging Face Spaces - Complete Guide
-
-## Overview
-
-This guide explains how to properly deploy and use Trackio on Hugging Face Spaces, addressing the unique challenges of ephemeral storage and data persistence.
-
-## 🏗️ Hugging Face Spaces Architecture
-
-### Key Challenges
-
-1. **Ephemeral Storage**: File system gets reset between deployments
-2. **No Persistent Storage**: Files written during runtime don't persist
-3. **Multiple Instances**: Training and monitoring might run in different environments
-4. **Limited File System**: Restricted write permissions in certain directories
-
-### How Trackio Handles HF Spaces
-
-The updated Trackio app now includes:
-
-- **Automatic HF Spaces Detection**: Detects when running on HF Spaces
-- **Persistent Path Selection**: Uses `/tmp/` for better persistence
-- **Backup Recovery**: Automatically recovers experiments from backup data
-- **Fallback Storage**: Multiple storage locations for redundancy
-
-## 📊 Your Current Experiments
-
-Based on your logs, you have these experiments available:
-
-### Experiment 1: `exp_20250720_130853`
-- **Name**: petite-elle-l-aime-3
-- **Status**: Running
-- **Metrics**: 4 entries (steps 25, 50, 75, 100)
-- **Key Metrics**: Loss decreasing from 1.1659 to 1.1528
-
-### Experiment 2: `exp_20250720_134319`
-- **Name**: petite-elle-l-aime-3-1
-- **Status**: Running
-- **Metrics**: 2 entries (step 25)
-- **Key Metrics**: Loss 1.166, GPU memory usage
-
-## 🎯 How to Use Your Experiments
-
-### 1. View Experiments
-- Go to the "View Experiments" tab
-- Enter experiment ID: `exp_20250720_130853` or `exp_20250720_134319`
-- Click "View Experiment" to see details
-
-### 2. Create Plots
-- Go to the "Visualizations" tab
-- Enter experiment ID
-- Select metric to plot:
- - `loss` - Training loss curve
- - `learning_rate` - Learning rate schedule
- - `mean_token_accuracy` - Token accuracy
- - `grad_norm` - Gradient norm
- - `gpu_0_memory_allocated` - GPU memory usage
-
-### 3. Compare Experiments
-- Use the "Experiment Comparison" feature
-- Enter: `exp_20250720_130853,exp_20250720_134319`
-- Compare loss curves between experiments
-
-## 🔧 Technical Details
-
-### Data Persistence Strategy
-
-```python
-# HF Spaces detection
-if os.environ.get('SPACE_ID'):
- data_file = "/tmp/trackio_experiments.json"
-else:
- data_file = "trackio_experiments.json"
-```
-
-### Backup Recovery
-
-The app automatically recovers your experiments from backup data when:
-- Running on HF Spaces
-- No existing experiments found
-- Data file is missing or empty
-
-### Storage Locations
-
-1. **Primary**: `/tmp/trackio_experiments.json`
-2. **Backup**: `/tmp/trackio_backup.json`
-3. **Fallback**: Local directory (for development)
-
-## 🚀 Deployment Best Practices
-
-### 1. Environment Variables
-```bash
-# Set in HF Spaces environment
-SPACE_ID=your-space-id
-TRACKIO_URL=https://your-space.hf.space
-```
-
-### 2. File Structure
-```
-your-space/
-├── app.py # Main Trackio app
-├── requirements.txt # Dependencies
-├── README.md # Space description
-└── .gitignore # Ignore temporary files
-```
-
-### 3. Requirements
-```txt
-gradio>=4.0.0
-plotly>=5.0.0
-pandas>=1.5.0
-numpy>=1.24.0
-```
-
-## 📈 Monitoring Your Training
-
-### Real-time Metrics
-Your experiments show:
-- **Loss**: Decreasing from 1.1659 to 1.1528 (good convergence)
-- **Learning Rate**: Properly scheduled from 7e-08 to 2.8875e-07
-- **Token Accuracy**: Around 75-76% (reasonable for early training)
-- **GPU Memory**: ~17GB allocated, 75GB reserved
-
-### Expected Behavior
-- Loss should continue decreasing
-- Learning rate will follow cosine schedule
-- Token accuracy should improve over time
-- GPU memory usage should remain stable
-
-## 🔍 Troubleshooting
-
-### Issue: "No metrics data available"
-**Solution**: The app now automatically recovers experiments from backup
-
-### Issue: Plots not showing
-**Solution**:
-1. Check experiment ID is correct
-2. Try different metrics (loss, learning_rate, etc.)
-3. Refresh the page
-
-### Issue: Data not persisting
-**Solution**:
-1. App now uses `/tmp/` for better persistence
-2. Backup recovery ensures data availability
-3. Multiple storage locations provide redundancy
-
-## 🎯 Next Steps
-
-1. **Deploy Updated App**: Push the updated `app.py` to your HF Space
-2. **Test Plots**: Try plotting your experiments
-3. **Monitor Training**: Continue monitoring your training runs
-4. **Add New Experiments**: Create new experiments as needed
-
-## 📞 Support
-
-If you encounter issues:
-1. Check the logs in your HF Space
-2. Verify experiment IDs are correct
-3. Try the backup recovery feature
-4. Contact for additional support
-
----
-
-**Your experiments are now properly configured and should display correctly in the Trackio interface!** 🎉
\ No newline at end of file
diff --git a/docs/INTERACTIVE_PIPELINE_IMPROVEMENTS.md b/docs/INTERACTIVE_PIPELINE_IMPROVEMENTS.md
deleted file mode 100644
index 0e1a27da4941ae50a32b09d3ab2fa0b379de468b..0000000000000000000000000000000000000000
--- a/docs/INTERACTIVE_PIPELINE_IMPROVEMENTS.md
+++ /dev/null
@@ -1,330 +0,0 @@
-# Interactive Pipeline Improvements
-
-This document explains the improvements made to the `launch.sh` script to make it interactive and configurable for different training scenarios.
-
-## 🎯 Key Improvements
-
-### 1. **Interactive User Interface**
-- **Colored Output**: Added color-coded status messages for better UX
-- **Input Validation**: Real-time validation of user inputs
-- **Default Values**: Smart defaults for common configurations
-- **Error Handling**: Graceful error handling with helpful messages
-
-### 2. **Training Configuration Selection**
-The script now offers 4 predefined training configurations:
-
-#### **Basic Training (Default)**
-```bash
-Model: SmolLM3-3B
-Dataset: SmolTalk
-Epochs: 3
-Batch Size: 2
-Learning Rate: 5e-6
-Sequence Length: 4096
-Best for: Quick experiments, learning
-```
-
-#### **H100 Lightweight (Rapid)**
-```bash
-Model: SmolLM3-3B
-Dataset: OpenHermes-FR (80K samples)
-Epochs: 1
-Batch Size: 16
-Learning Rate: 8e-6
-Sequence Length: 8192
-Best for: Rapid training on H100
-```
-
-#### **A100 Large Scale**
-```bash
-Model: SmolLM3-3B
-Dataset: OpenHermes-FR
-Epochs: 1.3 passes
-Batch Size: 8
-Learning Rate: 5e-6
-Sequence Length: 8192
-Best for: High-performance training
-```
-
-#### **Multiple Passes**
-```bash
-Model: SmolLM3-3B
-Dataset: OpenHermes-FR
-Epochs: 4 passes
-Batch Size: 6
-Learning Rate: 3e-6
-Sequence Length: 8192
-Best for: Thorough training
-```
-
-#### **Custom Configuration**
-- User-defined parameters
-- Flexible model and dataset selection
-- Custom training parameters
-
-### 3. **Enhanced User Experience**
-
-#### **Step-by-Step Guidance**
-1. **Authentication** - HF username and token validation
-2. **Configuration Selection** - Choose from predefined configs
-3. **Experiment Setup** - Configure experiment details
-4. **Training Parameters** - Adjust hyperparameters
-5. **Deployment Setup** - Trackio Space configuration
-6. **Confirmation** - Review and confirm settings
-
-#### **Input Functions**
-```bash
-# Get input with default value
-get_input "Prompt" "default_value" VARIABLE_NAME
-
-# Select from options
-select_option "Choose option:" "Option 1" "Option 2" "Option 3" VARIABLE_NAME
-
-# Validate HF token
-validate_hf_token "$HF_TOKEN"
-```
-
-#### **Colored Output Functions**
-```bash
-print_status "Success message" # Green ✅
-print_warning "Warning message" # Yellow ⚠️
-print_error "Error message" # Red ❌
-print_info "Info message" # Blue ℹ️
-print_header "Header message" # Purple 🚀
-print_step "Step message" # Cyan 📋
-```
-
-### 4. **Dynamic Configuration Generation**
-
-The script now generates training configurations based on user selection:
-
-```python
-# Generated config file
-config = SmolLM3Config(
- model_name="$MODEL_NAME",
- max_seq_length=$MAX_SEQ_LENGTH,
- batch_size=$BATCH_SIZE,
- learning_rate=$LEARNING_RATE,
- # ... other parameters
-)
-```
-
-### 5. **Improved Error Handling**
-
-#### **Input Validation**
-- Required field validation
-- HF token validation
-- Numeric input validation
-- Choice validation
-
-#### **Graceful Degradation**
-- Clear error messages
-- Recovery suggestions
-- Exit on critical errors
-
-### 6. **Configuration Management**
-
-#### **User Credentials**
-- Interactive username input
-- Secure token input
-- Real-time token validation
-
-#### **Experiment Details**
-- Dynamic experiment naming
-- Repository name generation
-- Dataset repository configuration
-
-#### **Training Parameters**
-- Batch size selection
-- Learning rate adjustment
-- Sequence length configuration
-- Save/eval/logging steps
-
-### 7. **Enhanced Monitoring Integration**
-
-#### **Trackio Space**
-- Dynamic space naming
-- Automatic deployment
-- URL generation
-
-#### **HF Datasets**
-- Dataset repository setup
-- Experiment data storage
-- Access configuration
-
-## 🔧 Technical Improvements
-
-### 1. **Modular Functions**
-```bash
-# Input handling
-get_input() # Get user input with defaults
-select_option() # Select from options
-validate_hf_token() # Validate HF token
-
-# Configuration
-show_training_configs() # Display available configs
-get_training_config() # Get config based on selection
-create_training_config() # Generate config file
-
-# Output formatting
-print_status() # Success messages
-print_warning() # Warning messages
-print_error() # Error messages
-print_info() # Info messages
-print_header() # Header messages
-print_step() # Step messages
-```
-
-### 2. **Configuration Selection Logic**
-```bash
-case "$config_type" in
- "Basic Training")
- MODEL_NAME="HuggingFaceTB/SmolLM3-3B"
- DATASET_NAME="HuggingFaceTB/smoltalk"
- # ... other parameters
- ;;
- "A100 Large Scale")
- MODEL_NAME="HuggingFaceTB/SmolLM3-3B"
- DATASET_NAME="legmlai/openhermes-fr"
- # ... other parameters
- ;;
- # ... other configurations
-esac
-```
-
-### 3. **Dynamic File Generation**
-```bash
-# Generate training config
-create_training_config "$CONFIG_FILE"
-
-# Generate deployment input
-cat > deploy_input.txt << EOF
-$HF_USERNAME
-$TRACKIO_SPACE_NAME
-$HF_TOKEN
-EOF
-```
-
-## 📊 User Workflow
-
-### **Before (Static)**
-1. Edit `launch.sh` manually
-2. Update hardcoded variables
-3. Run script
-4. Hope configuration is correct
-
-### **After (Interactive)**
-1. Run `./launch.sh`
-2. Follow interactive prompts
-3. Select training configuration
-4. Confirm settings
-5. Watch automated pipeline
-
-## 🎯 Benefits
-
-### **For Users**
-- **No Manual Editing**: No need to edit script files
-- **Guided Experience**: Step-by-step prompts
-- **Validation**: Real-time input validation
-- **Flexibility**: Multiple configuration options
-- **Safety**: Confirmation before execution
-
-### **For Developers**
-- **Maintainable**: Modular function structure
-- **Extensible**: Easy to add new configurations
-- **Robust**: Comprehensive error handling
-- **User-Friendly**: Clear feedback and guidance
-
-### **For Different Use Cases**
-- **Beginners**: Basic Training configuration
-- **H100 Users**: H100 Lightweight for rapid experiments
-- **Researchers**: A100 Large Scale for serious experiments
-- **Production**: Multiple Passes for thorough training
-- **Custom**: User-defined parameters for specific needs
-
-## 🔄 Configuration Examples
-
-### **Quick Start (Basic Training)**
-```bash
-./launch.sh
-# Follow prompts:
-# 1. Enter HF username and token
-# 2. Select "Basic Training"
-# 3. Confirm settings
-# 4. Watch automated pipeline
-```
-
-### **High-Performance Training (A100)**
-```bash
-./launch.sh
-# Follow prompts:
-# 1. Enter HF username and token
-# 2. Select "A100 Large Scale"
-# 3. Adjust parameters if needed
-# 4. Confirm and run
-```
-
-### **Rapid Training (H100)**
-```bash
-./launch.sh
-# Follow prompts:
-# 1. Enter HF username and token
-# 2. Select "H100 Lightweight (Rapid)"
-# 3. Confirm settings
-# 4. Watch rapid training on H100
-```
-
-### **Custom Training**
-```bash
-./launch.sh
-# Follow prompts:
-# 1. Enter HF username and token
-# 2. Select "Custom Configuration"
-# 3. Enter custom parameters:
-# - Model: microsoft/DialoGPT-medium
-# - Dataset: your-custom-dataset
-# - Epochs: 5
-# - Batch Size: 4
-# - Learning Rate: 1e-5
-# 4. Confirm and run
-```
-
-## 🚀 Future Enhancements
-
-### **Planned Improvements**
-- **GUI Interface**: Web-based configuration interface
-- **Configuration Templates**: Save/load custom configurations
-- **Advanced Validation**: More sophisticated input validation
-- **Progress Tracking**: Real-time progress indicators
-- **Rollback Capability**: Undo changes if needed
-
-### **Extensibility**
-- **Plugin System**: Add custom training configurations
-- **API Integration**: Connect to external services
-- **Multi-GPU Support**: Distributed training options
-- **Advanced Monitoring**: Enhanced tracking capabilities
-
-## 📋 Migration Guide
-
-### **For Existing Users**
-1. **Backup**: Save your current `launch.sh`
-2. **Update**: Replace with new interactive version
-3. **Test**: Run with basic configuration first
-4. **Migrate**: Use interactive prompts instead of manual editing
-
-### **For New Users**
-1. **Setup**: Run `python setup_launch.py`
-2. **Check**: Run `python check_requirements.py`
-3. **Launch**: Run `./launch.sh`
-4. **Follow**: Use interactive prompts
-
-## 🎉 Conclusion
-
-The interactive pipeline provides a much better user experience with:
-- **Guided Configuration**: No manual editing required
-- **Multiple Options**: Predefined configurations for different use cases
-- **Validation**: Real-time input validation and error handling
-- **Flexibility**: Custom configuration support
-- **Safety**: Confirmation steps and error recovery
-
-The script is now production-ready for users of all skill levels, from beginners to advanced researchers.
\ No newline at end of file
diff --git a/docs/LATEST_DEPLOYMENT_APPROACH.md b/docs/LATEST_DEPLOYMENT_APPROACH.md
deleted file mode 100644
index 0ef14d45c534384ac2d2dbe32b8e21ea0f16a361..0000000000000000000000000000000000000000
--- a/docs/LATEST_DEPLOYMENT_APPROACH.md
+++ /dev/null
@@ -1,267 +0,0 @@
-# Latest Trackio Space Deployment Approach
-
-## Overview
-
-Based on the [Hugging Face Hub repository code](https://github.com/huggingface/huggingface_hub/blob/9e0493cfdb4de5a27b45c53c3342c83ab1a138fb/src/huggingface_hub/commands/repo.py#L30), I've updated the Trackio Space deployment to use the latest Hugging Face Hub Python API instead of CLI commands.
-
-## Key Improvements
-
-### 1. **Latest HF Hub API Integration**
-
-**Before**: Using CLI commands
-```python
-cmd = ["hf", "repo", "create", f"{username}/{space_name}", "--type", "space"]
-```
-
-**After**: Using Python API
-```python
-from huggingface_hub import create_repo
-
-create_repo(
- repo_id=f"{username}/{space_name}",
- token=token,
- repo_type="space",
- exist_ok=True,
- private=False,
- space_sdk="gradio",
- space_hardware="cpu-basic"
-)
-```
-
-### 2. **Robust Fallback Mechanism**
-
-The deployment script now includes both API and CLI approaches:
-
-```python
-def create_space(self) -> bool:
- """Create a new Hugging Face Space using the latest API"""
- try:
- if not HF_HUB_AVAILABLE:
- return self._create_space_cli()
-
- # Use latest API
- create_repo(...)
-
- except Exception as api_error:
- # Fallback to CLI
- return self._create_space_cli()
-```
-
-### 3. **Enhanced Dependencies**
-
-Updated `requirements/requirements_core.txt`:
-```txt
-# Hugging Face Hub for model and space management
-huggingface_hub>=0.19.0
-```
-
-## API Parameters
-
-### **Required Parameters**
-- `repo_id`: Repository identifier (username/space-name)
-- `token`: Hugging Face token with write permissions
-
-### **Optional Parameters**
-- `repo_type`: Set to "space" for Spaces
-- `exist_ok`: Allow existing repositories (default: True)
-- `private`: Make repository private (default: False)
-- `space_sdk`: SDK type (default: "gradio")
-- `space_hardware`: Hardware specification (default: "cpu-basic")
-
-## Deployment Process
-
-### **Step 1: API Creation**
-```python
-# Create space using latest API
-create_repo(
- repo_id=f"{username}/{space_name}",
- token=token,
- repo_type="space",
- exist_ok=True,
- private=False,
- space_sdk="gradio",
- space_hardware="cpu-basic"
-)
-```
-
-### **Step 2: File Preparation**
-```python
-# Prepare files in temporary directory
-temp_dir = tempfile.mkdtemp()
-# Copy template files
-shutil.copy2(source_path, dest_path)
-# Update README with actual space URL
-readme_content.replace("{SPACE_URL}", self.space_url)
-```
-
-### **Step 3: Git Upload**
-```python
-# Initialize git in temp directory
-os.chdir(temp_dir)
-subprocess.run(["git", "init"], check=True)
-subprocess.run(["git", "remote", "add", "origin", space_url], check=True)
-subprocess.run(["git", "add", "."], check=True)
-subprocess.run(["git", "commit", "-m", "Initial Trackio Space setup"], check=True)
-subprocess.run(["git", "push", "origin", "main"], check=True)
-```
-
-## Testing the Latest Deployment
-
-### **Run Latest Deployment Tests**
-```bash
-python tests/test_latest_deployment.py
-```
-
-Expected output:
-```
-🚀 Testing Latest Trackio Space Deployment
-=======================================================
-🔍 Testing huggingface_hub import...
-✅ huggingface_hub imported successfully
-
-🔍 Testing deployment script import...
-✅ TrackioSpaceDeployer class imported successfully
-✅ HF API initialized
-
-🔍 Testing API methods...
-✅ Method exists: create_space
-✅ Method exists: _create_space_cli
-✅ Method exists: prepare_space_files
-✅ Method exists: upload_files_to_space
-✅ Method exists: test_space
-✅ Method exists: deploy
-
-🔍 Testing create_repo API...
-✅ Required parameter: repo_id
-✅ Required parameter: token
-✅ Optional parameter: repo_type
-✅ Optional parameter: space_sdk
-✅ Optional parameter: space_hardware
-✅ create_repo API signature looks correct
-
-🔍 Testing space creation logic...
-✅ Space URL formatted correctly
-✅ Repo ID formatted correctly
-
-🔍 Testing template files...
-✅ app.py exists
-✅ requirements.txt exists
-✅ README.md exists
-
-🔍 Testing temporary directory handling...
-✅ Created temp directory: /tmp/tmp_xxxxx
-✅ File copying works
-✅ Cleanup successful
-
-📊 Test Results: 7/7 tests passed
-✅ All deployment tests passed! The latest deployment should work correctly.
-```
-
-## Files Updated
-
-### **Core Deployment Files**
-1. **`scripts/trackio_tonic/deploy_trackio_space.py`**
- - Added HF Hub API integration
- - Implemented fallback mechanism
- - Enhanced error handling
- - Better logging and debugging
-
-### **Dependencies**
-2. **`requirements/requirements_core.txt`**
- - Updated huggingface_hub to >=0.19.0
- - Organized dependencies by category
- - Added missing dependencies
-
-### **Testing**
-3. **`tests/test_latest_deployment.py`**
- - Comprehensive API testing
- - Import validation
- - Method verification
- - Template file checking
-
-## Benefits of Latest Approach
-
-### **1. Better Error Handling**
-- API-first approach with CLI fallback
-- Detailed error messages
-- Graceful degradation
-
-### **2. More Reliable**
-- Uses official HF Hub API
-- Better parameter validation
-- Consistent behavior
-
-### **3. Future-Proof**
-- Follows latest HF Hub patterns
-- Easy to update with new API features
-- Maintains backward compatibility
-
-### **4. Enhanced Logging**
-- Detailed progress reporting
-- Better debugging information
-- Clear success/failure indicators
-
-## Usage Instructions
-
-### **1. Install Latest Dependencies**
-```bash
-pip install huggingface_hub>=0.19.0
-```
-
-### **2. Test the Deployment**
-```bash
-python tests/test_latest_deployment.py
-```
-
-### **3. Deploy Trackio Space**
-```bash
-python scripts/trackio_tonic/deploy_trackio_space.py
-```
-
-### **4. Verify Deployment**
-- Check the Space URL
-- Test the interface
-- Verify API endpoints
-
-## Troubleshooting
-
-### **Common Issues**
-
-#### **1. Import Errors**
-```
-❌ Failed to import huggingface_hub
-```
-**Solution**: Install latest version
-```bash
-pip install huggingface_hub>=0.19.0
-```
-
-#### **2. API Errors**
-```
-API creation failed: 401 Client Error
-```
-**Solution**: Check token permissions and validity
-
-#### **3. Git Push Errors**
-```
-❌ Error uploading files: git push failed
-```
-**Solution**: Verify git configuration and token access
-
-### **Fallback Behavior**
-
-The deployment script automatically falls back to CLI if:
-- `huggingface_hub` is not available
-- API creation fails
-- Network issues occur
-
-## Reference Implementation
-
-Based on the [Hugging Face Hub repository](https://github.com/huggingface/huggingface_hub/blob/9e0493cfdb4de5a27b45c53c3342c83ab1a138fb/src/huggingface_hub/commands/repo.py#L30), this implementation:
-
-1. **Uses the latest API patterns**
-2. **Follows HF Hub best practices**
-3. **Maintains backward compatibility**
-4. **Provides robust error handling**
-
-The Trackio Space deployment should now work reliably with the latest Hugging Face Hub infrastructure! 🚀
\ No newline at end of file
diff --git a/docs/LAUNCH_SCRIPT_UPDATES.md b/docs/LAUNCH_SCRIPT_UPDATES.md
deleted file mode 100644
index d47229a2d3f9e8839150ecb4c8c2f760a20a550f..0000000000000000000000000000000000000000
--- a/docs/LAUNCH_SCRIPT_UPDATES.md
+++ /dev/null
@@ -1,174 +0,0 @@
-# Launch Script Updates
-
-This document outlines the updates made to `launch.sh` to work with the new automated Trackio deployment features.
-
-## Key Changes Made
-
-### ✅ **Removed Manual Username Input**
-- **Before**: Script asked for username manually
-- **After**: Username is automatically extracted from HF token using `whoami()`
-- **Benefit**: Fewer manual inputs, better user experience
-
-### ✅ **Updated Token Validation**
-- **Before**: `validate_hf_token()` only validated token
-- **After**: `validate_hf_token_and_get_username()` validates token AND extracts username
-- **Benefit**: Automatic username detection from token
-
-### ✅ **Updated Deployment Workflow**
-- **Before**: Passed username manually to deployment script
-- **After**: Deployment script automatically gets username from token
-- **Benefit**: Consistent with new automated features
-
-### ✅ **Enhanced User Feedback**
-- **Before**: Basic status messages
-- **After**: Clear information about automated features
-- **Benefit**: Users understand what's happening automatically
-
-## Updated Workflow
-
-### **Step 1: Authentication (Simplified)**
-```bash
-# Before: Asked for username + token
-get_input "Hugging Face username" "" HF_USERNAME
-get_input "Hugging Face token" "" HF_TOKEN
-
-# After: Only asks for token, username auto-detected
-get_input "Hugging Face token" "" HF_TOKEN
-# Username automatically extracted from token
-```
-
-### **Step 9: Trackio Space Deployment (Automated)**
-```bash
-# Before: Manual input file creation
-cat > deploy_input.txt << EOF
-$HF_USERNAME
-$TRACKIO_SPACE_NAME
-$HF_TOKEN
-$GIT_EMAIL
-$HF_USERNAME
-EOF
-python deploy_trackio_space.py < deploy_input.txt
-
-# After: Direct input with automated features
-python deploy_trackio_space.py << EOF
-$TRACKIO_SPACE_NAME
-$HF_TOKEN
-$GIT_EMAIL
-$HF_USERNAME
-EOF
-```
-
-### **Step 10: Dataset Setup (Automated)**
-```bash
-# Before: Basic dataset setup
-python setup_hf_dataset.py
-
-# After: Automated dataset setup with user feedback
-print_info "Setting up HF Dataset with automated features..."
-print_info "Username will be auto-detected from token"
-print_info "Dataset repository: $TRACKIO_DATASET_REPO"
-python setup_hf_dataset.py
-```
-
-### **Step 11: Trackio Configuration (Automated)**
-```bash
-# Before: Basic configuration
-python configure_trackio.py
-
-# After: Automated configuration with user feedback
-print_info "Configuring Trackio with automated features..."
-print_info "Username will be auto-detected from token"
-python configure_trackio.py
-```
-
-## New Function: `validate_hf_token_and_get_username()`
-
-```bash
-validate_hf_token_and_get_username() {
- local token="$1"
- if [ -z "$token" ]; then
- return 1
- fi
-
- # Test the token and get username
- export HF_TOKEN="$token"
- if hf whoami >/dev/null 2>&1; then
- # Get username from whoami command
- HF_USERNAME=$(hf whoami | head -n1 | tr -d '\n')
- return 0
- else
- return 1
- fi
-}
-```
-
-## User Experience Improvements
-
-### ✅ **Fewer Manual Inputs**
-- Only need to provide HF token
-- Username automatically detected
-- Git email still required (for git operations)
-
-### ✅ **Better Feedback**
-- Clear messages about automated features
-- Shows what's happening automatically
-- Better error messages
-
-### ✅ **Consistent Automation**
-- All scripts now use automated features
-- No manual username input anywhere
-- Automatic secret setting
-
-## Configuration Summary Updates
-
-### **Before:**
-```
-📋 Configuration Summary:
-========================
- User: username (manually entered)
- Experiment: experiment_name
- ...
-```
-
-### **After:**
-```
-📋 Configuration Summary:
-========================
- User: username (auto-detected from token)
- Experiment: experiment_name
- ...
-```
-
-## Benefits
-
-1. **Simplified Workflow**: Only need token, username auto-detected
-2. **Consistent Automation**: All scripts use automated features
-3. **Better User Experience**: Clear feedback about automated features
-4. **Reduced Errors**: No manual username input means fewer typos
-5. **Streamlined Process**: Fewer steps, more automation
-
-## Testing
-
-The updated launch script has been tested for:
-- ✅ Syntax validation (`bash -n launch.sh`)
-- ✅ Function integration with updated scripts
-- ✅ Automated username extraction
-- ✅ Consistent workflow with new features
-
-## Compatibility
-
-The updated launch script is fully compatible with:
-- ✅ Updated `deploy_trackio_space.py` (automated features)
-- ✅ Updated `setup_hf_dataset.py` (username extraction)
-- ✅ Updated `configure_trackio.py` (automated configuration)
-- ✅ Existing training and model push scripts
-
-## Summary
-
-The launch script now provides a seamless, automated experience that:
-- Extracts username automatically from HF token
-- Uses all the new automated features in the deployment scripts
-- Provides clear feedback about automated processes
-- Maintains compatibility with existing workflows
-- Reduces manual input requirements
-- Improves overall user experience
\ No newline at end of file
diff --git a/docs/LAUNCH_SCRIPT_USERNAME_FIX.md b/docs/LAUNCH_SCRIPT_USERNAME_FIX.md
deleted file mode 100644
index 2fb4c38682f9b1a62c609df7d6dff168311356de..0000000000000000000000000000000000000000
--- a/docs/LAUNCH_SCRIPT_USERNAME_FIX.md
+++ /dev/null
@@ -1,154 +0,0 @@
-# Launch Script Username Parameter Fix
-
-This document outlines the fix for removing unnecessary username parameters from the launch script deployment calls.
-
-## 🐛 **Problem Description**
-
-The `launch.sh` script was still passing the username parameter to the deployment script even though the deployment script should auto-detect the username from the token.
-
-**Before:**
-```bash
-# Run deployment script with automated features
-python deploy_trackio_space.py << EOF
-$TRACKIO_SPACE_NAME
-$HF_TOKEN
-$GIT_EMAIL
-$HF_USERNAME # ❌ Unnecessary - should be auto-detected
-EOF
-```
-
-## ✅ **Solution Implemented**
-
-### **Removed Unnecessary Username Parameter**
-
-**After:**
-```bash
-# Run deployment script with automated features
-python deploy_trackio_space.py << EOF
-$TRACKIO_SPACE_NAME
-$HF_TOKEN
-$GIT_EMAIL
-
-EOF
-```
-
-## 🔧 **Why This Fix Was Needed**
-
-### **1. Deployment Script Auto-Detection**
-The `deploy_trackio_space.py` script already has robust username auto-detection:
-
-```python
-def __init__(self, space_name: str, token: str, git_email: str = None, git_name: str = None):
- # Username is auto-detected from token
- username = get_username_from_token(token)
- if not username:
- username = get_username_from_cli(token)
-```
-
-### **2. Consistent Automation**
-All deployment scripts now use the same pattern:
-- `deploy_trackio_space.py` - Auto-detects username from token
-- `setup_hf_dataset.py` - Auto-detects username from token
-- `configure_trackio.py` - Auto-detects username from token
-
-### **3. Reduced Manual Input**
-The launch script still extracts username for its own use (defaults, display), but doesn't pass it to scripts that can auto-detect it.
-
-## 📋 **Current Workflow**
-
-### **Launch Script Username Usage:**
-```bash
-# 1. Extract username for launch script use
-HF_USERNAME=$(hf whoami | head -n1 | tr -d '\n')
-
-# 2. Use for default values and display
-get_input "Model repository name" "$HF_USERNAME/smollm3-finetuned-$(date +%Y%m%d)" REPO_NAME
-get_input "Trackio dataset repository" "$HF_USERNAME/trackio-experiments" TRACKIO_DATASET_REPO
-TRACKIO_URL="https://huggingface.co/spaces/$HF_USERNAME/$TRACKIO_SPACE_NAME"
-
-# 3. Display in summary
-echo " User: $HF_USERNAME (auto-detected from token)"
-```
-
-### **Deployment Script Auto-Detection:**
-```python
-# Each script auto-detects username from token
-username = get_username_from_token(hf_token)
-if not username:
- username = get_username_from_cli(hf_token)
-```
-
-## 🎯 **Benefits**
-
-### **✅ Consistent Automation**
-- All scripts use the same username detection method
-- No manual username input required anywhere
-- Automatic fallback to CLI if API fails
-
-### **✅ Reduced Complexity**
-- Fewer parameters to pass between scripts
-- Less chance of username mismatch errors
-- Cleaner script interfaces
-
-### **✅ Better User Experience**
-- Username is auto-detected from token
-- No manual username input required
-- Clear feedback about auto-detection
-
-### **✅ Future-Proof**
-- If username detection method changes, only one place to update
-- Consistent behavior across all scripts
-- Easier to maintain and debug
-
-## 🔍 **Scripts Updated**
-
-### **1. `launch.sh`**
-- ✅ Removed `$HF_USERNAME` parameter from deployment script call
-- ✅ Kept username extraction for launch script use (defaults, display)
-- ✅ Maintained all other functionality
-
-### **2. Deployment Scripts (No Changes Needed)**
-- ✅ `deploy_trackio_space.py` - Already auto-detects username
-- ✅ `setup_hf_dataset.py` - Already auto-detects username
-- ✅ `configure_trackio.py` - Already auto-detects username
-
-## 🧪 **Testing Results**
-
-```bash
-# Syntax check passes
-bash -n launch.sh
-# ✅ No syntax errors
-
-# All tests pass
-python tests/test_trackio_fixes.py
-# ✅ 7/7 tests passed
-```
-
-## 🚀 **Usage**
-
-The fix is transparent to users. The workflow remains the same:
-
-```bash
-# 1. Run launch script
-bash launch.sh
-
-# 2. Enter token (username auto-detected)
-Enter your Hugging Face token: hf_...
-
-# 3. All deployment happens automatically
-# - Username auto-detected from token
-# - No manual username input required
-# - Consistent behavior across all scripts
-```
-
-## 🎉 **Summary**
-
-The username parameter fix ensures that:
-
-- ✅ **No Manual Username Input**: Username is auto-detected from token
-- ✅ **Consistent Automation**: All scripts use the same detection method
-- ✅ **Reduced Complexity**: Fewer parameters to pass between scripts
-- ✅ **Better User Experience**: Clear feedback about auto-detection
-- ✅ **Future-Proof**: Easy to maintain and update
-
-The launch script now provides a truly automated experience where the username is seamlessly extracted from the token and used consistently across all deployment scripts.
\ No newline at end of file
diff --git a/docs/MODEL_CARD_USER_INPUT_ANALYSIS.md b/docs/MODEL_CARD_USER_INPUT_ANALYSIS.md
deleted file mode 100644
index 0d0f60d8ebc2e6e45571e9fdeefc47b1bbea91c9..0000000000000000000000000000000000000000
--- a/docs/MODEL_CARD_USER_INPUT_ANALYSIS.md
+++ /dev/null
@@ -1,233 +0,0 @@
-# Model Card User Input Analysis
-
-## Overview
-
-This document analyzes the interaction between the model card template (`templates/model_card.md`), the model card generator (`scripts/model_tonic/generate_model_card.py`), and the launch script (`launch.sh`) to identify variables that require user input and improve the user experience.
-
-## Template Variables Analysis
-
-### Variables in `templates/model_card.md`
-
-The model card template uses the following variables that can be populated with user input:
-
-#### Core Model Information
-- `{{model_name}}` - Display name of the model
-- `{{model_description}}` - Brief description of the model
-- `{{repo_name}}` - Hugging Face repository name
-- `{{base_model}}` - Base model used for fine-tuning
-
-#### Training Configuration
-- `{{training_config_type}}` - Type of training configuration used
-- `{{trainer_type}}` - Type of trainer (SFT, DPO, etc.)
-- `{{batch_size}}` - Training batch size
-- `{{gradient_accumulation_steps}}` - Gradient accumulation steps
-- `{{learning_rate}}` - Learning rate used
-- `{{max_epochs}}` - Maximum number of epochs
-- `{{max_seq_length}}` - Maximum sequence length
-
-#### Dataset Information
-- `{{dataset_name}}` - Name of the dataset used
-- `{{dataset_size}}` - Size of the dataset
-- `{{dataset_format}}` - Format of the dataset
-- `{{dataset_sample_size}}` - Sample size (for lightweight configs)
-
-#### Training Results
-- `{{training_loss}}` - Final training loss
-- `{{validation_loss}}` - Final validation loss
-- `{{perplexity}}` - Model perplexity
-
-#### Infrastructure
-- `{{hardware_info}}` - Hardware used for training
-- `{{experiment_name}}` - Name of the experiment
-- `{{trackio_url}}` - Trackio monitoring URL
-- `{{dataset_repo}}` - HF Dataset repository
-
-#### Author Information
-- `{{author_name}}` - Author name for citations and attribution
-- `{{model_name_slug}}` - URL-friendly model name
-
-#### Quantization
-- `{{quantized_models}}` - Boolean indicating if quantized models exist
-
-## User Input Requirements
-
-### Previously Missing User Inputs
-
-#### 1. **Author Name** (`author_name`)
-- **Purpose**: Used in model card metadata and citations
-- **Template Usage**: `{{#if author_name}}author: {{author_name}}{{/if}}`
-- **Citation Usage**: `author={{{author_name}}}`
-- **Default**: "Your Name"
-- **User Input Added**: ✅ **IMPLEMENTED**
-
-#### 2. **Model Description** (`model_description`)
-- **Purpose**: Brief description of the model's capabilities
-- **Template Usage**: `{{model_description}}`
-- **Default**: "A fine-tuned version of SmolLM3-3B for improved text generation and conversation capabilities."
-- **User Input Added**: ✅ **IMPLEMENTED**
-
-### Variables That Don't Need User Input
-
-Most variables are automatically populated from:
-- **Training Configuration**: Batch size, learning rate, epochs, etc.
-- **System Detection**: Hardware info, model size, etc.
-- **Auto-Generation**: Repository names, experiment names, etc.
-- **Training Results**: Loss values, perplexity, etc.
-
-## Implementation Changes
-
-### 1. Launch Script Updates (`launch.sh`)
-
-#### Added User Input Prompts
-```bash
-# Step 8.2: Author Information for Model Card
-print_step "Step 8.2: Author Information"
-echo "================================="
-
-print_info "This information will be used in the model card and citation."
-get_input "Author name for model card" "$HF_USERNAME" AUTHOR_NAME
-
-print_info "Model description will be used in the model card and repository."
-get_input "Model description" "A fine-tuned version of SmolLM3-3B for improved text generation and conversation capabilities." MODEL_DESCRIPTION
-```
-
-#### Updated Configuration Summary
-```bash
-echo " Author: $AUTHOR_NAME"
-```
-
-#### Updated Model Push Call
-```bash
-python scripts/model_tonic/push_to_huggingface.py /output-checkpoint "$REPO_NAME" \
- --token "$HF_TOKEN" \
- --trackio-url "$TRACKIO_URL" \
- --experiment-name "$EXPERIMENT_NAME" \
- --dataset-repo "$TRACKIO_DATASET_REPO" \
- --author-name "$AUTHOR_NAME" \
- --model-description "$MODEL_DESCRIPTION"
-```
-
-### 2. Push Script Updates (`scripts/model_tonic/push_to_huggingface.py`)
-
-#### Added Command Line Arguments
-```python
-parser.add_argument('--author-name', type=str, default=None, help='Author name for model card')
-parser.add_argument('--model-description', type=str, default=None, help='Model description for model card')
-```
-
-#### Updated Class Constructor
-```python
-def __init__(
- self,
- model_path: str,
- repo_name: str,
- token: Optional[str] = None,
- private: bool = False,
- trackio_url: Optional[str] = None,
- experiment_name: Optional[str] = None,
- dataset_repo: Optional[str] = None,
- hf_token: Optional[str] = None,
- author_name: Optional[str] = None,
- model_description: Optional[str] = None
-):
-```
-
-#### Updated Model Card Generation
-```python
-variables = {
- "model_name": f"{self.repo_name.split('/')[-1]} - Fine-tuned SmolLM3",
- "model_description": self.model_description or "A fine-tuned version of SmolLM3-3B for improved text generation and conversation capabilities.",
- # ... other variables
- "author_name": self.author_name or training_config.get('author_name', 'Your Name'),
-}
-```
-
-## User Experience Improvements
-
-### 1. **Interactive Prompts**
-- Users are now prompted for author name and model description
-- Default values are provided for convenience
-- Clear explanations of what each field is used for
-
-### 2. **Configuration Summary**
-- Author name is now displayed in the configuration summary
-- Users can review all settings before proceeding
-
-### 3. **Automatic Integration**
-- User inputs are automatically passed to the model card generation
-- No manual editing of scripts required
-
-## Template Variable Categories
-
-### Automatic Variables (No User Input Needed)
-- `repo_name` - Auto-generated from username and date
-- `base_model` - Always "HuggingFaceTB/SmolLM3-3B"
-- `training_config_type` - From user selection
-- `trainer_type` - From user selection
-- `batch_size`, `learning_rate`, `max_epochs` - From training config
-- `hardware_info` - Auto-detected
-- `experiment_name` - Auto-generated with timestamp
-- `trackio_url` - Auto-generated from space name
-- `dataset_repo` - Auto-generated
-- `training_loss`, `validation_loss`, `perplexity` - From training results
-
-### User Input Variables (Now Implemented)
-- `author_name` - ✅ **Added user prompt**
-- `model_description` - ✅ **Added user prompt**
-
-### Conditional Variables
-- `quantized_models` - Set automatically based on quantization choices
-- `dataset_sample_size` - Set based on training configuration type
-
-## Benefits of These Changes
-
-### 1. **Better Attribution**
-- Author names are properly captured and used in citations
-- Model cards include proper attribution
-
-### 2. **Customizable Descriptions**
-- Users can provide custom model descriptions
-- Better model documentation and discoverability
-
-### 3. **Improved User Experience**
-- No need to manually edit scripts
-- Interactive prompts with helpful defaults
-- Clear feedback on what information is being collected
-
-### 4. **Consistent Documentation**
-- All model cards will have proper author information
-- Standardized model descriptions
-- Better integration with Hugging Face Hub
-
-## Future Enhancements
-
-### Potential Additional User Inputs
-1. **License Selection** - Allow users to choose model license
-2. **Model Tags** - Custom tags for better discoverability
-3. **Usage Examples** - Custom usage examples for specific use cases
-4. **Limitations Description** - Custom limitations based on training data
-
-### Template Improvements
-1. **Dynamic License** - Support for different license types
-2. **Custom Tags** - User-defined model tags
-3. **Usage Scenarios** - Template sections for different use cases
-
-## Testing
-
-The changes have been tested to ensure:
-- ✅ Author name is properly passed to model card generation
-- ✅ Model description is properly passed to model card generation
-- ✅ Default values work correctly
-- ✅ Configuration summary displays new fields
-- ✅ Model push script accepts new parameters
-
-## Conclusion
-
-The analysis identified that the model card template had two key variables (`author_name` and `model_description`) that would benefit from user input. These have been successfully implemented with:
-
-1. **Interactive prompts** in the launch script
-2. **Command line arguments** in the push script
-3. **Proper integration** with the model card generator
-4. **User-friendly defaults** and clear explanations
-
-This improves the overall user experience and ensures that model cards have proper attribution and descriptions.
\ No newline at end of file
diff --git a/docs/MODEL_RECOVERY_GUIDE.md b/docs/MODEL_RECOVERY_GUIDE.md
deleted file mode 100644
index 4b85d251ff8fee2fb64ab811eaee155a0804b535..0000000000000000000000000000000000000000
--- a/docs/MODEL_RECOVERY_GUIDE.md
+++ /dev/null
@@ -1,228 +0,0 @@
-# Model Recovery and Deployment Guide
-
-This guide will help you recover your trained model from the cloud instance and deploy it to Hugging Face Hub with quantization.
-
-## Prerequisites
-
-1. **Hugging Face Token**: You need a Hugging Face token with write permissions
-2. **Cloud Instance Access**: SSH access to your cloud instance
-3. **Model Files**: Your trained model should be in `/output-checkpoint/` on the cloud instance
-
-## Step 1: Connect to Your Cloud Instance
-
-```bash
-ssh root@your-cloud-instance-ip
-cd ~/smollm3_finetune
-```
-
-## Step 2: Set Your Hugging Face Token
-
-```bash
-export HF_TOKEN=your_huggingface_token_here
-```
-
-Replace `your_huggingface_token_here` with your actual Hugging Face token.
-
-## Step 3: Verify Model Files
-
-Check that your model files exist:
-
-```bash
-ls -la /output-checkpoint/
-```
-
-You should see files like:
-- `config.json`
-- `model.safetensors.index.json`
-- `model-00001-of-00002.safetensors`
-- `model-00002-of-00002.safetensors`
-- `tokenizer.json`
-- `tokenizer_config.json`
-
-## Step 4: Update Configuration
-
-Edit the deployment script to use your Hugging Face username:
-
-```bash
-nano cloud_deploy.py
-```
-
-Change this line:
-```python
-REPO_NAME = "your-username/smollm3-finetuned" # Change to your HF username and desired repo name
-```
-
-To your actual username, for example:
-```python
-REPO_NAME = "tonic/smollm3-finetuned"
-```
-
-## Step 5: Run the Deployment
-
-Execute the deployment script:
-
-```bash
-python3 cloud_deploy.py
-```
-
-This will:
-1. ✅ Validate your model files
-2. ✅ Install required dependencies (torchao, huggingface_hub)
-3. ✅ Push the main model to Hugging Face Hub
-4. ✅ Create quantized versions (int8 and int4)
-5. ✅ Push quantized models to subdirectories
-
-## Step 6: Verify Deployment
-
-After successful deployment, you can verify:
-
-1. **Main Model**: https://huggingface.co/your-username/smollm3-finetuned
-2. **int8 Quantized**: https://huggingface.co/your-username/smollm3-finetuned/int8
-3. **int4 Quantized**: https://huggingface.co/your-username/smollm3-finetuned/int4
-
-## Alternative: Manual Deployment
-
-If you prefer to run the steps manually:
-
-### 1. Push Main Model Only
-
-```bash
-python3 scripts/model_tonic/push_to_huggingface.py \
- /output-checkpoint/ \
- your-username/smollm3-finetuned \
- --hf-token $HF_TOKEN \
- --author-name "Your Name" \
- --model-description "A fine-tuned SmolLM3 model for improved text generation"
-```
-
-### 2. Quantize and Push (Optional)
-
-```bash
-# int8 quantization (GPU optimized)
-python3 scripts/model_tonic/quantize_model.py \
- /output-checkpoint/ \
- your-username/smollm3-finetuned \
- --quant-type int8_weight_only \
- --hf-token $HF_TOKEN
-
-# int4 quantization (CPU optimized)
-python3 scripts/model_tonic/quantize_model.py \
- /output-checkpoint/ \
- your-username/smollm3-finetuned \
- --quant-type int4_weight_only \
- --hf-token $HF_TOKEN
-```
-
-## Troubleshooting
-
-### Common Issues
-
-1. **HF_TOKEN not set**
- ```bash
- export HF_TOKEN=your_token_here
- ```
-
-2. **Model files not found**
- ```bash
- ls -la /output-checkpoint/
- ```
- Make sure the training completed successfully.
-
-3. **Dependencies missing**
- ```bash
- pip install torchao huggingface_hub
- ```
-
-4. **Permission denied**
- ```bash
- chmod +x cloud_deploy.py
- chmod +x recover_model.py
- ```
-
-### Error Messages
-
-- **"Missing required model files"**: Check that your model training completed successfully
-- **"Repository creation failed"**: Verify your HF token has write permissions
-- **"Quantization failed"**: Check GPU memory availability or try CPU quantization
-
-## Model Usage
-
-Once deployed, you can use your model:
-
-```python
-from transformers import AutoModelForCausalLM, AutoTokenizer
-
-# Main model
-model = AutoModelForCausalLM.from_pretrained("your-username/smollm3-finetuned")
-tokenizer = AutoTokenizer.from_pretrained("your-username/smollm3-finetuned")
-
-# int8 quantized (GPU optimized)
-model = AutoModelForCausalLM.from_pretrained("your-username/smollm3-finetuned/int8")
-tokenizer = AutoTokenizer.from_pretrained("your-username/smollm3-finetuned/int8")
-
-# int4 quantized (CPU optimized)
-model = AutoModelForCausalLM.from_pretrained("your-username/smollm3-finetuned/int4")
-tokenizer = AutoTokenizer.from_pretrained("your-username/smollm3-finetuned/int4")
-
-# Generate text
-inputs = tokenizer("Hello, how are you?", return_tensors="pt")
-outputs = model.generate(**inputs, max_new_tokens=100)
-print(tokenizer.decode(outputs[0], skip_special_tokens=True))
-```
-
-## File Structure
-
-After deployment, your repository will have:
-
-```
-your-username/smollm3-finetuned/
-├── README.md (model card)
-├── config.json
-├── model.safetensors.index.json
-├── model-00001-of-00002.safetensors
-├── model-00002-of-00002.safetensors
-├── tokenizer.json
-├── tokenizer_config.json
-├── int8/ (quantized model for GPU)
-│ ├── README.md
-│ ├── config.json
-│ └── pytorch_model.bin
-└── int4/ (quantized model for CPU)
- ├── README.md
- ├── config.json
- └── pytorch_model.bin
-```
-
-## Success Indicators
-
-✅ **Successful deployment shows:**
-- "Model recovery and deployment completed successfully!"
-- "View your model at: https://huggingface.co/your-username/smollm3-finetuned"
-- No error messages in the output
-
-❌ **Failed deployment shows:**
-- Error messages about missing files or permissions
-- "Model recovery and deployment failed!"
-
-## Next Steps
-
-After successful deployment:
-
-1. **Test your model** on Hugging Face Hub
-2. **Share your model** with the community
-3. **Monitor usage** through Hugging Face analytics
-4. **Consider fine-tuning** further based on feedback
-
-## Support
-
-If you encounter issues:
-
-1. Check the error messages carefully
-2. Verify your HF token permissions
-3. Ensure all model files are present
-4. Try running individual steps manually
-5. Check the logs for detailed error information
-
----
-
-**Happy deploying! 🚀**
\ No newline at end of file
diff --git a/docs/MONITORING_IMPROVEMENTS_SUMMARY.md b/docs/MONITORING_IMPROVEMENTS_SUMMARY.md
deleted file mode 100644
index 6b2c7c8bb6ad2611fcc0408e2e72feaeb0e76c4e..0000000000000000000000000000000000000000
--- a/docs/MONITORING_IMPROVEMENTS_SUMMARY.md
+++ /dev/null
@@ -1,191 +0,0 @@
-# 🚀 Monitoring Improvements Summary
-
-## Overview
-
-The monitoring system has been significantly enhanced to support **Hugging Face Datasets** for persistent experiment storage, making it ideal for deployment on Hugging Face Spaces and other cloud environments.
-
-## ✅ Key Improvements Made
-
-### 1. **Enhanced `monitoring.py`**
-- ✅ **HF Datasets Integration**: Added support for saving experiments to HF Datasets repositories
-- ✅ **Environment Variables**: Automatic detection of `HF_TOKEN` and `TRACKIO_DATASET_REPO`
-- ✅ **Fallback Support**: Graceful degradation if HF Datasets unavailable
-- ✅ **Dual Storage**: Experiments saved to both Trackio and HF Datasets
-- ✅ **Periodic Saving**: Metrics saved to HF Dataset every 10 steps
-- ✅ **Error Handling**: Robust error logging and recovery
-
-### 2. **Updated `train.py`**
-- ✅ **Monitoring Integration**: Automatic monitoring setup in training scripts
-- ✅ **Configuration Logging**: Experiment configuration logged at start
-- ✅ **Training Callbacks**: Monitoring callbacks added to trainer
-- ✅ **Summary Logging**: Training summaries logged at completion
-- ✅ **Error Logging**: Errors logged to monitoring system
-- ✅ **Cleanup**: Proper monitoring session cleanup
-
-### 3. **Configuration Files Updated**
-- ✅ **HF Datasets Config**: Added `hf_token` and `dataset_repo` parameters
-- ✅ **Environment Support**: Environment variables automatically detected
-- ✅ **Backward Compatible**: Existing configurations still work
-
-### 4. **New Utility Scripts**
-- ✅ **`configure_trackio.py`**: Configuration testing and setup
-- ✅ **`integrate_monitoring.py`**: Automated integration script
-- ✅ **`test_monitoring_integration.py`**: Comprehensive testing
-- ✅ **`setup_hf_dataset.py`**: Dataset repository setup
-
-### 5. **Documentation**
-- ✅ **`MONITORING_INTEGRATION_GUIDE.md`**: Comprehensive usage guide
-- ✅ **`ENVIRONMENT_VARIABLES.md`**: Environment variable reference
-- ✅ **`HF_DATASETS_GUIDE.md`**: Detailed HF Datasets guide
-
-## 🔧 Environment Variables
-
-| Variable | Required | Default | Description |
-|----------|----------|---------|-------------|
-| `HF_TOKEN` | ✅ Yes | None | Your Hugging Face token |
-| `TRACKIO_DATASET_REPO` | ❌ No | `tonic/trackio-experiments` | Dataset repository |
-| `TRACKIO_URL` | ❌ No | None | Trackio server URL |
-| `TRACKIO_TOKEN` | ❌ No | None | Trackio authentication token |
-
-## 📊 What Gets Monitored
-
-### **Training Metrics**
-- Loss values (training and validation)
-- Learning rate
-- Gradient norms
-- Training steps and epochs
-
-### **System Metrics**
-- GPU memory usage
-- GPU utilization
-- CPU usage
-- Memory usage
-
-### **Experiment Data**
-- Configuration parameters
-- Model checkpoints
-- Evaluation results
-- Training summaries
-
-### **Artifacts**
-- Configuration files
-- Training logs
-- Evaluation results
-- Model checkpoints
-
-## 🚀 Usage Examples
-
-### **Basic Training**
-```bash
-# Set environment variables
-export HF_TOKEN=your_token_here
-export TRACKIO_DATASET_REPO=your-username/experiments
-
-# Run training with monitoring
-python train.py config/train_smollm3_openhermes_fr.py
-```
-
-### **Advanced Configuration**
-```bash
-# Train with custom settings
-python train.py config/train_smollm3_openhermes_fr.py \
- --experiment_name "smollm3_french_v2" \
- --hf_token your_token_here \
- --dataset_repo your-username/french-experiments
-```
-
-### **Testing Setup**
-```bash
-# Test configuration
-python configure_trackio.py
-
-# Test monitoring integration
-python test_monitoring_integration.py
-
-# Test dataset access
-python test_hf_datasets.py
-```
-
-## 📈 Benefits
-
-### **For HF Spaces Deployment**
-- ✅ **Persistent Storage**: Data survives Space restarts
-- ✅ **No Local Storage**: No dependency on ephemeral storage
-- ✅ **Scalable**: Works with any dataset size
-- ✅ **Secure**: Private dataset storage
-
-### **For Experiment Management**
-- ✅ **Centralized**: All experiments in one place
-- ✅ **Searchable**: Easy to find specific experiments
-- ✅ **Versioned**: Dataset versioning for experiments
-- ✅ **Collaborative**: Share experiments with team
-
-### **For Development**
-- ✅ **Flexible**: Easy to switch between datasets
-- ✅ **Configurable**: Environment-based configuration
-- ✅ **Robust**: Fallback mechanisms
-- ✅ **Debuggable**: Comprehensive logging
-
-## 🧪 Testing Results
-
-All monitoring integration tests passed:
-- ✅ Module Import
-- ✅ Monitor Creation
-- ✅ Config Creation
-- ✅ Metrics Logging
-- ✅ Configuration Logging
-- ✅ System Metrics
-- ✅ Training Summary
-- ✅ Callback Creation
-
-## 📋 Files Modified/Created
-
-### **Core Files**
-- `monitoring.py` - Enhanced with HF Datasets support
-- `train.py` - Updated with monitoring integration
-- `requirements_core.txt` - Added monitoring dependencies
-- `requirements_space.txt` - Updated for HF Spaces
-
-### **Configuration Files**
-- `config/train_smollm3.py` - Added HF Datasets config
-- `config/train_smollm3_openhermes_fr.py` - Added HF Datasets config
-- `config/train_smollm3_openhermes_fr_a100_balanced.py` - Added HF Datasets config
-- `config/train_smollm3_openhermes_fr_a100_large.py` - Added HF Datasets config
-- `config/train_smollm3_openhermes_fr_a100_max_performance.py` - Added HF Datasets config
-- `config/train_smollm3_openhermes_fr_a100_multiple_passes.py` - Added HF Datasets config
-
-### **New Utility Scripts**
-- `configure_trackio.py` - Configuration testing
-- `integrate_monitoring.py` - Automated integration
-- `test_monitoring_integration.py` - Comprehensive testing
-- `setup_hf_dataset.py` - Dataset setup
-
-### **Documentation**
-- `MONITORING_INTEGRATION_GUIDE.md` - Usage guide
-- `ENVIRONMENT_VARIABLES.md` - Environment reference
-- `HF_DATASETS_GUIDE.md` - HF Datasets guide
-- `MONITORING_IMPROVEMENTS_SUMMARY.md` - This summary
-
-## 🎯 Next Steps
-
-1. **Set up your HF token and dataset repository**
-2. **Test the configuration with `python configure_trackio.py`**
-3. **Run a training experiment to verify full functionality**
-4. **Check your HF Dataset repository for experiment data**
-5. **View results in your Trackio interface**
-
-## 🔍 Troubleshooting
-
-### **Common Issues**
-- **HF_TOKEN not set**: Set your Hugging Face token
-- **Dataset access failed**: Check token permissions and repository existence
-- **Monitoring not working**: Run `python test_monitoring_integration.py` to diagnose
-
-### **Getting Help**
-- Check the comprehensive guides in the documentation files
-- Run the test scripts to verify your setup
-- Check logs for specific error messages
-
----
-
-**🎉 The monitoring system is now ready for production use with persistent HF Datasets storage!**
\ No newline at end of file
diff --git a/docs/MONITORING_INTEGRATION_GUIDE.md b/docs/MONITORING_INTEGRATION_GUIDE.md
deleted file mode 100644
index 480e51fbb1cc406cac93103fb9f8d22c084d933d..0000000000000000000000000000000000000000
--- a/docs/MONITORING_INTEGRATION_GUIDE.md
+++ /dev/null
@@ -1,245 +0,0 @@
-# 🔧 Improved Monitoring Integration Guide
-
-## Overview
-
-The monitoring system has been enhanced to support **Hugging Face Datasets** for persistent experiment storage, making it ideal for deployment on Hugging Face Spaces and other cloud environments.
-
-## 🚀 Key Improvements
-
-### 1. **HF Datasets Integration**
-- ✅ **Persistent Storage**: Experiments are saved to HF Datasets repositories
-- ✅ **Environment Variables**: Configurable via `HF_TOKEN` and `TRACKIO_DATASET_REPO`
-- ✅ **Fallback Support**: Graceful degradation if HF Datasets unavailable
-- ✅ **Automatic Backup**: Local files as backup
-
-### 2. **Enhanced Monitoring Features**
-- 📊 **Real-time Metrics**: Training metrics logged to both Trackio and HF Datasets
-- 🔧 **System Metrics**: GPU memory, CPU usage, and system performance
-- 📈 **Training Summaries**: Comprehensive experiment summaries
-- 🛡️ **Error Handling**: Robust error logging and recovery
-
-### 3. **Easy Integration**
-- 🔌 **Automatic Setup**: Environment variables automatically detected
-- 📝 **Configuration**: Simple setup with environment variables
-- 🔄 **Backward Compatible**: Works with existing Trackio setup
-
-## 📋 Environment Variables
-
-| Variable | Required | Default | Description |
-|----------|----------|---------|-------------|
-| `HF_TOKEN` | ✅ Yes | None | Your Hugging Face token |
-| `TRACKIO_DATASET_REPO` | ❌ No | `tonic/trackio-experiments` | Dataset repository |
-| `TRACKIO_URL` | ❌ No | None | Trackio server URL |
-| `TRACKIO_TOKEN` | ❌ No | None | Trackio authentication token |
-
-## 🛠️ Setup Instructions
-
-### 1. **Get Your HF Token**
-```bash
-# Go to https://huggingface.co/settings/tokens
-# Create a new token with "Write" permissions
-# Copy the token
-```
-
-### 2. **Set Environment Variables**
-```bash
-# For HF Spaces, add these to your Space settings:
-HF_TOKEN=your_hf_token_here
-TRACKIO_DATASET_REPO=your-username/your-dataset-name
-
-# For local development:
-export HF_TOKEN=your_hf_token_here
-export TRACKIO_DATASET_REPO=your-username/your-dataset-name
-```
-
-### 3. **Create Dataset Repository**
-```bash
-# Run the setup script
-python setup_hf_dataset.py
-
-# Or manually create a dataset on HF Hub
-# Go to https://huggingface.co/datasets
-# Create a new dataset repository
-```
-
-### 4. **Test Configuration**
-```bash
-# Test your setup
-python configure_trackio.py
-
-# Test dataset access
-python test_hf_datasets.py
-```
-
-## 🚀 Usage Examples
-
-### **Basic Training with Monitoring**
-```bash
-# Train with default monitoring
-python train.py config/train_smollm3_openhermes_fr.py
-
-# Train with custom dataset repository
-TRACKIO_DATASET_REPO=your-username/smollm3-experiments python train.py config/train_smollm3_openhermes_fr.py
-```
-
-### **Advanced Training Configuration**
-```bash
-# Train with custom experiment name
-python train.py config/train_smollm3_openhermes_fr.py \
- --experiment_name "smollm3_french_tuning_v2" \
- --hf_token your_token_here \
- --dataset_repo your-username/french-experiments
-```
-
-### **Training Scripts with Monitoring**
-```bash
-# All training scripts now support monitoring:
-python train.py config/train_smollm3_openhermes_fr_a100_balanced.py
-python train.py config/train_smollm3_openhermes_fr_a100_large.py
-python train.py config/train_smollm3_openhermes_fr_a100_max_performance.py
-python train.py config/train_smollm3_openhermes_fr_a100_multiple_passes.py
-```
-
-## 📊 What Gets Monitored
-
-### **Training Metrics**
-- Loss values (training and validation)
-- Learning rate
-- Gradient norms
-- Training steps and epochs
-
-### **System Metrics**
-- GPU memory usage
-- GPU utilization
-- CPU usage
-- Memory usage
-
-### **Experiment Data**
-- Configuration parameters
-- Model checkpoints
-- Evaluation results
-- Training summaries
-
-### **Artifacts**
-- Configuration files
-- Training logs
-- Evaluation results
-- Model checkpoints
-
-## 🔍 Viewing Results
-
-### **1. Trackio Interface**
-- Visit your Trackio Space
-- Navigate to "Experiments" tab
-- View real-time metrics and plots
-
-### **2. HF Dataset Repository**
-- Go to your dataset repository on HF Hub
-- Browse experiment data
-- Download experiment files
-
-### **3. Local Files**
-- Check local backup files
-- Review training logs
-- Examine configuration files
-
-## 🛠️ Configuration Examples
-
-### **Default Setup**
-```python
-# Uses default dataset: tonic/trackio-experiments
-# Requires only HF_TOKEN
-```
-
-### **Personal Dataset**
-```bash
-export HF_TOKEN=your_token_here
-export TRACKIO_DATASET_REPO=your-username/trackio-experiments
-```
-
-### **Team Dataset**
-```bash
-export HF_TOKEN=your_token_here
-export TRACKIO_DATASET_REPO=your-org/team-experiments
-```
-
-### **Project-Specific Dataset**
-```bash
-export HF_TOKEN=your_token_here
-export TRACKIO_DATASET_REPO=your-username/smollm3-experiments
-```
-
-## 🔧 Troubleshooting
-
-### **Issue: "HF_TOKEN not found"**
-```bash
-# Solution: Set your HF token
-export HF_TOKEN=your_token_here
-# Or add to HF Space environment variables
-```
-
-### **Issue: "Failed to load dataset"**
-```bash
-# Solutions:
-# 1. Check token has read access
-# 2. Verify dataset repository exists
-# 3. Run setup script: python setup_hf_dataset.py
-```
-
-### **Issue: "Failed to save experiments"**
-```bash
-# Solutions:
-# 1. Check token has write permissions
-# 2. Verify dataset repository exists
-# 3. Check network connectivity
-```
-
-### **Issue: "Monitoring not working"**
-```bash
-# Solutions:
-# 1. Check environment variables
-# 2. Run configuration test: python configure_trackio.py
-# 3. Check logs for specific errors
-```
-
-## 📈 Benefits
-
-### **For HF Spaces Deployment**
-- ✅ **Persistent Storage**: Data survives Space restarts
-- ✅ **No Local Storage**: No dependency on ephemeral storage
-- ✅ **Scalable**: Works with any dataset size
-- ✅ **Secure**: Private dataset storage
-
-### **For Experiment Management**
-- ✅ **Centralized**: All experiments in one place
-- ✅ **Searchable**: Easy to find specific experiments
-- ✅ **Versioned**: Dataset versioning for experiments
-- ✅ **Collaborative**: Share experiments with team
-
-### **For Development**
-- ✅ **Flexible**: Easy to switch between datasets
-- ✅ **Configurable**: Environment-based configuration
-- ✅ **Robust**: Fallback mechanisms
-- ✅ **Debuggable**: Comprehensive logging
-
-## 🎯 Next Steps
-
-1. **Set up your HF token and dataset repository**
-2. **Test the configuration with `python configure_trackio.py`**
-3. **Run a training experiment to verify monitoring**
-4. **Check your HF Dataset repository for experiment data**
-5. **View results in your Trackio interface**
-
-## 📚 Related Files
-
-- `monitoring.py` - Enhanced monitoring with HF Datasets support
-- `train.py` - Updated training script with monitoring integration
-- `configure_trackio.py` - Configuration and testing script
-- `setup_hf_dataset.py` - Dataset repository setup
-- `test_hf_datasets.py` - Dataset access testing
-- `ENVIRONMENT_VARIABLES.md` - Environment variable reference
-- `HF_DATASETS_GUIDE.md` - Detailed HF Datasets guide
-
----
-
-**🎉 Your experiments are now persistently stored and easily accessible!**
\ No newline at end of file
diff --git a/docs/MONITORING_VERIFICATION_REPORT.md b/docs/MONITORING_VERIFICATION_REPORT.md
deleted file mode 100644
index 3169006a0685ed02dfe3d5afe2634a2fe6fe78a6..0000000000000000000000000000000000000000
--- a/docs/MONITORING_VERIFICATION_REPORT.md
+++ /dev/null
@@ -1,163 +0,0 @@
-# Monitoring Verification Report
-
-## Overview
-
-This document verifies that `src/monitoring.py` is fully compatible with the actual deployed Trackio space and all monitoring components.
-
-## ✅ **VERIFICATION STATUS: ALL TESTS PASSED**
-
-### **Trackio Space Deployment Verification**
-
-The actual deployed Trackio space at `https://tonic-trackio-monitoring-20250726.hf.space` provides the following API endpoints:
-
-#### **Available API Endpoints**
-1. ✅ `/update_trackio_config` - Update configuration
-2. ✅ `/test_dataset_connection` - Test dataset connection
-3. ✅ `/create_dataset_repository` - Create dataset repository
-4. ✅ `/create_experiment_interface` - Create experiment
-5. ✅ `/log_metrics_interface` - Log metrics
-6. ✅ `/log_parameters_interface` - Log parameters
-7. ✅ `/get_experiment_details` - Get experiment details
-8. ✅ `/list_experiments_interface` - List experiments
-9. ✅ `/create_metrics_plot` - Create metrics plot
-10. ✅ `/create_experiment_comparison` - Compare experiments
-11. ✅ `/simulate_training_data` - Simulate training data
-12. ✅ `/create_demo_experiment` - Create demo experiment
-13. ✅ `/update_experiment_status_interface` - Update status
-
-### **Monitoring.py Compatibility Verification**
-
-#### **✅ Dataset Structure Compatibility**
-- **Field Structure**: All 10 fields match between monitoring.py and actual dataset
- - `experiment_id`, `name`, `description`, `created_at`, `status`
- - `metrics`, `parameters`, `artifacts`, `logs`, `last_updated`
-- **Metrics Structure**: All 16 metrics fields compatible
- - `loss`, `grad_norm`, `learning_rate`, `num_tokens`, `mean_token_accuracy`
- - `epoch`, `total_tokens`, `throughput`, `step_time`, `batch_size`
- - `seq_len`, `token_acc`, `gpu_memory_allocated`, `gpu_memory_reserved`
- - `gpu_utilization`, `cpu_percent`, `memory_percent`
-- **Parameters Structure**: All 11 parameters fields compatible
- - `model_name`, `max_seq_length`, `batch_size`, `learning_rate`, `epochs`
- - `dataset`, `trainer_type`, `hardware`, `mixed_precision`
- - `gradient_checkpointing`, `flash_attention`
-
-#### **✅ Trackio API Client Compatibility**
-- **Available Methods**: All 7 methods working correctly
- - `create_experiment` ✅
- - `log_metrics` ✅
- - `log_parameters` ✅
- - `get_experiment_details` ✅
- - `list_experiments` ✅
- - `update_experiment_status` ✅
- - `simulate_training_data` ✅
-
-#### **✅ Monitoring Variables Verification**
-- **Core Variables**: All 10 variables present and working
- - `experiment_id`, `experiment_name`, `start_time`, `metrics_history`, `artifacts`
- - `trackio_client`, `hf_dataset_client`, `dataset_repo`, `hf_token`, `enable_tracking`
-- **Core Methods**: All 7 methods present and working
- - `log_metrics`, `log_configuration`, `log_model_checkpoint`, `log_evaluation_results`
- - `log_system_metrics`, `log_training_summary`, `create_monitoring_callback`
-
-#### **✅ Integration Verification**
-- **Monitor Creation**: ✅ Working perfectly
-- **Attribute Verification**: ✅ All 7 expected attributes present
-- **Dataset Repository**: ✅ Properly set and validated
-- **Enable Tracking**: ✅ Correctly configured
-
-### **Key Compatibility Features**
-
-#### **1. Dataset Structure Alignment**
-```python
-# monitoring.py uses the exact structure from setup_hf_dataset.py
-dataset_data = [{
- 'experiment_id': self.experiment_id or f"exp_{datetime.now().strftime('%Y%m%d_%H%M%S')}",
- 'name': self.experiment_name,
- 'description': "SmolLM3 fine-tuning experiment",
- 'created_at': self.start_time.isoformat(),
- 'status': 'running',
- 'metrics': json.dumps(self.metrics_history),
- 'parameters': json.dumps(experiment_data),
- 'artifacts': json.dumps(self.artifacts),
- 'logs': json.dumps([]),
- 'last_updated': datetime.now().isoformat()
-}]
-```
-
-#### **2. Trackio Space Integration**
-```python
-# Uses only available methods from deployed space
-self.trackio_client.log_metrics(experiment_id, metrics, step)
-self.trackio_client.log_parameters(experiment_id, parameters)
-self.trackio_client.list_experiments()
-self.trackio_client.update_experiment_status(experiment_id, status)
-```
-
-#### **3. Error Handling**
-```python
-# Graceful fallback when Trackio space is unavailable
-try:
- result = self.trackio_client.list_experiments()
- if result.get('error'):
- logger.warning(f"Trackio Space not accessible: {result['error']}")
- self.enable_tracking = False
- return
-except Exception as e:
- logger.warning(f"Trackio Space not accessible: {e}")
- self.enable_tracking = False
-```
-
-### **Verification Test Results**
-
-```
-🚀 Monitoring Verification Tests
-==================================================
-✅ Dataset structure: Compatible
-✅ Trackio space: Compatible
-✅ Monitoring variables: Correct
-✅ API client: Compatible
-✅ Integration: Working
-✅ Structure compatibility: Verified
-✅ Space compatibility: Verified
-
-🎉 ALL MONITORING VERIFICATION TESTS PASSED!
-Monitoring.py is fully compatible with all components!
-```
-
-### **Deployed Trackio Space API Endpoints**
-
-The actual deployed space provides these endpoints that monitoring.py can use:
-
-#### **Core Experiment Management**
-- `POST /create_experiment_interface` - Create new experiments
-- `POST /log_metrics_interface` - Log training metrics
-- `POST /log_parameters_interface` - Log experiment parameters
-- `GET /list_experiments_interface` - List all experiments
-- `POST /update_experiment_status_interface` - Update experiment status
-
-#### **Configuration & Setup**
-- `POST /update_trackio_config` - Update HF token and dataset repo
-- `POST /test_dataset_connection` - Test dataset connectivity
-- `POST /create_dataset_repository` - Create HF dataset repository
-
-#### **Analysis & Visualization**
-- `POST /create_metrics_plot` - Generate metric plots
-- `POST /create_experiment_comparison` - Compare multiple experiments
-- `POST /get_experiment_details` - Get detailed experiment info
-
-#### **Testing & Demo**
-- `POST /simulate_training_data` - Generate demo training data
-- `POST /create_demo_experiment` - Create demonstration experiments
-
-### **Conclusion**
-
-**✅ MONITORING.PY IS FULLY COMPATIBLE WITH THE ACTUAL DEPLOYED TRACKIO SPACE**
-
-The monitoring system has been verified to work correctly with:
-- ✅ All actual API endpoints from the deployed Trackio space
-- ✅ Complete dataset structure compatibility
-- ✅ Proper error handling and fallback mechanisms
-- ✅ All monitoring variables and methods working correctly
-- ✅ Seamless integration with HF Datasets and Trackio space
-
-**The monitoring.py file is production-ready and fully compatible with the actual deployed Trackio space!** 🚀
\ No newline at end of file
diff --git a/docs/Model_Abstraction.md b/docs/Model_Abstraction.md
new file mode 100644
index 0000000000000000000000000000000000000000..5f0472e205ca3833d34db33e633288f069902e02
--- /dev/null
+++ b/docs/Model_Abstraction.md
@@ -0,0 +1,36 @@
+```mermaid
+graph LR
+ EntryPoint["EntryPoint"]
+ Model_Abstraction["Model Abstraction"]
+ EntryPoint -- "initiates model loading in" --> Model_Abstraction
+ click Model_Abstraction href "https://github.com/Josephrp/SmolFactory/blob/main/docs/Model_Abstraction.md" "Details"
+```
+
+[](https://github.com/CodeBoarding/GeneratedOnBoardings)[](https://www.codeboarding.org/demo)[](mailto:contact@codeboarding.org)
+
+## Details
+
+Updated analysis to include EntryPoint component and clarify its interaction with Model Abstraction.
+
+### EntryPoint
+This component represents the primary execution flow of the `smollm3_finetune` application. It is responsible for initializing the application, parsing configuration, and orchestrating the high-level tasks such as initiating the model loading process and potentially the training or inference loops. It acts as the user-facing interface or the main script that kicks off the application's operations.
+
+
+**Related Classes/Methods**:
+
+- `smollm3_finetune.main` (1:1)
+
+
+### Model Abstraction [[Expand]](./Model_Abstraction.md)
+This component is responsible for encapsulating the complex logic of loading pre-trained models, defining their architectures, and managing various model variants such as quantization and LoRA adapters. It provides a unified and consistent interface for interacting with different model configurations, ensuring that the core training logic can operate seamlessly regardless of the underlying model specifics. This abstraction is crucial for maintaining modularity and flexibility within the machine learning training and fine-tuning framework.
+
+
+**Related Classes/Methods**:
+
+- `smollm3_finetune.model` (1:1)
+- `smollm3_finetune.model.load_model` (1:1)
+
+
+
+
+### [FAQ](https://github.com/CodeBoarding/GeneratedOnBoardings/tree/main?tab=readme-ov-file#faq)
\ No newline at end of file
diff --git a/docs/NO_THINK_TAG_GUIDE.md b/docs/NO_THINK_TAG_GUIDE.md
deleted file mode 100644
index 6ad314647b093fb0e6240a02e48f2a55675568b9..0000000000000000000000000000000000000000
--- a/docs/NO_THINK_TAG_GUIDE.md
+++ /dev/null
@@ -1,146 +0,0 @@
-# SmolLM3 `/no_think` Tag Implementation Guide
-
-## The Problem
-
-You were using the `enable_thinking` parameter in the chat template configuration, which is **incorrect** for SmolLM3. The `/no_think` tag should be added as a **system message** in your training data, not as a configuration parameter.
-
-### What was wrong:
-
-```python
-# ❌ INCORRECT - This doesn't work for SmolLM3
-chat_template_kwargs={
- "enable_thinking": False, # This parameter doesn't exist in SmolLM3
- "add_generation_prompt": True
-}
-```
-
-### What's correct:
-
-```python
-# ✅ CORRECT - Add /no_think as system message
-messages = [
- {"role": "system", "content": "You are a helpful assistant. /no_think"},
- {"role": "user", "content": "What is machine learning?"},
- {"role": "assistant", "content": "Machine learning is..."}
-]
-```
-
-## The Solution
-
-### 1. Updated Data Processing
-
-The `data.py` file now properly handles the `/no_think` tag by:
-
-- Adding a system message with `/no_think` when `no_think_system_message=True`
-- Using the correct chat template parameters
-- Properly formatting messages for SmolLM3
-
-### 2. Updated Configuration
-
-All configuration files now use the correct parameter:
-
-```python
-chat_template_kwargs={
- "add_generation_prompt": True,
- "no_think_system_message": True # Set to True to add /no_think tag
-}
-```
-
-### 3. How It Works
-
-When `no_think_system_message=True`, the system automatically adds:
-
-```
-{"role": "system", "content": "You are a helpful assistant. /no_think"}
-```
-
-as the first message in each conversation.
-
-## Testing the Fix
-
-### 1. Run the Test Script
-
-```bash
-python test_no_think.py
-```
-
-This will show you the difference between:
-- Messages with `/no_think` tag
-- Messages without `/no_think` tag
-
-### 2. Test with Training
-
-```bash
-python train.py config/train_smollm3_no_think_test.py --dataset_dir=test_data
-```
-
-This uses a configuration that enables the `/no_think` tag.
-
-## Configuration Options
-
-### Enable `/no_think` tag:
-```python
-chat_template_kwargs={
- "add_generation_prompt": True,
- "no_think_system_message": True # Adds /no_think to system message
-}
-```
-
-### Disable `/no_think` tag (allow thinking):
-```python
-chat_template_kwargs={
- "add_generation_prompt": True,
- "no_think_system_message": False # No /no_think tag
-}
-```
-
-## When to Use `/no_think`
-
-- **Use `/no_think`** for tasks that require direct, concise responses
-- **Don't use `/no_think`** for tasks that benefit from reasoning and step-by-step thinking
-
-### Examples:
-
-**With `/no_think` (direct answers):**
-```
-System: You are a helpful assistant. /no_think
-User: What is 2+2?
-Assistant: 4
-```
-
-**Without `/no_think` (reasoning allowed):**
-```
-System: You are a helpful assistant.
-User: Solve this math problem step by step: 15 * 7
-Assistant: Let me solve this step by step:
-1. First, I'll break down 15 * 7
-2. 15 * 7 = (10 + 5) * 7
-3. = 10 * 7 + 5 * 7
-4. = 70 + 35
-5. = 105
-The answer is 105.
-```
-
-## Updated Files
-
-The following files were updated to fix the `/no_think` tag issue:
-
-1. `data.py` - Updated `format_chat_template` function
-2. `config/train_smollm3.py` - Updated default configuration
-3. `config/train_smollm3_openhermes_fr.py` - Updated configuration
-4. `config/train_smollm3_long_context.py` - Updated configuration
-5. `config/runpod_config.py` - Updated configuration
-6. All A100 configuration files - Updated configurations
-
-## Verification
-
-To verify the fix is working:
-
-1. Check that system messages include `/no_think` when `no_think_system_message=True`
-2. Verify that the chat template is applied correctly
-3. Test with actual training to ensure the model learns the `/no_think` behavior
-
-## References
-
-- [SmolLM3 Model Card](https://huggingface.co/HuggingFaceTB/SmolLM3-3B)
-- [SmolLM3 Documentation](https://huggingface.co/docs/transformers/model_doc/smollm3)
\ No newline at end of file
diff --git a/docs/PIPELINE_SUMMARY.md b/docs/PIPELINE_SUMMARY.md
deleted file mode 100644
index 843b3deec2efda895933b85b795daf39c02c4cf6..0000000000000000000000000000000000000000
--- a/docs/PIPELINE_SUMMARY.md
+++ /dev/null
@@ -1,330 +0,0 @@
-# SmolLM3 End-to-End Pipeline - Implementation Summary
-
-This document summarizes the comprehensive refactoring and enhancement of the SmolLM3 fine-tuning codebase to create a complete end-to-end pipeline.
-
-## 🎯 Overview
-
-The pipeline now provides a complete solution from Trackio Space deployment to model push, with integrated monitoring, dataset management, and automated deployment.
-
-## 📁 Files Created/Modified
-
-### **Core Pipeline Files**
-
-1. **`launch.sh`** - Complete end-to-end pipeline script
- - 16-step comprehensive pipeline
- - Automated environment setup
- - Integrated monitoring and deployment
- - Dynamic configuration generation
-
-2. **`setup_launch.py`** - User configuration helper
- - Interactive setup for user credentials
- - Automatic script configuration
- - Requirements checker generation
-
-3. **`test_pipeline.py`** - Comprehensive testing suite
- - Import testing
- - Component verification
- - CUDA and HF token validation
-
-4. **`README_END_TO_END.md`** - Complete documentation
- - Step-by-step usage guide
- - Troubleshooting section
- - Advanced configuration options
-
-### **Scripts and Utilities**
-
-5. **`scripts/trackio_tonic/trackio_api_client.py`** - API client for Trackio
- - Complete API client implementation
- - Error handling and retry logic
- - Support for both JSON and SSE responses
-
-6. **`scripts/trackio_tonic/deploy_trackio_space.py`** - Space deployment
- - Automated HF Space creation
- - File upload and configuration
- - Space testing and validation
-
-7. **`scripts/trackio_tonic/configure_trackio.py`** - Configuration helper
- - Environment variable setup
- - Dataset repository configuration
- - Usage examples and validation
-
-8. **`scripts/model_tonic/push_to_huggingface.py`** - Model deployment
- - Complete model upload pipeline
- - Model card generation
- - Training results documentation
-
-9. **`scripts/dataset_tonic/setup_hf_dataset.py`** - Dataset setup
- - HF Dataset repository creation
- - Initial experiment data structure
- - Dataset access configuration
-
-### **Source Code Updates**
-
-10. **`src/monitoring.py`** - Enhanced monitoring
- - HF Datasets integration
- - Trackio API client integration
- - Comprehensive metrics logging
-
-11. **`src/train.py`** - Updated training script
- - Monitoring integration
- - HF Datasets support
- - Enhanced error handling
-
-12. **`src/config.py`** - Configuration management
- - Dynamic config loading
- - Multiple config type support
- - Fallback mechanisms
-
-13. **`src/data.py`** - Enhanced dataset handling
- - Multiple format support
- - Automatic conversion
- - Bad entry filtering
-
-14. **`src/model.py`** - Model wrapper
- - SmolLM3-specific optimizations
- - Flash attention support
- - Long context handling
-
-15. **`src/trainer.py`** - Training orchestration
- - Monitoring callback integration
- - Enhanced logging
- - Checkpoint management
-
-## 🔧 Key Improvements
-
-### **1. Import Path Fixes**
-- Fixed all import paths to work with the refactored structure
-- Added proper sys.path handling for cross-module imports
-- Ensured compatibility between different script locations
-
-### **2. Monitoring Integration**
-- **Trackio Space**: Real-time experiment tracking
-- **HF Datasets**: Persistent experiment storage
-- **System Metrics**: GPU, memory, and CPU monitoring
-- **Training Callbacks**: Automatic metric logging
-
-### **3. Dataset Handling**
-- **Multi-format Support**: Prompt/completion, instruction/output, chat formats
-- **Automatic Conversion**: Handles different dataset structures
-- **Validation**: Ensures data quality and completeness
-- **Splitting**: Automatic train/validation/test splits
-
-### **4. Configuration Management**
-- **Dynamic Generation**: Creates configs based on user input
-- **Multiple Types**: Support for different training configurations
-- **Environment Variables**: Proper integration with environment
-- **Validation**: Ensures configuration correctness
-
-### **5. Deployment Automation**
-- **Model Upload**: Complete model push to HF Hub
-- **Model Cards**: Comprehensive documentation generation
-- **Training Results**: Complete experiment documentation
-- **Testing**: Automated model validation
-
-## 🚀 Pipeline Steps
-
-The end-to-end pipeline performs these 16 steps:
-
-1. **Environment Setup** - System dependencies and Python environment
-2. **PyTorch Installation** - CUDA-enabled PyTorch installation
-3. **Dependencies** - All required Python packages
-4. **Authentication** - HF token setup and validation
-5. **Trackio Deployment** - HF Space creation and configuration
-6. **Dataset Setup** - HF Dataset repository creation
-7. **Trackio Configuration** - Environment and dataset configuration
-8. **Training Config** - Dynamic configuration generation
-9. **Dataset Preparation** - Download and format conversion
-10. **Parameter Calculation** - Training steps and batch calculations
-11. **Training Execution** - Model fine-tuning with monitoring
-12. **Model Push** - Upload to HF Hub with documentation
-13. **Model Testing** - Validation of uploaded model
-14. **Summary Report** - Complete training documentation
-15. **Resource Links** - All online resource URLs
-16. **Next Steps** - Usage instructions and recommendations
-
-## 📊 Monitoring Features
-
-### **Trackio Space Interface**
-- Real-time training metrics
-- Experiment comparison
-- System resource monitoring
-- Training progress visualization
-
-### **HF Dataset Storage**
-- Persistent experiment data
-- Version-controlled history
-- Collaborative sharing
-- Automated backup
-
-### **Comprehensive Logging**
-- Training metrics (loss, accuracy, etc.)
-- System metrics (GPU, memory, CPU)
-- Configuration parameters
-- Training artifacts
-
-## 🔧 Configuration Options
-
-### **User Configuration**
-```bash
-# Required
-HF_TOKEN="your_token"
-HF_USERNAME="your_username"
-
-# Optional
-MODEL_NAME="HuggingFaceTB/SmolLM3-3B"
-DATASET_NAME="HuggingFaceTB/smoltalk"
-```
-
-### **Training Parameters**
-```bash
-BATCH_SIZE=2
-GRADIENT_ACCUMULATION_STEPS=8
-LEARNING_RATE=5e-6
-MAX_EPOCHS=3
-MAX_SEQ_LENGTH=4096
-```
-
-### **Monitoring Configuration**
-```bash
-TRACKIO_DATASET_REPO="username/trackio-experiments"
-EXPERIMENT_NAME="smollm3_finetune_YYYYMMDD_HHMMSS"
-```
-
-## 🛠️ Error Handling
-
-### **Comprehensive Error Handling**
-- Import error detection and reporting
-- Configuration validation
-- Network timeout handling
-- Graceful degradation
-
-### **Debugging Support**
-- Detailed logging at all levels
-- Component-specific error messages
-- Fallback mechanisms
-- Testing utilities
-
-## 📈 Performance Optimizations
-
-### **Training Optimizations**
-- Flash Attention for efficiency
-- Gradient checkpointing for memory
-- Mixed precision training
-- Optimized data loading
-
-### **Monitoring Optimizations**
-- Asynchronous logging
-- Batch metric updates
-- Efficient data storage
-- Minimal overhead
-
-## 🔄 Integration Points
-
-### **Hugging Face Ecosystem**
-- **HF Hub**: Model and dataset storage
-- **HF Spaces**: Trackio monitoring interface
-- **HF Datasets**: Experiment data persistence
-- **HF CLI**: Authentication and deployment
-
-### **External Services**
-- **Trackio**: Experiment tracking
-- **CUDA**: GPU acceleration
-- **PyTorch**: Deep learning framework
-- **Transformers**: Model library
-
-## 🎯 Usage Workflow
-
-### **1. Setup Phase**
-```bash
-python setup_launch.py # Configure with user info
-python test_pipeline.py # Verify all components
-```
-
-### **2. Execution Phase**
-```bash
-chmod +x launch.sh # Make executable
-./launch.sh # Run complete pipeline
-```
-
-### **3. Monitoring Phase**
-- Track progress in Trackio Space
-- Monitor metrics in real-time
-- Check logs for issues
-- Validate results
-
-### **4. Results Phase**
-- Access model on HF Hub
-- Review training summary
-- Test model performance
-- Share results
-
-## 📋 Quality Assurance
-
-### **Testing Coverage**
-- Import testing for all modules
-- Script availability verification
-- Configuration validation
-- CUDA and token testing
-- Component integration testing
-
-### **Documentation**
-- Comprehensive README
-- Step-by-step guides
-- Troubleshooting section
-- Advanced usage examples
-
-### **Error Recovery**
-- Graceful error handling
-- Detailed error messages
-- Recovery mechanisms
-- Fallback options
-
-## 🚀 Future Enhancements
-
-### **Planned Improvements**
-- Multi-GPU training support
-- Distributed training
-- Advanced hyperparameter tuning
-- Custom dataset upload
-- Model evaluation metrics
-- Automated testing pipeline
-
-### **Extensibility**
-- Plugin architecture for custom components
-- Configuration templates
-- Custom monitoring backends
-- Advanced deployment options
-
-## 📊 Success Metrics
-
-### **Pipeline Completeness**
-- ✅ All 16 steps implemented
-- ✅ Error handling at each step
-- ✅ Monitoring integration
-- ✅ Documentation complete
-
-### **User Experience**
-- ✅ Simple setup process
-- ✅ Clear error messages
-- ✅ Comprehensive documentation
-- ✅ Testing utilities
-
-### **Technical Quality**
-- ✅ Import path fixes
-- ✅ Configuration management
-- ✅ Monitoring integration
-- ✅ Deployment automation
-
-## 🎉 Conclusion
-
-The SmolLM3 end-to-end pipeline provides a complete solution for fine-tuning with integrated monitoring, automated deployment, and comprehensive documentation. The refactored codebase is now production-ready with proper error handling, testing, and user experience considerations.
-
-**Key Achievements:**
-- Complete end-to-end automation
-- Integrated monitoring and tracking
-- Comprehensive error handling
-- Production-ready deployment
-- Extensive documentation
-- Testing and validation suite
-
-The pipeline is now ready for users to easily fine-tune SmolLM3 models with full monitoring and deployment capabilities.
\ No newline at end of file
diff --git a/docs/PUSH_GUIDE.md b/docs/PUSH_GUIDE.md
deleted file mode 100644
index 21be1de5e084e5fc3c4947da8b1d151c261139bb..0000000000000000000000000000000000000000
--- a/docs/PUSH_GUIDE.md
+++ /dev/null
@@ -1,406 +0,0 @@
-# Push to Hugging Face Hub Guide
-
-This guide explains how to use the `push_to_huggingface.py` script to upload your trained SmolLM3 models and results to Hugging Face Hub.
-
-## Features
-
-- ✅ **Automatic Repository Creation** - Creates HF repositories automatically
-- ✅ **Model Validation** - Validates required model files before upload
-- ✅ **Comprehensive Model Cards** - Generates detailed model documentation
-- ✅ **Training Results Upload** - Uploads logs, configs, and results
-- ✅ **Trackio Integration** - Logs push actions to your monitoring system
-- ✅ **Private/Public Repositories** - Support for both private and public models
-
-## Prerequisites
-
-### 1. Install Dependencies
-
-```bash
-pip install huggingface_hub
-```
-
-### 2. Set Up Hugging Face Token
-
-```bash
-# Option 1: Environment variable
-export HF_TOKEN="your_huggingface_token_here"
-
-# Option 2: Use --token argument
-python push_to_huggingface.py model_path repo_name --token "your_token"
-```
-
-### 3. Get Your Hugging Face Token
-
-1. Go to https://huggingface.co/settings/tokens
-2. Click "New token"
-3. Give it a name (e.g., "model-upload")
-4. Select "Write" permissions
-5. Copy the token
-
-## Basic Usage
-
-### Simple Model Push
-
-```bash
-python push_to_huggingface.py /path/to/model username/model-name
-```
-
-### Push with Custom Token
-
-```bash
-python push_to_huggingface.py /path/to/model username/model-name \
- --token "hf_your_token_here"
-```
-
-### Push Private Model
-
-```bash
-python push_to_huggingface.py /path/to/model username/model-name \
- --private
-```
-
-### Push with Trackio Integration
-
-```bash
-python push_to_huggingface.py /path/to/model username/model-name \
- --trackio-url "https://your-space.hf.space" \
- --experiment-name "my_experiment"
-```
-
-## Complete Workflow Example
-
-### 1. Train Your Model
-
-```bash
-python train.py config/train_smollm3.py \
- --dataset_dir my_dataset \
- --enable_tracking \
- --trackio_url "https://your-space.hf.space" \
- --experiment_name "smollm3_finetune_v1"
-```
-
-### 2. Push to Hugging Face Hub
-
-```bash
-python push_to_huggingface.py /output-checkpoint username/smollm3-finetuned \
- --trackio-url "https://your-space.hf.space" \
- --experiment-name "smollm3_finetune_v1"
-```
-
-### 3. Use Your Model
-
-```python
-from transformers import AutoModelForCausalLM, AutoTokenizer
-
-# Load your uploaded model
-model = AutoModelForCausalLM.from_pretrained("username/smollm3-finetuned")
-tokenizer = AutoTokenizer.from_pretrained("username/smollm3-finetuned")
-
-# Generate text
-inputs = tokenizer("Hello, how are you?", return_tensors="pt")
-outputs = model.generate(**inputs, max_new_tokens=100)
-print(tokenizer.decode(outputs[0], skip_special_tokens=True))
-```
-
-## Repository Structure
-
-After pushing, your repository will contain:
-
-```
-username/model-name/
-├── README.md # Auto-generated model card
-├── config.json # Model configuration
-├── pytorch_model.bin # Model weights
-├── tokenizer.json # Tokenizer configuration
-├── tokenizer_config.json # Tokenizer settings
-├── special_tokens_map.json # Special tokens
-├── training_results/ # Training artifacts
-│ ├── train_results.json
-│ ├── eval_results.json
-│ ├── training_config.json
-│ └── training.log
-└── .gitattributes # Git attributes
-```
-
-## Model Card Features
-
-The script automatically generates comprehensive model cards including:
-
-- **Model Details**: Base model, fine-tuning method, size
-- **Training Configuration**: All training parameters
-- **Training Results**: Loss, accuracy, steps, time
-- **Usage Examples**: Code snippets for loading and using
-- **Performance Metrics**: Training and validation metrics
-- **Hardware Information**: GPU/CPU used for training
-
-## Advanced Usage
-
-### Custom Repository Names
-
-```bash
-# Public repository
-python push_to_huggingface.py /model myusername/smollm3-chatbot
-
-# Private repository
-python push_to_huggingface.py /model myusername/smollm3-private --private
-```
-
-### Integration with Training Pipeline
-
-```bash
-#!/bin/bash
-# Complete training and push workflow
-
-# 1. Train the model
-python train.py config/train_smollm3.py \
- --dataset_dir my_dataset \
- --enable_tracking \
- --trackio_url "https://your-space.hf.space" \
- --experiment_name "smollm3_v1"
-
-# 2. Push to Hugging Face Hub
-python push_to_huggingface.py /output-checkpoint myusername/smollm3-v1 \
- --trackio-url "https://your-space.hf.space" \
- --experiment-name "smollm3_v1"
-
-# 3. Test the model
-python -c "
-from transformers import AutoModelForCausalLM, AutoTokenizer
-model = AutoModelForCausalLM.from_pretrained('myusername/smollm3-v1')
-tokenizer = AutoTokenizer.from_pretrained('myusername/smollm3-v1')
-print('Model loaded successfully!')
-"
-```
-
-### Batch Processing Multiple Models
-
-```bash
-#!/bin/bash
-# Push multiple models
-
-models=(
- "smollm3-baseline"
- "smollm3-high-lr"
- "smollm3-dpo"
-)
-
-for model in "${models[@]}"; do
- echo "Pushing $model..."
- python push_to_huggingface.py "/models/$model" "username/$model"
-done
-```
-
-## Error Handling
-
-### Common Issues and Solutions
-
-#### 1. Missing Model Files
-
-**Error**: `❌ Missing required files: ['config.json', 'pytorch_model.bin']`
-
-**Solution**: Ensure your model directory contains all required files:
-- `config.json`
-- `pytorch_model.bin`
-- `tokenizer.json`
-- `tokenizer_config.json`
-
-#### 2. Authentication Issues
-
-**Error**: `❌ Failed to create repository: 401 Client Error`
-
-**Solution**:
-- Check your HF token is valid
-- Ensure token has write permissions
-- Verify username in repository name matches your account
-
-#### 3. Repository Already Exists
-
-**Error**: `Repository already exists`
-
-**Solution**: The script handles this automatically with `exist_ok=True`, but you can:
-- Use a different repository name
-- Delete the existing repository first
-- Use version numbers: `username/model-v2`
-
-#### 4. Large File Upload Issues
-
-**Error**: `Upload failed for large files`
-
-**Solution**:
-- Check your internet connection
-- Use Git LFS for large files
-- Consider splitting large models
-
-## Trackio Integration
-
-### Logging Push Actions
-
-When using Trackio integration, the script logs:
-
-- **Push Action**: Repository creation and file uploads
-- **Model Metadata**: Size, configuration, results
-- **Repository Info**: Name, privacy settings, URL
-- **Training Results**: Loss, accuracy, steps
-
-### Viewing Push Logs
-
-1. Go to your Trackio Space
-2. Navigate to the "View Experiments" tab
-3. Find your experiment
-4. Check the metrics for push-related actions
-
-## Security Best Practices
-
-### Token Management
-
-```bash
-# Use environment variables (recommended)
-export HF_TOKEN="your_token_here"
-python push_to_huggingface.py model repo
-
-# Don't hardcode tokens in scripts
-# ❌ Bad: python push_to_huggingface.py model repo --token "hf_xxx"
-```
-
-### Private Models
-
-```bash
-# For sensitive models, use private repositories
-python push_to_huggingface.py model username/private-model --private
-```
-
-### Repository Naming
-
-```bash
-# Use descriptive names
-python push_to_huggingface.py model username/smollm3-chatbot-v1
-
-# Include version numbers
-python push_to_huggingface.py model username/smollm3-v2.0
-```
-
-## Performance Optimization
-
-### Large Models
-
-For models > 5GB:
-
-```bash
-# Use Git LFS for large files
-git lfs install
-git lfs track "*.bin"
-
-# Consider splitting models
-python push_to_huggingface.py model username/model-large --private
-```
-
-### Upload Speed
-
-```bash
-# Use stable internet connection
-# Consider uploading during off-peak hours
-# Use private repositories for faster uploads
-```
-
-## Troubleshooting
-
-### Debug Mode
-
-```bash
-# Enable debug logging
-export LOG_LEVEL=DEBUG
-python push_to_huggingface.py model repo
-```
-
-### Validate Model Files
-
-```bash
-# Check model structure before pushing
-ls -la /path/to/model/
-# Should contain: config.json, pytorch_model.bin, tokenizer.json, etc.
-```
-
-### Test Repository Access
-
-```bash
-# Test your HF token
-python -c "
-from huggingface_hub import HfApi
-api = HfApi(token='your_token')
-print('Token is valid!')
-"
-```
-
-## Integration Examples
-
-### With CI/CD Pipeline
-
-```yaml
-# .github/workflows/train-and-push.yml
-name: Train and Push Model
-
-on:
- push:
- branches: [main]
-
-jobs:
- train-and-push:
- runs-on: ubuntu-latest
- steps:
- - uses: actions/checkout@v2
-
- - name: Train Model
- run: |
- python train.py config/train_smollm3.py
-
- - name: Push to HF Hub
- run: |
- python push_to_huggingface.py /output username/model-${{ github.run_number }}
- env:
- HF_TOKEN: ${{ secrets.HF_TOKEN }}
-```
-
-### With Docker
-
-```dockerfile
-# Dockerfile
-FROM python:3.9
-
-WORKDIR /app
-COPY requirements.txt .
-RUN pip install -r requirements.txt
-
-COPY . .
-
-CMD ["python", "push_to_huggingface.py", "/model", "username/model"]
-```
-
-## Support and Resources
-
-### Documentation
-
-- [Hugging Face Hub Documentation](https://huggingface.co/docs/hub/index)
-- [Transformers Documentation](https://huggingface.co/docs/transformers/index)
-- [Model Cards Guide](https://huggingface.co/docs/hub/model-cards)
-
-### Community
-
-- [Hugging Face Forums](https://discuss.huggingface.co/)
-- [GitHub Issues](https://github.com/huggingface/huggingface_hub/issues)
-
-### Examples
-
-- [Model Repository Examples](https://huggingface.co/models?search=smollm3)
-- [Fine-tuned Models](https://huggingface.co/models?pipeline_tag=text-generation&sort=downloads)
-
-## Conclusion
-
-The `push_to_huggingface.py` script provides a complete solution for:
-
-- ✅ **Easy Model Deployment** - One command to push models
-- ✅ **Professional Documentation** - Auto-generated model cards
-- ✅ **Training Artifacts** - Complete experiment tracking
-- ✅ **Integration Ready** - Works with CI/CD and monitoring
-- ✅ **Security Focused** - Proper token and privacy management
-
-Start sharing your fine-tuned SmolLM3 models with the community!
\ No newline at end of file
diff --git a/docs/PUSH_SCRIPT_GUIDE.md b/docs/PUSH_SCRIPT_GUIDE.md
deleted file mode 100644
index de9183e69eac81533ed8a432fea8d53101d38559..0000000000000000000000000000000000000000
--- a/docs/PUSH_SCRIPT_GUIDE.md
+++ /dev/null
@@ -1,267 +0,0 @@
-# 🚀 Push to Hugging Face Script Guide
-
-## Overview
-
-The `push_to_huggingface.py` script has been enhanced to integrate with **HF Datasets** for experiment tracking and provides complete model deployment with persistent experiment storage.
-
-## 🚀 Key Improvements
-
-### **1. HF Datasets Integration**
-- ✅ **Dataset Repository Support**: Configurable dataset repository for experiment storage
-- ✅ **Environment Variables**: Automatic detection of `HF_TOKEN` and `TRACKIO_DATASET_REPO`
-- ✅ **Enhanced Logging**: Logs push actions to both Trackio and HF Datasets
-- ✅ **Model Card Integration**: Includes dataset repository information in model cards
-
-### **2. Enhanced Configuration**
-- ✅ **Flexible Token Input**: Multiple ways to provide HF token
-- ✅ **Dataset Repository Tracking**: Links models to their experiment datasets
-- ✅ **Environment Variable Support**: Fallback to environment variables
-- ✅ **Command Line Arguments**: New arguments for HF Datasets integration
-
-### **3. Improved Model Cards**
-- ✅ **Dataset Repository Info**: Shows which dataset contains experiment data
-- ✅ **Experiment Tracking Section**: Explains how to access training data
-- ✅ **Enhanced Documentation**: Better model cards with experiment links
-
-## 📋 Usage Examples
-
-### **Basic Usage**
-```bash
-# Push model with default settings
-python push_to_huggingface.py /path/to/model username/repo-name
-```
-
-### **With HF Datasets Integration**
-```bash
-# Push model with custom dataset repository
-python push_to_huggingface.py /path/to/model username/repo-name \
- --dataset-repo username/experiments
-```
-
-### **With Custom Token**
-```bash
-# Push model with custom HF token
-python push_to_huggingface.py /path/to/model username/repo-name \
- --hf-token your_token_here
-```
-
-### **Complete Example**
-```bash
-# Push model with all options
-python push_to_huggingface.py /path/to/model username/repo-name \
- --dataset-repo username/experiments \
- --hf-token your_token_here \
- --private \
- --experiment-name "smollm3_finetune_v2"
-```
-
-## 🔧 Command Line Arguments
-
-| Argument | Required | Default | Description |
-|----------|----------|---------|-------------|
-| `model_path` | ✅ Yes | None | Path to trained model directory |
-| `repo_name` | ✅ Yes | None | HF repository name (username/repo-name) |
-| `--token` | ❌ No | `HF_TOKEN` env | Hugging Face token |
-| `--hf-token` | ❌ No | `HF_TOKEN` env | HF token (alternative to --token) |
-| `--private` | ❌ No | False | Make repository private |
-| `--trackio-url` | ❌ No | None | Trackio Space URL for logging |
-| `--experiment-name` | ❌ No | None | Experiment name for Trackio |
-| `--dataset-repo` | ❌ No | `TRACKIO_DATASET_REPO` env | HF Dataset repository |
-
-## 🛠️ Configuration Methods
-
-### **Method 1: Command Line Arguments**
-```bash
-python push_to_huggingface.py model_path repo_name \
- --dataset-repo username/experiments \
- --hf-token your_token_here
-```
-
-### **Method 2: Environment Variables**
-```bash
-export HF_TOKEN=your_token_here
-export TRACKIO_DATASET_REPO=username/experiments
-python push_to_huggingface.py model_path repo_name
-```
-
-### **Method 3: Hybrid Approach**
-```bash
-# Set defaults via environment variables
-export HF_TOKEN=your_token_here
-export TRACKIO_DATASET_REPO=username/experiments
-
-# Override specific values via command line
-python push_to_huggingface.py model_path repo_name \
- --dataset-repo username/specific-experiments
-```
-
-## 📊 What Gets Pushed
-
-### **Model Files**
-- ✅ **Model Weights**: `pytorch_model.bin`
-- ✅ **Configuration**: `config.json`
-- ✅ **Tokenizer**: `tokenizer.json`, `tokenizer_config.json`
-- ✅ **All Other Files**: Any additional files in model directory
-
-### **Documentation**
-- ✅ **Model Card**: Comprehensive README.md with model information
-- ✅ **Training Configuration**: JSON configuration used for training
-- ✅ **Training Results**: JSON results and metrics
-- ✅ **Training Logs**: Text logs from training process
-
-### **Experiment Data**
-- ✅ **Dataset Repository**: Links to HF Dataset containing experiment data
-- ✅ **Training Metrics**: All training metrics stored in dataset
-- ✅ **Configuration**: Training configuration stored in dataset
-- ✅ **Artifacts**: Training artifacts and logs
-
-## 🔍 Enhanced Model Cards
-
-The improved script creates enhanced model cards that include:
-
-### **Model Information**
-- Base model and architecture
-- Training date and model size
-- **Dataset repository** for experiment data
-
-### **Training Configuration**
-- Complete training parameters
-- Hardware information
-- Training duration and steps
-
-### **Experiment Tracking**
-- Links to HF Dataset repository
-- Instructions for accessing experiment data
-- Training metrics and results
-
-### **Usage Examples**
-- Code examples for loading and using the model
-- Generation examples
-- Performance information
-
-## 📈 Logging Integration
-
-### **Trackio Logging**
-- ✅ **Push Actions**: Logs model push events
-- ✅ **Model Information**: Repository name, size, configuration
-- ✅ **Training Data**: Links to experiment dataset
-
-### **HF Datasets Logging**
-- ✅ **Experiment Summary**: Final training summary
-- ✅ **Push Metadata**: Model repository and push date
-- ✅ **Configuration**: Complete training configuration
-
-### **Dual Storage**
-- ✅ **Trackio**: Real-time monitoring and visualization
-- ✅ **HF Datasets**: Persistent experiment storage
-- ✅ **Synchronized**: Both systems updated together
-
-## 🚨 Troubleshooting
-
-### **Issue: "Missing required files"**
-**Solutions**:
-1. Check model directory contains required files
-2. Ensure model was saved correctly during training
-3. Verify file permissions
-
-### **Issue: "Failed to create repository"**
-**Solutions**:
-1. Check HF token has write permissions
-2. Verify repository name format: `username/repo-name`
-3. Ensure repository doesn't already exist (or use `--private`)
-
-### **Issue: "Failed to upload files"**
-**Solutions**:
-1. Check network connectivity
-2. Verify HF token is valid
-3. Ensure repository was created successfully
-
-### **Issue: "Dataset repository not found"**
-**Solutions**:
-1. Check dataset repository exists
-2. Verify HF token has read access
-3. Use `--dataset-repo` to specify correct repository
-
-## 📋 Workflow Integration
-
-### **Complete Training Workflow**
-1. **Train Model**: Use training scripts with monitoring
-2. **Monitor Progress**: View metrics in Trackio interface
-3. **Push Model**: Use improved push script
-4. **Access Data**: View experiments in HF Dataset repository
-
-### **Example Workflow**
-```bash
-# 1. Train model with monitoring
-python train.py config/train_smollm3_openhermes_fr.py \
- --experiment_name "smollm3_french_v2"
-
-# 2. Push model to HF Hub
-python push_to_huggingface.py outputs/model username/smollm3-french \
- --dataset-repo username/experiments \
- --experiment-name "smollm3_french_v2"
-
-# 3. View results
-# - Model: https://huggingface.co/username/smollm3-french
-# - Experiments: https://huggingface.co/datasets/username/experiments
-# - Trackio: Your Trackio Space interface
-```
-
-## 🎯 Benefits
-
-### **For Model Deployment**
-- ✅ **Complete Documentation**: Enhanced model cards with experiment links
-- ✅ **Persistent Storage**: Experiment data stored in HF Datasets
-- ✅ **Easy Access**: Direct links to training data and metrics
-- ✅ **Reproducibility**: Complete training configuration included
-
-### **For Experiment Management**
-- ✅ **Centralized Storage**: All experiments in HF Dataset repository
-- ✅ **Version Control**: Model versions linked to experiment data
-- ✅ **Collaboration**: Share experiments and models easily
-- ✅ **Searchability**: Easy to find specific experiments
-
-### **For Development**
-- ✅ **Flexible Configuration**: Multiple ways to set parameters
-- ✅ **Backward Compatible**: Works with existing setups
-- ✅ **Error Handling**: Clear error messages and troubleshooting
-- ✅ **Integration**: Works with existing monitoring system
-
-## 📊 Testing Results
-
-All push script tests passed:
-- ✅ **HuggingFacePusher Initialization**: Works with new parameters
-- ✅ **Model Card Creation**: Includes HF Datasets integration
-- ✅ **Logging Integration**: Logs to both Trackio and HF Datasets
-- ✅ **Argument Parsing**: Handles new command line arguments
-- ✅ **Environment Variables**: Proper fallback handling
-
-## 🔄 Migration Guide
-
-### **From Old Script**
-```bash
-# Old way
-python push_to_huggingface.py model_path repo_name --token your_token
-
-# New way (same functionality)
-python push_to_huggingface.py model_path repo_name --hf-token your_token
-
-# New way with HF Datasets
-python push_to_huggingface.py model_path repo_name \
- --hf-token your_token \
- --dataset-repo username/experiments
-```
-
-### **Environment Variables**
-```bash
-# Set environment variables for automatic detection
-export HF_TOKEN=your_token_here
-export TRACKIO_DATASET_REPO=username/experiments
-
-# Then use simple command
-python push_to_huggingface.py model_path repo_name
-```
-
----
-
-**🎉 Your push script is now fully integrated with HF Datasets for complete experiment tracking and model deployment!**
\ No newline at end of file
diff --git a/docs/QUANTIZATION_FIX_SUMMARY.md b/docs/QUANTIZATION_FIX_SUMMARY.md
deleted file mode 100644
index f6aa79576af8f21dd528f6ece8741ec3ccbd5419..0000000000000000000000000000000000000000
--- a/docs/QUANTIZATION_FIX_SUMMARY.md
+++ /dev/null
@@ -1,165 +0,0 @@
-# Quantization Fix Summary
-
-## Issues Identified
-
-The quantization script was failing due to several compatibility issues:
-
-1. **Int8 Quantization Error**:
- - Error: `The model is quantized with QuantizationMethod.TORCHAO and is not serializable`
- - Cause: Offloaded modules in the model cannot be quantized with torchao
- - Solution: Added alternative save method and fallback to bitsandbytes
-
-2. **Int4 Quantization Error**:
- - Error: `Could not run 'aten::_convert_weight_to_int4pack_for_cpu' with arguments from the 'CUDA' backend`
- - Cause: Int4 quantization requires CPU backend but was being attempted on CUDA
- - Solution: Added proper device selection logic
-
-3. **Monitoring Error**:
- - Error: `'SmolLM3Monitor' object has no attribute 'log_event'`
- - Cause: Incorrect monitoring API usage
- - Solution: Added flexible monitoring method detection
-
-## Fixes Implemented
-
-### 1. Enhanced Device Management (`scripts/model_tonic/quantize_model.py`)
-
-```python
-def get_optimal_device(self, quant_type: str) -> str:
- """Get optimal device for quantization type"""
- if quant_type == "int4_weight_only":
- # Int4 quantization works better on CPU
- return "cpu"
- elif quant_type == "int8_weight_only":
- # Int8 quantization works on GPU
- if torch.cuda.is_available():
- return "cuda"
- else:
- logger.warning("⚠️ CUDA not available, falling back to CPU for int8")
- return "cpu"
- else:
- return "auto"
-```
-
-### 2. Alternative Quantization Method
-
-Added `quantize_model_alternative()` method using bitsandbytes for better compatibility:
-
-```python
-def quantize_model_alternative(self, quant_type: str, device: str = "auto", group_size: int = 128, save_dir: Optional[str] = None) -> Optional[str]:
- """Alternative quantization using bitsandbytes for better compatibility"""
- # Uses BitsAndBytesConfig instead of TorchAoConfig
- # Handles serialization issues better
-```
-
-### 3. Improved Error Handling
-
-- Added fallback from torchao to bitsandbytes
-- Enhanced save method with alternative approaches
-- Better device mapping for different quantization types
-
-### 4. Fixed Monitoring Integration
-
-```python
-def log_to_trackio(self, action: str, details: Dict[str, Any]):
- """Log quantization events to Trackio"""
- if self.monitor:
- try:
- # Use the correct monitoring method
- if hasattr(self.monitor, 'log_event'):
- self.monitor.log_event(action, details)
- elif hasattr(self.monitor, 'log_metric'):
- self.monitor.log_metric(action, details.get('value', 1.0))
- elif hasattr(self.monitor, 'log'):
- self.monitor.log(action, details)
- else:
- logger.info(f"📊 {action}: {details}")
- except Exception as e:
- logger.warning(f"⚠️ Failed to log to Trackio: {e}")
-```
-
-## Usage Instructions
-
-### 1. Install Dependencies
-
-```bash
-pip install -r requirements_quantization.txt
-```
-
-### 2. Run Quantization
-
-```bash
-python3 quantize_and_push.py
-```
-
-### 3. Test Fixes
-
-```bash
-python3 test_quantization_fix.py
-```
-
-## Expected Behavior
-
-### Successful Quantization
-
-The script will now:
-
-1. **Try torchao first** for each quantization type
-2. **Fall back to bitsandbytes** if torchao fails
-3. **Use appropriate devices** (CPU for int4, GPU for int8)
-4. **Handle serialization issues** with alternative save methods
-5. **Log progress** without monitoring errors
-
-### Output
-
-```
-✅ Model files validated
-🔄 Processing quantization type: int8_weight_only
-🔄 Using device: cuda
-✅ int8_weight_only quantization and push completed
-🔄 Processing quantization type: int4_weight_only
-🔄 Using device: cpu
-✅ int4_weight_only quantization and push completed
-📊 Quantization summary: 2/2 successful
-✅ Quantization completed successfully!
-```
-
-## Troubleshooting
-
-### If All Quantization Fails
-
-1. **Install bitsandbytes**:
- ```bash
- pip install bitsandbytes
- ```
-
-2. **Check model path**:
- ```bash
- ls -la /output-checkpoint
- ```
-
-3. **Verify dependencies**:
- ```bash
- python3 test_quantization_fix.py
- ```
-
-### Common Issues
-
-1. **Memory Issues**: Use CPU for int4 quantization
-2. **Serialization Errors**: The script now handles these automatically
-3. **Device Conflicts**: Automatic device selection based on quantization type
-
-## Files Modified
-
-1. `scripts/model_tonic/quantize_model.py` - Main quantization logic
-2. `quantize_and_push.py` - Main script with better error handling
-3. `test_quantization_fix.py` - Test script for verification
-4. `requirements_quantization.txt` - Dependencies file
-
-## Next Steps
-
-1. Run the test script to verify fixes
-2. Install bitsandbytes if not already installed
-3. Run the quantization script
-4. Check the Hugging Face repository for quantized models
-
-The fixes ensure robust quantization with multiple fallback options and proper error handling.
\ No newline at end of file
diff --git a/docs/QUANTIZATION_GUIDE.md b/docs/QUANTIZATION_GUIDE.md
deleted file mode 100644
index 731b6482b5676a5c8253ea018b3e057a02101537..0000000000000000000000000000000000000000
--- a/docs/QUANTIZATION_GUIDE.md
+++ /dev/null
@@ -1,313 +0,0 @@
-# Model Quantization Guide
-
-## Overview
-
-This guide covers the quantization functionality integrated into the SmolLM3 fine-tuning pipeline. The system supports creating quantized versions of trained models using `torchao` and automatically uploading them to Hugging Face Hub in a unified repository structure.
-
-## Repository Structure
-
-With the updated pipeline, all models (main and quantized) are stored in a single repository:
-
-```
-your-username/model-name/
-├── README.md (unified model card)
-├── config.json
-├── pytorch_model.bin
-├── tokenizer.json
-├── tokenizer_config.json
-├── int8/ (quantized model for GPU)
-│ ├── README.md
-│ ├── config.json
-│ └── pytorch_model.bin
-└── int4/ (quantized model for CPU)
- ├── README.md
- ├── config.json
- └── pytorch_model.bin
-```
-
-## Quantization Types
-
-### int8 Weight-Only Quantization (GPU Optimized)
-- **Memory Reduction**: ~50% compared to original model
-- **Speed**: Faster inference with minimal accuracy loss
-- **Hardware**: GPU optimized for high-performance inference
-- **Use Case**: Production deployments with GPU resources
-
-### int4 Weight-Only Quantization (CPU Optimized)
-- **Memory Reduction**: ~75% compared to original model
-- **Speed**: Significantly faster inference with some accuracy trade-off
-- **Hardware**: CPU optimized for deployment
-- **Use Case**: Edge deployment, CPU-only environments
-
-## Integration with Pipeline
-
-### Automatic Quantization
-
-The quantization process is integrated into the main training pipeline:
-
-1. **Training**: Model is trained using the standard pipeline
-2. **Model Push**: Main model is pushed to Hugging Face Hub
-3. **Quantization Options**: User is prompted to create quantized versions
-4. **Quantized Models**: Quantized models are created and pushed to subdirectories
-5. **Unified Documentation**: Single model card covers all versions
-
-### Pipeline Integration
-
-The quantization step is added to `launch.sh` after the main model push:
-
-```bash
-# Step 16.5: Quantization Options
-print_step "Step 16.5: Model Quantization Options"
-echo "=========================================="
-
-print_info "Would you like to create quantized versions of your model?"
-print_info "Quantization reduces model size and improves inference speed."
-
-# Ask about quantization
-get_input "Create quantized models? (y/n)" "y" "CREATE_QUANTIZED"
-
-if [ "$CREATE_QUANTIZED" = "y" ] || [ "$CREATE_QUANTIZED" = "Y" ]; then
- print_info "Quantization options:"
- print_info "1. int8_weight_only (GPU optimized, ~50% memory reduction)"
- print_info "2. int4_weight_only (CPU optimized, ~75% memory reduction)"
- print_info "3. Both int8 and int4 versions"
-
- select_option "Select quantization type:" "int8_weight_only" "int4_weight_only" "both" "QUANT_TYPE"
-
- # Create quantized models in the same repository
- python scripts/model_tonic/quantize_model.py /output-checkpoint "$REPO_NAME" \
- --quant-type "$QUANT_TYPE" \
- --device "$DEVICE" \
- --token "$HF_TOKEN" \
- --trackio-url "$TRACKIO_URL" \
- --experiment-name "${EXPERIMENT_NAME}-${QUANT_TYPE}" \
- --dataset-repo "$TRACKIO_DATASET_REPO"
-fi
-```
-
-## Standalone Quantization
-
-### Using the Standalone Script
-
-For models already uploaded to Hugging Face Hub:
-
-```bash
-python scripts/model_tonic/quantize_standalone.py \
- "your-username/model-name" \
- "your-username/model-name" \
- --quant-type "int8_weight_only" \
- --device "auto" \
- --token "your-hf-token"
-```
-
-### Command Line Options
-
-```bash
-python scripts/model_tonic/quantize_standalone.py model_path repo_name [options]
-
-Options:
- --quant-type {int8_weight_only,int4_weight_only,int8_dynamic}
- Quantization type (default: int8_weight_only)
- --device DEVICE Device for quantization (auto, cpu, cuda)
- --group-size GROUP_SIZE
- Group size for quantization (default: 128)
- --token TOKEN Hugging Face token
- --private Create private repository
- --trackio-url TRACKIO_URL
- Trackio URL for monitoring
- --experiment-name EXPERIMENT_NAME
- Experiment name for tracking
- --dataset-repo DATASET_REPO
- HF Dataset repository
- --save-only Save quantized model locally without pushing to HF
-```
-
-## Loading Quantized Models
-
-### Loading Main Model
-
-```python
-import torch
-from transformers import AutoModelForCausalLM, AutoTokenizer
-
-# Load the main model
-model = AutoModelForCausalLM.from_pretrained(
- "your-username/model-name",
- device_map="auto",
- torch_dtype=torch.bfloat16
-)
-tokenizer = AutoTokenizer.from_pretrained("your-username/model-name")
-```
-
-### Loading int8 Quantized Model (GPU)
-
-```python
-import torch
-from transformers import AutoModelForCausalLM, AutoTokenizer
-
-# Load int8 quantized model (GPU optimized)
-model = AutoModelForCausalLM.from_pretrained(
- "your-username/model-name/int8",
- device_map="auto",
- torch_dtype=torch.bfloat16
-)
-tokenizer = AutoTokenizer.from_pretrained("your-username/model-name/int8")
-```
-
-### Loading int4 Quantized Model (CPU)
-
-```python
-import torch
-from transformers import AutoModelForCausalLM, AutoTokenizer
-
-# Load int4 quantized model (CPU optimized)
-model = AutoModelForCausalLM.from_pretrained(
- "your-username/model-name/int4",
- device_map="cpu",
- torch_dtype=torch.bfloat16
-)
-tokenizer = AutoTokenizer.from_pretrained("your-username/model-name/int4")
-```
-
-## Usage Examples
-
-### Text Generation with Quantized Model
-
-```python
-from transformers import AutoModelForCausalLM, AutoTokenizer
-
-# Load quantized model
-model = AutoModelForCausalLM.from_pretrained("your-username/model-name/int8")
-tokenizer = AutoTokenizer.from_pretrained("your-username/model-name/int8")
-
-# Generate text
-text = "The future of artificial intelligence is"
-inputs = tokenizer(text, return_tensors="pt")
-outputs = model.generate(**inputs, max_new_tokens=100)
-print(tokenizer.decode(outputs[0], skip_special_tokens=True))
-```
-
-### Conversation with Quantized Model
-
-```python
-def chat_with_quantized_model(prompt, max_length=100):
- inputs = tokenizer(prompt, return_tensors="pt")
- outputs = model.generate(**inputs, max_new_tokens=max_length)
- return tokenizer.decode(outputs[0], skip_special_tokens=True)
-
-response = chat_with_quantized_model("Hello, how are you today?")
-print(response)
-```
-
-## Configuration Options
-
-### Quantization Parameters
-
-- **group_size**: Group size for quantization (default: 128)
-- **device**: Target device for quantization (auto, cpu, cuda)
-- **quant_type**: Type of quantization to apply
-
-### Hardware Requirements
-
-- **Main Model**: GPU with 8GB+ VRAM recommended
-- **int8 Model**: GPU with 4GB+ VRAM
-- **int4 Model**: CPU deployment possible
-
-## Performance Comparison
-
-| Model Type | Memory Usage | Speed | Accuracy | Use Case |
-|------------|--------------|-------|----------|----------|
-| Original | 100% | Baseline | Best | Development, Research |
-| int8 | ~50% | Faster | Minimal loss | Production GPU |
-| int4 | ~25% | Fastest | Some loss | Edge, CPU deployment |
-
-## Best Practices
-
-### When to Use Quantization
-
-1. **int8 (GPU)**: When you need faster inference with minimal accuracy loss
-2. **int4 (CPU)**: When deploying to CPU-only environments or edge devices
-3. **Both**: When you need flexibility for different deployment scenarios
-
-### Memory Optimization
-
-- Use int8 for GPU deployments with memory constraints
-- Use int4 for CPU deployments or very memory-constrained environments
-- Consider the trade-off between speed and accuracy
-
-### Deployment Considerations
-
-- Test quantized models on your specific use case
-- Monitor performance and accuracy in production
-- Consider using the main model for development and quantized versions for deployment
-
-## Troubleshooting
-
-### Common Issues
-
-1. **CUDA Out of Memory**: Reduce batch size or use int8 quantization
-2. **Import Errors**: Install torchao: `pip install torchao>=0.10.0`
-3. **Model Loading Errors**: Ensure the model path is correct and accessible
-
-### Debugging
-
-```bash
-# Test quantization functionality
-python tests/test_quantization.py
-
-# Check torchao installation
-python -c "import torchao; print('torchao available')"
-
-# Verify model files
-ls -la /path/to/model/
-```
-
-## Monitoring and Tracking
-
-### Trackio Integration
-
-Quantization events are logged to Trackio:
-
-- `quantization_started`: When quantization begins
-- `quantization_completed`: When quantization finishes
-- `quantized_model_pushed`: When model is uploaded to HF Hub
-- `quantization_failed`: If quantization fails
-
-### Metrics Tracked
-
-- Quantization type and parameters
-- Model size reduction
-- Upload URLs for quantized models
-- Processing time and success status
-
-## Dependencies
-
-### Required Packages
-
-```bash
-pip install torchao>=0.10.0
-pip install transformers>=4.35.0
-pip install huggingface_hub>=0.16.0
-```
-
-### Optional Dependencies
-
-```bash
-pip install accelerate>=0.20.0 # For device mapping
-pip install bitsandbytes>=0.41.0 # For additional quantization
-```
-
-## References
-
-- [torchao Documentation](https://huggingface.co/docs/transformers/main/en/quantization/torchao)
-- [Hugging Face Model Cards](https://huggingface.co/docs/hub/model-cards)
-- [Transformers Quantization Guide](https://huggingface.co/docs/transformers/main/en/quantization)
-
-## Support
-
-For issues and questions:
-
-1. Check the troubleshooting section above
-2. Review the test files in `tests/test_quantization.py`
-3. Open an issue on the project repository
-4. Check the Trackio monitoring for detailed logs
\ No newline at end of file
diff --git a/docs/QUANTIZATION_IMPLEMENTATION_SUMMARY.md b/docs/QUANTIZATION_IMPLEMENTATION_SUMMARY.md
deleted file mode 100644
index c16bbf363fd7e1c53416abb530ec26dd60f01027..0000000000000000000000000000000000000000
--- a/docs/QUANTIZATION_IMPLEMENTATION_SUMMARY.md
+++ /dev/null
@@ -1,248 +0,0 @@
-# Quantization Implementation Summary
-
-This document summarizes the torchao quantization features that have been added to the SmolLM3 fine-tuning pipeline.
-
-## 🚀 New Features Added
-
-### 1. Core Quantization Scripts
-
-#### `scripts/model_tonic/quantize_model.py`
-- **Main quantization script** with full HF Hub integration
-- Supports int8 (GPU) and int4 (CPU) quantization
-- Automatic model card and README generation
-- Trackio monitoring integration
-- Comprehensive error handling and validation
-
-#### `scripts/model_tonic/quantize_standalone.py`
-- **Standalone quantization script** for independent use
-- Simple command-line interface
-- Option to save locally without pushing to HF Hub
-- Quick quantization workflow
-
-### 2. Pipeline Integration
-
-#### Updated `launch.sh`
-- **Interactive quantization prompts** after model training
-- Support for single or dual quantization (int8 + int4)
-- Automatic repository naming with quantization suffixes
-- Enhanced summary reporting with quantization results
-
-### 3. Documentation
-
-#### `docs/QUANTIZATION_GUIDE.md`
-- **Comprehensive quantization guide**
-- Usage examples and best practices
-- Performance comparisons
-- Troubleshooting section
-- Advanced configuration options
-
-#### Updated `README.md`
-- **Quantization section** with quick start examples
-- Integration with main pipeline documentation
-- Loading quantized models examples
-
-### 4. Testing
-
-#### `tests/test_quantization.py`
-- **Comprehensive test suite** for quantization functionality
-- Tests for imports, initialization, configuration creation
-- Model validation and documentation generation tests
-- Automated testing workflow
-
-### 5. Dependencies
-
-#### Updated `requirements/requirements.txt`
-- **Added torchao>=0.10.0** for quantization support
-- Maintains compatibility with existing dependencies
-
-## 🔧 Quantization Types Supported
-
-### int8_weight_only (GPU Optimized)
-- **Memory Reduction**: ~50%
-- **Accuracy**: Minimal degradation
-- **Speed**: Faster inference
-- **Hardware**: GPU optimized
-- **Use Case**: High-performance inference on GPU
-
-### int4_weight_only (CPU Optimized)
-- **Memory Reduction**: ~75%
-- **Accuracy**: Some degradation acceptable
-- **Speed**: Significantly faster inference
-- **Hardware**: CPU optimized
-- **Use Case**: Deployment on CPU or memory-constrained environments
-
-### int8_dynamic (Dynamic Quantization)
-- **Memory Reduction**: ~50%
-- **Accuracy**: Minimal degradation
-- **Speed**: Faster inference
-- **Hardware**: GPU optimized
-- **Use Case**: Dynamic quantization during inference
-
-## 📋 Usage Examples
-
-### Interactive Pipeline (launch.sh)
-```bash
-./launch.sh
-# Complete training and model push
-# Choose quantization options when prompted:
-# - y/n for quantization
-# - int8_weight_only / int4_weight_only / both
-```
-
-### Standalone Quantization
-```bash
-# Quantize and push to HF Hub
-python scripts/model_tonic/quantize_standalone.py /path/to/model my-username/quantized-model \
- --quant-type int8_weight_only \
- --token YOUR_HF_TOKEN
-
-# Quantize and save locally
-python scripts/model_tonic/quantize_standalone.py /path/to/model my-username/quantized-model \
- --quant-type int4_weight_only \
- --device cpu \
- --save-only
-```
-
-### Loading Quantized Models
-```python
-import torch
-from transformers import AutoModelForCausalLM, AutoTokenizer
-
-# Load int8 quantized model (GPU)
-model = AutoModelForCausalLM.from_pretrained(
- "your-username/model-int8",
- device_map="auto",
- torch_dtype=torch.bfloat16
-)
-
-# Load int4 quantized model (CPU)
-model = AutoModelForCausalLM.from_pretrained(
- "your-username/model-int4",
- device_map="cpu",
- torch_dtype=torch.bfloat16
-)
-```
-
-## 🧪 Testing
-
-Run the quantization tests:
-```bash
-python tests/test_quantization.py
-```
-
-Tests cover:
-- Import validation
-- Quantizer initialization
-- Configuration creation
-- Model validation
-- Documentation generation
-
-## 📊 Performance Comparison
-
-| Model Type | Memory Usage | Speed | Accuracy | Hardware |
-|------------|--------------|-------|----------|----------|
-| Original | 100% | Baseline | Best | GPU/CPU |
-| int8 | ~50% | Faster | Minimal loss | GPU |
-| int4 | ~25% | Fastest | Some loss | CPU |
-
-## 🔍 Key Features
-
-### 1. Automatic Integration
-- Seamlessly integrated into the main training pipeline
-- Interactive prompts for quantization options
-- Automatic repository creation and naming
-
-### 2. Comprehensive Documentation
-- Automatic model card generation
-- Detailed README creation
-- Usage examples and best practices
-
-### 3. Monitoring Integration
-- Trackio logging for quantization events
-- Performance metrics tracking
-- Artifact storage and versioning
-
-### 4. Error Handling
-- Robust validation of model paths
-- Graceful handling of quantization failures
-- Detailed error messages and logging
-
-### 5. Flexibility
-- Support for multiple quantization types
-- Standalone usage option
-- Custom configuration options
-
-## 🛠️ Technical Implementation
-
-### Core Components
-
-1. **ModelQuantizer Class**
- - Main quantization orchestration
- - HF Hub integration
- - Trackio monitoring
- - Error handling and validation
-
-2. **Quantization Configuration**
- - torchao configuration management
- - Device-specific optimizations
- - Group size and parameter tuning
-
-3. **Documentation Generation**
- - Automatic model card creation
- - README generation with usage examples
- - Performance and limitation documentation
-
-4. **Pipeline Integration**
- - Interactive prompts in launch.sh
- - Automatic repository naming
- - Enhanced summary reporting
-
-## 📈 Benefits
-
-### For Users
-- **Easy Integration**: Seamless addition to existing pipeline
-- **Multiple Options**: Choose quantization type based on needs
-- **Performance**: Significant memory and speed improvements
-- **Documentation**: Automatic comprehensive documentation
-
-### For Deployment
-- **GPU Optimization**: int8 for high-performance inference
-- **CPU Optimization**: int4 for resource-constrained environments
-- **Memory Efficiency**: 50-75% memory reduction
-- **Speed Improvement**: Faster inference times
-
-## 🔮 Future Enhancements
-
-### Planned Features
-1. **Additional Quantization Types**: Support for more torchao configurations
-2. **Automated Benchmarking**: Performance comparison tools
-3. **Batch Quantization**: Process multiple models simultaneously
-4. **Custom Configurations**: Advanced quantization parameter tuning
-5. **Integration Testing**: End-to-end quantization workflow tests
-
-### Potential Improvements
-1. **Quantization-Aware Training**: Support for QAT workflows
-2. **Mixed Precision**: Advanced precision optimization
-3. **Hardware-Specific**: Optimizations for specific GPU/CPU types
-4. **Automated Selection**: Smart quantization type selection
-
-## 📚 References
-
-- [torchao Documentation](https://huggingface.co/docs/transformers/main/en/quantization/torchao)
-- [Hugging Face Quantization Guide](https://huggingface.co/docs/transformers/main/en/quantization)
-- [PyTorch Quantization](https://pytorch.org/docs/stable/quantization.html)
-
-## 🎯 Summary
-
-The quantization implementation provides a complete, production-ready solution for creating optimized versions of fine-tuned SmolLM3 models. The integration is seamless, the documentation is comprehensive, and the functionality is robust and well-tested.
-
-Key achievements:
-- ✅ Full pipeline integration
-- ✅ Multiple quantization types
-- ✅ Comprehensive documentation
-- ✅ Robust error handling
-- ✅ Testing suite
-- ✅ Monitoring integration
-- ✅ Standalone usage option
-
-The implementation follows the repository's architecture patterns and maintains consistency with existing code structure and documentation standards.
\ No newline at end of file
diff --git a/docs/README_END_TO_END.md b/docs/README_END_TO_END.md
deleted file mode 100644
index 2bd3e8562796421b5338318b6df737fddb74ae44..0000000000000000000000000000000000000000
--- a/docs/README_END_TO_END.md
+++ /dev/null
@@ -1,303 +0,0 @@
-# SmolLM3 End-to-End Fine-tuning Pipeline
-
-This repository provides a complete end-to-end pipeline for fine-tuning SmolLM3 models with integrated experiment tracking, monitoring, and model deployment.
-
-## 🚀 Quick Start
-
-### 1. Setup Configuration
-
-```bash
-# Run the setup script to configure with your information
-python setup_launch.py
-```
-
-
-### 2. Check Requirements
-
-```bash
-# Verify all dependencies are installed
-python check_requirements.py
-```
-
-### 3. Run the Pipeline
-
-```bash
-# Make the script executable and run
-chmod +x launch.sh
-./launch.sh
-```
-This will prompt you for:
-- Your Hugging Face token
-- Optional model and dataset customizations
-
-## 📋 What the Pipeline Does
-
-The end-to-end pipeline performs the following steps:
-
-### 1. **Environment Setup**
-- Installs system dependencies
-- Creates Python virtual environment
-- Installs PyTorch with CUDA support
-- Installs all required Python packages
-
-### 2. **Trackio Space Deployment**
-- Creates a new Hugging Face Space for experiment tracking
-- Configures the Trackio monitoring interface
-- Sets up environment variables
-
-### 3. **HF Dataset Setup**
-- Creates a Hugging Face Dataset repository for experiment storage
-- Configures dataset access and permissions
-- Sets up initial experiment data structure
-
-### 4. **Dataset Preparation**
-- Downloads the specified dataset from Hugging Face Hub
-- Converts to training format (prompt/completion pairs)
-- Handles multiple dataset formats automatically
-- Creates train/validation splits
-
-### 5. **Training Configuration**
-- Creates optimized training configuration
-- Sets up monitoring integration
-- Configures model parameters and hyperparameters
-
-### 6. **Model Training**
-- Runs the SmolLM3 fine-tuning process
-- Logs metrics to Trackio Space in real-time
-- Saves experiment data to HF Dataset
-- Creates checkpoints during training
-
-### 7. **Model Deployment**
-- Pushes trained model to Hugging Face Hub
-- Creates comprehensive model card
-- Uploads training results and logs
-- Tests the uploaded model
-
-### 8. **Summary Report**
-- Generates detailed training summary
-- Provides links to all resources
-- Documents configuration and results
-
-## 🎯 Features
-
-### **Integrated Monitoring**
-- Real-time experiment tracking via Trackio Space
-- Persistent storage in Hugging Face Datasets
-- Comprehensive metrics logging
-- System resource monitoring
-
-### **Flexible Dataset Support**
-- Automatic format detection and conversion
-- Support for multiple dataset types
-- Built-in data preprocessing
-- Train/validation split handling
-
-### **Optimized Training**
-- Flash Attention support for efficiency
-- Gradient checkpointing for memory optimization
-- Mixed precision training
-- Automatic hyperparameter optimization
-
-### **Complete Deployment**
-- Automated model upload to Hugging Face Hub
-- Comprehensive model cards
-- Training results documentation
-- Model testing and validation
-
-## 📊 Monitoring & Tracking
-
-### **Trackio Space Interface**
-- Real-time training metrics visualization
-- Experiment management and comparison
-- System resource monitoring
-- Training progress tracking
-
-### **HF Dataset Storage**
-- Persistent experiment data storage
-- Version-controlled experiment history
-- Collaborative experiment sharing
-- Automated data backup
-
-## 🔧 Configuration
-
-### **Required Configuration**
-Update these variables in `launch.sh`:
-
-```bash
-# Your Hugging Face credentials
-HF_TOKEN="your_hf_token_here"
-HF_USERNAME="your-username"
-
-# Model and dataset
-MODEL_NAME="HuggingFaceTB/SmolLM3-3B"
-DATASET_NAME="HuggingFaceTB/smoltalk"
-
-# Output repositories
-REPO_NAME="your-username/smollm3-finetuned-$(date +%Y%m%d)"
-TRACKIO_DATASET_REPO="your-username/trackio-experiments"
-```
-
-### **Training Parameters**
-Customize training parameters:
-
-```bash
-# Training configuration
-BATCH_SIZE=2
-GRADIENT_ACCUMULATION_STEPS=8
-LEARNING_RATE=5e-6
-MAX_EPOCHS=3
-MAX_SEQ_LENGTH=4096
-```
-
-## 📁 Output Structure
-
-After running the pipeline, you'll have:
-
-```
-├── training_dataset/ # Prepared dataset
-│ ├── train.json
-│ └── validation.json
-├── /output-checkpoint/ # Model checkpoints
-│ ├── config.json
-│ ├── pytorch_model.bin
-│ └── training_results/
-├── training.log # Training logs
-├── training_summary.md # Summary report
-└── config/train_smollm3_end_to_end.py # Training config
-```
-
-## 🌐 Online Resources
-
-The pipeline creates these online resources:
-
-- **Model Repository**: `https://huggingface.co/your-username/smollm3-finetuned-YYYYMMDD`
-- **Trackio Space**: `https://huggingface.co/spaces/your-username/trackio-monitoring-YYYYMMDD`
-- **Experiment Dataset**: `https://huggingface.co/datasets/your-username/trackio-experiments`
-
-## 🛠️ Troubleshooting
-
-### **Common Issues**
-
-1. **HF Token Issues**
- ```bash
- # Verify your token is correct
- hf whoami
- ```
-
-2. **CUDA Issues**
- ```bash
- # Check CUDA availability
- python -c "import torch; print(torch.cuda.is_available())"
- ```
-
-3. **Memory Issues**
- ```bash
- # Reduce batch size or gradient accumulation
- BATCH_SIZE=1
- GRADIENT_ACCUMULATION_STEPS=16
- ```
-
-4. **Dataset Issues**
- ```bash
- # Test dataset access
- python -c "from datasets import load_dataset; print(load_dataset('your-dataset'))"
- ```
-
-### **Debug Mode**
-
-Run individual components for debugging:
-
-```bash
-# Test Trackio deployment
-cd scripts/trackio_tonic
-python deploy_trackio_space.py
-
-# Test dataset setup
-cd scripts/dataset_tonic
-python setup_hf_dataset.py
-
-# Test training
-python src/train.py config/train_smollm3_end_to_end.py --help
-```
-
-## 📚 Advanced Usage
-
-### **Custom Datasets**
-
-For custom datasets, ensure they have one of these formats:
-
-```json
-// Format 1: Prompt/Completion
-{
- "prompt": "What is machine learning?",
- "completion": "Machine learning is..."
-}
-
-// Format 2: Instruction/Output
-{
- "instruction": "Explain machine learning",
- "output": "Machine learning is..."
-}
-
-// Format 3: Chat format
-{
- "messages": [
- {"role": "user", "content": "What is ML?"},
- {"role": "assistant", "content": "ML is..."}
- ]
-}
-```
-
-### **Custom Models**
-
-To use different models, update the configuration:
-
-```bash
-MODEL_NAME="microsoft/DialoGPT-medium"
-MAX_SEQ_LENGTH=1024
-```
-
-### **Custom Training**
-
-Modify training parameters in the generated config:
-
-```python
-# In config/train_smollm3_end_to_end.py
-config = SmolLM3Config(
- learning_rate=1e-5, # Custom learning rate
- max_iters=5000, # Custom training steps
- # ... other parameters
-)
-```
-
-## 🤝 Contributing
-
-1. Fork the repository
-2. Create a feature branch
-3. Make your changes
-4. Test the pipeline
-5. Submit a pull request
-
-## 📄 License
-
-This project is licensed under the MIT License - see the LICENSE file for details.
-
-## 🙏 Acknowledgments
-
-- Hugging Face for the excellent transformers library
-- The SmolLM3 team for the base model
-- The Trackio team for experiment tracking
-- The open-source community for contributions
-
-## 📞 Support
-
-For issues and questions:
-
-1. Check the troubleshooting section
-2. Review the logs in `training.log`
-3. Check the Trackio Space for monitoring data
-4. Open an issue on GitHub
-
----
-
-**Happy Fine-tuning! 🚀**
\ No newline at end of file
diff --git a/docs/SFT_TRAINER_CONFIG_USAGE.md b/docs/SFT_TRAINER_CONFIG_USAGE.md
deleted file mode 100644
index 531293a89042431b0c987bacb5ad55d8bc576189..0000000000000000000000000000000000000000
--- a/docs/SFT_TRAINER_CONFIG_USAGE.md
+++ /dev/null
@@ -1,233 +0,0 @@
-# SFT Trainer Configuration Usage Guide
-
-## Overview
-
-This guide describes how the SFT (Supervised Fine-tuning) trainer uses the premade configuration files and how the `trainer_type` field is passed through the system.
-
-## How SFT Trainer Uses Premade Configs
-
-### 1. Configuration Loading Process
-
-The SFT trainer uses premade configs through the following process:
-
-1. **Config File Selection**: Users specify a config file via command line or launch script
-2. **Config Loading**: The system loads the config using `get_config()` function
-3. **Config Inheritance**: All configs inherit from `SmolLM3Config` base class
-4. **Trainer Type Detection**: The system checks for `trainer_type` field in the config
-5. **Training Arguments Creation**: Config parameters are used to create `TrainingArguments`
-
-### 2. Configuration Parameters Used by SFT Trainer
-
-The SFT trainer uses the following config parameters:
-
-#### Model Configuration
-- `model_name`: Model to load (e.g., "HuggingFaceTB/SmolLM3-3B")
-- `max_seq_length`: Maximum sequence length for tokenization
-- `use_flash_attention`: Whether to use flash attention
-- `use_gradient_checkpointing`: Whether to use gradient checkpointing
-
-#### Training Configuration
-- `batch_size`: Per-device batch size
-- `gradient_accumulation_steps`: Gradient accumulation steps
-- `learning_rate`: Learning rate for optimization
-- `weight_decay`: Weight decay for optimizer
-- `warmup_steps`: Number of warmup steps
-- `max_iters`: Maximum training iterations
-- `save_steps`: Save checkpoint every N steps
-- `eval_steps`: Evaluate every N steps
-- `logging_steps`: Log every N steps
-
-#### Optimizer Configuration
-- `optimizer`: Optimizer type (e.g., "adamw_torch")
-- `beta1`, `beta2`, `eps`: Optimizer parameters
-
-#### Scheduler Configuration
-- `scheduler`: Learning rate scheduler type
-- `min_lr`: Minimum learning rate
-
-#### Mixed Precision
-- `fp16`: Whether to use fp16 precision
-- `bf16`: Whether to use bf16 precision
-
-#### Data Configuration
-- `dataset_name`: Hugging Face dataset name
-- `data_dir`: Local dataset directory
-- `train_file`: Training file name
-- `validation_file`: Validation file name
-
-#### Monitoring Configuration
-- `enable_tracking`: Whether to enable Trackio tracking
-- `trackio_url`: Trackio server URL
-- `experiment_name`: Experiment name for tracking
-
-### 3. Training Arguments Creation
-
-The SFT trainer creates `TrainingArguments` from config parameters:
-
-```python
-def get_training_arguments(self, output_dir: str, **kwargs) -> TrainingArguments:
- training_args = {
- "output_dir": output_dir,
- "per_device_train_batch_size": self.config.batch_size,
- "per_device_eval_batch_size": self.config.batch_size,
- "gradient_accumulation_steps": self.config.gradient_accumulation_steps,
- "learning_rate": self.config.learning_rate,
- "weight_decay": self.config.weight_decay,
- "warmup_steps": self.config.warmup_steps,
- "max_steps": self.config.max_iters,
- "save_steps": self.config.save_steps,
- "eval_steps": self.config.eval_steps,
- "logging_steps": self.config.logging_steps,
- "fp16": self.config.fp16,
- "bf16": self.config.bf16,
- # ... additional parameters
- }
- return TrainingArguments(**training_args)
-```
-
-### 4. Trainer Selection Logic
-
-The system determines which trainer to use based on the `trainer_type` field:
-
-```python
-# Determine trainer type (command line overrides config)
-trainer_type = args.trainer_type or getattr(config, 'trainer_type', 'sft')
-
-# Initialize trainer based on type
-if trainer_type.lower() == 'dpo':
- trainer = SmolLM3DPOTrainer(...)
-else:
- trainer = SmolLM3Trainer(...) # SFT trainer
-```
-
-## Configuration Files Structure
-
-### Base Config (`config/train_smollm3.py`)
-
-```python
-@dataclass
-class SmolLM3Config:
- # Trainer type selection
- trainer_type: str = "sft" # "sft" or "dpo"
-
- # Model configuration
- model_name: str = "HuggingFaceTB/SmolLM3-3B"
- max_seq_length: int = 4096
- # ... other fields
-```
-
-### DPO Config (`config/train_smollm3_dpo.py`)
-
-```python
-@dataclass
-class SmolLM3DPOConfig(SmolLM3Config):
- # Trainer type selection
- trainer_type: str = "dpo" # Override default to use DPO trainer
-
- # DPO-specific configuration
- beta: float = 0.1
- # ... DPO-specific fields
-```
-
-### Specialized Configs (e.g., `config/train_smollm3_openhermes_fr_a100_multiple_passes.py`)
-
-```python
-@dataclass
-class SmolLM3ConfigOpenHermesFRMultiplePasses(SmolLM3Config):
- # Inherits trainer_type = "sft" from base config
-
- # Specialized configuration for multiple passes
- batch_size: int = 6
- gradient_accumulation_steps: int = 20
- learning_rate: float = 3e-6
- max_iters: int = 25000
- # ... other specialized fields
-```
-
-## Trainer Type Priority
-
-The trainer type is determined in the following order of priority:
-
-1. **Command line argument** (`--trainer_type`) - Highest priority
-2. **Config file** (`trainer_type` field) - Medium priority
-3. **Default value** (`"sft"`) - Lowest priority
-
-## Usage Examples
-
-### Using SFT Trainer with Different Configs
-
-```bash
-# Basic SFT training (uses base config)
-python src/train.py config/train_smollm3.py
-
-# SFT training with specialized config
-python src/train.py config/train_smollm3_openhermes_fr_a100_multiple_passes.py
-
-# SFT training with override
-python src/train.py config/train_smollm3.py --trainer_type sft
-
-# DPO training (uses DPO config)
-python src/train.py config/train_smollm3_dpo.py
-
-# Override config's trainer type
-python src/train.py config/train_smollm3.py --trainer_type dpo
-```
-
-### Launch Script Usage
-
-```bash
-./launch.sh
-# Select "SFT" when prompted for trainer type
-# The system will use the appropriate config based on selection
-```
-
-## Configuration Inheritance
-
-All specialized configs inherit from `SmolLM3Config` and automatically get:
-
-- `trainer_type = "sft"` (default)
-- All base training parameters
-- All monitoring configuration
-- All data configuration
-
-Specialized configs can override any of these parameters for their specific use case.
-
-## SFT Trainer Features
-
-The SFT trainer provides:
-
-1. **SFTTrainer Backend**: Uses Hugging Face's `SFTTrainer` for instruction tuning
-2. **Fallback Support**: Falls back to standard `Trainer` if `SFTTrainer` fails
-3. **Config Integration**: Uses all config parameters for training setup
-4. **Monitoring**: Integrates with Trackio for experiment tracking
-5. **Checkpointing**: Supports model checkpointing and resuming
-6. **Mixed Precision**: Supports fp16 and bf16 training
-
-## Troubleshooting
-
-### Common Issues
-
-1. **Missing trainer_type field**: Ensure all configs have the `trainer_type` field
-2. **Config inheritance issues**: Check that specialized configs properly inherit from base
-3. **Parameter conflicts**: Ensure command line arguments don't conflict with config values
-
-### Debugging
-
-Enable verbose logging to see config usage:
-
-```bash
-python src/train.py config/train_smollm3.py --trainer_type sft
-```
-
-Look for these log messages:
-```
-Using trainer type: sft
-Initializing SFT trainer...
-Creating SFTTrainer with training arguments...
-```
-
-## Related Documentation
-
-- [Trainer Selection Guide](TRAINER_SELECTION_GUIDE.md)
-- [Training Configuration Guide](TRAINING_CONFIGURATION_GUIDE.md)
-- [Monitoring Integration Guide](MONITORING_INTEGRATION_GUIDE.md)
\ No newline at end of file
diff --git a/docs/TOKEN_FIX_SUMMARY.md b/docs/TOKEN_FIX_SUMMARY.md
deleted file mode 100644
index 6b29cc089a8252fed93c5cab71329f850eddb83c..0000000000000000000000000000000000000000
--- a/docs/TOKEN_FIX_SUMMARY.md
+++ /dev/null
@@ -1,249 +0,0 @@
-# Token Fix Summary
-
-## Issue Identified
-
-The user encountered an error when running the launch script:
-
-```
-usage: hf []
-hf: error: argument {auth,cache,download,jobs,repo,repo-files,upload,upload-large-folder,env,version,lfs-enable-largefiles,lfs-multipart-upload}: invalid choice: 'login' (choose from 'auth', 'cache', 'download', 'jobs', 'repo', 'repo-files', 'upload', 'upload-large-folder', 'env', 'version', 'lfs-enable-largefiles', 'lfs-multipart-upload')
-❌ Failed to login to Hugging Face
-```
-
-## Root Cause
-
-The `launch.sh` script was using `hf login` command which doesn't exist in the current version of the Hugging Face CLI. The script was trying to use CLI commands instead of the Python API for authentication.
-
-## Fixes Applied
-
-### 1. **Removed HF Login Step** ✅ **FIXED**
-
-**File**: `launch.sh`
-
-**Before**:
-```bash
-# Login to Hugging Face with token
-print_info "Logging in to Hugging Face..."
-if hf login --token "$HF_TOKEN" --add-to-git-credential; then
- print_status "Successfully logged in to Hugging Face"
- print_info "Username: $(hf whoami)"
-else
- print_error "Failed to login to Hugging Face"
- print_error "Please check your token and try again"
- exit 1
-fi
-```
-
-**After**:
-```bash
-# Set HF token for Python API usage
-print_info "Setting up Hugging Face token for Python API..."
-export HUGGING_FACE_HUB_TOKEN="$HF_TOKEN"
-print_status "HF token configured for Python API usage"
-print_info "Username: $HF_USERNAME (auto-detected from token)"
-```
-
-### 2. **Updated Dataset Setup Script** ✅ **FIXED**
-
-**File**: `scripts/dataset_tonic/setup_hf_dataset.py`
-
-**Changes**:
-- Updated `main()` function to properly get token from environment
-- Added token validation before proceeding
-- Improved error handling for missing tokens
-
-**Before**:
-```python
-def main():
- """Main function to set up the dataset."""
-
- # Get dataset name from command line or use default
- dataset_name = None
- if len(sys.argv) > 2:
- dataset_name = sys.argv[2]
-
- success = setup_trackio_dataset(dataset_name)
- sys.exit(0 if success else 1)
-```
-
-**After**:
-```python
-def main():
- """Main function to set up the dataset."""
-
- # Get token from environment first
- token = os.environ.get('HUGGING_FACE_HUB_TOKEN') or os.environ.get('HF_TOKEN')
-
- # If no token in environment, try command line argument
- if not token and len(sys.argv) > 1:
- token = sys.argv[1]
-
- if not token:
- print("❌ No HF token found. Please set HUGGING_FACE_HUB_TOKEN environment variable or provide as argument.")
- sys.exit(1)
-
- # Get dataset name from command line or use default
- dataset_name = None
- if len(sys.argv) > 2:
- dataset_name = sys.argv[2]
-
- success = setup_trackio_dataset(dataset_name)
- sys.exit(0 if success else 1)
-```
-
-### 3. **Updated Launch Script to Pass Token** ✅ **FIXED**
-
-**File**: `launch.sh`
-
-**Changes**:
-- Updated dataset setup call to pass token as argument
-- Updated Trackio Space deployment call to pass token as argument
-
-**Before**:
-```bash
-python setup_hf_dataset.py
-```
-
-**After**:
-```bash
-python setup_hf_dataset.py "$HF_TOKEN"
-```
-
-**Before**:
-```bash
-python deploy_trackio_space.py << EOF
-$TRACKIO_SPACE_NAME
-$HF_TOKEN
-$GIT_EMAIL
-
-EOF
-```
-
-**After**:
-```bash
-python deploy_trackio_space.py "$TRACKIO_SPACE_NAME" "$HF_TOKEN" "$GIT_EMAIL"
-```
-
-### 4. **Updated Space Deployment Script** ✅ **FIXED**
-
-**File**: `scripts/trackio_tonic/deploy_trackio_space.py`
-
-**Changes**:
-- Updated `main()` function to handle command line arguments
-- Added support for both interactive and command-line modes
-- Improved token handling and validation
-
-**Before**:
-```python
-def main():
- """Main deployment function"""
- print("Trackio Space Deployment Script")
- print("=" * 40)
-
- # Get user input (no username needed - will be extracted from token)
- space_name = input("Enter Space name (e.g., trackio-monitoring): ").strip()
- token = input("Enter your Hugging Face token: ").strip()
-```
-
-**After**:
-```python
-def main():
- """Main deployment function"""
- print("Trackio Space Deployment Script")
- print("=" * 40)
-
- # Check if arguments are provided
- if len(sys.argv) >= 3:
- # Use command line arguments
- space_name = sys.argv[1]
- token = sys.argv[2]
- git_email = sys.argv[3] if len(sys.argv) > 3 else None
- git_name = sys.argv[4] if len(sys.argv) > 4 else None
-
- print(f"Using provided arguments:")
- print(f" Space name: {space_name}")
- print(f" Token: {'*' * 10}...{token[-4:]}")
- print(f" Git email: {git_email or 'default'}")
- print(f" Git name: {git_name or 'default'}")
- else:
- # Get user input (no username needed - will be extracted from token)
- space_name = input("Enter Space name (e.g., trackio-monitoring): ").strip()
- token = input("Enter your Hugging Face token: ").strip()
-```
-
-## Key Improvements
-
-### 1. **Complete Python API Usage**
-- ✅ **No CLI commands**: All authentication uses Python API
-- ✅ **Direct token passing**: Token passed directly to functions
-- ✅ **Environment variables**: Proper environment variable setup
-- ✅ **No username required**: Automatic extraction from token
-
-### 2. **Robust Error Handling**
-- ✅ **Token validation**: Proper token validation before use
-- ✅ **Environment fallbacks**: Multiple ways to get token
-- ✅ **Clear error messages**: Descriptive error messages
-- ✅ **Graceful degradation**: Fallback mechanisms
-
-### 3. **Automated Token Handling**
-- ✅ **Automatic extraction**: Username extracted from token
-- ✅ **Environment setup**: Token set in environment variables
-- ✅ **Command line support**: Token passed as arguments
-- ✅ **No manual input**: No username required
-
-## Test Results
-
-### **Token Validation Test**
-```bash
-$ python tests/test_token_fix.py
-
-🚀 Token Validation and Deployment Tests
-==================================================
-🔍 Testing Token Validation
-✅ Token validation module imported successfully
-✅ Token validation successful!
-✅ Username: Tonic
-
-🔍 Testing Dataset Setup
-✅ Dataset setup module imported successfully
-✅ Username extraction successful: Tonic
-
-🔍 Testing Space Deployment
-✅ Space deployment module imported successfully
-✅ Space deployer initialization successful
-✅ Username: Tonic
-
-==================================================
-🎉 ALL TOKEN TESTS PASSED!
-✅ Token validation: Working
-✅ Dataset setup: Working
-✅ Space deployment: Working
-
-The token is working correctly with all components!
-```
-
-## User Token
-
-**Token**: `xxxx`
-
-**Status**: ✅ **Working correctly**
-
-**Username**: `Tonic` (auto-detected)
-
-## Next Steps
-
-The user can now run the launch script without encountering the HF login error:
-
-```bash
-./launch.sh
-```
-
-The script will:
-1. ✅ **Validate token** using Python API
-2. ✅ **Extract username** automatically from token
-3. ✅ **Set environment variables** for Python API usage
-4. ✅ **Deploy Trackio Space** using Python API
-5. ✅ **Setup HF Dataset** using Python API
-6. ✅ **Configure all components** automatically
-
-**No manual username input required!** 🎉
\ No newline at end of file
diff --git a/docs/TOKEN_VALIDATION_FIX.md b/docs/TOKEN_VALIDATION_FIX.md
deleted file mode 100644
index d6d3430012fad62f6b026d598511bdf90e281863..0000000000000000000000000000000000000000
--- a/docs/TOKEN_VALIDATION_FIX.md
+++ /dev/null
@@ -1,183 +0,0 @@
-# Hugging Face Token Validation Fix
-
-## Problem Description
-
-The original launch script was using the `hf` CLI command to validate Hugging Face tokens, which was causing authentication failures even with valid tokens. This was due to:
-
-1. CLI installation issues
-2. Inconsistent token format handling
-3. Poor error reporting
-
-## Solution Implementation
-
-### New Python-Based Validation System
-
-We've implemented a robust Python-based token validation system using the official `huggingface_hub` API:
-
-#### Key Components
-
-1. **`scripts/validate_hf_token.py`** - Main validation script
-2. **Updated `launch.sh`** - Modified to use Python validation
-3. **`tests/test_token_validation.py`** - Test suite for validation
-4. **`scripts/check_dependencies.py`** - Dependency verification
-
-### Features
-
-- ✅ **Robust Error Handling**: Detailed error messages for different failure types
-- ✅ **JSON Output**: Structured responses for easy parsing
-- ✅ **Multiple Input Methods**: Command line arguments or environment variables
-- ✅ **Username Extraction**: Automatically retrieves username from valid tokens
-- ✅ **Dependency Checking**: Verifies required packages are installed
-
-## Usage
-
-### Direct Script Usage
-
-```bash
-# Using command line argument
-python scripts/validate_hf_token.py hf_your_token_here
-
-# Using environment variable
-export HF_TOKEN=hf_your_token_here
-python scripts/validate_hf_token.py
-```
-
-### Expected Output
-
-**Success:**
-```json
-{"success": true, "username": "YourUsername", "error": null}
-```
-
-**Failure:**
-```json
-{"success": false, "username": null, "error": "Invalid token - unauthorized access"}
-```
-
-### Integration with Launch Script
-
-The `launch.sh` script now automatically:
-
-1. Prompts for your HF token
-2. Validates it using the Python script
-3. Extracts your username automatically
-4. Provides detailed error messages if validation fails
-
-## Error Types and Solutions
-
-### Common Error Messages
-
-| Error Message | Cause | Solution |
-|---------------|-------|----------|
-| "Invalid token - unauthorized access" | Token is invalid or expired | Generate new token at https://huggingface.co/settings/tokens |
-| "Token lacks required permissions" | Token doesn't have write access | Ensure token has write permissions |
-| "Network error" | Connection issues | Check internet connection |
-| "Failed to run token validation script" | Missing dependencies | Run `pip install huggingface_hub` |
-
-### Dependency Installation
-
-```bash
-# Install required dependencies
-pip install huggingface_hub
-
-# Check all dependencies
-python scripts/check_dependencies.py
-
-# Install all requirements
-pip install -r requirements/requirements.txt
-```
-
-## Testing
-
-### Run the Test Suite
-
-```bash
-python tests/test_token_validation.py
-```
-
-### Manual Testing
-
-```bash
-# Test with your token
-python scripts/validate_hf_token.py hf_your_token_here
-
-# Test dependency check
-python scripts/check_dependencies.py
-```
-
-## Troubleshooting
-
-### If Token Validation Still Fails
-
-1. **Check Token Format**: Ensure token starts with `hf_`
-2. **Verify Token Permissions**: Token needs read/write access
-3. **Check Network**: Ensure internet connection is stable
-4. **Update Dependencies**: Run `pip install --upgrade huggingface_hub`
-
-### If Launch Script Fails
-
-1. **Check Python Path**: Ensure `python3` is available
-2. **Verify Script Permissions**: Script should be executable
-3. **Check JSON Parsing**: Ensure Python can parse JSON output
-4. **Review Error Messages**: Check the specific error in launch.sh output
-
-## Technical Details
-
-### Token Validation Process
-
-1. **Environment Setup**: Sets `HUGGING_FACE_HUB_TOKEN` environment variable
-2. **API Client Creation**: Initializes `HfApi()` client
-3. **User Info Retrieval**: Calls `api.whoami()` to validate token
-4. **Username Extraction**: Extracts username from user info
-5. **Error Handling**: Catches and categorizes different error types
-
-### JSON Parsing in Shell
-
-The launch script uses Python's JSON parser to safely extract values:
-
-```bash
-local success=$(echo "$result" | python3 -c "
-import sys, json
-try:
- data = json.load(sys.stdin)
- print(data.get('success', False))
-except:
- print('False')
-")
-```
-
-## Migration from Old System
-
-### Before (CLI-based)
-```bash
-if hf whoami >/dev/null 2>&1; then
- HF_USERNAME=$(hf whoami | head -n1 | tr -d '\n')
-```
-
-### After (Python-based)
-```bash
-if result=$(python3 scripts/validate_hf_token.py "$token" 2>/dev/null); then
- # Parse JSON result with error handling
- local success=$(echo "$result" | python3 -c "...")
- local username=$(echo "$result" | python3 -c "...")
-```
-
-## Benefits
-
-1. **Reliability**: Uses official Python API instead of CLI
-2. **Error Reporting**: Detailed error messages for debugging
-3. **Cross-Platform**: Works on Windows, Linux, and macOS
-4. **Maintainability**: Easy to update and extend
-5. **Testing**: Comprehensive test suite included
-
-## Future Enhancements
-
-- [ ] Add token expiration checking
-- [ ] Implement token refresh functionality
-- [ ] Add support for organization tokens
-- [ ] Create GUI for token management
-- [ ] Add token security validation
-
----
-
-**Note**: This fix ensures that valid Hugging Face tokens are properly recognized and that users get clear feedback when there are authentication issues.
\ No newline at end of file
diff --git a/docs/TRACKIO_API_FIX_SUMMARY.md b/docs/TRACKIO_API_FIX_SUMMARY.md
deleted file mode 100644
index 074fd9c262db7e6031c220d3fbfea97d386d9a72..0000000000000000000000000000000000000000
--- a/docs/TRACKIO_API_FIX_SUMMARY.md
+++ /dev/null
@@ -1,276 +0,0 @@
-# Trackio API Fix Summary
-
-## Overview
-
-This document summarizes the fixes applied to resolve the 404 errors in the Trackio integration and implement automatic Space URL resolution.
-
-## Issues Identified
-
-### 1. **404 Errors in Trackio API Calls**
-- **Problem**: The original API client was using incorrect endpoints and HTTP request patterns
-- **Error**: `POST request failed: 404 - Cannot POST /spaces/Tonic/trackio-monitoring-20250727/gradio_api/call/list_experiments_interface`
-- **Root Cause**: Using raw HTTP requests instead of the proper Gradio client API
-
-### 2. **Hardcoded Space URL**
-- **Problem**: The Space URL was hardcoded, making it inflexible
-- **Issue**: No automatic resolution of Space URLs from Space IDs
-- **Impact**: Required manual URL updates when Space deployment changes
-
-## Solutions Implemented
-
-### 1. **Updated API Client to Use Gradio Client**
-
-**File**: `scripts/trackio_tonic/trackio_api_client.py`
-
-**Changes**:
-- Replaced custom HTTP requests with `gradio_client.Client`
-- Uses proper two-step process (POST to get event_id, then GET to get results)
-- Handles all Gradio API endpoints correctly
-
-**Before**:
-```python
-# Custom HTTP requests with manual event_id handling
-response = requests.post(url, json=payload)
-event_id = response.json()["event_id"]
-result = requests.get(f"{url}/{event_id}")
-```
-
-**After**:
-```python
-# Using gradio_client for proper API communication
-result = self.client.predict(*args, api_name=api_name)
-```
-
-### 2. **Automatic Space URL Resolution**
-
-**Implementation**:
-- Uses Hugging Face Hub API to resolve Space URLs from Space IDs
-- Falls back to default URL format if API is unavailable
-- Supports both authenticated and anonymous access
-
-**Key Features**:
-```python
-def _resolve_space_url(self) -> Optional[str]:
- """Resolve Space URL using Hugging Face Hub API"""
- api = HfApi(token=self.hf_token)
- space_info = api.space_info(self.space_id)
- if space_info and hasattr(space_info, 'host'):
- return space_info.host
- else:
- # Fallback to default URL format
- space_name = self.space_id.replace('/', '-')
- return f"https://{space_name}.hf.space"
-```
-
-### 3. **Updated Client Interface**
-
-**Before**:
-```python
-client = TrackioAPIClient("https://tonic-trackio-monitoring-20250727.hf.space")
-```
-
-**After**:
-```python
-client = TrackioAPIClient("Tonic/trackio-monitoring-20250727", hf_token)
-```
-
-### 4. **Enhanced Monitoring Integration**
-
-**File**: `src/monitoring.py`
-
-**Changes**:
-- Updated to use Space ID instead of hardcoded URL
-- Automatic experiment creation with proper ID extraction
-- Better error handling and fallback mechanisms
-
-## Dependencies Added
-
-### Required Packages
-```bash
-pip install gradio_client huggingface_hub
-```
-
-### Package Versions
-- `gradio_client>=1.10.4` - For proper Gradio API communication
-- `huggingface_hub>=0.19.3` - For Space URL resolution
-
-## API Endpoints Supported
-
-The updated client supports all documented Gradio endpoints:
-
-1. **Experiment Management**:
- - `/create_experiment_interface` - Create new experiments
- - `/list_experiments_interface` - List all experiments
- - `/get_experiment_details` - Get experiment details
- - `/update_experiment_status_interface` - Update experiment status
-
-2. **Metrics and Parameters**:
- - `/log_metrics_interface` - Log training metrics
- - `/log_parameters_interface` - Log experiment parameters
-
-3. **Visualization**:
- - `/create_metrics_plot` - Create metrics plots
- - `/create_experiment_comparison` - Compare experiments
-
-4. **Testing and Demo**:
- - `/simulate_training_data` - Simulate training data
- - `/create_demo_experiment` - Create demo experiments
-
-## Configuration
-
-### Environment Variables
-```bash
-# Required for Space URL resolution
-export HF_TOKEN="your_huggingface_token"
-
-# Optional: Custom Space ID
-export TRACKIO_SPACE_ID="your-username/your-space-name"
-
-# Optional: Dataset repository
-export TRACKIO_DATASET_REPO="your-username/your-dataset"
-```
-
-### Default Configuration
-- **Default Space ID**: `Tonic/trackio-monitoring-20250727`
-- **Default Dataset**: `tonic/trackio-experiments`
-- **Auto-resolution**: Enabled by default
-
-## Testing
-
-### Test Script
-**File**: `tests/test_trackio_api_fix.py`
-
-**Tests Included**:
-1. **Space URL Resolution** - Tests automatic URL resolution
-2. **API Client** - Tests all API endpoints
-3. **Monitoring Integration** - Tests full monitoring workflow
-
-### Running Tests
-```bash
-python tests/test_trackio_api_fix.py
-```
-
-**Expected Output**:
-```
-🚀 Starting Trackio API Client Tests with Automatic URL Resolution
-======================================================================
-✅ Space URL Resolution: PASSED
-✅ API Client Test: PASSED
-✅ Monitoring Integration: PASSED
-
-🎉 All tests passed! The Trackio integration with automatic URL resolution is working correctly.
-```
-
-## Benefits
-
-### 1. **Reliability**
-- ✅ No more 404 errors
-- ✅ Proper error handling and fallbacks
-- ✅ Automatic retry mechanisms
-
-### 2. **Flexibility**
-- ✅ Automatic Space URL resolution
-- ✅ Support for any Trackio Space
-- ✅ Configurable via environment variables
-
-### 3. **Maintainability**
-- ✅ Clean separation of concerns
-- ✅ Proper logging and debugging
-- ✅ Comprehensive test coverage
-
-### 4. **User Experience**
-- ✅ Seamless integration with training pipeline
-- ✅ Real-time experiment monitoring
-- ✅ Automatic experiment creation and management
-
-## Usage Examples
-
-### Basic Usage
-```python
-from scripts.trackio_tonic.trackio_api_client import TrackioAPIClient
-
-# Initialize with Space ID (URL resolved automatically)
-client = TrackioAPIClient("Tonic/trackio-monitoring-20250727")
-
-# Create experiment
-result = client.create_experiment("my_experiment", "Test experiment")
-
-# Log metrics
-metrics = {"loss": 1.234, "accuracy": 0.85}
-client.log_metrics("exp_123", metrics, step=100)
-```
-
-### With Monitoring Integration
-```python
-from src.monitoring import SmolLM3Monitor
-
-# Create monitor (automatically creates experiment)
-monitor = SmolLM3Monitor(
- experiment_name="my_training_run",
- enable_tracking=True
-)
-
-# Log metrics during training
-monitor.log_metrics({"loss": 1.234}, step=100)
-
-# Log configuration
-monitor.log_config({"learning_rate": 2e-5, "batch_size": 8})
-```
-
-## Troubleshooting
-
-### Common Issues
-
-1. **"gradio_client not available"**
- ```bash
- pip install gradio_client
- ```
-
-2. **"huggingface_hub not available"**
- ```bash
- pip install huggingface_hub
- ```
-
-3. **"Space not accessible"**
- - Check if the Space is running
- - Verify Space ID is correct
- - Ensure HF token has proper permissions
-
-4. **"Experiment not found"**
- - Experiments are created automatically by the monitor
- - Use the experiment ID returned by `create_experiment()`
-
-### Debug Mode
-Enable debug logging to see detailed API calls:
-```python
-import logging
-logging.basicConfig(level=logging.DEBUG)
-```
-
-## Future Enhancements
-
-### Planned Features
-1. **Multi-Space Support** - Support for multiple Trackio Spaces
-2. **Advanced Metrics** - Support for custom metric types
-3. **Artifact Upload** - Direct file upload to Spaces
-4. **Real-time Dashboard** - Live monitoring dashboard
-5. **Export Capabilities** - Export experiments to various formats
-
-### Extensibility
-The new architecture is designed to be easily extensible:
-- Modular API client design
-- Plugin-based monitoring system
-- Configurable Space resolution
-- Support for custom endpoints
-
-## Conclusion
-
-The Trackio API integration has been successfully fixed and enhanced with:
-
-- ✅ **Resolved 404 errors** through proper Gradio client usage
-- ✅ **Automatic URL resolution** using Hugging Face Hub API
-- ✅ **Comprehensive testing** with full test coverage
-- ✅ **Enhanced monitoring** with seamless integration
-- ✅ **Future-proof architecture** for easy extensions
-
-The system is now production-ready and provides reliable experiment tracking for SmolLM3 fine-tuning workflows.
\ No newline at end of file
diff --git a/docs/TRACKIO_DEPLOYMENT_FIXES.md b/docs/TRACKIO_DEPLOYMENT_FIXES.md
deleted file mode 100644
index ddb864794d7dace3f887e433cd0cc1a5f06c7d5c..0000000000000000000000000000000000000000
--- a/docs/TRACKIO_DEPLOYMENT_FIXES.md
+++ /dev/null
@@ -1,266 +0,0 @@
-# Trackio Deployment Fixes
-
-This document outlines the fixes made to resolve the Trackio Space deployment and dataset creation issues.
-
-## Issues Identified
-
-### 1. Git Authentication Issues in Space Deployment
-- **Problem**: The `deploy_trackio_space.py` script was using git commands for file upload, which failed with authentication errors
-- **Solution**: Replaced git commands with direct HF Hub API calls using `upload_file()`
-
-### 2. Dataset Repository Creation Issues
-- **Problem**: The `setup_hf_dataset.py` script was trying to push to a dataset repository that didn't exist, causing 404 errors
-- **Solution**: Added proper repository creation using `create_repo()` before pushing the dataset
-
-### 3. Missing Environment Variable Setup
-- **Problem**: The Space deployment didn't set up the required `HF_TOKEN` environment variable
-- **Solution**: Added automatic secret setting using `add_space_secret()` API method
-
-### 4. Manual Username Input Required
-- **Problem**: Users had to manually enter their username
-- **Solution**: Automatically extract username from token using `whoami()` API method
-
-### 5. Dataset Access Testing Issues
-- **Problem**: The configuration script failed when testing dataset access for non-existent datasets
-- **Solution**: Added proper error handling and repository existence checks
-
-## Fixed Scripts
-
-### 1. `scripts/trackio_tonic/deploy_trackio_space.py`
-
-#### Key Changes:
-- **Replaced git upload with HF Hub API**: Now uses `upload_file()` directly instead of git commands
-- **Automatic secret setting**: Uses `add_space_secret()` API to set HF_TOKEN automatically
-- **Username extraction from token**: Uses `whoami()` to get username automatically
-- **Removed manual username input**: No longer asks for username
-- **Improved error handling**: Better error messages and fallback options
-
-#### Usage:
-```bash
-python scripts/trackio_tonic/deploy_trackio_space.py
-```
-
-#### What it does:
-1. Extracts username from HF token automatically
-2. Creates a new HF Space using the API
-3. Prepares Space files from templates
-4. Uploads files using HF Hub API (no git required)
-5. **Automatically sets secrets via API** (HF_TOKEN and TRACKIO_DATASET_REPO)
-6. Tests the Space accessibility
-
-### 2. `scripts/dataset_tonic/setup_hf_dataset.py`
-
-#### Key Changes:
-- **Added repository creation**: Creates the dataset repository before pushing data
-- **Username extraction from token**: Uses `whoami()` to get username automatically
-- **Automatic dataset naming**: Uses username in dataset repository name
-- **Improved error handling**: Better error messages for common issues
-- **Public datasets by default**: Makes datasets public for easier access
-
-#### Usage:
-```bash
-python scripts/dataset_tonic/setup_hf_dataset.py
-```
-
-#### What it does:
-1. Extracts username from HF token automatically
-2. Creates the dataset repository if it doesn't exist
-3. Creates a dataset with sample experiment data
-4. Uploads README template
-5. Makes the dataset public for easier access
-
-### 3. `scripts/trackio_tonic/configure_trackio.py`
-
-#### Key Changes:
-- **Added repository existence check**: Checks if dataset repository exists before trying to load
-- **Username extraction from token**: Uses `whoami()` to get username automatically
-- **Automatic dataset naming**: Uses username in default dataset repository
-- **Better error handling**: Distinguishes between missing repository and permission issues
-- **Improved user guidance**: Clear instructions for next steps
-
-#### Usage:
-```bash
-python scripts/trackio_tonic/configure_trackio.py
-```
-
-#### What it does:
-1. Extracts username from HF token automatically
-2. Validates current configuration
-3. Tests dataset access with proper error handling
-4. Generates configuration file with username
-5. Provides usage examples with actual username
-
-## Model Push Script (`scripts/model_tonic/push_to_huggingface.py`)
-
-The model push script was already using the HF Hub API correctly, so no changes were needed. It properly:
-- Creates repositories using `create_repo()`
-- Uploads files using `upload_file()`
-- Handles authentication correctly
-
-## Environment Variables Required
-
-### For HF Spaces:
-```bash
-HF_TOKEN=your_hf_token_here
-TRACKIO_DATASET_REPO=your-username/your-dataset-name
-```
-
-### For Local Development:
-```bash
-export HF_TOKEN=your_hf_token_here
-export TRACKIO_DATASET_REPO=your-username/your-dataset-name
-```
-
-## Deployment Workflow
-
-### 1. Create Dataset
-```bash
-# Set environment variables
-export HF_TOKEN=your_token_here
-# TRACKIO_DATASET_REPO will be auto-generated as username/trackio-experiments
-
-# Create the dataset
-python scripts/dataset_tonic/setup_hf_dataset.py
-```
-
-### 2. Deploy Trackio Space
-```bash
-# Deploy the Space (no username needed - extracted from token)
-python scripts/trackio_tonic/deploy_trackio_space.py
-```
-
-### 3. Secrets are Automatically Set
-The script now automatically sets the required secrets via the HF Hub API:
-- `HF_TOKEN` - Your Hugging Face token
-- `TRACKIO_DATASET_REPO` - Your dataset repository (if specified)
-
-### 4. Test Configuration
-```bash
-# Test the configuration
-python scripts/trackio_tonic/configure_trackio.py
-```
-
-## New Features
-
-### ✅ **Automatic Secret Setting**
-- Uses `add_space_secret()` API method
-- Sets `HF_TOKEN` automatically
-- Sets `TRACKIO_DATASET_REPO` if specified
-- Falls back to manual instructions if API fails
-
-### ✅ **Username Extraction from Token**
-- Uses `whoami()` API method
-- No manual username input required
-- Automatically uses username in dataset names
-- Provides better user experience
-
-### ✅ **Improved User Experience**
-- Fewer manual inputs required
-- Automatic configuration based on token
-- Clear feedback about what's happening
-- Better error messages
-
-## Troubleshooting
-
-### Common Issues:
-
-1. **"Repository not found" errors**:
- - Run `setup_hf_dataset.py` to create the dataset first
- - Check that your HF token has write permissions
-
-2. **"Authentication failed" errors**:
- - Verify your HF token is valid
- - Check token permissions on https://huggingface.co/settings/tokens
-
-3. **"Space not accessible" errors**:
- - Wait 2-5 minutes for the Space to build
- - Check Space logs at the Space URL
- - Verify all files were uploaded correctly
-
-4. **"Dataset access failed" errors**:
- - Ensure the dataset repository exists
- - Check that your token has read permissions
- - Verify the dataset repository name is correct
-
-5. **"Secret setting failed" errors**:
- - The script will fall back to manual instructions
- - Follow the provided instructions to set secrets manually
- - Check that your token has write permissions to the Space
-
-### Debugging Steps:
-
-1. **Check token permissions**:
- ```bash
- hf whoami
- ```
-
-2. **Test dataset access**:
- ```python
- from datasets import load_dataset
- dataset = load_dataset("your-username/your-dataset", token="your-token")
- ```
-
-3. **Test Space deployment**:
- ```bash
- python scripts/trackio_tonic/deploy_trackio_space.py
- ```
-
-4. **Test secret setting**:
- ```python
- from huggingface_hub import HfApi
- api = HfApi(token="your-token")
- api.add_space_secret("your-username/your-space", "TEST_KEY", "test_value")
- ```
-
-## Security Considerations
-
-- **Public datasets**: Datasets are now public by default for easier access
-- **Token security**: Never commit tokens to version control
-- **Space secrets**: Automatically set via API, with manual fallback
-- **Access control**: Verify token permissions before deployment
-
-## Performance Improvements
-
-- **Direct API calls**: Eliminated git dependency for faster uploads
-- **Automatic configuration**: No manual username input required
-- **Parallel processing**: Files are uploaded individually for better error handling
-- **Caching**: HF Hub API handles caching automatically
-- **Error recovery**: Better error handling and retry logic
-
-## Future Enhancements
-
-1. **Batch secret setting**: Set multiple secrets in one API call
-2. **Progress tracking**: Add progress bars for large uploads
-3. **Validation**: Add more comprehensive validation checks
-4. **Rollback**: Add ability to rollback failed deployments
-5. **Hardware configuration**: Automatically configure Space hardware
-
-## Testing
-
-To test the fixes:
-
-```bash
-# Test dataset creation
-python scripts/dataset_tonic/setup_hf_dataset.py
-
-# Test Space deployment
-python scripts/trackio_tonic/deploy_trackio_space.py
-
-# Test configuration
-python scripts/trackio_tonic/configure_trackio.py
-
-# Test model push (if you have a trained model)
-python scripts/model_tonic/push_to_huggingface.py --model-path /path/to/model --repo-name your-username/your-model
-```
-
-## Summary
-
-These fixes resolve the main issues with:
-- ✅ Git authentication problems
-- ✅ Dataset repository creation failures
-- ✅ Missing environment variable setup
-- ✅ Manual username input requirement
-- ✅ Poor error handling and user feedback
-- ✅ Security concerns with public datasets
-
-The scripts now use the HF Hub API directly, provide better error messages, handle edge cases properly, and offer a much improved user experience with automatic configuration.
\ No newline at end of file
diff --git a/docs/TRACKIO_DICT_ACCESS_FIX.md b/docs/TRACKIO_DICT_ACCESS_FIX.md
deleted file mode 100644
index b26a4c2bb76c9cac9459cb17e17e2d1abea85e96..0000000000000000000000000000000000000000
--- a/docs/TRACKIO_DICT_ACCESS_FIX.md
+++ /dev/null
@@ -1,144 +0,0 @@
-# TrackioConfig Dictionary-Style Access Fix
-
-## Problem Description
-
-The error `'TrackioConfig' object does not support item assignment` occurred because the TRL library was trying to use dictionary-style item assignment on our `TrackioConfig` object (like `config['key'] = value`), but our implementation only supported attribute assignment.
-
-## Root Cause
-
-TRL expects configuration objects to support both attribute-style and dictionary-style access:
-- Attribute-style: `config.project_name = "test"`
-- Dictionary-style: `config['project_name'] = "test"`
-
-Our `TrackioConfig` class only implemented attribute-style access, causing TRL to fail when it tried to use dictionary-style assignment.
-
-## Solution Implementation
-
-### Enhanced TrackioConfig Class
-
-Modified `src/trackio.py` to add full dictionary-style access support:
-
-```python
-class TrackioConfig:
- """Configuration class for trackio (TRL compatibility)"""
-
- def __init__(self):
- # ... existing initialization ...
-
- def update(self, config_dict: Dict[str, Any] = None, **kwargs):
- # ... existing update method ...
-
- def __getitem__(self, key: str) -> Any:
- """Dictionary-style access to configuration values"""
- if hasattr(self, key):
- return getattr(self, key)
- else:
- raise KeyError(f"Configuration key '{key}' not found")
-
- def __setitem__(self, key: str, value: Any):
- """Dictionary-style assignment to configuration values"""
- setattr(self, key, value)
-
- def __contains__(self, key: str) -> bool:
- """Check if configuration key exists"""
- return hasattr(self, key)
-
- def get(self, key: str, default: Any = None) -> Any:
- """Get configuration value with default"""
- if hasattr(self, key):
- return getattr(self, key)
- else:
- return default
-
- def keys(self):
- """Get all configuration keys"""
- return list(self.__dict__.keys())
-
- def items(self):
- """Get all configuration key-value pairs"""
- return list(self.__dict__.items())
-
- def __repr__(self):
- """String representation of configuration"""
- attrs = []
- for key, value in self.__dict__.items():
- attrs.append(f"{key}={repr(value)}")
- return f"TrackioConfig({', '.join(attrs)})"
-```
-
-### Key Features Added
-
-#### 1. **Dictionary-Style Access**
-- `config['key']` - Get configuration value
-- `config['key'] = value` - Set configuration value
-- `'key' in config` - Check if key exists
-
-#### 2. **Dictionary Methods**
-- `config.get('key', default)` - Get with default value
-- `config.keys()` - Get all configuration keys
-- `config.items()` - Get all key-value pairs
-
-#### 3. **TRL Compatibility**
-- Supports TRL's dictionary-style configuration updates
-- Handles dynamic key assignment
-- Maintains backward compatibility with attribute access
-
-## Testing Verification
-
-### Test Results
-- ✅ Dictionary-style assignment: `config['project_name'] = 'test'`
-- ✅ Dictionary-style access: `config['project_name']`
-- ✅ Contains check: `'key' in config`
-- ✅ Get method: `config.get('key', default)`
-- ✅ Keys and items: `config.keys()`, `config.items()`
-- ✅ TRL-style usage: `config['allow_val_change'] = True`
-
-### TRL-Specific Usage Patterns
-```python
-# TRL-style configuration updates
-config['allow_val_change'] = True
-config['report_to'] = 'trackio'
-config['project_name'] = 'my_experiment'
-
-# Dictionary-style access
-project = config['project_name']
-allow_change = config.get('allow_val_change', False)
-```
-
-## Integration with Existing Features
-
-### Maintains All Existing Functionality
-- ✅ Attribute-style access: `config.project_name`
-- ✅ Update method: `config.update({'key': 'value'})`
-- ✅ Keyword arguments: `config.update(allow_val_change=True)`
-- ✅ Dynamic attributes: New attributes added at runtime
-
-### Enhanced Compatibility
-- ✅ Full TRL dictionary-style interface
-- ✅ Backward compatibility with existing code
-- ✅ Robust error handling for missing keys
-- ✅ Comprehensive dictionary methods
-
-## Production Readiness
-
-### Status: ✅ PRODUCTION READY
-
-The enhanced `TrackioConfig` class now provides:
-1. **Complete TRL Compatibility** - Supports all TRL configuration patterns
-2. **Flexible Access** - Both attribute and dictionary-style access
-3. **Robust Error Handling** - Graceful handling of missing keys
-4. **Comprehensive Interface** - Full dictionary-like behavior
-5. **Backward Compatibility** - Existing code continues to work
-
-## Conclusion
-
-The dictionary-style access fix resolves the `'TrackioConfig' object does not support item assignment` error and provides complete compatibility with TRL's configuration expectations.
-
-**Key Achievements:**
-- ✅ Full dictionary-style interface support
-- ✅ TRL configuration pattern compatibility
-- ✅ Backward compatibility maintained
-- ✅ Comprehensive testing verification
-- ✅ Production-ready implementation
-
-**No additional changes are required** for TRL configuration compatibility. The system now handles all known TRL configuration access patterns.
\ No newline at end of file
diff --git a/docs/TRACKIO_INTEGRATION.md b/docs/TRACKIO_INTEGRATION.md
deleted file mode 100644
index a79ed12fbe45ea0a98ad9fdfc5b4a8ed424b8fa0..0000000000000000000000000000000000000000
--- a/docs/TRACKIO_INTEGRATION.md
+++ /dev/null
@@ -1,252 +0,0 @@
-# Trackio Integration for SmolLM3 Fine-tuning
-
-This document provides comprehensive information about the Trackio experiment tracking and monitoring integration for your SmolLM3 fine-tuning pipeline.
-
-## Features
-
-- **SmolLM3 Fine-tuning**: Support for supervised fine-tuning and DPO training
-- **Trackio Integration**: Complete experiment tracking and monitoring
-- **Hugging Face Spaces Deployment**: Easy deployment of Trackio monitoring interface
-- **Comprehensive Logging**: Metrics, parameters, artifacts, and system monitoring
-- **Flexible Configuration**: Support for various training configurations
-
-## Quick Start
-
-### 1. Install Dependencies
-
-```bash
-pip install -r requirements.txt
-```
-
-### 2. Basic Training with Trackio
-
-```bash
-python train.py config/train_smollm3.py \
- --dataset_dir my_dataset \
- --enable_tracking \
- --trackio_url "https://your-trackio-instance.com" \
- --experiment_name "smollm3_finetune_v1"
-```
-
-### 3. Training with Custom Parameters
-
-```bash
-python train.py config/train_smollm3.py \
- --dataset_dir my_dataset \
- --batch_size 8 \
- --learning_rate 1e-5 \
- --max_iters 2000 \
- --enable_tracking \
- --trackio_url "https://your-trackio-instance.com" \
- --experiment_name "smollm3_high_lr_experiment"
-```
-
-## Trackio Integration
-
-### Configuration
-
-Add Trackio settings to your configuration:
-
-```python
-# In your config file
-config = SmolLM3Config(
- # ... other settings ...
-
- # Trackio monitoring configuration
- enable_tracking=True,
- trackio_url="https://your-trackio-instance.com",
- trackio_token="your_token_here", # Optional
- log_artifacts=True,
- log_metrics=True,
- log_config=True,
- experiment_name="my_experiment"
-)
-```
-
-### Environment Variables
-
-You can also set Trackio configuration via environment variables:
-
-```bash
-export TRACKIO_URL="https://your-trackio-instance.com"
-export TRACKIO_TOKEN="your_token_here"
-```
-
-### What Gets Tracked
-
-- **Configuration**: All training parameters and model settings
-- **Metrics**: Loss, accuracy, learning rate, and custom metrics
-- **System Metrics**: GPU memory, CPU usage, training time
-- **Artifacts**: Model checkpoints, evaluation results
-- **Training Summary**: Final results and experiment duration
-
-## Hugging Face Spaces Deployment
-
-### Deploy Trackio Monitoring Interface
-
-1. **Create a new Space** on Hugging Face:
- - Go to https://huggingface.co/spaces
- - Click "Create new Space"
- - Choose "Gradio" as the SDK
- - Set visibility (Public or Private)
-
-2. **Upload the deployment files**:
- - `app.py` - The Gradio interface
- - `requirements_space.txt` - Dependencies
- - `README.md` - Documentation
-
-3. **Configure the Space**:
- - The Space will automatically install dependencies
- - The Gradio interface will be available at your Space URL
-
-### Using the Trackio Space
-
-1. **Create Experiments**: Use the "Create Experiment" tab to start new experiments
-2. **Log Metrics**: Use the "Log Metrics" tab to track training progress
-3. **View Results**: Use the "View Experiments" tab to see experiment details
-4. **Update Status**: Use the "Update Status" tab to mark experiments as completed
-
-### Integration with Your Training
-
-To connect your training script to the Trackio Space:
-
-```python
-# In your training script
-from monitoring import SmolLM3Monitor
-
-# Initialize monitor
-monitor = SmolLM3Monitor(
- experiment_name="my_experiment",
- trackio_url="https://your-space.hf.space", # Your Space URL
- enable_tracking=True
-)
-
-# Log configuration
-monitor.log_config(config_dict)
-
-# Log metrics during training
-monitor.log_metrics({"loss": 0.5, "accuracy": 0.85}, step=100)
-
-# Log final results
-monitor.log_training_summary(final_results)
-```
-
-## Configuration Files
-
-### Main Configuration (`config/train_smollm3.py`)
-
-```python
-@dataclass
-class SmolLM3Config:
- # Model configuration
- model_name: str = "HuggingFaceTB/SmolLM3-3B"
- max_seq_length: int = 4096
-
- # Training configuration
- batch_size: int = 4
- learning_rate: float = 2e-5
- max_iters: int = 1000
-
- # Trackio monitoring
- enable_tracking: bool = True
- trackio_url: Optional[str] = None
- trackio_token: Optional[str] = None
- experiment_name: Optional[str] = None
-```
-
-### DPO Configuration (`config/train_smollm3_dpo.py`)
-
-```python
-@dataclass
-class SmolLM3DPOConfig(SmolLM3Config):
- # DPO-specific settings
- beta: float = 0.1
- max_prompt_length: int = 2048
-
- # Trackio monitoring (inherited)
- enable_tracking: bool = True
- trackio_url: Optional[str] = None
-```
-
-## Monitoring Features
-
-### Real-time Metrics
-
-- Training loss and evaluation metrics
-- Learning rate scheduling
-- GPU memory and utilization
-- Training time and progress
-
-### Artifact Tracking
-
-- Model checkpoints at regular intervals
-- Evaluation results and plots
-- Configuration snapshots
-- Training logs and summaries
-
-### Experiment Management
-
-- Experiment naming and organization
-- Status tracking (running, completed, failed)
-- Parameter comparison across experiments
-- Result visualization
-
-## Advanced Usage
-
-### Custom Metrics
-
-```python
-# Log custom metrics
-monitor.log_metrics({
- "custom_metric": value,
- "perplexity": perplexity_score,
- "bleu_score": bleu_score
-}, step=current_step)
-```
-
-### System Monitoring
-
-```python
-# Log system metrics
-monitor.log_system_metrics(step=current_step)
-```
-
-### Artifact Logging
-
-```python
-# Log model checkpoint
-monitor.log_model_checkpoint("checkpoint-1000", step=1000)
-
-# Log evaluation results
-monitor.log_evaluation_results(eval_results, step=1000)
-```
-
-## Troubleshooting
-
-### Common Issues
-
-1. **Trackio not available**: Install with `pip install trackio`
-2. **Connection errors**: Check your Trackio URL and token
-3. **Missing metrics**: Ensure monitoring is enabled in configuration
-4. **Space deployment issues**: Check Gradio version compatibility
-
-### Debug Mode
-
-Enable debug logging:
-
-```python
-import logging
-logging.basicConfig(level=logging.DEBUG)
-```
-
-## Contributing
-
-1. Fork the repository
-2. Create a feature branch
-3. Make your changes
-4. Add tests if applicable
-5. Submit a pull request
-
-## License
-
-This project is licensed under the MIT License - see the LICENSE file for details.
\ No newline at end of file
diff --git a/docs/TRACKIO_INTEGRATION_VERIFICATION.md b/docs/TRACKIO_INTEGRATION_VERIFICATION.md
deleted file mode 100644
index 7c5ba12a88458a6a1c7acc4728f0e1cac6506b05..0000000000000000000000000000000000000000
--- a/docs/TRACKIO_INTEGRATION_VERIFICATION.md
+++ /dev/null
@@ -1,177 +0,0 @@
-# Trackio Integration Verification Report
-
-## ✅ Verification Status: PASSED
-
-All Trackio integration tests have passed successfully. The integration is correctly implemented according to the documentation provided in `TRACKIO_INTEGRATION.md` and `TRACKIO_INTERFACE_GUIDE.md`.
-
-## 🔧 Issues Fixed
-
-### 1. **Training Arguments Configuration**
-- **Issue**: `'bool' object is not callable` error with `report_to` parameter
-- **Fix**: Changed `report_to: "none"` to `report_to: None` in `model.py`
-- **Impact**: Resolves the original training failure
-
-### 2. **Boolean Parameter Type Safety**
-- **Issue**: Boolean parameters not properly typed in training arguments
-- **Fix**: Added explicit boolean conversion for all boolean parameters:
- - `dataloader_pin_memory`
- - `group_by_length`
- - `prediction_loss_only`
- - `ignore_data_skip`
- - `remove_unused_columns`
- - `ddp_find_unused_parameters`
- - `fp16`
- - `bf16`
- - `load_best_model_at_end`
- - `greater_is_better`
-
-### 3. **Callback Implementation**
-- **Issue**: Callback creation failing when tracking disabled
-- **Fix**: Modified `create_monitoring_callback()` to always return a callback
-- **Improvement**: Added proper inheritance from `TrainerCallback`
-
-### 4. **Method Naming Conflicts**
-- **Issue**: Boolean attributes conflicting with method names
-- **Fix**: Renamed boolean attributes to avoid conflicts:
- - `log_config` → `log_config_enabled`
- - `log_metrics` → `log_metrics_enabled`
-
-### 5. **System Compatibility**
-- **Issue**: Training arguments test failing on systems without bf16 support
-- **Fix**: Added conditional bf16 support detection
-- **Improvement**: Added conditional support for `dataloader_prefetch_factor`
-
-## 📊 Test Results
-
-| Test | Status | Description |
-|------|--------|-------------|
-| Trackio Configuration | ✅ PASS | All required attributes present |
-| Monitor Creation | ✅ PASS | Monitor created successfully |
-| Callback Creation | ✅ PASS | Callback with all required methods |
-| Monitor Methods | ✅ PASS | All logging methods work correctly |
-| Training Arguments | ✅ PASS | Arguments created without errors |
-
-## 🎯 Key Features Verified
-
-### 1. **Configuration Management**
-- ✅ Trackio-specific attributes properly defined
-- ✅ Environment variable support
-- ✅ Default values correctly set
-- ✅ Configuration inheritance working
-
-### 2. **Monitoring Integration**
-- ✅ Monitor creation from config
-- ✅ Callback integration with Hugging Face Trainer
-- ✅ Real-time metrics logging
-- ✅ System metrics collection
-- ✅ Artifact tracking
-- ✅ Evaluation results logging
-
-### 3. **Training Integration**
-- ✅ Training arguments properly configured
-- ✅ Boolean parameters correctly typed
-- ✅ Report_to parameter fixed
-- ✅ Callback methods properly implemented
-- ✅ Error handling enhanced
-
-### 4. **Interface Compatibility**
-- ✅ Compatible with Trackio Space deployment
-- ✅ Supports all documented features
-- ✅ Handles missing Trackio URL gracefully
-- ✅ Provides fallback behavior
-
-## 🚀 Integration Points
-
-### 1. **With Training Script**
-```python
-# Automatic integration via config
-config = SmolLM3ConfigOpenHermesFRBalanced()
-monitor = create_monitor_from_config(config)
-
-# Callback automatically added to trainer
-trainer = Trainer(
- model=model,
- args=training_args,
- callbacks=[monitor.create_monitoring_callback()]
-)
-```
-
-### 2. **With Trackio Space**
-```python
-# Configuration for Trackio Space
-config.trackio_url = "https://your-space.hf.space"
-config.enable_tracking = True
-config.experiment_name = "my_experiment"
-```
-
-### 3. **With Hugging Face Trainer**
-```python
-# Training arguments properly configured
-training_args = model.get_training_arguments(
- output_dir=output_dir,
- report_to=None, # Fixed
- # ... other parameters
-)
-```
-
-## 📈 Monitoring Features
-
-### Real-time Metrics
-- ✅ Training loss and evaluation metrics
-- ✅ Learning rate scheduling
-- ✅ GPU memory and utilization
-- ✅ Training time and progress
-
-### Artifact Tracking
-- ✅ Model checkpoints at regular intervals
-- ✅ Evaluation results and plots
-- ✅ Configuration snapshots
-- ✅ Training logs and summaries
-
-### Experiment Management
-- ✅ Experiment naming and organization
-- ✅ Status tracking (running, completed, failed)
-- ✅ Parameter comparison across experiments
-- ✅ Result visualization
-
-## 🔍 Error Handling
-
-### Graceful Degradation
-- ✅ Continues training when Trackio unavailable
-- ✅ Handles missing environment variables
-- ✅ Provides console logging fallback
-- ✅ Maintains functionality without external dependencies
-
-### Robust Callbacks
-- ✅ Callback methods handle exceptions gracefully
-- ✅ Training continues even if monitoring fails
-- ✅ Detailed error logging for debugging
-- ✅ Fallback to console monitoring
-
-## 📋 Compliance with Documentation
-
-### TRACKIO_INTEGRATION.md Requirements
-- ✅ All configuration options implemented
-- ✅ Environment variable support
-- ✅ Hugging Face Spaces deployment ready
-- ✅ Comprehensive logging features
-- ✅ Artifact tracking capabilities
-
-### TRACKIO_INTERFACE_GUIDE.md Requirements
-- ✅ Real-time visualization support
-- ✅ Interactive plots and metrics
-- ✅ Experiment comparison features
-- ✅ Demo data generation
-- ✅ Status tracking and updates
-
-## 🎉 Conclusion
-
-The Trackio integration is **fully functional** and **correctly implemented** according to the provided documentation. All major issues have been resolved:
-
-1. **Original Error Fixed**: The `'bool' object is not callable` error has been resolved
-2. **Callback Integration**: Trackio callbacks now work correctly with Hugging Face Trainer
-3. **Configuration Management**: All Trackio-specific configuration is properly handled
-4. **Error Handling**: Robust error handling and graceful degradation implemented
-5. **Compatibility**: Works across different systems and configurations
-
-The integration is ready for production use and will provide comprehensive monitoring for SmolLM3 fine-tuning experiments.
\ No newline at end of file
diff --git a/docs/TRACKIO_INTERFACE_GUIDE.md b/docs/TRACKIO_INTERFACE_GUIDE.md
deleted file mode 100644
index d3786a57237202045e489ef4bb134eda3042b9f4..0000000000000000000000000000000000000000
--- a/docs/TRACKIO_INTERFACE_GUIDE.md
+++ /dev/null
@@ -1,222 +0,0 @@
-# Enhanced Trackio Interface Guide
-
-## Overview
-
-Your Trackio application has been significantly enhanced to provide comprehensive monitoring and visualization for SmolLM3 training experiments. Here's how to make the most of it.
-
-## 🚀 Key Enhancements
-
-### 1. **Real-time Visualization**
-- **Interactive Plots**: Loss curves, accuracy, learning rate, GPU metrics
-- **Experiment Comparison**: Compare multiple training runs side-by-side
-- **Live Updates**: Watch training progress in real-time
-
-### 2. **Comprehensive Data Display**
-- **Formatted Output**: Clean, emoji-rich experiment details
-- **Statistics Overview**: Metrics count, parameters count, artifacts count
-- **Status Tracking**: Visual status indicators (🟢 running, ✅ completed, ❌ failed)
-
-### 3. **Demo Data Generation**
-- **Realistic Simulation**: Generate realistic training metrics for testing
-- **Multiple Metrics**: Loss, accuracy, learning rate, GPU memory, training time
-- **Configurable Parameters**: Customize demo data to match your setup
-
-## 📊 How to Use with Your SmolLM3 Training
-
-### Step 1: Start Your Training
-```bash
-python run_a100_large_experiment.py \
- --config config/train_smollm3_openhermes_fr_a100_balanced.py \
- --trackio_url "https://tonic-test-trackio-test.hf.space" \
- --experiment-name "petit-elle-l-aime-3-balanced" \
- --output-dir ./outputs/balanced
-```
-
-### Step 2: Monitor in Real-time
-1. **Visit your Trackio Space**: `https://tonic-test-trackio-test.hf.space`
-2. **Go to "View Experiments" tab**
-3. **Enter your experiment ID** (e.g., `exp_20231201_143022`)
-4. **Click "View Experiment"** to see detailed information
-
-### Step 3: Visualize Training Progress
-1. **Go to "📊 Visualizations" tab**
-2. **Enter your experiment ID**
-3. **Select a metric** (loss, accuracy, learning_rate, gpu_memory, training_time)
-4. **Click "Create Plot"** to see interactive charts
-
-### Step 4: Compare Experiments
-1. **In the "📊 Visualizations" tab**
-2. **Enter multiple experiment IDs** (comma-separated)
-3. **Click "Compare Experiments"** to see side-by-side comparison
-
-## 🎯 Interface Features
-
-### Create Experiment Tab
-- **Experiment Name**: Descriptive name for your training run
-- **Description**: Detailed description of what you're training
-- **Automatic ID Generation**: Unique experiment identifier
-
-### Log Metrics Tab
-- **Experiment ID**: The experiment to log metrics for
-- **Metrics JSON**: Training metrics in JSON format
-- **Step**: Current training step (optional)
-
-Example metrics JSON:
-```json
-{
- "loss": 0.5234,
- "accuracy": 0.8567,
- "learning_rate": 3.5e-6,
- "gpu_memory_gb": 22.5,
- "gpu_utilization_percent": 87.3,
- "training_time_per_step": 0.456
-}
-```
-
-### Log Parameters Tab
-- **Experiment ID**: The experiment to log parameters for
-- **Parameters JSON**: Training configuration in JSON format
-
-Example parameters JSON:
-```json
-{
- "model_name": "HuggingFaceTB/SmolLM3-3B",
- "batch_size": 8,
- "learning_rate": 3.5e-6,
- "max_iters": 18000,
- "mixed_precision": "bf16",
- "no_think_system_message": true
-}
-```
-
-### View Experiments Tab
-- **Experiment ID**: Enter to view specific experiment
-- **List All Experiments**: Shows overview of all experiments
-- **Detailed Information**: Formatted display with statistics
-
-### 📊 Visualizations Tab
-- **Training Metrics**: Interactive plots for individual metrics
-- **Experiment Comparison**: Side-by-side comparison of multiple runs
-- **Real-time Updates**: Plots update as new data is logged
-
-### 🎯 Demo Data Tab
-- **Generate Demo Data**: Create realistic training data for testing
-- **Configurable**: Adjust parameters to match your setup
-- **Multiple Metrics**: Simulates loss, accuracy, GPU metrics, etc.
-
-### Update Status Tab
-- **Experiment ID**: The experiment to update
-- **Status**: running, completed, failed, paused
-- **Visual Indicators**: Status shown with emojis
-
-## 📈 What Gets Displayed
-
-### Training Metrics
-- **Loss**: Training loss over time
-- **Accuracy**: Model accuracy progression
-- **Learning Rate**: Learning rate scheduling
-- **GPU Memory**: Memory usage in GB
-- **GPU Utilization**: GPU usage percentage
-- **Training Time**: Time per training step
-
-### Experiment Details
-- **Basic Info**: ID, name, description, status, creation time
-- **Statistics**: Metrics count, parameters count, artifacts count
-- **Parameters**: All training configuration
-- **Latest Metrics**: Most recent training metrics
-
-### Visualizations
-- **Line Charts**: Smooth curves showing metric progression
-- **Interactive Hover**: Detailed information on hover
-- **Multiple Metrics**: Switch between different metrics
-- **Comparison Charts**: Side-by-side experiment comparison
-
-## 🔧 Integration with Your Training
-
-### Automatic Integration
-Your training script automatically:
-1. **Creates experiments** with your specified name
-2. **Logs parameters** from your configuration
-3. **Logs metrics** every 25 steps (configurable)
-4. **Logs system metrics** (GPU memory, utilization)
-5. **Logs checkpoints** every 2000 steps
-6. **Updates status** when training completes
-
-### Manual Integration
-You can also manually:
-1. **Create experiments** through the interface
-2. **Log custom metrics** for specific analysis
-3. **Compare different runs** with different parameters
-4. **Generate demo data** for testing the interface
-
-## 🎨 Customization
-
-### Adding Custom Metrics
-```python
-# In your training script
-custom_metrics = {
- "loss": current_loss,
- "accuracy": current_accuracy,
- "custom_metric": your_custom_value,
- "gpu_memory": gpu_memory_usage
-}
-
-monitor.log_metrics(custom_metrics, step=current_step)
-```
-
-### Custom Visualizations
-The interface supports any metric you log. Just add it to your metrics JSON and it will appear in the visualization dropdown.
-
-## 🚨 Troubleshooting
-
-### No Data Displayed
-1. **Check experiment ID**: Make sure you're using the correct ID
-2. **Verify metrics were logged**: Check if training is actually logging metrics
-3. **Use demo data**: Generate demo data to test the interface
-
-### Plots Not Updating
-1. **Refresh the page**: Sometimes plots need a refresh
-2. **Check data format**: Ensure metrics are in the correct JSON format
-3. **Verify step numbers**: Make sure step numbers are increasing
-
-### Interface Not Loading
-1. **Check dependencies**: Ensure plotly and pandas are installed
-2. **Check Gradio version**: Use Gradio 4.0.0 or higher
-3. **Check browser console**: Look for JavaScript errors
-
-## 📊 Example Workflow
-
-1. **Start Training**:
- ```bash
- python run_a100_large_experiment.py --experiment-name "my_experiment"
- ```
-
-2. **Monitor Progress**:
- - Visit your Trackio Space
- - Go to "View Experiments"
- - Enter your experiment ID
- - Watch real-time updates
-
-3. **Visualize Results**:
- - Go to "📊 Visualizations"
- - Select "loss" metric
- - Create plot to see training progress
-
-4. **Compare Runs**:
- - Run multiple experiments with different parameters
- - Use "Compare Experiments" to see differences
-
-5. **Generate Demo Data**:
- - Use "🎯 Demo Data" tab to test the interface
- - Generate realistic training data for demonstration
-
-## 🎉 Success Indicators
-
-Your interface is working correctly when you see:
-- ✅ **Formatted experiment details** with emojis and structure
-- ✅ **Interactive plots** that respond to your inputs
-- ✅ **Real-time metric updates** during training
-- ✅ **Clean experiment overview** with statistics
-- ✅ **Smooth visualization** with hover information
-
-The enhanced interface will now display much more meaningful information and provide a comprehensive monitoring experience for your SmolLM3 training experiments!
\ No newline at end of file
diff --git a/docs/TRACKIO_SPACE_DEPLOYMENT_FIXES.md b/docs/TRACKIO_SPACE_DEPLOYMENT_FIXES.md
deleted file mode 100644
index b554a2aa15727cd28d9d7daf17627a9aea8bc62c..0000000000000000000000000000000000000000
--- a/docs/TRACKIO_SPACE_DEPLOYMENT_FIXES.md
+++ /dev/null
@@ -1,262 +0,0 @@
-# Trackio Space Deployment Fixes
-
-## Issues Identified
-
-Based on the reference Hugging Face Space structure at [yourbench/advanced](https://huggingface.co/spaces/yourbench/advanced/tree/main), the original Trackio Space deployment had several issues:
-
-1. **Incorrect File Structure**: Not following the proper Hugging Face Spaces format
-2. **Poor Git Integration**: Trying to use git commands incorrectly
-3. **Missing Required Files**: Incomplete template structure
-4. **Incorrect README Format**: Not following HF Spaces metadata format
-5. **Dependency Issues**: Requirements file not properly structured
-
-## Fixes Applied
-
-### 1. Proper Hugging Face Spaces Structure
-
-**Before**: Files were copied to current directory and pushed via git
-**After**: Files are prepared in temporary directory with proper structure
-
-```python
-# New approach - proper temp directory handling
-temp_dir = tempfile.mkdtemp()
-# Copy files to temp directory
-shutil.copy2(source_path, dest_path)
-# Initialize git in temp directory
-os.chdir(temp_dir)
-subprocess.run(["git", "init"], check=True)
-subprocess.run(["git", "remote", "add", "origin", space_url], check=True)
-```
-
-### 2. Correct README.md Format
-
-**Before**: Basic README without proper HF Spaces metadata
-**After**: Proper HF Spaces metadata format
-
-```markdown
----
-title: Trackio Experiment Tracking
-emoji: 📊
-colorFrom: indigo
-colorTo: yellow
-sdk: gradio
-sdk_version: 4.44.0
-app_file: app.py
-pinned: true
-license: mit
-short_description: Trackio experiment tracking and monitoring interface
----
-```
-
-### 3. Updated Requirements.txt
-
-**Before**: Duplicate dependencies and incorrect versions
-**After**: Clean, organized dependencies
-
-```txt
-# Core Gradio dependencies
-gradio>=4.0.0
-gradio-client>=0.10.0
-
-# Data processing and visualization
-pandas>=2.0.0
-numpy>=1.24.0
-plotly>=5.15.0
-
-# HTTP requests and API
-requests>=2.31.0
-
-# JSON handling
-jsonschema>=4.17.0
-
-# Hugging Face integration
-datasets>=2.14.0
-huggingface-hub>=0.16.0
-
-# Environment and configuration
-python-dotenv>=1.0.0
-
-# Optional: for better performance
-matplotlib>=3.7.0
-```
-
-### 4. Improved Deployment Script
-
-**Key Improvements**:
-- Proper temporary directory handling
-- Better error handling and logging
-- Correct git workflow
-- Environment variable setup
-- Comprehensive testing
-
-```python
-class TrackioSpaceDeployer:
- def __init__(self, space_name: str, username: str, token: str):
- self.space_name = space_name
- self.username = username
- self.token = token
- self.space_url = f"https://huggingface.co/spaces/{username}/{space_name}"
-
- def create_space(self) -> bool:
- # Set HF token for CLI
- os.environ['HF_TOKEN'] = self.token
- # Create space with proper error handling
-
- def prepare_space_files(self) -> str:
- # Create temp directory and copy files
- # Update README with actual space URL
-
- def upload_files_to_space(self, temp_dir: str) -> bool:
- # Proper git workflow in temp directory
- # Push to main/master branch
-```
-
-## Files Modified
-
-### Core Deployment Files
-1. **`scripts/trackio_tonic/deploy_trackio_space.py`**
- - Complete rewrite following HF Spaces best practices
- - Proper temporary directory handling
- - Better error handling and logging
- - Correct git workflow
-
-### Template Files
-2. **`templates/spaces/README.md`**
- - Updated to proper HF Spaces metadata format
- - Comprehensive documentation
- - API endpoint documentation
- - Troubleshooting guide
-
-3. **`templates/spaces/requirements.txt`**
- - Clean, organized dependencies
- - Proper version specifications
- - All required packages included
-
-### Test Files
-4. **`tests/test_trackio_deployment.py`**
- - Comprehensive deployment testing
- - Template structure validation
- - File content verification
- - Deployment script testing
-
-## Testing the Deployment
-
-### Run Deployment Tests
-```bash
-python tests/test_trackio_deployment.py
-```
-
-Expected output:
-```
-🚀 Testing Trackio Space Deployment
-==================================================
-🔍 Testing templates structure...
-✅ app.py exists
-✅ requirements.txt exists
-✅ README.md exists
-
-🔍 Testing app.py content...
-✅ Found: import gradio as gr
-✅ Found: class TrackioSpace
-✅ Found: def create_experiment_interface
-✅ Found: def log_metrics_interface
-✅ Found: def log_parameters_interface
-✅ Found: demo.launch()
-
-🔍 Testing requirements.txt content...
-✅ Found: gradio>=
-✅ Found: pandas>=
-✅ Found: numpy>=
-✅ Found: plotly>=
-✅ Found: requests>=
-✅ Found: datasets>=
-✅ Found: huggingface-hub>=
-
-🔍 Testing README.md structure...
-✅ Found: ---
-✅ Found: title: Trackio Experiment Tracking
-✅ Found: sdk: gradio
-✅ Found: app_file: app.py
-✅ Found: # Trackio Experiment Tracking
-✅ Found: ## Features
-✅ Found: ## Usage
-✅ Found: Visit: {SPACE_URL}
-
-🔍 Testing deployment script...
-✅ TrackioSpaceDeployer class imported successfully
-✅ Method exists: create_space
-✅ Method exists: prepare_space_files
-✅ Method exists: upload_files_to_space
-✅ Method exists: test_space
-✅ Method exists: deploy
-
-🔍 Testing temporary directory creation...
-✅ Created temp directory: /tmp/tmp_xxxxx
-✅ File copying works
-✅ Cleanup successful
-
-📊 Test Results: 6/6 tests passed
-✅ All deployment tests passed! The Trackio Space should deploy correctly.
-```
-
-### Deploy Trackio Space
-```bash
-python scripts/trackio_tonic/deploy_trackio_space.py
-```
-
-## Key Improvements
-
-### 1. **Proper HF Spaces Structure**
-- Follows the exact format from reference spaces
-- Correct metadata in README.md
-- Proper file organization
-
-### 2. **Robust Deployment Process**
-- Temporary directory handling
-- Proper git workflow
-- Better error handling
-- Comprehensive logging
-
-### 3. **Better Error Handling**
-- Graceful failure handling
-- Detailed error messages
-- Fallback mechanisms
-- Cleanup procedures
-
-### 4. **Comprehensive Testing**
-- Template structure validation
-- File content verification
-- Deployment script testing
-- Integration testing
-
-## Reference Structure
-
-The fixes are based on the Hugging Face Space structure from [yourbench/advanced](https://huggingface.co/spaces/yourbench/advanced/tree/main), which includes:
-
-- **Proper README.md** with HF Spaces metadata
-- **Clean requirements.txt** with organized dependencies
-- **Correct app.py** structure for Gradio
-- **Proper git workflow** for deployment
-
-## Next Steps
-
-1. **Test the deployment**:
- ```bash
- python tests/test_trackio_deployment.py
- ```
-
-2. **Deploy the Space**:
- ```bash
- python scripts/trackio_tonic/deploy_trackio_space.py
- ```
-
-3. **Verify deployment**:
- - Check the Space URL
- - Test the interface
- - Verify API endpoints
-
-4. **Use in training**:
- - Update your training scripts with the new Space URL
- - Test the monitoring integration
-
-The Trackio Space should now deploy correctly and provide reliable experiment tracking for your SmolLM3 fine-tuning pipeline! 🚀
\ No newline at end of file
diff --git a/docs/TRACKIO_TRL_FIX.md b/docs/TRACKIO_TRL_FIX.md
deleted file mode 100644
index 8c06cf17b9887d92fb08e15e84f41e410072d546..0000000000000000000000000000000000000000
--- a/docs/TRACKIO_TRL_FIX.md
+++ /dev/null
@@ -1,169 +0,0 @@
-# Trackio TRL Compatibility Fix
-
-## Problem Analysis
-
-The TRL library (specifically SFTTrainer) expects a `trackio` module with the following interface:
-- `trackio.init()` - Initialize experiment tracking
-- `trackio.log()` - Log metrics during training
-- `trackio.finish()` - Finish experiment tracking
-- `trackio.config` - Access configuration (additional requirement discovered)
-
-Our custom monitoring system didn't provide this interface, causing the training to fail.
-
-## Solution Implementation
-
-### 1. Created Trackio Module Interface (`src/trackio.py`)
-
-Created a new module that provides the exact interface expected by TRL:
-
-```python
-def init(project_name: Optional[str] = None, experiment_name: Optional[str] = None, **kwargs) -> str:
- """Initialize trackio experiment (TRL interface)"""
- # Implementation that routes to our SmolLM3Monitor
-
-def log(metrics: Dict[str, Any], step: Optional[int] = None, **kwargs):
- """Log metrics to trackio (TRL interface)"""
- # Implementation that routes to our SmolLM3Monitor
-
-def finish():
- """Finish trackio experiment (TRL interface)"""
- # Implementation that routes to our SmolLM3Monitor
-
-# Added config attribute for TRL compatibility
-class TrackioConfig:
- """Configuration class for trackio (TRL compatibility)"""
- def __init__(self):
- self.project_name = os.environ.get('EXPERIMENT_NAME', 'smollm3_experiment')
- self.experiment_name = os.environ.get('EXPERIMENT_NAME', 'smollm3_experiment')
- # ... other config properties
-
-config = TrackioConfig()
-```
-
-**Key Feature**: The `init()` function can be called without any arguments, making it compatible with TRL's expectations. It will use environment variables or defaults when no arguments are provided.
-
-### 2. Global Trackio Module (`trackio.py`)
-
-Created a root-level `trackio.py` file that imports from our custom implementation:
-
-```python
-from src.trackio import (
- init, log, finish, log_config, log_checkpoint,
- log_evaluation_results, get_experiment_url, is_available, get_monitor
-)
-```
-
-This makes the trackio module available globally for TRL to import.
-
-### 3. Updated Trainer Integration (`src/trainer.py`)
-
-Modified the trainer to properly initialize trackio before creating SFTTrainer:
-
-```python
-# Initialize trackio for TRL compatibility
-try:
- import trackio
- experiment_id = trackio.init(
- project_name=self.config.experiment_name,
- experiment_name=self.config.experiment_name,
- trackio_url=getattr(self.config, 'trackio_url', None),
- trackio_token=getattr(self.config, 'trackio_token', None),
- hf_token=getattr(self.config, 'hf_token', None),
- dataset_repo=getattr(self.config, 'dataset_repo', None)
- )
- logger.info(f"Trackio initialized with experiment ID: {experiment_id}")
-except Exception as e:
- logger.warning(f"Failed to initialize trackio: {e}")
- logger.info("Continuing without trackio integration")
-```
-
-### 4. Proper Cleanup
-
-Added trackio.finish() calls in both success and error scenarios:
-
-```python
-# Finish trackio experiment
-try:
- import trackio
- trackio.finish()
- logger.info("Trackio experiment finished")
-except Exception as e:
- logger.warning(f"Failed to finish trackio experiment: {e}")
-```
-
-## Integration with Custom Monitoring
-
-The trackio module integrates seamlessly with our existing monitoring system:
-
-- Uses `SmolLM3Monitor` for actual monitoring functionality
-- Provides TRL-compatible interface on top
-- Maintains all existing features (HF Datasets, Trackio Space, etc.)
-- Graceful fallback when Trackio Space is not accessible
-
-## Testing and Verification
-
-### Test Script: `tests/test_trackio_trl_fix.py`
-
-The test script verifies:
-
-1. **Module Import**: `import trackio` works correctly
-2. **Function Availability**: All required functions (`init`, `log`, `finish`) exist
-3. **Function Signatures**: Functions have the correct signatures expected by TRL
-4. **Initialization**: `trackio.init()` can be called with and without arguments
-5. **Configuration Access**: `trackio.config` is available and accessible
-6. **Logging**: Metrics can be logged successfully
-7. **Cleanup**: Experiments can be finished properly
-
-### Test Results
-
-```
-✅ Successfully imported trackio module
-✅ Found required function: init
-✅ Found required function: log
-✅ Found required function: finish
-✅ Trackio initialization with args successful: trl_20250727_135621
-✅ Trackio initialization without args successful: trl_20250727_135621
-✅ Trackio logging successful
-✅ Trackio finish successful
-✅ init() can be called without arguments
-✅ trackio.config is available:
-✅ config.project_name: smollm3_experiment
-✅ config.experiment_name: smollm3_experiment
-✅ All tests passed! Trackio TRL fix is working correctly.
-```
-
-## Benefits
-
-1. **Resolves Training Error**: Fixes the "module trackio has no attribute init" error and "init() missing 1 required positional argument: 'project_name'" error
-2. **Maintains Functionality**: All existing monitoring features continue to work
-3. **TRL Compatibility**: SFTTrainer can now use trackio for logging, even when called without arguments
-4. **Graceful Fallback**: Continues training even if trackio initialization fails
-5. **Future-Proof**: Easy to extend with additional TRL-compatible functions
-6. **Flexible Initialization**: Supports both argument-based and environment-based configuration
-
-## Usage
-
-The fix is transparent to users. Training will now work with SFTTrainer and automatically:
-
-1. Initialize trackio when SFTTrainer is created
-2. Log metrics during training
-3. Finish the experiment when training completes
-4. Fall back gracefully if trackio is not available
-
-## Files Modified
-
-- `src/trackio.py` - New trackio module interface
-- `trackio.py` - Global trackio module for TRL
-- `src/trainer.py` - Updated trainer integration
-- `src/__init__.py` - Package exports
-- `tests/test_trackio_trl_fix.py` - Test suite
-
-## Verification
-
-To verify the fix works:
-
-```bash
-python tests/test_trackio_trl_fix.py
-```
-
-This should show all tests passing and confirm that the trackio module provides the interface expected by TRL library.
\ No newline at end of file
diff --git a/docs/TRACKIO_TRL_FIX_SUMMARY.md b/docs/TRACKIO_TRL_FIX_SUMMARY.md
deleted file mode 100644
index 8684e619a47751229663bc46f35d259dcdb0f9aa..0000000000000000000000000000000000000000
--- a/docs/TRACKIO_TRL_FIX_SUMMARY.md
+++ /dev/null
@@ -1,41 +0,0 @@
-# Trackio TRL Fix - Complete Solution Summary
-
-## ✅ Complete Solution Summary
-
-I have successfully implemented a comprehensive fix for the Trackio TRL compatibility issues. Here's what was accomplished:
-
-### 🔧 **Problems Solved**
-
-1. **`module 'trackio' has no attribute 'init'`** - TRL expected trackio.init() function
-2. **`init() missing 1 required positional argument: 'project_name'`** - TRL called init() without arguments
-3. **`module 'trackio' has no attribute 'config'`** - TRL expected trackio.config attribute
-
-### 🛠️ **Solution Components**
-
-#### 1. **Trackio Module Interface** (`src/trackio.py`)
-- Created `init()`, `log()`, `finish()` functions expected by TRL
-- Added `TrackioConfig` class with `config` attribute
-- Routes all calls to our custom `SmolLM3Monitor`
-
-#### 2. **Global Module Access** (`trackio.py`)
-- Root-level module that imports from `src.trackio`
-- Makes functions globally available for TRL import
-
-#### 3. **Enhanced Trainer Integration** (`src/trainer.py`)
-- Explicit trackio initialization before SFTTrainer creation
-- Proper cleanup with trackio.finish() calls
-- Robust error handling and fallbacks
-
-#### 4. **Comprehensive Testing** (`tests/test_trackio_trl_fix.py`)
-- Verifies all required functions exist and work
-- Tests both argument and no-argument initialization
-- Confirms config attribute accessibility
-- Validates monitoring integration
-
-### 🎯 **Key Features**
-
-- **TRL Compatibility**: Full interface compatibility with TRL library expectations
-- **Flexible Initialization**: Supports both argument and no-argument init() calls
-- **Configuration Access**: Provides trackio.config attribute as expected
-- **Error Resilience**: Graceful fallbacks when external services unavailable
-- **Monitoring Integration**: Seamless integration with our custom monitoring system
\ No newline at end of file
diff --git a/docs/TRACKIO_UPDATE_FIX.md b/docs/TRACKIO_UPDATE_FIX.md
deleted file mode 100644
index a3144471dd1325359b278573dc329d1897bed5b8..0000000000000000000000000000000000000000
--- a/docs/TRACKIO_UPDATE_FIX.md
+++ /dev/null
@@ -1,110 +0,0 @@
-# TrackioConfig Update Method Fix
-
-## Problem Description
-
-The error `'TrackioConfig' object has no attribute 'update'` occurred because the TRL library (specifically SFTTrainer) expects the Trackio configuration object to have an `update` method, but our custom `TrackioConfig` class didn't implement it.
-
-Additionally, TRL calls the `update` method with keyword arguments like `allow_val_change`, which our initial implementation didn't support.
-
-## Root Cause
-
-Based on the [Trackio documentation](https://github.com/gradio-app/trackio?tab=readme-ov-file), Trackio is designed to be API compatible with `wandb.init`, `wandb.log`, and `wandb.finish`. However, the TRL library has additional expectations for the configuration object, including an `update` method that allows dynamic configuration updates with both dictionary and keyword arguments.
-
-## Solution Implementation
-
-### 1. Enhanced Update Method for TrackioConfig
-
-Modified `src/trackio.py` to add a flexible `update` method that handles both dictionary and keyword arguments:
-
-```python
-class TrackioConfig:
- """Configuration class for trackio (TRL compatibility)"""
-
- def __init__(self):
- self.project_name = os.environ.get('EXPERIMENT_NAME', 'smollm3_experiment')
- self.experiment_name = os.environ.get('EXPERIMENT_NAME', 'smollm3_experiment')
- self.trackio_url = os.environ.get('TRACKIO_URL')
- self.trackio_token = os.environ.get('TRACKIO_TOKEN')
- self.hf_token = os.environ.get('HF_TOKEN')
- self.dataset_repo = os.environ.get('TRACKIO_DATASET_REPO', 'tonic/trackio-experiments')
-
- def update(self, config_dict: Dict[str, Any] = None, **kwargs):
- """
- Update configuration with new values (TRL compatibility)
-
- Args:
- config_dict: Dictionary of configuration values to update (optional)
- **kwargs: Additional configuration values to update
- """
- # Handle both dictionary and keyword arguments
- if config_dict is not None:
- for key, value in config_dict.items():
- if hasattr(self, key):
- setattr(self, key, value)
- else:
- # Add new attributes dynamically
- setattr(self, key, value)
-
- # Handle keyword arguments
- for key, value in kwargs.items():
- if hasattr(self, key):
- setattr(self, key, value)
- else:
- # Add new attributes dynamically
- setattr(self, key, value)
-```
-
-### 2. Key Features of the Enhanced Fix
-
-- **Flexible Argument Handling**: Supports both dictionary and keyword arguments
-- **TRL Compatibility**: Handles TRL's `allow_val_change` and other keyword arguments
-- **Dynamic Attribute Updates**: Can update existing attributes and add new ones dynamically
-- **Backward Compatibility**: Doesn't break existing functionality
-- **Future-Proof**: Supports additional TRL requirements
-
-### 3. Usage Examples
-
-#### Dictionary-based updates:
-```python
-import trackio
-
-config = trackio.config
-config.update({
- 'project_name': 'my_experiment',
- 'experiment_name': 'test_run_1',
- 'custom_setting': 'value'
-})
-```
-
-#### Keyword argument updates (TRL style):
-```python
-config.update(allow_val_change=True, project_name="test_project")
-```
-
-#### Mixed updates:
-```python
-config.update({'experiment_name': 'test'}, allow_val_change=True, new_attr='value')
-```
-
-## Verification
-
-The enhanced fix has been verified to work correctly:
-
-1. **Import Test**: `import trackio` works without errors
-2. **Config Access**: `trackio.config` is available
-3. **Update Method**: `trackio.config.update()` method exists and works
-4. **Keyword Arguments**: Handles TRL's `allow_val_change` and other kwargs
-5. **TRL Compatibility**: All TRL-expected methods are available
-
-## Benefits
-
-1. **Resolves Training Error**: Fixes both `'TrackioConfig' object has no attribute 'update'` and `'TrackioConfig.update() got an unexpected keyword argument 'allow_val_change'` errors
-2. **Maintains TRL Compatibility**: Ensures SFTTrainer can use Trackio for logging with any argument style
-3. **Dynamic Configuration**: Allows runtime configuration updates via multiple methods
-4. **Future-Proof**: Supports additional TRL requirements and argument patterns
-
-## Related Documentation
-
-- [Trackio TRL Fix Summary](TRACKIO_TRL_FIX_SUMMARY.md)
-- [Trackio Integration Guide](TRACKIO_INTEGRATION.md)
-- [Monitoring Integration Guide](MONITORING_INTEGRATION_GUIDE.md)
\ No newline at end of file
diff --git a/docs/TRAINER_SELECTION_GUIDE.md b/docs/TRAINER_SELECTION_GUIDE.md
deleted file mode 100644
index ff3270877a7538004fe799b4219baf9d9a191b56..0000000000000000000000000000000000000000
--- a/docs/TRAINER_SELECTION_GUIDE.md
+++ /dev/null
@@ -1,205 +0,0 @@
-# Trainer Selection Guide
-
-## Overview
-
-This guide explains how to use the new trainer selection feature that allows you to choose between **SFT (Supervised Fine-tuning)** and **DPO (Direct Preference Optimization)** trainers in the SmolLM3 fine-tuning pipeline.
-
-## Trainer Types
-
-### SFT (Supervised Fine-tuning)
-- **Purpose**: Standard instruction tuning for most fine-tuning tasks
-- **Use Case**: General instruction following, conversation, and task-specific training
-- **Dataset Format**: Standard prompt-completion pairs
-- **Trainer**: `SmolLM3Trainer` with `SFTTrainer` backend
-- **Default**: Yes (default trainer type)
-
-### DPO (Direct Preference Optimization)
-- **Purpose**: Preference-based training using human feedback
-- **Use Case**: Aligning models with human preferences, reducing harmful outputs
-- **Dataset Format**: Preference pairs (chosen/rejected responses)
-- **Trainer**: `SmolLM3DPOTrainer` with `DPOTrainer` backend
-- **Default**: No (must be explicitly selected)
-
-## Implementation Details
-
-### Configuration Changes
-
-#### Base Config (`config/train_smollm3.py`)
-```python
-@dataclass
-class SmolLM3Config:
- # Trainer type selection
- trainer_type: str = "sft" # "sft" or "dpo"
- # ... other fields
-```
-
-#### DPO Config (`config/train_smollm3_dpo.py`)
-```python
-@dataclass
-class SmolLM3DPOConfig(SmolLM3Config):
- # Trainer type selection
- trainer_type: str = "dpo" # Override default to use DPO trainer
- # ... DPO-specific fields
-```
-
-### Training Script Changes
-
-#### Command Line Arguments
-Both `src/train.py` and `scripts/training/train.py` now support:
-```bash
---trainer_type {sft,dpo}
-```
-
-#### Trainer Selection Logic
-```python
-# Determine trainer type (command line overrides config)
-trainer_type = args.trainer_type or getattr(config, 'trainer_type', 'sft')
-
-# Initialize trainer based on type
-if trainer_type.lower() == 'dpo':
- trainer = SmolLM3DPOTrainer(...)
-else:
- trainer = SmolLM3Trainer(...)
-```
-
-### Launch Script Changes
-
-#### Interactive Selection
-The `launch.sh` script now prompts users to select the trainer type:
-```
-Step 3.5: Trainer Type Selection
-====================================
-
-Select the type of training to perform:
-1. SFT (Supervised Fine-tuning) - Standard instruction tuning
- - Uses SFTTrainer for instruction following
- - Suitable for most fine-tuning tasks
- - Optimized for instruction datasets
-
-2. DPO (Direct Preference Optimization) - Preference-based training
- - Uses DPOTrainer for preference learning
- - Requires preference datasets (chosen/rejected pairs)
- - Optimizes for human preferences
-```
-
-#### Configuration Generation
-The generated config file includes the trainer type:
-```python
-config = SmolLM3Config(
- # Trainer type selection
- trainer_type="$TRAINER_TYPE",
- # ... other fields
-)
-```
-
-## Usage Examples
-
-### Using the Launch Script
-```bash
-./launch.sh
-# Follow the interactive prompts
-# Select "SFT" or "DPO" when prompted
-```
-
-### Using Command Line Arguments
-```bash
-# SFT training (default)
-python src/train.py config/train_smollm3.py
-
-# DPO training
-python src/train.py config/train_smollm3_dpo.py
-
-# Override trainer type
-python src/train.py config/train_smollm3.py --trainer_type dpo
-```
-
-### Using the Training Script
-```bash
-# SFT training
-python scripts/training/train.py --config config/train_smollm3.py
-
-# DPO training
-python scripts/training/train.py --config config/train_smollm3_dpo.py
-
-# Override trainer type
-python scripts/training/train.py --config config/train_smollm3.py --trainer-type dpo
-```
-
-## Dataset Requirements
-
-### SFT Training
-- **Format**: Standard instruction datasets
-- **Fields**: `prompt` and `completion` (or similar)
-- **Examples**: OpenHermes, Alpaca, instruction datasets
-
-### DPO Training
-- **Format**: Preference datasets
-- **Fields**: `chosen` and `rejected` responses
-- **Examples**: Human preference datasets, RLHF datasets
-
-## Configuration Priority
-
-1. **Command line argument** (`--trainer_type`) - Highest priority
-2. **Config file** (`trainer_type` field) - Medium priority
-3. **Default value** (`"sft"`) - Lowest priority
-
-## Monitoring and Logging
-
-Both trainer types support:
-- Trackio experiment tracking
-- Training metrics logging
-- Model checkpointing
-- Progress monitoring
-
-## Testing
-
-Run the trainer selection tests:
-```bash
-python tests/test_trainer_selection.py
-```
-
-This verifies:
-- Config inheritance works correctly
-- Trainer classes exist and are importable
-- Trainer type defaults are set correctly
-
-## Troubleshooting
-
-### Common Issues
-
-1. **Import Errors**: Ensure all dependencies are installed
- ```bash
- pip install trl>=0.7.0 transformers>=4.30.0
- ```
-
-2. **Dataset Format**: DPO requires preference datasets with `chosen`/`rejected` fields
-
-3. **Memory Issues**: DPO training may require more memory due to reference model
-
-4. **Config Conflicts**: Command line arguments override config file settings
-
-### Debugging
-
-Enable verbose logging to see trainer selection:
-```bash
-python src/train.py config/train_smollm3.py --trainer_type dpo
-```
-
-Look for these log messages:
-```
-Using trainer type: dpo
-Initializing DPO trainer...
-```
-
-## Future Enhancements
-
-- Support for additional trainer types (RLHF, PPO, etc.)
-- Automatic dataset format detection
-- Enhanced preference dataset validation
-- Multi-objective training support
-
-## Related Documentation
-
-- [Training Configuration Guide](TRAINING_CONFIGURATION_GUIDE.md)
-- [Dataset Preparation Guide](DATASET_PREPARATION_GUIDE.md)
-- [Monitoring Integration Guide](MONITORING_INTEGRATION_GUIDE.md)
\ No newline at end of file
diff --git a/docs/TRAINER_SELECTION_SUMMARY.md b/docs/TRAINER_SELECTION_SUMMARY.md
deleted file mode 100644
index 91e8195476bf5b12fb4f4092cd582dd20c3dd444..0000000000000000000000000000000000000000
--- a/docs/TRAINER_SELECTION_SUMMARY.md
+++ /dev/null
@@ -1,129 +0,0 @@
-# Trainer Selection Implementation Summary
-
-## ✅ Completed Implementation
-
-### 1. Configuration Changes
-- ✅ Added `trainer_type` field to base `SmolLM3Config` (default: "sft")
-- ✅ Added `trainer_type` field to `SmolLM3DPOConfig` (default: "dpo")
-- ✅ Updated config file generation in `launch.sh` to include trainer_type
-
-### 2. Training Script Updates
-- ✅ Added `--trainer_type` argument to `src/train.py`
-- ✅ Added `--trainer-type` argument to `scripts/training/train.py`
-- ✅ Implemented trainer selection logic in `src/train.py`
-- ✅ Updated trainer instantiation to support both SFT and DPO
-
-### 3. Launch Script Updates
-- ✅ Added interactive trainer type selection (Step 3.5)
-- ✅ Updated configuration summary to show trainer type
-- ✅ Updated training parameters display to show trainer type
-- ✅ Updated training script call to pass trainer_type argument
-- ✅ Updated summary report to include trainer type
-
-### 4. Documentation and Testing
-- ✅ Created comprehensive `TRAINER_SELECTION_GUIDE.md`
-- ✅ Created test script `tests/test_trainer_selection.py`
-- ✅ All tests passing (3/3)
-
-## 🎯 Key Features
-
-### Interactive Selection
-Users can now choose between SFT and DPO during the launch process:
-```
-Step 3.5: Trainer Type Selection
-====================================
-
-Select the type of training to perform:
-1. SFT (Supervised Fine-tuning) - Standard instruction tuning
-2. DPO (Direct Preference Optimization) - Preference-based training
-```
-
-### Command Line Override
-Users can override the config's trainer type via command line:
-```bash
-python src/train.py config/train_smollm3.py --trainer_type dpo
-python scripts/training/train.py --config config/train_smollm3.py --trainer-type dpo
-```
-
-### Configuration Priority
-1. Command line argument (highest priority)
-2. Config file trainer_type field (medium priority)
-3. Default value "sft" (lowest priority)
-
-### Automatic Trainer Selection
-The system automatically selects the appropriate trainer:
-- **SFT**: Uses `SmolLM3Trainer` with `SFTTrainer` backend
-- **DPO**: Uses `SmolLM3DPOTrainer` with `DPOTrainer` backend
-
-## 📋 Usage Examples
-
-### Launch Script (Interactive)
-```bash
-./launch.sh
-# Follow prompts and select SFT or DPO
-```
-
-### Direct Training
-```bash
-# SFT training (default)
-python src/train.py config/train_smollm3.py
-
-# DPO training
-python src/train.py config/train_smollm3_dpo.py
-
-# Override trainer type
-python src/train.py config/train_smollm3.py --trainer_type dpo
-```
-
-### Training Script
-```bash
-# SFT training
-python scripts/training/train.py --config config/train_smollm3.py
-
-# DPO training with override
-python scripts/training/train.py --config config/train_smollm3.py --trainer-type dpo
-```
-
-## 🔧 Technical Details
-
-### Files Modified
-1. `config/train_smollm3.py` - Added trainer_type field
-2. `config/train_smollm3_dpo.py` - Added trainer_type field
-3. `src/train.py` - Added trainer selection logic
-4. `scripts/training/train.py` - Added trainer_type argument
-5. `launch.sh` - Added interactive selection and config generation
-6. `src/trainer.py` - Already had both trainer classes
-
-### Files Created
-1. `docs/TRAINER_SELECTION_GUIDE.md` - Comprehensive documentation
-2. `tests/test_trainer_selection.py` - Test suite
-3. `TRAINER_SELECTION_SUMMARY.md` - This summary
-
-## ✅ Testing Results
-```
-🧪 Testing Trainer Selection Implementation
-==================================================
-Testing config trainer_type...
-✅ Base config trainer_type: sft
-✅ DPO config trainer_type: dpo
-Testing trainer class existence...
-✅ Trainer module imported successfully
-✅ Both trainer classes exist
-Testing config inheritance...
-✅ DPO config properly inherits from base config
-✅ Trainer type inheritance works correctly
-==================================================
-Tests passed: 3/3
-🎉 All tests passed!
-```
-
-## 🚀 Next Steps
-
-The trainer selection feature is now fully implemented and tested. Users can:
-
-1. **Use the interactive launch script** to select SFT or DPO
-2. **Override trainer type** via command line arguments
-3. **Use DPO configs** that automatically select DPO trainer
-4. **Monitor training** with the same Trackio integration for both trainers
-
-The implementation maintains backward compatibility while adding the new trainer selection capability.
\ No newline at end of file
diff --git a/docs/TRAINING_FIXES_SUMMARY.md b/docs/TRAINING_FIXES_SUMMARY.md
deleted file mode 100644
index 7e409f428cf4bfda31c7e2c2f9b4ca7de480e3b0..0000000000000000000000000000000000000000
--- a/docs/TRAINING_FIXES_SUMMARY.md
+++ /dev/null
@@ -1,150 +0,0 @@
-# SmolLM3 Training Pipeline Fixes Summary
-
-## Issues Identified and Fixed
-
-### 1. Format String Error
-**Issue**: `Unknown format code 'f' for object of type 'str'`
-**Root Cause**: The console callback was trying to format non-numeric values with f-string format specifiers
-**Fix**: Updated `src/trainer.py` to properly handle type conversion before formatting
-
-```python
-# Before (causing error):
-print("Step {}: loss={:.4f}, lr={}".format(step, loss, lr))
-
-# After (fixed):
-if isinstance(loss, (int, float)):
- loss_str = f"{loss:.4f}"
-else:
- loss_str = str(loss)
-if isinstance(lr, (int, float)):
- lr_str = f"{lr:.2e}"
-else:
- lr_str = str(lr)
-print(f"Step {step}: loss={loss_str}, lr={lr_str}")
-```
-
-### 2. Callback Addition Error
-**Issue**: `'SmolLM3Trainer' object has no attribute 'add_callback'`
-**Root Cause**: The trainer was trying to add callbacks after creation, but callbacks should be passed during trainer creation
-**Fix**: Removed the incorrect `add_callback` call from `src/train.py` since callbacks are already handled in `SmolLM3Trainer._setup_trainer()`
-
-### 3. Trackio Space Deployment Issues
-**Issue**: 404 errors when trying to create experiments via Trackio API
-**Root Cause**: The Trackio Space deployment was failing or the API endpoints weren't accessible
-**Fix**: Updated `src/monitoring.py` to gracefully handle Trackio Space failures and continue with HF Datasets integration
-
-```python
-# Added graceful fallback:
-try:
- result = self.trackio_client.log_metrics(...)
- if "success" in result:
- logger.debug("Metrics logged to Trackio")
- else:
- logger.warning("Failed to log metrics to Trackio: %s", result)
-except Exception as e:
- logger.warning("Trackio logging failed: %s", e)
-```
-
-### 4. Monitoring Integration Improvements
-**Enhancement**: Made monitoring more robust by:
-- Testing Trackio Space connectivity before attempting operations
-- Continuing with HF Datasets even if Trackio fails
-- Adding better error handling and logging
-- Ensuring experiments are saved to HF Datasets regardless of Trackio status
-
-## Files Modified
-
-### Core Training Files
-1. **`src/trainer.py`**
- - Fixed format string error in SimpleConsoleCallback
- - Improved callback handling and error reporting
-
-2. **`src/train.py`**
- - Removed incorrect `add_callback` call
- - Simplified trainer initialization
-
-3. **`src/monitoring.py`**
- - Added graceful Trackio Space failure handling
- - Improved error logging and fallback mechanisms
- - Enhanced HF Datasets integration
-
-### Test Files
-4. **`tests/test_training_fix.py`**
- - Created comprehensive test suite
- - Tests imports, config loading, monitoring setup, trainer creation
- - Validates format string fixes
-
-## Testing the Fixes
-
-Run the test suite to verify all fixes work:
-
-```bash
-python tests/test_training_fix.py
-```
-
-Expected output:
-```
-🚀 Testing SmolLM3 Training Pipeline Fixes
-==================================================
-🔍 Testing imports...
-✅ config.py imported successfully
-✅ model.py imported successfully
-✅ data.py imported successfully
-✅ trainer.py imported successfully
-✅ monitoring.py imported successfully
-
-🔍 Testing configuration loading...
-✅ Configuration loaded successfully
- Model: HuggingFaceTB/SmolLM3-3B
- Dataset: legmlai/openhermes-fr
- Batch size: 16
- Learning rate: 8e-06
-
-🔍 Testing monitoring setup...
-✅ Monitoring setup successful
- Experiment: test_experiment
- Tracking enabled: False
- HF Dataset: tonic/trackio-experiments
-
-🔍 Testing trainer creation...
-✅ Model created successfully
-✅ Dataset created successfully
-✅ Trainer created successfully
-
-🔍 Testing format string fix...
-✅ Format string fix works correctly
-
-📊 Test Results: 5/5 tests passed
-✅ All tests passed! The training pipeline should work correctly.
-```
-
-## Running the Training Pipeline
-
-The training pipeline should now work correctly with the H100 lightweight configuration:
-
-```bash
-# Run the interactive pipeline
-./launch.sh
-
-# Or run training directly
-python src/train.py config/train_smollm3_h100_lightweight.py \
- --experiment-name "smollm3_test" \
- --trackio-url "https://your-space.hf.space" \
- --output-dir /output-checkpoint
-```
-
-## Key Improvements
-
-1. **Robust Error Handling**: Training continues even if monitoring components fail
-2. **Better Logging**: More informative error messages and status updates
-3. **Graceful Degradation**: HF Datasets integration works even without Trackio Space
-4. **Type Safety**: Proper type checking prevents format string errors
-5. **Comprehensive Testing**: Test suite validates all components work correctly
-
-## Next Steps
-
-1. **Deploy Trackio Space**: If you want full monitoring, deploy the Trackio Space manually
-2. **Test Training**: Run a short training session to verify everything works
-3. **Monitor Progress**: Check HF Datasets for experiment data even if Trackio Space is unavailable
-
-The training pipeline should now work reliably for your end-to-end fine-tuning experiments!
\ No newline at end of file
diff --git a/docs/TRL_COMPATIBILITY_ANALYSIS.md b/docs/TRL_COMPATIBILITY_ANALYSIS.md
deleted file mode 100644
index 467d6c1c511694e5a5f8bf6fa82c44dd7466f043..0000000000000000000000000000000000000000
--- a/docs/TRL_COMPATIBILITY_ANALYSIS.md
+++ /dev/null
@@ -1,225 +0,0 @@
-# TRL Library Compatibility Analysis
-
-## Overview
-
-This document provides a comprehensive analysis of the TRL (Transformer Reinforcement Learning) library's interface requirements and our current Trackio implementation to ensure full compatibility.
-
-## TRL Library Interface Requirements
-
-### 1. **Core Logging Interface**
-
-Based on the [TRL documentation](https://huggingface.co/docs/trl/logging), TRL expects a wandb-compatible interface:
-
-#### Required Functions:
-- `init()` - Initialize experiment tracking
-- `log()` - Log metrics during training
-- `finish()` - Finish experiment tracking
-- `config` - Access configuration object
-
-#### Function Signatures:
-```python
-def init(project_name: Optional[str] = None, **kwargs) -> str:
- """Initialize experiment tracking"""
- pass
-
-def log(metrics: Dict[str, Any], step: Optional[int] = None, **kwargs):
- """Log metrics during training"""
- pass
-
-def finish():
- """Finish experiment tracking"""
- pass
-```
-
-### 2. **Configuration Object Requirements**
-
-TRL expects a configuration object with:
-- `update()` method that accepts both dictionary and keyword arguments
-- Dynamic attribute assignment
-- Support for TRL-specific parameters like `allow_val_change`
-
-### 3. **Logging Integration**
-
-TRL supports multiple logging backends:
-- **Weights & Biases (wandb)** - Primary supported backend
-- **TensorBoard** - Alternative logging option
-- **Custom trackers** - Via Accelerate's tracking system
-
-## Our Current Implementation Analysis
-
-### ✅ **Fully Implemented Features**
-
-#### 1. **Core Interface Functions**
-```python
-# src/trackio.py
-def init(project_name: Optional[str] = None, experiment_name: Optional[str] = None, **kwargs) -> str:
- """Initialize trackio experiment (TRL interface)"""
- # ✅ Handles both argument and no-argument calls
- # ✅ Routes to SmolLM3Monitor
- # ✅ Returns experiment ID
-
-def log(metrics: Dict[str, Any], step: Optional[int] = None, **kwargs):
- """Log metrics to trackio (TRL interface)"""
- # ✅ Handles metrics dictionary
- # ✅ Supports step parameter
- # ✅ Routes to SmolLM3Monitor
-
-def finish():
- """Finish trackio experiment (TRL interface)"""
- # ✅ Proper cleanup
- # ✅ Routes to SmolLM3Monitor
-```
-
-#### 2. **Configuration Object**
-```python
-class TrackioConfig:
- def __init__(self):
- # ✅ Environment-based configuration
- # ✅ Default values for all required fields
-
- def update(self, config_dict: Dict[str, Any] = None, **kwargs):
- # ✅ Handles both dictionary and keyword arguments
- # ✅ Dynamic attribute assignment
- # ✅ TRL compatibility (allow_val_change, etc.)
-```
-
-#### 3. **Global Module Access**
-```python
-# trackio.py (root level)
-from src.trackio import init, log, finish, config
-# ✅ Makes functions globally available
-# ✅ TRL can import trackio directly
-```
-
-### ✅ **Advanced Features**
-
-#### 1. **Enhanced Logging**
-- **Metrics Logging**: Comprehensive metric tracking
-- **System Metrics**: GPU usage, memory, etc.
-- **Artifact Logging**: Model checkpoints, configs
-- **HF Dataset Integration**: Persistent storage
-
-#### 2. **Error Handling**
-- **Graceful Fallbacks**: Continues training if Trackio unavailable
-- **Robust Error Recovery**: Handles network issues, timeouts
-- **Comprehensive Logging**: Detailed error messages
-
-#### 3. **Integration Points**
-- **SFTTrainer Integration**: Direct integration in trainer setup
-- **Callback System**: Custom TrainerCallback for monitoring
-- **Configuration Management**: Environment variable support
-
-## TRL-Specific Requirements Analysis
-
-### 1. **SFTTrainer Requirements**
-
-#### ✅ **Fully Supported**
-- **Initialization**: `trackio.init()` called before SFTTrainer creation
-- **Logging**: `trackio.log()` called during training
-- **Cleanup**: `trackio.finish()` called after training
-- **Configuration**: `trackio.config.update()` with TRL parameters
-
-#### ✅ **Advanced Features**
-- **No-argument init**: `trackio.init()` without parameters
-- **Keyword arguments**: `config.update(allow_val_change=True)`
-- **Dynamic attributes**: New attributes added at runtime
-
-### 2. **DPOTrainer Requirements**
-
-#### ✅ **Fully Supported**
-- **Same interface**: DPO uses same logging interface as SFT
-- **Preference logging**: Special handling for preference data
-- **Reward tracking**: Custom reward metric logging
-
-### 3. **Other TRL Trainers**
-
-#### ✅ **Compatible with**
-- **PPOTrainer**: Uses same wandb interface
-- **GRPOTrainer**: Compatible logging interface
-- **CPOTrainer**: Standard logging requirements
-- **KTOTrainer**: Basic logging interface
-
-## Potential Future Enhancements
-
-### 1. **Additional TRL Features**
-
-#### 🔄 **Could Add**
-- **Custom reward functions**: Enhanced reward logging
-- **Multi-objective training**: Support for multiple objectives
-- **Advanced callbacks**: More sophisticated monitoring callbacks
-
-### 2. **Performance Optimizations**
-
-#### 🔄 **Could Optimize**
-- **Batch logging**: Reduce logging overhead
-- **Async logging**: Non-blocking metric logging
-- **Compression**: Compress large metric datasets
-
-### 3. **Extended Compatibility**
-
-#### 🔄 **Could Extend**
-- **More TRL trainers**: Support for newer TRL features
-- **Custom trackers**: Integration with other tracking systems
-- **Advanced metrics**: More sophisticated metric calculations
-
-## Testing and Verification
-
-### ✅ **Current Test Coverage**
-
-#### 1. **Basic Functionality**
-- ✅ `trackio.init()` with and without arguments
-- ✅ `trackio.log()` with various metric types
-- ✅ `trackio.finish()` proper cleanup
-- ✅ `trackio.config.update()` with kwargs
-
-#### 2. **TRL Compatibility**
-- ✅ SFTTrainer integration
-- ✅ DPO trainer compatibility
-- ✅ Configuration object requirements
-- ✅ Error handling and fallbacks
-
-#### 3. **Advanced Features**
-- ✅ HF Dataset integration
-- ✅ System metrics logging
-- ✅ Artifact management
-- ✅ Multi-process support
-
-## Recommendations
-
-### 1. **Current Status: ✅ FULLY COMPATIBLE**
-
-Our current implementation provides **complete compatibility** with TRL's requirements:
-
-- ✅ **Core Interface**: All required functions implemented
-- ✅ **Configuration**: Flexible config object with update method
-- ✅ **Error Handling**: Robust fallback mechanisms
-- ✅ **Integration**: Seamless SFTTrainer/DPOTrainer integration
-
-### 2. **No Additional Changes Required**
-
-The current implementation handles all known TRL interface requirements:
-
-- **wandb-compatible API**: ✅ Complete
-- **Configuration updates**: ✅ Flexible
-- **Error resilience**: ✅ Comprehensive
-- **Future extensibility**: ✅ Well-designed
-
-### 3. **Monitoring and Maintenance**
-
-#### **Ongoing Tasks**
-- Monitor TRL library updates for new requirements
-- Test with new TRL trainer types as they're released
-- Maintain compatibility with TRL version updates
-
-## Conclusion
-
-Our Trackio implementation provides **complete and robust compatibility** with the TRL library. The current implementation handles all known interface requirements and provides extensive additional features beyond basic TRL compatibility.
-
-**Key Strengths:**
-- ✅ Full TRL interface compatibility
-- ✅ Advanced logging and monitoring
-- ✅ Robust error handling
-- ✅ Future-proof architecture
-- ✅ Comprehensive testing
-
-**No additional changes are required** for current TRL compatibility. The implementation is production-ready and handles all known TRL interface requirements.
\ No newline at end of file
diff --git a/docs/TRL_COMPATIBILITY_FINAL_SUMMARY.md b/docs/TRL_COMPATIBILITY_FINAL_SUMMARY.md
deleted file mode 100644
index b33110c60eafaab6c8a10f6474e254c6b0950104..0000000000000000000000000000000000000000
--- a/docs/TRL_COMPATIBILITY_FINAL_SUMMARY.md
+++ /dev/null
@@ -1,129 +0,0 @@
-# TRL Compatibility - Final Summary
-
-## ✅ **COMPLETE TRL COMPATIBILITY ACHIEVED**
-
-Based on comprehensive analysis of the TRL library documentation and thorough testing, our Trackio implementation provides **complete compatibility** with all TRL interface requirements.
-
-## 🎯 **Verified TRL Interface Requirements**
-
-### ✅ **Core Functions (All Implemented)**
-- `trackio.init()` - ✅ Handles both argument and no-argument calls
-- `trackio.log()` - ✅ Supports metrics dictionary and step parameter
-- `trackio.finish()` - ✅ Proper cleanup and experiment termination
-- `trackio.config` - ✅ Configuration object with update method
-
-### ✅ **Configuration Object (Fully Compatible)**
-- `config.update()` - ✅ Handles both dictionary and keyword arguments
-- Dynamic attributes - ✅ New attributes added at runtime
-- TRL-specific parameters - ✅ Supports `allow_val_change` and other TRL kwargs
-- **Dictionary-style access** - ✅ `config['key'] = value` and `config['key']` support
-- **Dictionary methods** - ✅ `config.get()`, `config.keys()`, `config.items()`
-
-### ✅ **Advanced Features (Beyond Basic Requirements)**
-- HF Dataset integration - ✅ Persistent metric storage
-- System metrics logging - ✅ GPU usage, memory, etc.
-- Artifact management - ✅ Model checkpoints, configs
-- Error resilience - ✅ Graceful fallbacks when services unavailable
-
-## 📋 **TRL Library Analysis Results**
-
-### **From TRL Documentation Research:**
-
-#### **Supported Logging Backends:**
-- ✅ **Weights & Biases (wandb)** - Primary supported backend
-- ✅ **TensorBoard** - Alternative logging option
-- ✅ **Custom trackers** - Via Accelerate's tracking system
-
-#### **TRL Trainer Compatibility:**
-- ✅ **SFTTrainer** - Fully compatible with our interface
-- ✅ **DPOTrainer** - Uses same logging interface
-- ✅ **PPOTrainer** - Compatible with wandb interface
-- ✅ **GRPOTrainer** - Compatible logging interface
-- ✅ **CPOTrainer** - Standard logging requirements
-- ✅ **KTOTrainer** - Basic logging interface
-
-#### **Required Function Signatures:**
-```python
-def init(project_name: Optional[str] = None, **kwargs) -> str:
- # ✅ Implemented with flexible argument handling
-
-def log(metrics: Dict[str, Any], step: Optional[int] = None, **kwargs):
- # ✅ Implemented with comprehensive metric support
-
-def finish():
- # ✅ Implemented with proper cleanup
-
-class TrackioConfig:
- def update(self, config_dict: Dict[str, Any] = None, **kwargs):
- # ✅ Implemented with TRL-specific support
-```
-
-## 🧪 **Testing Verification**
-
-### **Core Interface Test Results:**
-- ✅ `trackio.init()` - Works with and without arguments
-- ✅ `trackio.log()` - Handles various metric types
-- ✅ `trackio.finish()` - Proper cleanup
-- ✅ `trackio.config.update()` - Supports TRL kwargs like `allow_val_change`
-
-### **TRL-Specific Test Results:**
-- ✅ No-argument initialization (TRL compatibility)
-- ✅ Keyword argument support (`allow_val_change=True`)
-- ✅ Dynamic attribute assignment
-- ✅ Error handling and fallbacks
-- ✅ **Dictionary-style access** (`config['key'] = value`)
-- ✅ **Dictionary methods** (`config.get()`, `config.keys()`, `config.items()`)
-
-### **Advanced Feature Test Results:**
-- ✅ HF Dataset integration
-- ✅ System metrics logging
-- ✅ Artifact management
-- ✅ Multi-process support
-
-## 🚀 **Production Readiness**
-
-### **Current Status: ✅ PRODUCTION READY**
-
-Our implementation provides:
-
-1. **Complete TRL Compatibility** - All interface requirements met
-2. **Advanced Features** - Beyond basic TRL requirements
-3. **Robust Error Handling** - Graceful fallbacks and recovery
-4. **Comprehensive Testing** - Thorough verification of all features
-5. **Future-Proof Architecture** - Extensible for new TRL features
-
-### **No Additional Changes Required**
-
-The current implementation handles all known TRL interface requirements and provides extensive additional features. The system is ready for production use with TRL-based training.
-
-## 📚 **Documentation Coverage**
-
-### **Created Documentation:**
-- ✅ `TRL_COMPATIBILITY_ANALYSIS.md` - Comprehensive analysis
-- ✅ `TRACKIO_UPDATE_FIX.md` - Configuration update fix
-- ✅ `TRACKIO_TRL_FIX_SUMMARY.md` - Complete solution summary
-- ✅ `TRACKIO_DICT_ACCESS_FIX.md` - Dictionary-style access fix
-- ✅ `TRL_COMPATIBILITY_FINAL_SUMMARY.md` - This final summary
-
-### **Test Coverage:**
-- ✅ `test_trl_comprehensive_compatibility.py` - Comprehensive TRL tests
-- ✅ `test_trackio_update_fix.py` - Configuration update tests
-- ✅ Manual verification tests - All passing
-
-## 🎉 **Conclusion**
-
-**Our Trackio implementation provides complete and robust compatibility with the TRL library.**
-
-### **Key Achievements:**
-- ✅ **Full TRL Interface Compatibility** - All required functions implemented
-- ✅ **Advanced Logging Features** - Beyond basic TRL requirements
-- ✅ **Robust Error Handling** - Production-ready resilience
-- ✅ **Comprehensive Testing** - Thorough verification
-- ✅ **Future-Proof Design** - Extensible architecture
-
-### **Ready for Production:**
-The system is ready for production use with TRL-based training pipelines. No additional changes are required for current TRL compatibility.
-
----
-
-**Status: ✅ COMPLETE - No further action required for TRL compatibility**
\ No newline at end of file
diff --git a/docs/Training_Orchestrator.md b/docs/Training_Orchestrator.md
new file mode 100644
index 0000000000000000000000000000000000000000..17cc4d48dd5ca3f1eaa30ac4091ea6104a936df9
--- /dev/null
+++ b/docs/Training_Orchestrator.md
@@ -0,0 +1,59 @@
+```mermaid
+graph LR
+ Training_Orchestrator["Training Orchestrator"]
+ Model_Abstraction["Model Abstraction"]
+ Data_Pipeline["Data Pipeline"]
+ Configuration["Configuration"]
+ Training_Orchestrator -- "Uses" --> Model_Abstraction
+ Training_Orchestrator -- "Consumes" --> Data_Pipeline
+ Training_Orchestrator -- "Configured by" --> Configuration
+ click Training_Orchestrator href "https://github.com/Josephrp/SmolFactory/blob/main/SmolFactory/docs/Training_Orchestrator.md" "Details"
+ click Model_Abstraction href "https://github.com/Josephrp/SmolFactory/blob/main/SmolFactory/docs/Model_Abstraction.md" "Details"
+ click Data_Pipeline href "https://github.com/Josephrp/SmolFactory/blob/main/SmolFactory/docs/Data_Pipeline.md" "Details"
+```
+
+[](https://github.com/CodeBoarding/GeneratedOnBoardings)[](https://www.codeboarding.org/demo)[](mailto:contact@codeboarding.org)
+
+## Details
+
+One paragraph explaining the functionality which is represented by this graph. What the main flow is and what is its purpose.
+
+### Training Orchestrator [[Expand]](./Training_Orchestrator.md)
+Implements the core training and fine-tuning loop. This includes managing forward and backward passes, optimization, loss calculation, and integration with acceleration libraries (e.g., accelerate). It also handles callbacks and evaluation logic.
+
+
+**Related Classes/Methods**:
+
+- `src.trainer` (1:9999)
+
+
+### Model Abstraction [[Expand]](./Model_Abstraction.md)
+Provides an abstract interface for loading, configuring, and managing different language models. It handles model initialization, tokenizer loading, and potentially quantization settings, ensuring compatibility with various model architectures and training setups.
+
+
+**Related Classes/Methods**:
+
+- `src.model` (1:9999)
+
+
+### Data Pipeline [[Expand]](./Data_Pipeline.md)
+Manages the entire data processing workflow, from loading raw datasets to tokenization, formatting, and preparing data for training. It ensures efficient data handling, including features like dataset sharding, shuffling, and batching.
+
+
+**Related Classes/Methods**:
+
+- `src.data` (1:9999)
+
+
+### Configuration
+Centralizes all configurable parameters for the training process, including model parameters, training arguments, dataset paths, and optimization settings. It provides a structured way to define and access these settings, enabling easy modification and experimentation.
+
+
+**Related Classes/Methods**:
+
+- `src.config` (1:9999)
+
+
+
+
+### [FAQ](https://github.com/CodeBoarding/GeneratedOnBoardings/tree/main?tab=readme-ov-file#faq)
\ No newline at end of file
diff --git a/docs/UNIFIED_MODEL_CARD_GUIDE.md b/docs/UNIFIED_MODEL_CARD_GUIDE.md
deleted file mode 100644
index ae7831b701b42c4d8a2a0ea3fd273d359abed2d1..0000000000000000000000000000000000000000
--- a/docs/UNIFIED_MODEL_CARD_GUIDE.md
+++ /dev/null
@@ -1,295 +0,0 @@
-# Unified Model Card System Guide
-
-## Overview
-
-The unified model card system provides a template-based approach to generate comprehensive model cards that include information about both the main fine-tuned model and any quantized versions. This system ensures consistency across all model repositories and provides users with complete information about all available model variants.
-
-## Architecture
-
-### Template System
-
-The system uses a template-based approach with the following components:
-
-1. **Template File**: `templates/model_card.md` - Contains the master template with conditional sections
-2. **Generator Script**: `scripts/model_tonic/generate_model_card.py` - Processes templates and variables
-3. **Integration**: Updated push scripts that use the unified model card generator
-
-### Key Features
-
-- **Conditional Sections**: Template supports conditional rendering based on variables (e.g., quantized models)
-- **Variable Substitution**: Dynamic content based on training configuration and results
-- **Unified Repository Structure**: Single repository with subdirectories for quantized models
-- **Comprehensive Documentation**: Complete usage examples and deployment information
-
-## Template Structure
-
-### Conditional Sections
-
-The template uses Handlebars-style conditionals:
-
-```markdown
-{{#if quantized_models}}
-### Quantized Models
-
-This repository also includes quantized versions of the model for improved efficiency:
-
-#### int8 Weight-Only Quantization (GPU Optimized)
-```python
-model = AutoModelForCausalLM.from_pretrained("{{repo_name}}/int8")
-```
-{{/if}}
-```
-
-### Template Variables
-
-The template supports the following variables:
-
-| Variable | Description | Example |
-|----------|-------------|---------|
-| `model_name` | Display name of the model | "SmolLM3 Fine-tuned Model" |
-| `model_description` | Brief description | "A fine-tuned version of SmolLM3-3B..." |
-| `repo_name` | Hugging Face repository name | "username/model-name" |
-| `base_model` | Original model name | "HuggingFaceTB/SmolLM3-3B" |
-| `dataset_name` | Training dataset | "OpenHermes-FR" |
-| `training_config_type` | Training configuration | "H100 Lightweight" |
-| `trainer_type` | Trainer used | "SFTTrainer" |
-| `batch_size` | Training batch size | "8" |
-| `learning_rate` | Learning rate | "5e-6" |
-| `max_epochs` | Number of epochs | "3" |
-| `max_seq_length` | Maximum sequence length | "2048" |
-| `hardware_info` | Hardware used | "GPU (H100/A100)" |
-| `experiment_name` | Experiment name | "smollm3-experiment" |
-| `trackio_url` | Trackio monitoring URL | "https://trackio.space/exp" |
-| `dataset_repo` | HF Dataset repository | "tonic/trackio-experiments" |
-| `quantized_models` | Boolean for quantized models | `true` or `false` |
-| `author_name` | Model author | "Your Name" |
-
-## Repository Structure
-
-### Single Repository Approach
-
-Instead of creating separate repositories for quantized models, the system now uses a single repository with subdirectories:
-
-```
-username/model-name/
-├── README.md (unified model card)
-├── config.json
-├── pytorch_model.bin
-├── tokenizer.json
-├── tokenizer_config.json
-├── int8/ (quantized model for GPU)
-│ ├── README.md
-│ ├── config.json
-│ └── pytorch_model.bin
-└── int4/ (quantized model for CPU)
- ├── README.md
- ├── config.json
- └── pytorch_model.bin
-```
-
-### Benefits
-
-1. **Unified Documentation**: Single README with information about all model variants
-2. **Easier Discovery**: Users find all model versions in one place
-3. **Consistent Branding**: Single repository name and description
-4. **Simplified Management**: One repository to maintain and update
-
-## Usage
-
-### Automatic Generation (via launch.sh)
-
-The unified model card is automatically generated during the training pipeline:
-
-```bash
-# The launch script automatically generates the unified model card
-./launch.sh
-```
-
-### Manual Generation
-
-You can generate model cards manually using the generator script:
-
-```bash
-python scripts/model_tonic/generate_model_card.py \
- --repo-name "username/model-name" \
- --model-name "My Fine-tuned Model" \
- --experiment-name "my-experiment" \
- --dataset-name "OpenHermes-FR" \
- --training-config "H100 Lightweight" \
- --batch-size "8" \
- --learning-rate "5e-6" \
- --max-epochs "3" \
- --quantized-models \
- --output "README.md"
-```
-
-### Integration with Push Script
-
-The push script automatically uses the unified model card generator:
-
-```python
-# In push_to_huggingface.py
-def create_model_card(self, training_config: Dict[str, Any], results: Dict[str, Any]) -> str:
- """Create a comprehensive model card using the unified template"""
- try:
- from scripts.model_tonic.generate_model_card import ModelCardGenerator
-
- variables = {
- "model_name": f"{self.repo_name.split('/')[-1]} - Fine-tuned SmolLM3",
- "repo_name": self.repo_name,
- "quantized_models": False, # Updated if quantized models are added
- # ... other variables
- }
-
- generator = ModelCardGenerator()
- return generator.generate_model_card(variables)
-
- except Exception as e:
- # Fallback to simple model card
- return self._create_simple_model_card()
-```
-
-## Quantization Integration
-
-### Quantized Model Cards
-
-When quantized models are created, the system:
-
-1. **Updates Main Model Card**: Sets `quantized_models = True` and includes usage examples
-2. **Creates Subdirectory Cards**: Generates specific README files for each quantized version
-3. **Maintains Consistency**: All cards reference the same repository structure
-
-### Quantization Types
-
-The system supports:
-
-- **int8_weight_only**: GPU optimized, ~50% memory reduction
-- **int4_weight_only**: CPU optimized, ~75% memory reduction
-- **int8_dynamic**: Dynamic quantization for flexibility
-
-### Usage Examples
-
-```python
-# Main model
-model = AutoModelForCausalLM.from_pretrained("username/model-name")
-
-# int8 quantized (GPU)
-model = AutoModelForCausalLM.from_pretrained("username/model-name/int8")
-
-# int4 quantized (CPU)
-model = AutoModelForCausalLM.from_pretrained("username/model-name/int4")
-```
-
-## Template Customization
-
-### Adding New Sections
-
-To add new sections to the template:
-
-1. **Edit Template**: Modify `templates/model_card.md`
-2. **Add Variables**: Update the generator script with new variables
-3. **Update Integration**: Modify push scripts to pass new variables
-
-### Example: Adding Performance Metrics
-
-```markdown
-{{#if performance_metrics}}
-## Performance Metrics
-
-- **BLEU Score**: {{bleu_score}}
-- **ROUGE Score**: {{rouge_score}}
-- **Perplexity**: {{perplexity}}
-{{/if}}
-```
-
-### Conditional Logic
-
-The template supports complex conditional logic:
-
-```markdown
-{{#if quantized_models}}
-{{#if int8_available}}
-### int8 Quantized Model
-{{/if}}
-{{#if int4_available}}
-### int4 Quantized Model
-{{/if}}
-{{/if}}
-```
-
-## Best Practices
-
-### Template Design
-
-1. **Clear Structure**: Use consistent headings and organization
-2. **Comprehensive Information**: Include all relevant model details
-3. **Usage Examples**: Provide clear code examples
-4. **Limitations**: Document model limitations and biases
-5. **Citations**: Include proper citations and acknowledgments
-
-### Variable Management
-
-1. **Default Values**: Provide sensible defaults for all variables
-2. **Validation**: Validate variable types and ranges
-3. **Documentation**: Document all available variables
-4. **Fallbacks**: Provide fallback content for missing variables
-
-### Repository Organization
-
-1. **Single Repository**: Use one repository per model family
-2. **Clear Subdirectories**: Use descriptive subdirectory names
-3. **Consistent Naming**: Follow consistent naming conventions
-4. **Documentation**: Maintain comprehensive documentation
-
-## Troubleshooting
-
-### Common Issues
-
-1. **Template Not Found**: Ensure `templates/model_card.md` exists
-2. **Variable Errors**: Check that all required variables are provided
-3. **Conditional Issues**: Verify conditional syntax and logic
-4. **Import Errors**: Ensure all dependencies are installed
-
-### Debugging
-
-```bash
-# Test template generation
-python scripts/model_tonic/generate_model_card.py \
- --repo-name "test/model" \
- --output "test_readme.md" \
- --debug
-```
-
-### Validation
-
-The system includes validation for:
-
-- Template file existence
-- Required variables
-- Conditional syntax
-- Output file permissions
-
-## Future Enhancements
-
-### Planned Features
-
-1. **Multiple Template Support**: Support for different template types
-2. **Advanced Conditionals**: More complex conditional logic
-3. **Template Inheritance**: Base templates with extensions
-4. **Auto-Detection**: Automatic detection of model features
-5. **Custom Sections**: User-defined template sections
-
-### Extensibility
-
-The system is designed to be easily extensible:
-
-- **Plugin Architecture**: Support for custom template processors
-- **Variable Sources**: Multiple sources for template variables
-- **Output Formats**: Support for different output formats
-- **Integration Points**: Easy integration with other tools
-
-## Conclusion
-
-The unified model card system provides a comprehensive, maintainable approach to model documentation. By using templates and conditional sections, it ensures consistency while providing flexibility for different model configurations and quantization options.
-
-The single repository approach with subdirectories simplifies model management and improves user experience by providing all model variants in one location with unified documentation.
\ No newline at end of file
diff --git a/docs/UNIFIED_REPOSITORY_STRUCTURE_SUMMARY.md b/docs/UNIFIED_REPOSITORY_STRUCTURE_SUMMARY.md
deleted file mode 100644
index d9a5dfbbf2adfac7dec63018799e93ea106dd4d2..0000000000000000000000000000000000000000
--- a/docs/UNIFIED_REPOSITORY_STRUCTURE_SUMMARY.md
+++ /dev/null
@@ -1,252 +0,0 @@
-# Unified Repository Structure Implementation Summary
-
-## Overview
-
-This document summarizes the implementation of a unified repository structure where all models (main and quantized) are stored in a single Hugging Face repository with quantized models in subdirectories.
-
-## Key Changes Made
-
-### 1. Repository Structure
-
-**Before:**
-```
-your-username/model-name/ (main model)
-your-username/model-name-int8/ (int8 quantized)
-your-username/model-name-int4/ (int4 quantized)
-```
-
-**After:**
-```
-your-username/model-name/
-├── README.md (unified model card)
-├── config.json
-├── pytorch_model.bin
-├── tokenizer.json
-├── int8/ (quantized model for GPU)
-│ ├── README.md
-│ ├── config.json
-│ └── pytorch_model.bin
-└── int4/ (quantized model for CPU)
- ├── README.md
- ├── config.json
- └── pytorch_model.bin
-```
-
-### 2. New Files Created
-
-#### `templates/model_card.md`
-- Comprehensive model card template with conditional sections
-- Supports both main model and quantized versions
-- Includes usage examples for all model versions
-- Template variables for dynamic content generation
-
-#### `scripts/model_tonic/generate_model_card.py`
-- Model card generator using the template
-- Handles conditional sections and variable replacement
-- Supports command-line arguments for customization
-- Fallback to simple model card if template fails
-
-### 3. Updated Files
-
-#### `scripts/model_tonic/quantize_model.py`
-- **Fixed f-string errors**: Escaped curly braces in citation URLs
-- **Updated model card generation**: Uses subdirectory-aware URLs
-- **Modified push logic**: Uploads to subdirectories within the same repository
-- **Enhanced README generation**: References correct subdirectory paths
-
-#### `scripts/model_tonic/push_to_huggingface.py`
-- **Integrated unified model card**: Uses the new template-based generator
-- **Enhanced variable handling**: Passes training configuration to template
-- **Improved error handling**: Fallback to simple model card if template fails
-- **Better integration**: Works with the new unified structure
-
-#### `launch.sh`
-- **Updated quantization section**: Uses same repository for all models
-- **Modified summary reports**: Reflects new subdirectory structure
-- **Improved user feedback**: Shows correct URLs for all model versions
-- **Streamlined workflow**: Single repository management
-
-#### `docs/QUANTIZATION_GUIDE.md`
-- **Complete rewrite**: Reflects new unified structure
-- **Updated examples**: Shows correct loading paths
-- **Enhanced documentation**: Covers repository structure and usage
-- **Improved troubleshooting**: Addresses new structure-specific issues
-
-#### `README.md`
-- **Updated quantization section**: Shows unified repository structure
-- **Enhanced examples**: Demonstrates loading from subdirectories
-- **Improved clarity**: Better explanation of the new structure
-
-### 4. Key Features Implemented
-
-#### Unified Model Card
-- Single README.md covers all model versions
-- Conditional sections for quantized models
-- Comprehensive usage examples
-- Training information and configuration details
-
-#### Subdirectory Management
-- Quantized models stored in `/int8/` and `/int4/` subdirectories
-- Separate README files for each quantized version
-- Proper file organization and structure
-
-#### Template System
-- Handlebars-style template with conditionals
-- Variable replacement for dynamic content
-- Support for complex nested structures
-- Error handling and fallback mechanisms
-
-#### Enhanced User Experience
-- Clear repository structure documentation
-- Simplified model loading examples
-- Better error messages and feedback
-- Comprehensive troubleshooting guide
-
-## Technical Implementation Details
-
-### Template Processing
-```python
-# Conditional sections
-{{#if quantized_models}}
-### Quantized Models
-...
-{{/if}}
-
-# Variable replacement
-model = AutoModelForCausalLM.from_pretrained("{{repo_name}}/int8")
-```
-
-### Subdirectory Upload Logic
-```python
-# Determine subdirectory
-if quant_type == "int8_weight_only":
- subdir = "int8"
-elif quant_type == "int4_weight_only":
- subdir = "int4"
-
-# Upload to subdirectory
-repo_path = f"{subdir}/{relative_path}"
-upload_file(
- path_or_fileobj=str(file_path),
- path_in_repo=repo_path,
- repo_id=self.repo_name,
- token=self.token
-)
-```
-
-### Launch Script Integration
-```bash
-# Create quantized models in same repository
-python scripts/model_tonic/quantize_model.py /output-checkpoint "$REPO_NAME" \
- --quant-type "$QUANT_TYPE" \
- --device "$DEVICE" \
- --token "$HF_TOKEN"
-```
-
-## Benefits of the New Structure
-
-### 1. Simplified Management
-- Single repository for all model versions
-- Easier to track and manage
-- Reduced repository clutter
-- Unified documentation
-
-### 2. Better User Experience
-- Clear loading paths for all versions
-- Comprehensive model card with all information
-- Consistent URL structure
-- Simplified deployment
-
-### 3. Enhanced Documentation
-- Single source of truth for model information
-- Conditional sections for different versions
-- Comprehensive usage examples
-- Better discoverability
-
-### 4. Improved Workflow
-- Streamlined quantization process
-- Reduced configuration complexity
-- Better integration with existing pipeline
-- Enhanced monitoring and tracking
-
-## Usage Examples
-
-### Loading Models
-```python
-# Main model
-model = AutoModelForCausalLM.from_pretrained("your-username/model-name")
-
-# int8 quantized (GPU)
-model = AutoModelForCausalLM.from_pretrained("your-username/model-name/int8")
-
-# int4 quantized (CPU)
-model = AutoModelForCausalLM.from_pretrained("your-username/model-name/int4")
-```
-
-### Pipeline Usage
-```bash
-# Run full pipeline with quantization
-./launch.sh
-# Choose quantization options when prompted
-# All models will be in the same repository
-```
-
-### Standalone Quantization
-```bash
-# Quantize existing model
-python scripts/model_tonic/quantize_standalone.py \
- /path/to/model your-username/model-name \
- --quant-type int8_weight_only
-```
-
-## Migration Guide
-
-### For Existing Users
-1. **Update loading code**: Change from separate repositories to subdirectories
-2. **Update documentation**: Reference new unified structure
-3. **Test quantized models**: Verify loading from subdirectories works
-4. **Update deployment scripts**: Use new repository structure
-
-### For New Users
-1. **Follow the new structure**: All models in single repository
-2. **Use the unified model card**: Comprehensive documentation included
-3. **Leverage subdirectories**: Clear organization of model versions
-4. **Benefit from simplified workflow**: Easier management and deployment
-
-## Testing and Validation
-
-### Test Files
-- `tests/test_quantization.py`: Validates quantization functionality
-- Template processing: Ensures correct variable replacement
-- Subdirectory upload: Verifies proper file organization
-- Model loading: Tests all model versions
-
-### Validation Checklist
-- [x] Template processing works correctly
-- [x] Subdirectory uploads function properly
-- [x] Model cards generate with correct URLs
-- [x] Launch script integration works
-- [x] Documentation is updated and accurate
-- [x] Error handling is robust
-- [x] Fallback mechanisms work
-
-## Future Enhancements
-
-### Potential Improvements
-1. **Additional quantization types**: Support for more quantization methods
-2. **Enhanced template system**: More complex conditional logic
-3. **Automated testing**: Comprehensive test suite for all features
-4. **Performance optimization**: Faster quantization and upload processes
-5. **Better monitoring**: Enhanced tracking and metrics
-
-### Extension Points
-1. **Custom quantization configs**: User-defined quantization parameters
-2. **Batch processing**: Multiple model quantization
-3. **Advanced templates**: More sophisticated model card generation
-4. **Integration with other tools**: Support for additional deployment options
-
-## Conclusion
-
-The unified repository structure provides a cleaner, more manageable approach to model deployment and quantization. The implementation includes comprehensive documentation, robust error handling, and a streamlined user experience that makes it easier to work with multiple model versions while maintaining a single source of truth for all model-related information.
-
-The new structure significantly improves the user experience while maintaining backward compatibility and providing clear migration paths for existing users. The enhanced documentation and simplified workflow make the quantization feature more accessible and easier to use.
\ No newline at end of file
diff --git a/docs/USERNAME_EXTRACTION_FIX.md b/docs/USERNAME_EXTRACTION_FIX.md
deleted file mode 100644
index 17c1184ca40a11dd77407184c1a194909ae84f8a..0000000000000000000000000000000000000000
--- a/docs/USERNAME_EXTRACTION_FIX.md
+++ /dev/null
@@ -1,219 +0,0 @@
-# Username Extraction Fix
-
-This document outlines the fix for the "Invalid user token" error that occurred during Trackio Space deployment.
-
-## 🐛 **Problem Description**
-
-The error occurred in the `deploy_trackio_space.py` script when trying to extract the username from the HF token:
-
-```
-❌ Failed to get user info from token: Invalid user token.
-```
-
-This happened because:
-1. The `whoami()` API method was being called incorrectly
-2. The response format wasn't handled properly
-3. No fallback mechanism was in place
-
-## ✅ **Solution Implemented**
-
-### **1. Improved Username Extraction Function**
-
-Created a robust username extraction function that handles multiple scenarios:
-
-```python
-def get_username_from_token(token: str) -> str:
- """Get username from HF token with fallback to CLI"""
- try:
- # Try API first
- api = HfApi(token=token)
- user_info = api.whoami()
-
- # Handle different possible response formats
- if isinstance(user_info, dict):
- # Try different possible keys for username
- username = (
- user_info.get('name') or
- user_info.get('username') or
- user_info.get('user') or
- None
- )
- elif isinstance(user_info, str):
- # If whoami returns just the username as string
- username = user_info
- else:
- username = None
-
- if username:
- print(f"✅ Got username from API: {username}")
- return username
- else:
- print("⚠️ Could not get username from API, trying CLI...")
- return get_username_from_cli(token)
-
- except Exception as e:
- print(f"⚠️ API whoami failed: {e}")
- print("⚠️ Trying CLI fallback...")
- return get_username_from_cli(token)
-```
-
-### **2. CLI Fallback Method**
-
-Added a robust CLI fallback method:
-
-```python
-def get_username_from_cli(token: str) -> str:
- """Fallback method to get username using CLI"""
- try:
- # Set HF token for CLI
- os.environ['HF_TOKEN'] = token
-
- # Get username using CLI
- result = subprocess.run(
- ["hf", "whoami"],
- capture_output=True,
- text=True,
- timeout=30
- )
-
- if result.returncode == 0:
- username = result.stdout.strip()
- if username:
- print(f"✅ Got username from CLI: {username}")
- return username
- else:
- print("⚠️ CLI returned empty username")
- return None
- else:
- print(f"⚠️ CLI whoami failed: {result.stderr}")
- return None
-
- except Exception as e:
- print(f"⚠️ CLI fallback failed: {e}")
- return None
-```
-
-## 🔧 **Files Updated**
-
-### **1. `scripts/trackio_tonic/deploy_trackio_space.py`**
-- ✅ Added `_get_username_from_cli()` method
-- ✅ Updated `__init__()` to use improved username extraction
-- ✅ Better error handling and fallback mechanisms
-- ✅ Handles different response formats from `whoami()`
-
-### **2. `scripts/dataset_tonic/setup_hf_dataset.py`**
-- ✅ Added `get_username_from_token()` and `get_username_from_cli()` functions
-- ✅ Updated main function to use improved username extraction
-- ✅ Better error handling and user feedback
-
-### **3. `scripts/trackio_tonic/configure_trackio.py`**
-- ✅ Added same username extraction functions
-- ✅ Updated configuration function to use improved method
-- ✅ Consistent error handling across all scripts
-
-## 🎯 **Key Improvements**
-
-### **✅ Robust Error Handling**
-- API method fails → CLI fallback
-- CLI fails → Clear error message
-- Multiple response format handling
-
-### **✅ Better User Feedback**
-- Clear status messages for each step
-- Indicates which method is being used (API vs CLI)
-- Helpful error messages with suggestions
-
-### **✅ Multiple Response Format Support**
-- Handles dictionary responses with different key names
-- Handles string responses
-- Handles unexpected response formats
-
-### **✅ Timeout Protection**
-- 30-second timeout for CLI operations
-- Prevents hanging on network issues
-
-## 🔍 **Response Format Handling**
-
-The fix handles different possible response formats from the `whoami()` API:
-
-### **Dictionary Response:**
-```python
-{
- "name": "username",
- "username": "username",
- "user": "username"
-}
-```
-
-### **String Response:**
-```python
-"username"
-```
-
-### **Unknown Format:**
-- Falls back to CLI method
-- Provides clear error messages
-
-## 🧪 **Testing Results**
-
-All tests pass with the updated scripts:
-
-```
-📊 Test Results Summary
-========================================
-✅ PASS: Import Tests
-✅ PASS: Script Existence
-✅ PASS: Script Syntax
-✅ PASS: Environment Variables
-✅ PASS: API Connection
-✅ PASS: Script Functions
-✅ PASS: Template Files
-
-🎯 Overall: 7/7 tests passed
-🎉 All tests passed! The fixes are working correctly.
-```
-
-## 🚀 **Usage**
-
-The fix is transparent to users. The workflow remains the same:
-
-```bash
-# 1. Set HF token
-export HF_TOKEN=your_token_here
-
-# 2. Run deployment (username auto-detected)
-python scripts/trackio_tonic/deploy_trackio_space.py
-
-# 3. Or use the launch script
-bash launch.sh
-```
-
-## 🎉 **Benefits**
-
-1. **✅ Reliable Username Detection**: Works with different API response formats
-2. **✅ Robust Fallback**: CLI method as backup when API fails
-3. **✅ Better Error Messages**: Clear feedback about what's happening
-4. **✅ Consistent Behavior**: Same method across all scripts
-5. **✅ No User Impact**: Transparent to end users
-6. **✅ Future-Proof**: Handles different API response formats
-
-## 🔧 **Troubleshooting**
-
-If username extraction still fails:
-
-1. **Check Token**: Ensure HF_TOKEN is valid and has proper permissions
-2. **Check Network**: Ensure internet connection is stable
-3. **Check CLI**: Ensure `hf` is installed and working
-4. **Manual Override**: Can manually set username in scripts if needed
-
-## 📋 **Summary**
-
-The username extraction fix resolves the "Invalid user token" error by:
-
-- ✅ Implementing robust API response handling
-- ✅ Adding CLI fallback mechanism
-- ✅ Providing better error messages
-- ✅ Ensuring consistent behavior across all scripts
-- ✅ Maintaining backward compatibility
-
-The fix ensures that username extraction works reliably across different environments and API response formats, providing a smooth user experience for the Trackio deployment pipeline.
\ No newline at end of file
diff --git a/docs/on_boarding.md b/docs/on_boarding.md
new file mode 100644
index 0000000000000000000000000000000000000000..f27c39fcce78f2c33440f42a0c2b86174c2e4cfd
--- /dev/null
+++ b/docs/on_boarding.md
@@ -0,0 +1,72 @@
+
+```mermaid
+graph LR
+ Entry_Point["Entry Point"]
+ Configuration_Management["Configuration Management"]
+ Data_Pipeline["Data Pipeline"]
+ Model_Abstraction["Model Abstraction"]
+ Training_Orchestrator["Training Orchestrator"]
+ Entry_Point -- "Initializes and Uses" --> Configuration_Management
+ Entry_Point -- "Initializes" --> Data_Pipeline
+ Entry_Point -- "Initializes" --> Model_Abstraction
+ Entry_Point -- "Initializes and Invokes" --> Training_Orchestrator
+ Configuration_Management -- "Provides Configuration To" --> Model_Abstraction
+ Configuration_Management -- "Provides Configuration To" --> Data_Pipeline
+ Configuration_Management -- "Provides Configuration To" --> Training_Orchestrator
+ Data_Pipeline -- "Provides Data To" --> Training_Orchestrator
+ Model_Abstraction -- "Provides Model To" --> Training_Orchestrator
+ click Entry_Point href "https://github.com/Josephrp/SmolFactory/blob/main/docs/Entry_Point.md" "Details"
+ click Configuration_Management href "https://github.com/Josephrp/SmolFactory/blob/main/docs/Configuration_Management.md" "Details"
+ click Data_Pipeline href "https://github.com/Josephrp/SmolFactory/blob/main/docs/Data_Pipeline.md" "Details"
+ click Model_Abstraction href "https://github.com/Josephrp/SmolFactory/blob/main/docs/Model_Abstraction.md" "Details"
+ click Training_Orchestrator href "https://github.com/Josephrp/SmolFactory/blob/main/docs/Training_Orchestrator.md" "Details"
+```
+
+## Details
+
+Abstract Components Overview
+
+### Entry Point [[Expand]](./Entry_Point.md)
+The primary execution script that orchestrates the entire training process. It initializes all other major components, loads configurations, sets up the training environment, and invokes the `Training Orchestrator`.
+
+
+**Related Classes/Methods**:
+
+- `train`
+
+
+### Configuration Management [[Expand]](./Configuration_Management.md)
+Centralized management of all training parameters, model specifications, data paths, and hyper-parameters. It is responsible for loading, validating, and providing access to configuration settings, supporting base and custom configurations.
+
+
+**Related Classes/Methods**:
+
+- `config` (1:1)
+
+
+### Data Pipeline [[Expand]](./Data_Pipeline.md)
+Handles the entire data lifecycle, including dataset loading, preprocessing (e.g., tokenization, formatting), and creating efficient data loaders for both training and evaluation phases.
+
+
+**Related Classes/Methods**:
+
+- `data` (1:1)
+
+
+### Model Abstraction [[Expand]](./Model_Abstraction.md)
+Encapsulates the logic for loading pre-trained models, defining model architectures, and managing different model variants (e.g., quantization, LoRA adapters). It provides a consistent interface for model interaction.
+
+
+**Related Classes/Methods**:
+
+- `model` (1:1)
+
+
+### Training Orchestrator [[Expand]](./Training_Orchestrator.md)
+Implements the core training and fine-tuning loop. This includes managing forward and backward passes, optimization, loss calculation, and integration with acceleration libraries (e.g., `accelerate`). It also handles callbacks and evaluation logic.
+
+
+**Related Classes/Methods**:
+
+- `trainer` (1:1)
+
diff --git a/launch.sh b/launch.sh
index e4e21634f1e62fce58856200c5fc3d9ec20fb06f..1fa882503e944e86818891bca6f3103cfb31b6e6 100644
--- a/launch.sh
+++ b/launch.sh
@@ -60,6 +60,33 @@ get_input() {
eval "$var_name=\"$input\""
}
+# Function to get secure token input (hidden with stars)
+get_secure_token_input() {
+ local prompt="$1"
+ local var_name="$2"
+ local token_type="$3"
+
+ echo -n "$prompt: "
+ # Use -s flag to hide input, -r to not interpret backslashes
+ read -s -r input
+ echo # Add newline after hidden input
+
+ # Validate that input is not empty
+ while [ -z "$input" ]; do
+ print_error "Token is required!"
+ echo -n "$prompt: "
+ read -s -r input
+ echo
+ done
+
+ # Store the token
+ eval "$var_name=\"$input\""
+
+ # Show confirmation with stars
+ local masked_token="${input:0:4}****${input: -4}"
+ print_status "$token_type token added: $masked_token"
+}
+
# Function to select from options
select_option() {
local prompt="$1"
@@ -342,22 +369,51 @@ print_header "SmolLM3 End-to-End Fine-tuning Pipeline"
echo "=============================================="
echo ""
-# Step 1: Get user credentials (only token needed now)
+# Step 1: Get user credentials (write and read tokens)
print_step "Step 1: User Authentication"
echo "================================"
-get_input "Hugging Face token (get from https://huggingface.co/settings/tokens)" "" HF_TOKEN
+print_info "You'll need two Hugging Face tokens:"
+echo "1. Write Token - Used during training for creating repositories and pushing models"
+echo "2. Read Token - Used in Trackio Space after training for security"
+echo ""
+
+print_info "Getting Write Token (for training operations)..."
+get_secure_token_input "Enter your Hugging Face WRITE token (get from https://huggingface.co/settings/tokens)" HF_WRITE_TOKEN "Write"
-# Validate HF token and get username automatically
-print_info "Validating Hugging Face token and getting username..."
-if validate_hf_token_and_get_username "$HF_TOKEN"; then
- print_status "HF token validated successfully"
+print_info "Getting Read Token (for Trackio Space security)..."
+get_secure_token_input "Enter your Hugging Face READ token (get from https://huggingface.co/settings/tokens)" HF_READ_TOKEN "Read"
+
+# Validate write token and get username automatically
+print_info "Validating write token and getting username..."
+if validate_hf_token_and_get_username "$HF_WRITE_TOKEN"; then
+ print_status "Write token validated successfully"
print_info "Username: $HF_USERNAME"
else
- print_error "Invalid HF token. Please check your token and try again."
+ print_error "Invalid write token. Please check your token and try again."
exit 1
fi
+# Validate read token belongs to same user
+print_info "Validating read token..."
+if validate_hf_token_and_get_username "$HF_READ_TOKEN"; then
+ READ_USERNAME="$HF_USERNAME"
+ if [ "$READ_USERNAME" = "$HF_USERNAME" ]; then
+ print_status "Read token validated successfully"
+ print_info "Both tokens belong to user: $HF_USERNAME"
+ else
+ print_error "Token mismatch: write token user ($HF_USERNAME) != read token user ($READ_USERNAME)"
+ print_error "Both tokens must belong to the same user"
+ exit 1
+ fi
+else
+ print_error "Invalid read token. Please check your token and try again."
+ exit 1
+fi
+
+# Set the main HF_TOKEN to write token for training operations
+HF_TOKEN="$HF_WRITE_TOKEN"
+
# Step 2: Select training configuration
print_step "Step 2: Training Configuration"
echo "=================================="
@@ -535,6 +591,8 @@ fi
# Set environment variables before creating virtual environment
print_info "Setting up environment variables..."
+export HF_WRITE_TOKEN="$HF_WRITE_TOKEN"
+export HF_READ_TOKEN="$HF_READ_TOKEN"
export HF_TOKEN="$HF_TOKEN"
export TRACKIO_DATASET_REPO="$TRACKIO_DATASET_REPO"
export HUGGING_FACE_HUB_TOKEN="$HF_TOKEN"
@@ -546,6 +604,8 @@ source smollm3_env/bin/activate
# Re-export environment variables in the virtual environment
print_info "Configuring environment variables in virtual environment..."
+export HF_WRITE_TOKEN="$HF_WRITE_TOKEN"
+export HF_READ_TOKEN="$HF_READ_TOKEN"
export HF_TOKEN="$HF_TOKEN"
export TRACKIO_DATASET_REPO="$TRACKIO_DATASET_REPO"
export HUGGING_FACE_HUB_TOKEN="$HF_TOKEN"
@@ -574,15 +634,16 @@ print_status "HF token configured for Python API usage"
print_info "Username: $HF_USERNAME (auto-detected from token)"
print_info "Token available in environment: ${HF_TOKEN:0:10}...${HF_TOKEN: -4}"
-# Verify token is available in the virtual environment
+# Verify tokens are available in the virtual environment
print_info "Verifying token availability in virtual environment..."
-if [ -n "$HF_TOKEN" ] && [ -n "$HUGGING_FACE_HUB_TOKEN" ]; then
- print_status "✅ Token properly configured in virtual environment"
- print_info " HF_TOKEN: ${HF_TOKEN:0:10}...${HF_TOKEN: -4}"
+if [ -n "$HF_WRITE_TOKEN" ] && [ -n "$HF_READ_TOKEN" ] && [ -n "$HUGGING_FACE_HUB_TOKEN" ]; then
+ print_status "✅ Tokens properly configured in virtual environment"
+ print_info " HF_WRITE_TOKEN: ${HF_WRITE_TOKEN:0:10}...${HF_WRITE_TOKEN: -4}"
+ print_info " HF_READ_TOKEN: ${HF_READ_TOKEN:0:10}...${HF_READ_TOKEN: -4}"
print_info " HUGGING_FACE_HUB_TOKEN: ${HUGGING_FACE_HUB_TOKEN:0:10}...${HUGGING_FACE_HUB_TOKEN: -4}"
else
- print_error "❌ Token not properly configured in virtual environment"
- print_error "Please check your token and try again"
+ print_error "❌ Tokens not properly configured in virtual environment"
+ print_error "Please check your tokens and try again"
exit 1
fi
@@ -632,12 +693,14 @@ print_info "Username will be auto-detected from token"
print_info "Secrets will be set automatically via API"
# Ensure environment variables are available for the script
+export HF_WRITE_TOKEN="$HF_WRITE_TOKEN"
+export HF_READ_TOKEN="$HF_READ_TOKEN"
export HF_TOKEN="$HF_TOKEN"
export HUGGING_FACE_HUB_TOKEN="$HF_TOKEN"
export HF_USERNAME="$HF_USERNAME"
# Run deployment script with automated features
-python deploy_trackio_space.py "$TRACKIO_SPACE_NAME" "$HF_TOKEN" "$GIT_EMAIL"
+python deploy_trackio_space.py "$TRACKIO_SPACE_NAME" "$HF_TOKEN" "$GIT_EMAIL" "$HF_USERNAME" "$TRACKIO_DATASET_REPO"
print_status "Trackio Space deployed: $TRACKIO_URL"
@@ -651,6 +714,8 @@ print_info "Username will be auto-detected from token"
print_info "Dataset repository: $TRACKIO_DATASET_REPO"
# Ensure environment variables are available for the script
+export HF_WRITE_TOKEN="$HF_WRITE_TOKEN"
+export HF_READ_TOKEN="$HF_READ_TOKEN"
export HF_TOKEN="$HF_TOKEN"
export HUGGING_FACE_HUB_TOKEN="$HF_TOKEN"
export HF_USERNAME="$HF_USERNAME"
@@ -666,6 +731,8 @@ print_info "Configuring Trackio ..."
print_info "Username will be auto-detected from token"
# Ensure environment variables are available for the script
+export HF_WRITE_TOKEN="$HF_WRITE_TOKEN"
+export HF_READ_TOKEN="$HF_READ_TOKEN"
export HF_TOKEN="$HF_TOKEN"
export HUGGING_FACE_HUB_TOKEN="$HF_TOKEN"
export HF_USERNAME="$HF_USERNAME"
@@ -709,6 +776,8 @@ print_info "Output: /output-checkpoint"
print_info "Trackio: $TRACKIO_URL"
# Ensure environment variables are available for training
+export HF_WRITE_TOKEN="$HF_WRITE_TOKEN"
+export HF_READ_TOKEN="$HF_READ_TOKEN"
export HF_TOKEN="$HF_TOKEN"
export HUGGING_FACE_HUB_TOKEN="$HF_TOKEN"
export HF_USERNAME="$HF_USERNAME"
@@ -730,6 +799,8 @@ print_info "Pushing model to: $REPO_NAME"
print_info "Checkpoint: /output-checkpoint"
# Ensure environment variables are available for model push
+export HF_WRITE_TOKEN="$HF_WRITE_TOKEN"
+export HF_READ_TOKEN="$HF_READ_TOKEN"
export HF_TOKEN="$HF_TOKEN"
export HUGGING_FACE_HUB_TOKEN="$HF_TOKEN"
export HF_USERNAME="$HF_USERNAME"
@@ -744,94 +815,75 @@ python scripts/model_tonic/push_to_huggingface.py /output-checkpoint "$REPO_NAME
--author-name "$AUTHOR_NAME" \
--model-description "$MODEL_DESCRIPTION"
-# Step 16.5: Quantization Options
-print_step "Step 16.5: Model Quantization Options"
-echo "=========================================="
+# Step 16.5: Switch Trackio Space to Read Token (Security)
+print_step "Step 16.5: Switching to Read Token for Security"
+echo "===================================================="
+
+print_info "Switching Trackio Space from write token to read token for security..."
+print_info "This ensures the space can only read datasets, not write to repositories"
+
+# Ensure environment variables are available for token switch
+export HF_TOKEN="$HF_WRITE_TOKEN" # Use write token to update space
+export HUGGING_FACE_HUB_TOKEN="$HF_WRITE_TOKEN"
+export HF_USERNAME="$HF_USERNAME"
+
+# Switch to read token in Trackio Space
+cd scripts/trackio_tonic
+python switch_to_read_token.py "$HF_USERNAME/$TRACKIO_SPACE_NAME" "$HF_READ_TOKEN" "$HF_WRITE_TOKEN"
+
+if [ $? -eq 0 ]; then
+ print_status "✅ Successfully switched Trackio Space to read token"
+ print_info "🔒 Space now uses read-only permissions for security"
+else
+ print_warning "⚠️ Failed to switch to read token, but continuing with pipeline"
+ print_info "You can manually switch the token in your Space settings later"
+fi
+
+cd ../..
-print_info "Would you like to create quantized versions of your model?"
-print_info "Quantization reduces model size and improves inference speed."
+# Step 17: Deploy Demo Space
+print_step "Step 17: Deploying Demo Space"
+echo "=================================="
-# Ask about quantization
-get_input "Create quantized models? (y/n)" "y" "CREATE_QUANTIZED"
+# Ask user if they want to deploy a demo space
+get_input "Do you want to deploy a demo space to test your model? (y/n)" "y" "DEPLOY_DEMO"
-if [ "$CREATE_QUANTIZED" = "y" ] || [ "$CREATE_QUANTIZED" = "Y" ]; then
- print_info "Quantization options:"
- print_info "1. int8_weight_only (GPU optimized, ~50% memory reduction)"
- print_info "2. int4_weight_only (CPU optimized, ~75% memory reduction)"
- print_info "3. Both int8 and int4 versions"
+if [ "$DEPLOY_DEMO" = "y" ] || [ "$DEPLOY_DEMO" = "Y" ]; then
+ print_info "Deploying demo space for model testing..."
- select_option "Select quantization type:" "int8_weight_only" "int4_weight_only" "both" "QUANT_TYPE"
+ # Use main model for demo (no quantization)
+ DEMO_MODEL_ID="$REPO_NAME"
+ DEMO_SUBFOLDER=""
- if [ "$QUANT_TYPE" = "both" ]; then
- # Create both int8 and int4 versions in the same repository
- print_info "Creating int8 (GPU) quantized model..."
-
- # Ensure environment variables are available for quantization
- export HF_TOKEN="$HF_TOKEN"
- export HUGGING_FACE_HUB_TOKEN="$HF_TOKEN"
- export HF_USERNAME="$HF_USERNAME"
- export TRACKIO_DATASET_REPO="$TRACKIO_DATASET_REPO"
-
- python scripts/model_tonic/quantize_model.py /output-checkpoint "$REPO_NAME" \
- --quant-type "int8_weight_only" \
- --device "auto" \
- --token "$HF_TOKEN" \
- --trackio-url "$TRACKIO_URL" \
- --experiment-name "${EXPERIMENT_NAME}-int8" \
- --dataset-repo "$TRACKIO_DATASET_REPO"
-
- print_info "Creating int4 (CPU) quantized model..."
-
- # Ensure environment variables are available for quantization
- export HF_TOKEN="$HF_TOKEN"
- export HUGGING_FACE_HUB_TOKEN="$HF_TOKEN"
- export HF_USERNAME="$HF_USERNAME"
- export TRACKIO_DATASET_REPO="$TRACKIO_DATASET_REPO"
-
- python scripts/model_tonic/quantize_model.py /output-checkpoint "$REPO_NAME" \
- --quant-type "int4_weight_only" \
- --device "cpu" \
- --token "$HF_TOKEN" \
- --trackio-url "$TRACKIO_URL" \
- --experiment-name "${EXPERIMENT_NAME}-int4" \
- --dataset-repo "$TRACKIO_DATASET_REPO"
-
- print_status "✅ Both quantized models created in the same repository:"
- print_info "Main model: https://huggingface.co/$REPO_NAME"
- print_info "int8 (GPU): https://huggingface.co/$REPO_NAME/int8"
- print_info "int4 (CPU): https://huggingface.co/$REPO_NAME/int4"
-
+ # Ensure environment variables are available for demo deployment
+export HF_WRITE_TOKEN="$HF_WRITE_TOKEN"
+export HF_READ_TOKEN="$HF_READ_TOKEN"
+export HF_TOKEN="$HF_TOKEN"
+export HUGGING_FACE_HUB_TOKEN="$HF_TOKEN"
+export HF_USERNAME="$HF_USERNAME"
+
+ print_info "Deploying demo space for model: $DEMO_MODEL_ID"
+ print_info "Using subfolder: $DEMO_SUBFOLDER"
+
+ python scripts/deploy_demo_space.py \
+ --hf-token "$HF_TOKEN" \
+ --hf-username "$HF_USERNAME" \
+ --model-id "$DEMO_MODEL_ID" \
+ --subfolder "$DEMO_SUBFOLDER" \
+ --space-name "${REPO_NAME}-demo"
+
+ if [ $? -eq 0 ]; then
+ DEMO_SPACE_URL="https://huggingface.co/spaces/$HF_USERNAME/${REPO_NAME}-demo"
+ print_status "✅ Demo space deployed successfully: $DEMO_SPACE_URL"
else
- # Create single quantized version in the same repository
- print_info "Creating ${QUANT_TYPE} quantized model..."
-
- DEVICE="auto"
- if [ "$QUANT_TYPE" = "int4_weight_only" ]; then
- DEVICE="cpu"
- fi
-
- # Ensure environment variables are available for quantization
- export HF_TOKEN="$HF_TOKEN"
- export HUGGING_FACE_HUB_TOKEN="$HF_TOKEN"
- export HF_USERNAME="$HF_USERNAME"
- export TRACKIO_DATASET_REPO="$TRACKIO_DATASET_REPO"
-
- python scripts/model_tonic/quantize_model.py /output-checkpoint "$REPO_NAME" \
- --quant-type "$QUANT_TYPE" \
- --device "$DEVICE" \
- --token "$HF_TOKEN" \
- --trackio-url "$TRACKIO_URL" \
- --experiment-name "${EXPERIMENT_NAME}-${QUANT_TYPE}" \
- --dataset-repo "$TRACKIO_DATASET_REPO"
-
- print_status "✅ Quantized model created: https://huggingface.co/$REPO_NAME/${QUANT_TYPE//_/-}"
+ print_warning "⚠️ Demo space deployment failed, but continuing with pipeline"
fi
else
- print_info "Skipping quantization"
+ print_info "Skipping demo space deployment"
fi
-# Step 17: Create summary report
-print_step "Step 17: Creating Summary Report"
+# Step 18: Create summary report
+print_step "Step 18: Creating Summary Report"
echo "===================================="
cat > training_summary.md << EOF
@@ -846,6 +898,7 @@ cat > training_summary.md << EOF
- **HF Dataset**: $TRACKIO_DATASET_REPO
- **Training Config**: $TRAINING_CONFIG_TYPE
- **Trainer Type**: $TRAINER_TYPE
+- **Security**: Dual token system (write + read tokens)
$(if [ "$TRAINING_CONFIG_TYPE" = "H100 Lightweight (Rapid)" ]; then
echo "- **Dataset Sample Size**: ${DATASET_SAMPLE_SIZE:-80000}"
fi)
@@ -861,14 +914,9 @@ fi)
- **Model Repository**: https://huggingface.co/$REPO_NAME
- **Trackio Monitoring**: $TRACKIO_URL
- **Experiment Data**: https://huggingface.co/datasets/$TRACKIO_DATASET_REPO
-$(if [ "$CREATE_QUANTIZED" = "y" ] || [ "$CREATE_QUANTIZED" = "Y" ]; then
-echo "- **Quantization**: $QUANT_TYPE"
-if [ "$QUANT_TYPE" = "both" ]; then
-echo "- **int8 Model (GPU)**: https://huggingface.co/$REPO_NAME/int8"
-echo "- **int4 Model (CPU)**: https://huggingface.co/$REPO_NAME/int4"
-else
-echo "- **Quantized Model**: https://huggingface.co/$REPO_NAME/${QUANT_TYPE//_/-}"
-fi
+- **Security**: Trackio Space switched to read-only token for security
+$(if [ "$DEPLOY_DEMO" = "y" ] || [ "$DEPLOY_DEMO" = "Y" ]; then
+echo "- **Demo Space**: https://huggingface.co/spaces/$HF_USERNAME/${REPO_NAME}-demo"
fi)
## Next Steps
@@ -895,15 +943,8 @@ echo "📊 Model: https://huggingface.co/$REPO_NAME"
echo "📈 Trackio: $TRACKIO_URL"
echo "📋 Experiment: $EXPERIMENT_NAME"
echo "📊 Dataset: https://huggingface.co/datasets/$TRACKIO_DATASET_REPO"
-$(if [ "$CREATE_QUANTIZED" = "y" ] || [ "$CREATE_QUANTIZED" = "Y" ]; then
-echo ""
-echo "🔧 Quantized Models:"
-if [ "$QUANT_TYPE" = "both" ]; then
-echo " 📊 int8 (GPU): https://huggingface.co/$REPO_NAME/int8"
-echo " 📊 int4 (CPU): https://huggingface.co/$REPO_NAME/int4"
-else
-echo " 📊 $QUANT_TYPE: https://huggingface.co/$REPO_NAME/${QUANT_TYPE//_/-}"
-fi
+$(if [ "$DEPLOY_DEMO" = "y" ] || [ "$DEPLOY_DEMO" = "Y" ]; then
+echo "🎮 Demo: https://huggingface.co/spaces/$HF_USERNAME/${REPO_NAME}-demo"
fi)
echo ""
echo "📋 Summary report saved to: training_summary.md"
@@ -911,7 +952,11 @@ echo ""
echo "🚀 Next steps:"
echo "1. Monitor training progress in your Trackio Space"
echo "2. Check the model repository on Hugging Face Hub"
-echo "3. Use the model in your applications"
-echo "4. Share your results with the community"
+echo "3. Your Trackio Space is now secured with read-only permissions"
+$(if [ "$DEPLOY_DEMO" = "y" ] || [ "$DEPLOY_DEMO" = "Y" ]; then
+echo "3. Make your huggingface space a ZeroGPU Space & Test your model"
+fi)
+echo "5. Use the model in your applications"
+echo "6. Share your results with the community"
echo ""
print_status "Pipeline completed successfully!"
\ No newline at end of file
diff --git a/scripts/deploy_demo_space.py b/scripts/deploy_demo_space.py
new file mode 100644
index 0000000000000000000000000000000000000000..769324cb60d2298756d44f969f9b8ce725cb5764
--- /dev/null
+++ b/scripts/deploy_demo_space.py
@@ -0,0 +1,514 @@
+#!/usr/bin/env python3
+"""
+Demo Space Deployment Script
+Deploys a Gradio demo space to Hugging Face Spaces for testing the fine-tuned model.
+"""
+
+import os
+import sys
+import json
+import logging
+import argparse
+import subprocess
+import requests
+import tempfile
+import shutil
+from pathlib import Path
+from typing import Optional, Dict, Any
+import time
+
+# Import Hugging Face Hub API
+try:
+ from huggingface_hub import HfApi, create_repo, upload_file
+ HF_HUB_AVAILABLE = True
+except ImportError:
+ HF_HUB_AVAILABLE = False
+ print("Warning: huggingface_hub not available. Install with: pip install huggingface_hub")
+
+# Add src to path for imports
+sys.path.append(str(Path(__file__).parent.parent / "src"))
+
+from config import SmolLM3Config
+
+# Setup logging
+logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
+logger = logging.getLogger(__name__)
+
+class DemoSpaceDeployer:
+ """Deploy demo space to Hugging Face Spaces"""
+
+ def __init__(self, hf_token: str, hf_username: str, model_id: str,
+ subfolder: str = "int4", space_name: Optional[str] = None):
+ self.hf_token = hf_token
+ self.hf_username = hf_username
+ self.model_id = model_id
+ self.subfolder = subfolder
+ self.space_name = space_name or f"{model_id.split('/')[-1]}-demo"
+ self.space_id = f"{hf_username}/{self.space_name}"
+ self.space_url = f"https://huggingface.co/spaces/{self.space_id}"
+
+ # Template paths
+ self.template_dir = Path(__file__).parent.parent / "templates" / "spaces" / "demo"
+ self.workspace_dir = Path.cwd()
+
+ # Initialize HF API
+ if HF_HUB_AVAILABLE:
+ self.api = HfApi(token=self.hf_token)
+ else:
+ self.api = None
+ logger.warning("huggingface_hub not available, using CLI fallback")
+
+ def validate_model_exists(self) -> bool:
+ """Validate that the model exists on Hugging Face Hub"""
+ try:
+ logger.info(f"Validating model: {self.model_id}")
+
+ if HF_HUB_AVAILABLE:
+ # Use HF Hub API
+ try:
+ model_info = self.api.model_info(self.model_id)
+ logger.info(f"✅ Model {self.model_id} exists and is accessible")
+ return True
+ except Exception as e:
+ logger.error(f"❌ Model {self.model_id} not found via API: {e}")
+ return False
+ else:
+ # Fallback to requests
+ url = f"https://huggingface.co/api/models/{self.model_id}"
+ headers = {"Authorization": f"Bearer {self.hf_token}"}
+ response = requests.get(url, headers=headers, timeout=30)
+
+ if response.status_code == 200:
+ logger.info(f"✅ Model {self.model_id} exists and is accessible")
+ return True
+ else:
+ logger.error(f"❌ Model {self.model_id} not found or not accessible")
+ return False
+
+ except Exception as e:
+ logger.error(f"❌ Error validating model: {e}")
+ return False
+
+ def create_space_repository(self) -> bool:
+ """Create the space repository on Hugging Face Hub"""
+ try:
+ logger.info(f"Creating Space: {self.space_name}")
+
+ if not HF_HUB_AVAILABLE:
+ logger.warning("huggingface_hub not available, falling back to CLI")
+ return self._create_space_cli()
+
+ # Use the latest HF Hub API to create space
+ try:
+ # Create the space using the API
+ create_repo(
+ repo_id=self.space_id,
+ token=self.hf_token,
+ repo_type="space",
+ exist_ok=True,
+ private=False, # Spaces are typically public
+ space_sdk="gradio", # Specify Gradio SDK
+ space_hardware="cpu-basic" # Use basic CPU
+ )
+
+ logger.info(f"✅ Space created successfully: {self.space_url}")
+ return True
+
+ except Exception as api_error:
+ logger.error(f"API creation failed: {api_error}")
+ logger.info("Falling back to CLI method...")
+ return self._create_space_cli()
+
+ except Exception as e:
+ logger.error(f"❌ Error creating space: {e}")
+ return False
+
+ def _create_space_cli(self) -> bool:
+ """Fallback method using CLI commands"""
+ try:
+ logger.info("Using CLI fallback method...")
+
+ # Set HF token for CLI
+ os.environ['HF_TOKEN'] = self.hf_token
+
+ # Create space using Hugging Face CLI
+ cmd = [
+ "hf", "repo", "create",
+ self.space_id,
+ "--type", "space"
+ ]
+
+ logger.info(f"Running command: {' '.join(cmd)}")
+ result = subprocess.run(cmd, capture_output=True, text=True)
+
+ if result.returncode != 0:
+ logger.warning(f"First attempt failed: {result.stderr}")
+ # Try alternative approach without space-specific flags
+ logger.info("Retrying with basic space creation...")
+ cmd = [
+ "hf", "repo", "create",
+ self.space_id
+ ]
+ result = subprocess.run(cmd, capture_output=True, text=True)
+
+ if result.returncode == 0:
+ logger.info(f"✅ Space created successfully: {self.space_url}")
+ return True
+ else:
+ logger.error(f"❌ Failed to create space: {result.stderr}")
+ return False
+
+ except Exception as e:
+ logger.error(f"❌ Error creating space with CLI: {e}")
+ return False
+
+ def prepare_space_files(self) -> str:
+ """Prepare all necessary files for the Space in a temporary directory"""
+ try:
+ logger.info("Preparing Space files...")
+
+ # Create temporary directory
+ temp_dir = tempfile.mkdtemp()
+ logger.info(f"Created temporary directory: {temp_dir}")
+
+ # Copy template files
+ copied_files = []
+ for file_path in self.template_dir.iterdir():
+ if file_path.is_file():
+ dest_path = Path(temp_dir) / file_path.name
+ shutil.copy2(file_path, dest_path)
+ copied_files.append(file_path.name)
+ logger.info(f"✅ Copied {file_path.name} to temp directory")
+
+ # Update app.py with environment variables
+ app_file = Path(temp_dir) / "app.py"
+ if app_file.exists():
+ with open(app_file, 'r', encoding='utf-8') as f:
+ content = f.read()
+
+ # Add environment variable setup at the top
+ env_setup = f"""
+# Environment variables for model configuration
+import os
+os.environ['HF_MODEL_ID'] = '{self.model_id}'
+os.environ['MODEL_SUBFOLDER'] = '{self.subfolder if self.subfolder else ""}'
+os.environ['MODEL_NAME'] = '{self.model_id.split("/")[-1]}'
+
+"""
+
+ # Insert after imports
+ lines = content.split('\n')
+ import_end = 0
+ for i, line in enumerate(lines):
+ if line.startswith('import ') or line.startswith('from '):
+ import_end = i + 1
+ elif line.strip() == '' and import_end > 0:
+ break
+
+ lines.insert(import_end, env_setup)
+ content = '\n'.join(lines)
+
+ with open(app_file, 'w', encoding='utf-8') as f:
+ f.write(content)
+
+ logger.info("✅ Updated app.py with model configuration")
+
+ # Create README.md for the space
+ readme_content = f"""# Demo: {self.model_id}
+
+This is an interactive demo for the fine-tuned model {self.model_id}.
+
+## Features
+- Interactive chat interface
+- Customizable system prompts
+- Advanced generation parameters
+- Thinking mode support
+
+## Model Information
+- **Model ID**: {self.model_id}
+- **Subfolder**: {self.subfolder if self.subfolder and self.subfolder.strip() else "main"}
+- **Deployed by**: {self.hf_username}
+
+## Usage
+Simply start chatting with the model using the interface below!
+
+---
+*This demo was automatically deployed by the SmolLM3 Fine-tuning Pipeline*
+"""
+
+ with open(Path(temp_dir) / "README.md", 'w', encoding='utf-8') as f:
+ f.write(readme_content)
+
+ logger.info(f"✅ Prepared {len(copied_files)} files in temporary directory")
+ return temp_dir
+
+ except Exception as e:
+ logger.error(f"❌ Error preparing files: {e}")
+ return None
+
+ def upload_files_to_space(self, temp_dir: str) -> bool:
+ """Upload files to the Space using HF Hub API directly"""
+ try:
+ logger.info("Uploading files to Space using HF Hub API...")
+
+ if not HF_HUB_AVAILABLE:
+ logger.error("❌ huggingface_hub not available for file upload")
+ return self._upload_files_cli(temp_dir)
+
+ # Upload each file using the HF Hub API
+ temp_path = Path(temp_dir)
+ uploaded_files = []
+
+ for file_path in temp_path.iterdir():
+ if file_path.is_file():
+ try:
+ # Upload file to the space
+ upload_file(
+ path_or_fileobj=str(file_path),
+ path_in_repo=file_path.name,
+ repo_id=self.space_id,
+ repo_type="space",
+ token=self.hf_token
+ )
+ uploaded_files.append(file_path.name)
+ logger.info(f"✅ Uploaded {file_path.name}")
+ except Exception as e:
+ logger.error(f"❌ Failed to upload {file_path.name}: {e}")
+ return False
+
+ logger.info(f"✅ Successfully uploaded {len(uploaded_files)} files to Space")
+ return True
+
+ except Exception as e:
+ logger.error(f"❌ Error uploading files: {e}")
+ return self._upload_files_cli(temp_dir)
+
+ def _upload_files_cli(self, temp_dir: str) -> bool:
+ """Fallback method using CLI for file upload"""
+ try:
+ logger.info("Using CLI fallback for file upload...")
+
+ # Set HF token for CLI
+ os.environ['HF_TOKEN'] = self.hf_token
+
+ # Initialize git repository
+ subprocess.run(["git", "init"], cwd=temp_dir, check=True)
+ subprocess.run(["git", "config", "user.name", "Demo Deployer"], cwd=temp_dir, check=True)
+ subprocess.run(["git", "config", "user.email", "demo@example.com"], cwd=temp_dir, check=True)
+
+ # Add files
+ subprocess.run(["git", "add", "."], cwd=temp_dir, check=True)
+ subprocess.run(["git", "commit", "-m", f"Deploy demo for {self.model_id}"], cwd=temp_dir, check=True)
+
+ # Add remote and push
+ remote_url = f"https://{self.hf_token}@huggingface.co/spaces/{self.space_id}"
+ subprocess.run(["git", "remote", "add", "origin", remote_url], cwd=temp_dir, check=True)
+ subprocess.run(["git", "push", "-u", "origin", "main"], cwd=temp_dir, check=True)
+
+ logger.info(f"✅ Successfully pushed files to space: {self.space_id}")
+ return True
+
+ except subprocess.CalledProcessError as e:
+ logger.error(f"❌ Git operation failed: {e}")
+ return False
+ except Exception as e:
+ logger.error(f"❌ Error pushing to space: {e}")
+ return False
+
+ def set_space_secrets(self) -> bool:
+ """Set environment variables/secrets for the Space using HF Hub API"""
+ try:
+ logger.info("Setting Space secrets using HF Hub API...")
+
+ if not HF_HUB_AVAILABLE:
+ logger.warning("❌ huggingface_hub not available for setting secrets")
+ return self._manual_secret_setup()
+
+ # Set the HF_TOKEN secret for the space using the API
+ try:
+ self.api.add_space_secret(
+ repo_id=self.space_id,
+ key="HF_TOKEN",
+ value=self.hf_token,
+ description="Hugging Face token for model access"
+ )
+ logger.info("✅ Successfully set HF_TOKEN secret via API")
+
+ # Set model-specific environment variables
+ self.api.add_space_variable(
+ repo_id=self.space_id,
+ key="HF_MODEL_ID",
+ value=self.model_id,
+ description="Model ID for the demo"
+ )
+ logger.info(f"✅ Successfully set HF_MODEL_ID variable: {self.model_id}")
+
+ if self.subfolder and self.subfolder.strip():
+ self.api.add_space_variable(
+ repo_id=self.space_id,
+ key="MODEL_SUBFOLDER",
+ value=self.subfolder,
+ description="Model subfolder for the demo"
+ )
+ logger.info(f"✅ Successfully set MODEL_SUBFOLDER variable: {self.subfolder}")
+ else:
+ logger.info("ℹ️ No subfolder specified, using main model")
+
+ return True
+
+ except Exception as api_error:
+ logger.error(f"❌ Failed to set secrets via API: {api_error}")
+ logger.info("Falling back to manual setup...")
+ return self._manual_secret_setup()
+
+ except Exception as e:
+ logger.error(f"❌ Error setting space secrets: {e}")
+ return self._manual_secret_setup()
+
+ def _manual_secret_setup(self) -> bool:
+ """Fallback method for manual secret setup"""
+ logger.info("📝 Manual Space Secrets Configuration:")
+ logger.info(f" HF_TOKEN={self.hf_token}")
+ logger.info(f" HF_MODEL_ID={self.model_id}")
+ if self.subfolder and self.subfolder.strip():
+ logger.info(f" MODEL_SUBFOLDER={self.subfolder}")
+ else:
+ logger.info(" MODEL_SUBFOLDER=(empty - using main model)")
+
+ logger.info(f"\n🔧 To set secrets in your Space:")
+ logger.info(f"1. Go to your Space settings: {self.space_url}/settings")
+ logger.info("2. Navigate to the 'Repository secrets' section")
+ logger.info("3. Add the following secrets:")
+ logger.info(f" Name: HF_TOKEN")
+ logger.info(f" Value: {self.hf_token}")
+ logger.info(f" Name: HF_MODEL_ID")
+ logger.info(f" Value: {self.model_id}")
+ if self.subfolder and self.subfolder.strip():
+ logger.info(f" Name: MODEL_SUBFOLDER")
+ logger.info(f" Value: {self.subfolder}")
+ else:
+ logger.info(" Name: MODEL_SUBFOLDER")
+ logger.info(" Value: (leave empty)")
+ logger.info("4. Save the secrets")
+
+ return True
+
+ def test_space(self) -> bool:
+ """Test if the Space is working correctly"""
+ try:
+ logger.info("Testing Space...")
+
+ # Wait a bit for the space to build
+ logger.info("Waiting 180 seconds for Space to build...")
+ time.sleep(180)
+
+ # Try to access the space
+ response = requests.get(self.space_url, timeout=30)
+
+ if response.status_code == 200:
+ logger.info(f"✅ Space is accessible: {self.space_url}")
+ return True
+ else:
+ logger.warning(f"⚠️ Space returned status code: {response.status_code}")
+ logger.warning(f"Response: {response.text[:500]}...")
+ return False
+
+ except Exception as e:
+ logger.error(f"❌ Error testing space: {e}")
+ return False
+
+ def deploy(self) -> bool:
+ """Main deployment method"""
+ logger.info(f"🚀 Starting demo space deployment for {self.model_id}")
+
+ # Step 1: Validate model exists
+ if not self.validate_model_exists():
+ return False
+
+ # Step 2: Create space repository
+ if not self.create_space_repository():
+ return False
+
+ # Step 3: Prepare files
+ temp_dir = self.prepare_space_files()
+ if not temp_dir:
+ return False
+
+ # Step 4: Upload files
+ if not self.upload_files_to_space(temp_dir):
+ return False
+
+ # Step 5: Set space secrets
+ if not self.set_space_secrets():
+ return False
+
+ # Step 6: Clean up temp directory
+ try:
+ shutil.rmtree(temp_dir)
+ logger.info("✅ Cleaned up temporary directory")
+ except Exception as e:
+ logger.warning(f"⚠️ Warning: Could not clean up temp directory: {e}")
+
+ # Step 7: Test space
+ if not self.test_space():
+ logger.warning("⚠️ Space created but may need more time to build")
+ logger.info("Please check the Space manually in a few minutes")
+
+ logger.info(f"🎉 Demo space deployment completed!")
+ logger.info(f"📊 Space URL: {self.space_url}")
+ logger.info(f"🔧 Space configuration: {self.space_url}/settings")
+
+ return True
+
+def main():
+ """Main function for command line usage"""
+ print("Demo Space Deployment Script")
+ print("=" * 40)
+
+ parser = argparse.ArgumentParser(description="Deploy demo space to Hugging Face Spaces")
+ parser.add_argument("--hf-token", required=True, help="Hugging Face token")
+ parser.add_argument("--hf-username", required=True, help="Hugging Face username")
+ parser.add_argument("--model-id", required=True, help="Model ID to deploy demo for")
+ parser.add_argument("--subfolder", default="int4", help="Model subfolder (default: int4)")
+ parser.add_argument("--space-name", help="Custom space name (optional)")
+
+ args = parser.parse_args()
+
+ deployer = DemoSpaceDeployer(
+ hf_token=args.hf_token,
+ hf_username=args.hf_username,
+ model_id=args.model_id,
+ subfolder=args.subfolder,
+ space_name=args.space_name
+ )
+
+ success = deployer.deploy()
+
+ if success:
+ print("\n✅ Deployment successful!")
+ print(f"🌐 Your Demo Space: {deployer.space_url}")
+ print(f"👤 Username: {deployer.hf_username}")
+ print(f"🤖 Model: {deployer.model_id}")
+ print("\nNext steps:")
+ print("1. Wait for the Space to build (usually 2-5 minutes)")
+ print("2. Secrets have been automatically set via API")
+ print("3. Test the interface by visiting the Space URL")
+ print("4. Share your demo with others!")
+ print("\nIf the Space doesn't work immediately, check:")
+ print("- The Space logs at the Space URL")
+ print("- That all files were uploaded correctly")
+ print("- That the HF token has write permissions")
+ print("- That the secrets were set correctly in Space settings")
+ else:
+ print("\n❌ Deployment failed!")
+ print("Check the error messages above and try again.")
+ print("\nTroubleshooting:")
+ print("1. Verify your HF token has write permissions")
+ print("2. Check that the space name is available")
+ print("3. Verify the model exists and is accessible")
+ print("4. Try creating the space manually on HF first")
+
+ sys.exit(0 if success else 1)
+
+if __name__ == "__main__":
+ main()
\ No newline at end of file
diff --git a/scripts/trackio_tonic/app.py b/scripts/trackio_tonic/app.py
index 6f668114211f2dd5847c9d8231a2e0d4366ae92d..64382b6197e6c25e5234f272a8682e19ae023f4b 100644
--- a/scripts/trackio_tonic/app.py
+++ b/scripts/trackio_tonic/app.py
@@ -27,7 +27,9 @@ class TrackioSpace:
self.current_experiment = None
# Get dataset repository and HF token from parameters or environment variables
- self.dataset_repo = dataset_repo or os.environ.get('TRACKIO_DATASET_REPO', 'tonic/trackio-experiments')
+ # Use dynamic default based on environment or fallback to generic default
+ default_dataset_repo = os.environ.get('TRACKIO_DATASET_REPO', 'trackio-experiments')
+ self.dataset_repo = dataset_repo or default_dataset_repo
self.hf_token = hf_token or os.environ.get('HF_TOKEN')
logger.info(f"🔧 Using dataset repository: {self.dataset_repo}")
@@ -84,6 +86,9 @@ class TrackioSpace:
"""Load backup experiments when dataset is not available"""
logger.info("🔄 Loading backup experiments...")
+ # Get dynamic trackio URL from environment or use a placeholder
+ trackio_url = os.environ.get('TRACKIO_URL', 'https://your-trackio-space.hf.space')
+
backup_experiments = {
'exp_20250720_130853': {
'id': 'exp_20250720_130853',
@@ -180,7 +185,7 @@ class TrackioSpace:
'use_chat_template': True,
'chat_template_kwargs': {'add_generation_prompt': True, 'no_think_system_message': True},
'enable_tracking': True,
- 'trackio_url': 'https://tonic-test-trackio-test.hf.space',
+ 'trackio_url': trackio_url,
'trackio_token': None,
'log_artifacts': True,
'log_metrics': True,
@@ -275,7 +280,7 @@ class TrackioSpace:
'use_chat_template': True,
'chat_template_kwargs': {'add_generation_prompt': True, 'no_think_system_message': True},
'enable_tracking': True,
- 'trackio_url': 'https://tonic-test-trackio-test.hf.space',
+ 'trackio_url': trackio_url,
'trackio_token': None,
'log_artifacts': True,
'log_metrics': True,
@@ -563,9 +568,12 @@ def create_dataset_repository(hf_token: str, dataset_repo: str) -> str:
# Initialize API client for remote data
api_client = None
try:
- from trackio_api_client import TrackioAPIClient
- api_client = TrackioAPIClient("https://tonic-test-trackio-test.hf.space")
- logger.info("✅ API client initialized for remote data access")
+ from trackio_api_client import create_trackio_client
+ api_client = create_trackio_client()
+ if api_client:
+ logger.info("✅ API client initialized for remote data access")
+ else:
+ logger.warning("⚠️ Could not initialize API client, using local data only")
except ImportError:
logger.warning("⚠️ API client not available, using local data only")
@@ -700,14 +708,11 @@ Name: {experiment['name']}
Description: {experiment['description']}
Status: {experiment['status']}
Created: {experiment['created_at']}
-
📈 METRICS COUNT: {len(experiment['metrics'])}
📋 PARAMETERS COUNT: {len(experiment['parameters'])}
📦 ARTIFACTS COUNT: {len(experiment['artifacts'])}
-
🔧 PARAMETERS:
{json.dumps(experiment['parameters'], indent=2)}
-
📊 LATEST METRICS:
"""
if experiment['metrics']:
@@ -918,7 +923,7 @@ with gr.Blocks(title="Trackio - Experiment Tracking", theme=gr.themes.Soft()) as
dataset_repo_input = gr.Textbox(
label="Dataset Repository",
placeholder="your-username/your-dataset-name",
- value="tonic/trackio-experiments",
+ value=os.environ.get('TRACKIO_DATASET_REPO', 'trackio-experiments'),
info="HF Dataset repository for experiment storage"
)
diff --git a/scripts/trackio_tonic/configure_trackio.py b/scripts/trackio_tonic/configure_trackio.py
index aac96ef5e8fd7855e56de908af2c76e81615be83..9ba93d4fc6aa0bcf531f40515f7e97c223b99d38 100644
--- a/scripts/trackio_tonic/configure_trackio.py
+++ b/scripts/trackio_tonic/configure_trackio.py
@@ -79,11 +79,16 @@ def configure_trackio():
print("🔧 Trackio Configuration")
print("=" * 40)
- # Get HF token and user info
- hf_token = os.environ.get('HF_TOKEN')
+ # Get HF tokens and user info
+ hf_write_token = os.environ.get('HF_WRITE_TOKEN')
+ hf_read_token = os.environ.get('HF_READ_TOKEN')
+ hf_token = os.environ.get('HF_TOKEN') # Legacy support
- if hf_token:
- username = get_username_from_token(hf_token)
+ # Use write token if available, otherwise fall back to HF_TOKEN
+ active_token = hf_write_token or hf_token
+
+ if active_token:
+ username = get_username_from_token(active_token)
if username:
print(f"✅ Authenticated as: {username}")
else:
@@ -97,9 +102,12 @@ def configure_trackio():
# Current configuration
current_config = {
- 'HF_TOKEN': hf_token or 'Not set',
+ 'HF_WRITE_TOKEN': hf_write_token or 'Not set',
+ 'HF_READ_TOKEN': hf_read_token or 'Not set',
+ 'HF_TOKEN': hf_token or 'Not set', # Legacy
'TRACKIO_DATASET_REPO': dataset_repo,
- 'SPACE_ID': os.environ.get('SPACE_ID', 'Not set')
+ 'SPACE_ID': os.environ.get('SPACE_ID', 'Not set'),
+ 'TRACKIO_URL': os.environ.get('TRACKIO_URL', 'Not set')
}
print("📋 Current Configuration:")
@@ -108,9 +116,12 @@ def configure_trackio():
print(f" {status} {key}: {value}")
print("\n🎯 Configuration Options:")
- print("1. Set HF_TOKEN - Required for dataset access")
- print("2. Set TRACKIO_DATASET_REPO - Dataset repository (optional)")
- print("3. Set SPACE_ID - HF Space ID (auto-detected)")
+ print("1. Set HF_WRITE_TOKEN - Required for training operations")
+ print("2. Set HF_READ_TOKEN - Required for Trackio Space security")
+ print("3. Set HF_TOKEN - Legacy token (fallback)")
+ print("4. Set TRACKIO_DATASET_REPO - Dataset repository (optional)")
+ print("5. Set SPACE_ID - HF Space ID (auto-detected)")
+ print("6. Set TRACKIO_URL - Trackio Space URL (auto-detected)")
# Check if running on HF Spaces
if os.environ.get('SPACE_ID'):
@@ -120,27 +131,45 @@ def configure_trackio():
# Validate configuration
print("\n🔍 Configuration Validation:")
- # Check HF_TOKEN
- if current_config['HF_TOKEN'] != 'Not set':
- print("✅ HF_TOKEN is set")
- print(" This allows the app to read/write to HF Datasets")
+ # Check HF_WRITE_TOKEN
+ if current_config['HF_WRITE_TOKEN'] != 'Not set':
+ print("✅ HF_WRITE_TOKEN is set")
+ print(" This allows training operations and repository creation")
+ else:
+ print("❌ HF_WRITE_TOKEN is not set")
+ print(" Please set HF_WRITE_TOKEN for training operations")
+ print(" Get your token from: https://huggingface.co/settings/tokens")
+
+ # Check HF_READ_TOKEN
+ if current_config['HF_READ_TOKEN'] != 'Not set':
+ print("✅ HF_READ_TOKEN is set")
+ print(" This will be used for Trackio Space security")
else:
- print("❌ HF_TOKEN is not set")
- print(" Please set HF_TOKEN to enable dataset functionality")
+ print("❌ HF_READ_TOKEN is not set")
+ print(" Please set HF_READ_TOKEN for Space security")
print(" Get your token from: https://huggingface.co/settings/tokens")
+ # Check legacy HF_TOKEN
+ if current_config['HF_TOKEN'] != 'Not set':
+ print("✅ HF_TOKEN (legacy) is set")
+ print(" This provides fallback functionality")
+ else:
+ print("⚠️ HF_TOKEN (legacy) is not set")
+ print(" This is optional if using HF_WRITE_TOKEN")
+
# Check dataset repository
print(f"📊 Dataset Repository: {dataset_repo}")
# Test dataset access if token is available
- if current_config['HF_TOKEN'] != 'Not set':
+ test_token = current_config['HF_WRITE_TOKEN'] or current_config['HF_TOKEN']
+ if test_token != 'Not set':
print("\n🧪 Testing Dataset Access...")
try:
from datasets import load_dataset
from huggingface_hub import HfApi
# First check if the dataset repository exists
- api = HfApi(token=current_config['HF_TOKEN'])
+ api = HfApi(token=test_token)
try:
# Try to get repository info
@@ -148,7 +177,7 @@ def configure_trackio():
print(f"✅ Dataset repository exists: {dataset_repo}")
# Try to load the dataset
- dataset = load_dataset(dataset_repo, token=current_config['HF_TOKEN'])
+ dataset = load_dataset(dataset_repo, token=test_token)
print(f"✅ Successfully loaded dataset: {dataset_repo}")
# Show experiment count
@@ -182,14 +211,17 @@ def configure_trackio():
print(" Run setup_hf_dataset.py to create the dataset")
else:
print("\n🧪 Dataset Access Test:")
- print("❌ Cannot test dataset access - HF_TOKEN not set")
+ print("❌ Cannot test dataset access - no valid token set")
# Generate configuration file
config_file = "trackio_config.json"
config_data = {
- 'hf_token': current_config['HF_TOKEN'],
+ 'hf_write_token': current_config['HF_WRITE_TOKEN'],
+ 'hf_read_token': current_config['HF_READ_TOKEN'],
+ 'hf_token': current_config['HF_TOKEN'], # Legacy
'dataset_repo': current_config['TRACKIO_DATASET_REPO'],
'space_id': current_config['SPACE_ID'],
+ 'trackio_url': current_config['TRACKIO_URL'],
'username': username,
'last_updated': datetime.now().isoformat(),
'notes': 'Trackio configuration - set these as environment variables in your HF Space'
@@ -203,14 +235,19 @@ def configure_trackio():
# Show environment variable commands
print("\n📝 Environment Variables for HF Space:")
print("=" * 50)
- print(f"HF_TOKEN={current_config['HF_TOKEN']}")
+ print(f"HF_WRITE_TOKEN={current_config['HF_WRITE_TOKEN']}")
+ print(f"HF_READ_TOKEN={current_config['HF_READ_TOKEN']}")
+ print(f"HF_TOKEN={current_config['HF_TOKEN']}") # Legacy
print(f"TRACKIO_DATASET_REPO={current_config['TRACKIO_DATASET_REPO']}")
+ if current_config['TRACKIO_URL'] != 'Not set':
+ print(f"TRACKIO_URL={current_config['TRACKIO_URL']}")
print("\n🎯 Next Steps:")
- print("1. Set HF_TOKEN in your HF Space environment variables")
- print("2. Optionally set TRACKIO_DATASET_REPO to use a different dataset")
- print("3. Deploy your updated app.py to the Space")
- print("4. Run setup_hf_dataset.py if you haven't created the dataset yet")
+ print("1. Set HF_WRITE_TOKEN in your HF Space environment variables")
+ print("2. Set HF_READ_TOKEN in your HF Space environment variables")
+ print("3. Optionally set TRACKIO_DATASET_REPO to use a different dataset")
+ print("4. Deploy your updated app.py to the Space")
+ print("5. Run setup_hf_dataset.py if you haven't created the dataset yet")
print("\n📚 Usage Examples")
print("=" * 30)
diff --git a/scripts/trackio_tonic/deploy_trackio_space.py b/scripts/trackio_tonic/deploy_trackio_space.py
index 989f16576f908efc3b0902ff3b234541f483519e..3215f59b8ec8db35de08721bbac0cb65965bb650 100644
--- a/scripts/trackio_tonic/deploy_trackio_space.py
+++ b/scripts/trackio_tonic/deploy_trackio_space.py
@@ -25,9 +25,10 @@ except ImportError:
class TrackioSpaceDeployer:
"""Deployer for Trackio on Hugging Face Spaces"""
- def __init__(self, space_name: str, token: str, git_email: str = None, git_name: str = None):
+ def __init__(self, space_name: str, token: str, git_email: str = None, git_name: str = None, dataset_repo: str = None):
self.space_name = space_name
self.token = token
+ self.dataset_repo = dataset_repo
# Initialize HF API and get user info
if HF_HUB_AVAILABLE:
@@ -211,7 +212,11 @@ class TrackioSpaceDeployer:
dest_path = Path(temp_dir) / file_name
if source_path.exists():
- shutil.copy2(source_path, dest_path)
+ # For app.py, we need to customize it with user variables
+ if file_name == "app.py":
+ self._customize_app_py(source_path, dest_path)
+ else:
+ shutil.copy2(source_path, dest_path)
copied_files.append(file_name)
print(f"✅ Copied {file_name} to temp directory")
else:
@@ -238,6 +243,47 @@ class TrackioSpaceDeployer:
print(f"❌ Error preparing files: {e}")
return None
+ def _customize_app_py(self, source_path: Path, dest_path: Path):
+ """Customize app.py with user-specific variables"""
+ try:
+ with open(source_path, 'r', encoding='utf-8') as f:
+ content = f.read()
+
+ # Replace hardcoded values with user-specific ones
+ replacements = {
+ # Default dataset repository
+ "'tonic/trackio-experiments'": f"'{self.dataset_repo or f'{self.username}/trackio-experiments'}'",
+ "'trackio-experiments'": f"'{self.dataset_repo or f'{self.username}/trackio-experiments'}" if self.dataset_repo else "'trackio-experiments'",
+
+ # Trackio URL
+ "'https://tonic-test-trackio-test.hf.space'": f"'{self.space_url}'",
+ "'https://your-trackio-space.hf.space'": f"'{self.space_url}'",
+
+ # UI default values
+ '"tonic/trackio-experiments"': f'"{self.dataset_repo or f"{self.username}/trackio-experiments"}"',
+ '"trackio-experiments"': f'"{self.dataset_repo or f"{self.username}/trackio-experiments"}"' if self.dataset_repo else '"trackio-experiments"',
+
+ # Examples in help text
+ "'tonic/trackio-experiments'": f"'{self.username}/trackio-experiments'",
+ "'your-username/trackio-experiments'": f"'{self.username}/trackio-experiments'",
+ "'your-username/my-experiments'": f"'{self.username}/my-experiments'"
+ }
+
+ # Apply replacements
+ for old, new in replacements.items():
+ content = content.replace(old, new)
+
+ # Write customized content
+ with open(dest_path, 'w', encoding='utf-8') as f:
+ f.write(content)
+
+ print(f"✅ Customized app.py with user variables")
+
+ except Exception as e:
+ print(f"❌ Error customizing app.py: {e}")
+ # Fallback to copying original file
+ shutil.copy2(source_path, dest_path)
+
def upload_files_to_space(self, temp_dir: str) -> bool:
"""Upload files to the Space using HF Hub API directly"""
try:
@@ -288,29 +334,57 @@ class TrackioSpaceDeployer:
repo_id = f"{self.username}/{self.space_name}"
- # Get the HF token from environment or use the provided token
- hf_token = os.getenv('HF_TOKEN', self.token)
+ # Get the HF tokens from environment or use the provided token
+ hf_write_token = os.getenv('HF_WRITE_TOKEN', self.token)
+ hf_read_token = os.getenv('HF_READ_TOKEN', self.token)
+ hf_token = os.getenv('HF_TOKEN', self.token) # Legacy
- # Set the HF_TOKEN secret for the space using the API
+ # Set the HF_WRITE_TOKEN secret for the space using the API
try:
+ self.api.add_space_secret(
+ repo_id=repo_id,
+ key="HF_WRITE_TOKEN",
+ value=hf_write_token,
+ description="Hugging Face write token for training operations"
+ )
+ print("✅ Successfully set HF_WRITE_TOKEN secret via API")
+
+ # Set the HF_READ_TOKEN secret for the space using the API
+ self.api.add_space_secret(
+ repo_id=repo_id,
+ key="HF_READ_TOKEN",
+ value=hf_read_token,
+ description="Hugging Face read token for security"
+ )
+ print("✅ Successfully set HF_READ_TOKEN secret via API")
+
+ # Set legacy HF_TOKEN secret for backward compatibility
self.api.add_space_secret(
repo_id=repo_id,
key="HF_TOKEN",
value=hf_token,
- description="Hugging Face token for dataset access"
+ description="Hugging Face token for dataset access (legacy)"
)
print("✅ Successfully set HF_TOKEN secret via API")
- # Optionally set dataset repository if specified
- dataset_repo = os.getenv('TRACKIO_DATASET_REPO')
- if dataset_repo:
- self.api.add_space_variable(
- repo_id=repo_id,
- key="TRACKIO_DATASET_REPO",
- value=dataset_repo,
- description="Dataset repository for Trackio experiments"
- )
- print(f"✅ Successfully set TRACKIO_DATASET_REPO variable: {dataset_repo}")
+ # Set the TRACKIO_DATASET_REPO variable
+ dataset_repo = self.dataset_repo or f"{self.username}/trackio-experiments"
+ self.api.add_space_variable(
+ repo_id=repo_id,
+ key="TRACKIO_DATASET_REPO",
+ value=dataset_repo,
+ description="Dataset repository for Trackio experiments"
+ )
+ print(f"✅ Successfully set TRACKIO_DATASET_REPO variable: {dataset_repo}")
+
+ # Set the TRACKIO_URL variable
+ self.api.add_space_variable(
+ repo_id=repo_id,
+ key="TRACKIO_URL",
+ value=self.space_url,
+ description="Trackio Space URL for monitoring"
+ )
+ print(f"✅ Successfully set TRACKIO_URL variable: {self.space_url}")
return True
@@ -326,20 +400,34 @@ class TrackioSpaceDeployer:
def _manual_secret_setup(self) -> bool:
"""Fallback method for manual secret setup"""
print("📝 Manual Space Secrets Configuration:")
- print(f" HF_TOKEN={self.token}")
- dataset_repo = os.getenv('TRACKIO_DATASET_REPO', 'tonic/trackio-experiments')
+ # Get tokens from environment or use provided token
+ hf_write_token = os.getenv('HF_WRITE_TOKEN', self.token)
+ hf_read_token = os.getenv('HF_READ_TOKEN', self.token)
+ hf_token = os.getenv('HF_TOKEN', self.token) # Legacy
+
+ print(f" HF_WRITE_TOKEN={hf_write_token}")
+ print(f" HF_READ_TOKEN={hf_read_token}")
+ print(f" HF_TOKEN={hf_token}")
+
+ dataset_repo = self.dataset_repo or f"{self.username}/trackio-experiments"
print(f" TRACKIO_DATASET_REPO={dataset_repo}")
+ print(f" TRACKIO_URL={self.space_url}")
print("\n🔧 To set secrets in your Space:")
print("1. Go to your Space settings: {self.space_url}/settings")
print("2. Navigate to the 'Repository secrets' section")
print("3. Add the following secrets:")
+ print(f" Name: HF_WRITE_TOKEN")
+ print(f" Value: {hf_write_token}")
+ print(f" Name: HF_READ_TOKEN")
+ print(f" Value: {hf_read_token}")
print(f" Name: HF_TOKEN")
- print(f" Value: {self.token}")
- if dataset_repo:
- print(f" Name: TRACKIO_DATASET_REPO")
- print(f" Value: {dataset_repo}")
+ print(f" Value: {hf_token}")
+ print(f" Name: TRACKIO_DATASET_REPO")
+ print(f" Value: {dataset_repo}")
+ print(f" Name: TRACKIO_URL")
+ print(f" Value: {self.space_url}")
print("4. Save the secrets")
return True
@@ -420,12 +508,14 @@ def main():
token = sys.argv[2]
git_email = sys.argv[3] if len(sys.argv) > 3 else None
git_name = sys.argv[4] if len(sys.argv) > 4 else None
+ dataset_repo = sys.argv[5] if len(sys.argv) > 5 else None
print(f"Using provided arguments:")
print(f" Space name: {space_name}")
print(f" Token: {'*' * 10}...{token[-4:]}")
print(f" Git email: {git_email or 'default'}")
print(f" Git name: {git_name or 'default'}")
+ print(f" Dataset repo: {dataset_repo or 'default'}")
else:
# Get user input (no username needed - will be extracted from token)
space_name = input("Enter Space name (e.g., trackio-monitoring): ").strip()
@@ -434,6 +524,7 @@ def main():
# Get git configuration (optional)
git_email = input("Enter your git email (optional, press Enter for default): ").strip()
git_name = input("Enter your git name (optional, press Enter for default): ").strip()
+ dataset_repo = input("Enter dataset repository (optional, press Enter for default): ").strip()
if not space_name or not token:
print("❌ Space name and token are required")
@@ -444,9 +535,11 @@ def main():
git_email = None
if not git_name:
git_name = None
+ if not dataset_repo:
+ dataset_repo = None
# Create deployer (username will be extracted from token)
- deployer = TrackioSpaceDeployer(space_name, token, git_email, git_name)
+ deployer = TrackioSpaceDeployer(space_name, token, git_email, git_name, dataset_repo)
# Run deployment
success = deployer.deploy()
@@ -455,6 +548,7 @@ def main():
print("\n✅ Deployment successful!")
print(f"🌐 Your Trackio Space: {deployer.space_url}")
print(f"👤 Username: {deployer.username}")
+ print(f"📊 Dataset Repository: {deployer.dataset_repo or f'{deployer.username}/trackio-experiments'}")
print("\nNext steps:")
print("1. Wait for the Space to build (usually 2-5 minutes)")
print("2. Secrets have been automatically set via API")
diff --git a/scripts/trackio_tonic/switch_to_read_token.py b/scripts/trackio_tonic/switch_to_read_token.py
new file mode 100644
index 0000000000000000000000000000000000000000..101a0ca218d09bd066c7d122be275b722f6756c1
--- /dev/null
+++ b/scripts/trackio_tonic/switch_to_read_token.py
@@ -0,0 +1,150 @@
+#!/usr/bin/env python3
+"""
+Switch Trackio Space from Write Token to Read Token
+
+This script switches the HF_TOKEN secret in a Trackio Space from a write token
+to a read token after the experiment is complete, for security purposes.
+"""
+
+import os
+import sys
+import json
+from typing import Optional, Tuple
+from huggingface_hub import HfApi
+
+def validate_token_permissions(token: str) -> Tuple[bool, str, Optional[str]]:
+ """
+ Validate token and determine its permission level.
+
+ Args:
+ token (str): The Hugging Face token to validate
+
+ Returns:
+ Tuple[bool, str, Optional[str]]:
+ - success: True if token is valid
+ - permission_level: "read" or "write"
+ - username: The username associated with the token
+ """
+ try:
+ api = HfApi(token=token)
+ user_info = api.whoami()
+
+ # Extract username
+ username = user_info.get("name", user_info.get("username"))
+
+ # Test write permissions by trying to access a test repository
+ # We'll use a simple test - try to get repo info for a public repo
+ try:
+ # Try to access a public dataset to test read permissions
+ api.dataset_info("huggingface-course/documentation-tutorial")
+
+ # For write permissions, we'll assume the token has write access
+ # since we can't easily test write permissions without creating something
+ # In practice, write tokens are typically provided by users who know
+ # they have write access
+ return True, "write", username
+
+ except Exception as e:
+ # If we can't access even a public dataset, it's likely a read token
+ return True, "read", username
+
+ except Exception as e:
+ error_msg = str(e)
+ if "401" in error_msg or "unauthorized" in error_msg.lower():
+ return False, "invalid", None
+ else:
+ return False, "error", None
+
+def switch_space_token(space_id: str, read_token: str, write_token: str) -> bool:
+ """
+ Switch the HF_TOKEN secret in a Trackio Space from write to read token.
+
+ Args:
+ space_id (str): The space ID (username/space-name)
+ read_token (str): The read token to set
+ write_token (str): The write token (for validation)
+
+ Returns:
+ bool: True if successful, False otherwise
+ """
+ try:
+ # Validate both tokens
+ print("🔍 Validating tokens...")
+
+ write_valid, write_perm, write_user = validate_token_permissions(write_token)
+ read_valid, read_perm, read_user = validate_token_permissions(read_token)
+
+ if not write_valid:
+ print(f"❌ Write token validation failed")
+ return False
+
+ if not read_valid:
+ print(f"❌ Read token validation failed")
+ return False
+
+ if write_user != read_user:
+ print(f"❌ Token mismatch: write token user ({write_user}) != read token user ({read_user})")
+ return False
+
+ print(f"✅ Tokens validated successfully")
+ print(f" Write token: {write_perm} permissions for {write_user}")
+ print(f" Read token: {read_perm} permissions for {read_user}")
+
+ # Use the write token to update the space (since we need write access)
+ api = HfApi(token=write_token)
+
+ # Update the HF_TOKEN secret in the space
+ try:
+ api.add_space_secret(
+ repo_id=space_id,
+ key="HF_TOKEN",
+ value=read_token,
+ description="Hugging Face read token for dataset access (switched from write token)"
+ )
+ print(f"✅ Successfully switched HF_TOKEN to read token in space: {space_id}")
+ return True
+
+ except Exception as e:
+ print(f"❌ Failed to update space secret: {e}")
+ return False
+
+ except Exception as e:
+ print(f"❌ Error switching tokens: {e}")
+ return False
+
+def main():
+ """Main function to switch tokens."""
+
+ print("🔄 Trackio Space Token Switch")
+ print("=" * 40)
+
+ # Get arguments
+ if len(sys.argv) >= 4:
+ space_id = sys.argv[1]
+ read_token = sys.argv[2]
+ write_token = sys.argv[3]
+ else:
+ print("Usage: python switch_to_read_token.py ")
+ print("Example: python switch_to_read_token.py username/trackio-monitoring read_token write_token")
+ sys.exit(1)
+
+ # Validate space_id format
+ if "/" not in space_id:
+ print("❌ Invalid space_id format. Use: username/space-name")
+ sys.exit(1)
+
+ # Switch tokens
+ success = switch_space_token(space_id, read_token, write_token)
+
+ if success:
+ print("\n✅ Token switch completed successfully!")
+ print(f"📊 Space: {space_id}")
+ print("🔒 HF_TOKEN now uses read-only permissions")
+ print("💡 The space can still read datasets but cannot write to repositories")
+ else:
+ print("\n❌ Token switch failed!")
+ print("Please check your tokens and try again.")
+ sys.exit(1)
+
+if __name__ == "__main__":
+ main()
\ No newline at end of file
diff --git a/scripts/trackio_tonic/trackio_api_client.py b/scripts/trackio_tonic/trackio_api_client.py
index 9ea61de37dca84b2470b8cc3c6476d637af111ab..bb670415807a6ba8d9af88e331e50ec677388aad 100644
--- a/scripts/trackio_tonic/trackio_api_client.py
+++ b/scripts/trackio_tonic/trackio_api_client.py
@@ -10,6 +10,7 @@ import time
import logging
from typing import Dict, Any, Optional
from datetime import datetime
+import os
# Setup logging
logging.basicConfig(level=logging.INFO)
@@ -289,4 +290,31 @@ class TrackioAPIClient:
}
}
except Exception as e:
- return {"error": f"Failed to get Space info: {str(e)}"}
\ No newline at end of file
+ return {"error": f"Failed to get Space info: {str(e)}"}
+
+# Factory function to create client with dynamic configuration
+def create_trackio_client(space_id: Optional[str] = None, hf_token: Optional[str] = None) -> TrackioAPIClient:
+ """Create a TrackioAPIClient with dynamic configuration"""
+
+ # Get space_id from environment if not provided
+ if not space_id:
+ space_id = os.environ.get('TRACKIO_URL')
+ if not space_id:
+ # Try to construct from username and space name
+ username = os.environ.get('HF_USERNAME')
+ space_name = os.environ.get('TRACKIO_SPACE_NAME')
+ if username and space_name:
+ space_id = f"https://huggingface.co/spaces/{username}/{space_name}"
+ else:
+ logger.warning("⚠️ No space_id provided and could not determine from environment")
+ return None
+
+ # Get HF token from environment if not provided
+ if not hf_token:
+ hf_token = os.environ.get('HF_TOKEN')
+
+ if not space_id:
+ logger.error("❌ No space_id available for TrackioAPIClient")
+ return None
+
+ return TrackioAPIClient(space_id, hf_token)
\ No newline at end of file
diff --git a/templates/datasets/readme.md b/templates/datasets/readme.md
index d7e215e6927067de3fa8322620fb7ceb920ad490..5b95317b9ac594a4c828b9d7ffbd4153685578d4 100644
--- a/templates/datasets/readme.md
+++ b/templates/datasets/readme.md
@@ -33,7 +33,7 @@ configs:
- split: train
path: data/train-*
tags:
-- trackio
+- track tonic
- tonic
- experiment tracking
- smollm3
diff --git a/templates/model_card.md b/templates/model_card.md
index f9df426771a9c068439bcbefce7b295905018499..27c754c65a8072622eec6bfa62eb382c37fe7d30 100644
--- a/templates/model_card.md
+++ b/templates/model_card.md
@@ -9,9 +9,9 @@ tags:
- fine-tuned
- causal-lm
- text-generation
+- tonic
+- legml
- {{#if quantized_models}}quantized{{/if}}
-- {{#if dataset_name}}dataset:{{dataset_name}}{{/if}}
-- {{#if training_config_type}}config:{{training_config_type}}{{/if}}
pipeline_tag: text-generation
base_model: {{base_model}}
{{#if dataset_name}}
@@ -37,34 +37,34 @@ model-index:
- name: Perplexity
type: perplexity
value: "{{perplexity|default:'N/A'}}"
- - name: {{model_name}} (int8 quantized)
- results:
- - task:
- type: text-generation
- dataset:
- name: {{dataset_name}}
- type: {{dataset_name}}
- metrics:
- - name: Memory Reduction
- type: memory_efficiency
- value: "~50%"
- - name: Inference Speed
- type: speed
- value: "Faster"
- - name: {{model_name}} (int4 quantized)
- results:
- - task:
- type: text-generation
- dataset:
- name: {{dataset_name}}
- type: {{dataset_name}}
- metrics:
- - name: Memory Reduction
- type: memory_efficiency
- value: "~75%"
- - name: Inference Speed
- type: speed
- value: "Significantly Faster"
+- name: {{model_name}} (int8 quantized)
+ results:
+ - task:
+ type: text-generation
+ dataset:
+ name: {{dataset_name}}
+ type: {{dataset_name}}
+ metrics:
+ - name: Memory Reduction
+ type: memory_efficiency
+ value: "~50%"
+ - name: Inference Speed
+ type: speed
+ value: "Faster"
+- name: {{model_name}} (int4 quantized)
+ results:
+ - task:
+ type: text-generation
+ dataset:
+ name: {{dataset_name}}
+ type: {{dataset_name}}
+ metrics:
+ - name: Memory Reduction
+ type: memory_efficiency
+ value: "~75%"
+ - name: Inference Speed
+ type: speed
+ value: "Significantly Faster"
{{else}}
model-index:
- name: {{model_name}}
@@ -130,11 +130,6 @@ dataset_format: {{dataset_format}}
{{#if gradient_accumulation_steps}}
gradient_accumulation_steps: {{gradient_accumulation_steps}}
{{/if}}
-{{#if quantized_models}}
-quantization_types:
-- int8_weight_only
-- int4_weight_only
-{{/if}}
---
# {{model_name}}
@@ -175,46 +170,6 @@ output = model.generate(**input_ids, max_new_tokens=50)
print(tokenizer.decode(output[0], skip_special_tokens=True))
```
-{{#if quantized_models}}
-### Quantized Models
-
-This repository also includes quantized versions of the model for improved efficiency:
-
-#### int8 Weight-Only Quantization (GPU Optimized)
-```python
-import torch
-from transformers import AutoModelForCausalLM, AutoTokenizer
-
-# Load int8 quantized model (GPU optimized)
-model = AutoModelForCausalLM.from_pretrained(
- "{{repo_name}}/int8",
- device_map="auto",
- torch_dtype=torch.bfloat16
-)
-tokenizer = AutoTokenizer.from_pretrained("{{repo_name}}/int8")
-```
-
-#### int4 Weight-Only Quantization (CPU Optimized)
-```python
-import torch
-from transformers import AutoModelForCausalLM, AutoTokenizer
-
-# Load int4 quantized model (CPU optimized)
-model = AutoModelForCausalLM.from_pretrained(
- "{{repo_name}}/int4",
- device_map="cpu",
- torch_dtype=torch.bfloat16
-)
-tokenizer = AutoTokenizer.from_pretrained("{{repo_name}}/int4")
-```
-
-### Quantization Benefits
-
-- **int8 (GPU)**: ~50% memory reduction, faster inference with minimal accuracy loss
-- **int4 (CPU)**: ~75% memory reduction, significantly faster inference with some accuracy trade-off
-
-{{/if}}
-
## Training Information
### Training Configuration
@@ -322,17 +277,7 @@ For questions and support:
├── config.json
├── pytorch_model.bin
├── tokenizer.json
-├── tokenizer_config.json
-{{#if quantized_models}}
-├── int8/ (quantized model for GPU)
-│ ├── README.md
-│ ├── config.json
-│ └── pytorch_model.bin
-└── int4/ (quantized model for CPU)
- ├── README.md
- ├── config.json
- └── pytorch_model.bin
-{{/if}}
+└── tokenizer_config.json
```
## Usage Examples
@@ -394,22 +339,7 @@ pip install torchao # For quantized models
### Hardware Requirements
- **Main Model**: GPU with 8GB+ VRAM recommended
-{{#if quantized_models}}
-- **int8 Model**: GPU with 4GB+ VRAM
-- **int4 Model**: CPU deployment possible
-{{/if}}
-
-## Contributing
-
-Contributions are welcome! Please:
-1. Fork the repository
-2. Create a feature branch
-3. Make your changes
-4. Submit a pull request
## Changelog
-- **v1.0.0**: Initial release with fine-tuned model
-{{#if quantized_models}}
-- **v1.1.0**: Added quantized versions (int8, int4)
-{{/if}}
\ No newline at end of file
+- **v1.0.0**: Initial release with fine-tuned model
\ No newline at end of file
diff --git a/templates/spaces/README.md b/templates/spaces/README.md
index 4014a12a50566b4a01891f5ece28fa634ce86421..ca2aba64ad6ad70b8d55d6c1d2c3298d43c7ee7f 100644
--- a/templates/spaces/README.md
+++ b/templates/spaces/README.md
@@ -1,5 +1,5 @@
---
-title: Trackio Tonic
+title: Track Tonic
emoji: 🐠
colorFrom: indigo
colorTo: yellow
@@ -9,6 +9,14 @@ app_file: app.py
pinned: true
license: mit
short_description: trackio for training monitoring
+tags:
+- smollm3
+- fine-tuned
+- causal-lm
+- text-generation
+- track tonic
+- tonic
+- legml
---
# Trackio Experiment Tracking
diff --git a/templates/spaces/app.py b/templates/spaces/app.py
index 40b33cb5d0ec285e72a2cddc9066b36f30b523ec..67fcf537928442b006ccfa20462dd34e2eb22b6d 100644
--- a/templates/spaces/app.py
+++ b/templates/spaces/app.py
@@ -705,14 +705,11 @@ Name: {experiment['name']}
Description: {experiment['description']}
Status: {experiment['status']}
Created: {experiment['created_at']}
-
📈 METRICS COUNT: {len(experiment['metrics'])}
📋 PARAMETERS COUNT: {len(experiment['parameters'])}
📦 ARTIFACTS COUNT: {len(experiment['artifacts'])}
-
🔧 PARAMETERS:
{json.dumps(experiment['parameters'], indent=2)}
-
📊 LATEST METRICS:
"""
if experiment['metrics']:
diff --git a/templates/spaces/demo/README.md b/templates/spaces/demo/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..a602e2977dc77da944b7840d97cf8dbdb70fdc1a
--- /dev/null
+++ b/templates/spaces/demo/README.md
@@ -0,0 +1,191 @@
+---
+title: Petite LLM 3
+emoji: 💃🏻
+colorFrom: green
+colorTo: purple
+sdk: gradio
+sdk_version: 5.38.2
+app_file: app.py
+pinned: false
+license: mit
+short_description: Smollm3 for French Understanding
+---
+
+# 🤖 Petite Elle L'Aime 3 - Chat Interface
+
+A complete Gradio application for the [Petite Elle L'Aime 3](https://huggingface.co/Tonic/petite-elle-L-aime-3-sft) model, featuring the int4 quantized version for efficient CPU deployment.
+
+## 🚀 Features
+
+- **Multilingual Support**: English, French, Italian, Portuguese, Chinese, Arabic
+- **Int4 Quantization**: Optimized for CPU deployment with ~50% memory reduction
+- **Interactive Chat Interface**: Real-time conversation with the model
+- **Customizable System Prompt**: Define the assistant's personality and behavior
+- **Thinking Mode**: Enable reasoning mode with thinking tags
+- **Responsive Design**: Modern UI following the reference layout
+- **Chat Template Integration**: Proper Jinja template formatting
+- **Automatic Model Download**: Downloads int4 model at build time
+
+## 📋 Model Information
+
+- **Base Model**: SmolLM3-3B
+- **Parameters**: ~3B
+- **Context Length**: 128k
+- **Quantization**: int4 (CPU optimized)
+- **Memory Reduction**: ~50%
+- **Languages**: English, French, Italian, Portuguese, Chinese, Arabic
+
+## 🛠️ Installation
+
+1. Clone this repository:
+```bash
+git clone
+cd Petite-LLM-3
+```
+
+2. Install dependencies:
+```bash
+pip install -r requirements.txt
+```
+
+## 🚀 Usage
+
+### Local Development
+
+Run the application locally:
+```bash
+python app.py
+```
+
+The application will be available at `http://localhost:7860`
+
+### Hugging Face Spaces
+
+This application is configured for deployment on Hugging Face Spaces with automatic model download:
+
+1. **Build Process**: The `build.py` script automatically downloads the int4 model during Space build
+2. **Model Loading**: Uses local model files when available, falls back to Hugging Face download
+3. **Caching**: Model files are cached for faster subsequent runs
+
+## 🎛️ Interface Features
+
+### Layout Structure
+The interface follows the reference layout with:
+- **Title Section**: Main heading and description
+- **Information Panels**: Features and model information
+- **Input Section**: Context and user input areas
+- **Advanced Settings**: Collapsible parameter controls
+- **Chat Interface**: Real-time conversation display
+
+### System Prompt
+- **Default**: "Tu es TonicIA, un assistant francophone rigoureux et bienveillant."
+- **Editable**: Users can customize the system prompt to define the assistant's personality
+- **Real-time**: Changes take effect immediately for new conversations
+
+### Generation Parameters
+- **Max Length**: Maximum number of tokens to generate (64-2048)
+- **Temperature**: Controls randomness in generation (0.01-1.0)
+- **Top-p**: Nucleus sampling parameter (0.1-1.0)
+- **Enable Thinking**: Enable reasoning mode with thinking tags
+- **Advanced Settings**: Collapsible panel for fine-tuning
+
+## 🔧 Technical Details
+
+### Model Loading Strategy
+The application uses a smart loading strategy:
+
+1. **Local Check**: First checks if int4 model files exist locally
+2. **Local Loading**: If available, loads from `./int4` folder
+3. **Fallback Download**: If not available, downloads from Hugging Face
+4. **Tokenizer**: Always uses main repo for chat template and configuration
+
+### Build Process
+For Hugging Face Spaces deployment:
+
+1. **Build Script**: `build.py` runs during Space build
+2. **Model Download**: `download_model.py` downloads int4 model files
+3. **Local Storage**: Model files stored in `./int4` directory
+4. **Fast Loading**: Subsequent runs use local files
+
+### Chat Template Integration
+The application uses the custom chat template from the model, which supports:
+- System prompt integration
+- User and assistant message formatting
+- Thinking mode with `` tags
+- Proper conversation flow management
+
+### Memory Optimization
+- Uses int4 quantization for reduced memory footprint
+- Automatic device detection (CUDA/CPU)
+- Efficient tokenization and generation
+
+## 📝 Example Usage
+
+1. **Basic Conversation**:
+ - Add context in the system prompt area
+ - Type your message in the user input box
+ - Click the generate button to start chatting
+
+2. **Customizing System Prompt**:
+ - Edit the context in the dedicated text area
+ - Changes apply to new messages immediately
+ - Example: "Tu es un expert en programmation Python."
+
+3. **Advanced Settings**:
+ - Check the "Advanced Settings" checkbox
+ - Adjust generation parameters as needed
+ - Enable/disable thinking mode
+
+4. **Real-time Chat**:
+ - Messages appear in the chat interface
+ - Conversation history is maintained
+ - Responses are generated using the model's chat template
+
+## 🐛 Troubleshooting
+
+### Common Issues
+
+1. **Model Loading Errors**:
+ - Ensure you have sufficient RAM (8GB+ recommended)
+ - Check your internet connection for model download
+ - Verify all dependencies are installed
+
+2. **Generation Errors**:
+ - Try reducing the "Max Length" parameter
+ - Adjust temperature and top-p values
+ - Check the console for detailed error messages
+
+3. **Performance Issues**:
+ - The int4 model is optimized for CPU but may be slower than GPU versions
+ - Consider using a machine with more RAM for better performance
+
+4. **System Prompt Issues**:
+ - Ensure the system prompt is not too long (max 1000 characters)
+ - Check that the prompt follows the expected format
+
+5. **Build Process Issues**:
+ - Check that `download_model.py` runs successfully
+ - Verify that model files are downloaded to `./int4` directory
+ - Ensure sufficient storage space for model files
+
+## 📄 License
+
+This project is licensed under the MIT License. The underlying model is licensed under Apache 2.0.
+
+## 🙏 Acknowledgments
+
+- **Model**: [Tonic/petite-elle-L-aime-3-sft](https://huggingface.co/Tonic/petite-elle-L-aime-3-sft)
+- **Base Model**: SmolLM3-3B by HuggingFaceTB
+- **Training Data**: legmlai/openhermes-fr
+- **Framework**: Gradio, Transformers, PyTorch
+- **Layout Reference**: [Tonic/Nvidia-OpenReasoning](https://huggingface.co/spaces/Tonic/Nvidia-OpenReasoning)
+
+## 🔗 Links
+
+- [Model on Hugging Face](https://huggingface.co/Tonic/petite-elle-L-aime-3-sft)
+- [Chat Template](https://huggingface.co/Tonic/petite-elle-L-aime-3-sft/blob/main/chat_template.jinja)
+- [Original App Reference](https://huggingface.co/spaces/Tonic/Nvidia-OpenReasoning)
+
+---
+
+Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
diff --git a/templates/spaces/demo/app.py b/templates/spaces/demo/app.py
new file mode 100644
index 0000000000000000000000000000000000000000..6e1e46a1a35988d535dd58af81645a76de186837
--- /dev/null
+++ b/templates/spaces/demo/app.py
@@ -0,0 +1,278 @@
+import gradio as gr
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+import re
+import json
+from typing import List, Dict, Any, Optional
+import logging
+import spaces
+import os
+import sys
+import requests
+
+# Set torch to use float32 for better compatibility with quantized models
+torch.set_default_dtype(torch.float32)
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+
+# Get model ID from environment variable or use default
+MAIN_MODEL_ID = os.getenv("HF_MODEL_ID", "Tonic/petite-elle-L-aime-3-sft")
+MODEL_SUBFOLDER = os.getenv("MODEL_SUBFOLDER", "int4") # Default to int4 for CPU deployment
+MODEL_NAME = os.getenv("MODEL_NAME", "SmolLM3 Fine-tuned Model")
+
+DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
+model = None
+tokenizer = None
+DEFAULT_SYSTEM_PROMPT = "Tu es TonicIA, un assistant francophone rigoureux et bienveillant."
+title = f"# 🤖 {MODEL_NAME} - Chat Interface"
+description = f"A fine-tuned version of SmolLM3-3B optimized for conversations. This is the {MODEL_SUBFOLDER} quantized version for efficient deployment."
+presentation1 = """
+### 🎯 Features
+- **Multilingual Support**: English, French, Italian, Portuguese, Chinese, Arabic
+- **Quantized Model**: Optimized for deployment with memory reduction
+- **Interactive Chat Interface**: Real-time conversation with the model
+- **Customizable System Prompt**: Define the assistant's personality and behavior
+- **Thinking Mode**: Enable reasoning mode with thinking tags
+"""
+presentation2 = """### 🎯 Fonctionnalités
+* **Support multilingue** : Anglais, Français, Italien, Portugais, Chinois, Arabe
+* **Modèle quantifié** : Optimisé pour un déploiement avec réduction de mémoire
+* **Interface de chat interactive** : Conversation en temps réel avec le modèle
+* **Invite système personnalisable** : Définissez la personnalité et le comportement de l'assistant
+* **Mode Réflexion** : Activez le mode raisonnement avec des balises de réflexion
+"""
+joinus = """
+## Join us :
+🌟TeamTonic🌟 is always making cool demos! Join our active builder's 🛠️community 👻 [](https://discord.gg/qdfnvSPcqP) On 🤗Huggingface:[MultiTransformer](https://huggingface.co/MultiTransformer) On 🌐Github: [Tonic-AI](https://github.com/tonic-ai) & contribute to🌟 [Build Tonic](https://git.tonic-ai.com/contribute)🤗Big thanks to Yuvi Sharma and all the folks at huggingface for the community grant 🤗
+"""
+
+
+def download_chat_template():
+ """Download the chat template from the main repository"""
+ try:
+ chat_template_url = f"https://huggingface.co/{MAIN_MODEL_ID}/raw/main/chat_template.jinja"
+ logger.info(f"Downloading chat template from {chat_template_url}")
+
+ response = requests.get(chat_template_url, timeout=30)
+ response.raise_for_status()
+
+ chat_template_content = response.text
+ logger.info("Chat template downloaded successfully")
+ return chat_template_content
+
+ except requests.exceptions.RequestException as e:
+ logger.error(f"Error downloading chat template: {e}")
+ return None
+ except Exception as e:
+ logger.error(f"Unexpected error downloading chat template: {e}")
+ return None
+
+
+def load_model():
+ """Load the model and tokenizer"""
+ global model, tokenizer
+
+ try:
+ logger.info(f"Loading tokenizer from {MAIN_MODEL_ID}")
+ if MODEL_SUBFOLDER and MODEL_SUBFOLDER.strip():
+ tokenizer = AutoTokenizer.from_pretrained(MAIN_MODEL_ID, subfolder=MODEL_SUBFOLDER)
+ else:
+ tokenizer = AutoTokenizer.from_pretrained(MAIN_MODEL_ID)
+ chat_template = download_chat_template()
+ if chat_template:
+ tokenizer.chat_template = chat_template
+ logger.info("Chat template downloaded and set successfully")
+ else:
+ logger.warning("Could not download chat template, using default")
+
+ logger.info(f"Loading model from {MAIN_MODEL_ID}")
+ model_kwargs = {
+ "device_map": "auto" if DEVICE == "cuda" else "cpu",
+ "torch_dtype": torch.bfloat16,
+ "trust_remote_code": True,
+ "low_cpu_mem_usage": True,
+ }
+
+ logger.info(f"Model loading parameters: {model_kwargs}")
+ if MODEL_SUBFOLDER and MODEL_SUBFOLDER.strip():
+ model = AutoModelForCausalLM.from_pretrained(MAIN_MODEL_ID, subfolder=MODEL_SUBFOLDER, **model_kwargs)
+ else:
+ model = AutoModelForCausalLM.from_pretrained(MAIN_MODEL_ID, **model_kwargs)
+
+ if tokenizer.pad_token_id is None:
+ tokenizer.pad_token_id = tokenizer.eos_token_id
+
+ logger.info("Model loaded successfully")
+ return True
+
+ except Exception as e:
+ logger.error(f"Error loading model: {e}")
+ logger.error(f"Model config: {model.config if model else 'Model not loaded'}")
+ return False
+
+
+def create_prompt(system_message, user_message, enable_thinking=True):
+ """Create prompt using the model's chat template"""
+ try:
+ formatted_messages = []
+ if system_message and system_message.strip():
+ formatted_messages.append({"role": "system", "content": system_message})
+ formatted_messages.append({"role": "user", "content": user_message})
+ prompt = tokenizer.apply_chat_template(
+ formatted_messages,
+ tokenize=False,
+ add_generation_prompt=True,
+ enable_thinking=enable_thinking
+ )
+ if not enable_thinking:
+ prompt += " /no_think"
+
+ return prompt
+
+ except Exception as e:
+ logger.error(f"Error creating prompt: {e}")
+ return ""
+
+@spaces.GPU(duration=94)
+def generate_response(message, history, system_message, max_tokens, temperature, top_p, do_sample, enable_thinking=True):
+ """Generate response using the model"""
+ global model, tokenizer
+
+ if model is None or tokenizer is None:
+ return "Error: Model not loaded. Please wait for the model to load."
+ full_prompt = create_prompt(system_message, message, enable_thinking)
+
+ if not full_prompt:
+ return "Error: Failed to create prompt."
+
+ inputs = tokenizer(full_prompt, return_tensors="pt", padding=True, truncation=True)
+ logger.info(f"Input tensor shapes: {[(k, v.shape, v.dtype) for k, v in inputs.items()]}")
+
+ if DEVICE == "cuda":
+ inputs = {k: v.cuda() for k, v in inputs.items()}
+ with torch.no_grad():
+ output_ids = model.generate(
+ inputs['input_ids'],
+ max_new_tokens=max_tokens,
+ temperature=temperature,
+ top_p=top_p,
+ do_sample=do_sample,
+ attention_mask=inputs['attention_mask'],
+ pad_token_id=tokenizer.eos_token_id,
+ eos_token_id=tokenizer.eos_token_id
+ )
+ response = tokenizer.decode(output_ids[0], skip_special_tokens=True)
+ assistant_response = response[len(full_prompt):].strip()
+ assistant_response = re.sub(r'<\|im_start\|>.*?<\|im_end\|>', '', assistant_response, flags=re.DOTALL)
+ if not enable_thinking:
+ assistant_response = re.sub(r'.*?', '', assistant_response, flags=re.DOTALL)
+
+ assistant_response = assistant_response.strip()
+
+ return assistant_response
+
+def user(user_message, history):
+ """Add user message to history"""
+ if history is None:
+ history = []
+ return "", history + [{"role": "user", "content": user_message}]
+
+def bot(history, system_prompt, max_length, temperature, top_p, advanced_checkbox, enable_thinking):
+ """Generate bot response"""
+ if not history:
+ return history
+ user_message = history[-1]["content"] if history else ""
+
+ do_sample = advanced_checkbox
+ bot_message = generate_response(user_message, history, system_prompt, max_length, temperature, top_p, do_sample, enable_thinking)
+ history.append({"role": "assistant", "content": bot_message})
+ return history
+
+# Load model on startup
+logger.info("Starting model loading process...")
+load_model()
+
+# Create Gradio interface
+with gr.Blocks() as demo:
+ with gr.Row():
+ gr.Markdown(title)
+ with gr.Row():
+ gr.Markdown(description)
+ with gr.Row():
+ with gr.Column(scale=1):
+ with gr.Group():
+ gr.Markdown(presentation1)
+ with gr.Column(scale=1):
+ with gr.Group():
+ gr.Markdown(presentation2)
+ with gr.Row():
+ with gr.Column(scale=1):
+ with gr.Group():
+ gr.Markdown(joinus)
+ with gr.Column(scale=1):
+ pass # Empty column for balance
+
+ with gr.Row():
+ with gr.Column(scale=2):
+ system_prompt = gr.TextArea(
+ label="📑 Contexte",
+ placeholder="Tu es TonicIA, un assistant francophone rigoureux et bienveillant.",
+ lines=5,
+ value=DEFAULT_SYSTEM_PROMPT
+ )
+ user_input = gr.TextArea(
+ label="🤷🏻♂️ Message",
+ placeholder="Bonjour je m'appel Tonic!",
+ lines=2
+ )
+ advanced_checkbox = gr.Checkbox(label="🧪 Advanced Settings", value=False)
+ with gr.Column(visible=False) as advanced_settings:
+ max_length = gr.Slider(
+ label="📏 Longueur de la réponse",
+ minimum=10,
+ maximum=556,
+ value=120,
+ step=1
+ )
+ temperature = gr.Slider(
+ label="🌡️ Température",
+ minimum=0.01,
+ maximum=1.0,
+ value=0.5,
+ step=0.01
+ )
+ top_p = gr.Slider(
+ label="⚛️ Top-p (Echantillonnage)",
+ minimum=0.1,
+ maximum=1.0,
+ value=0.95,
+ step=0.01
+ )
+ enable_thinking = gr.Checkbox(label="Mode Réflexion", value=True)
+
+ generate_button = gr.Button(value=f"🤖 {MODEL_NAME}")
+
+ with gr.Column(scale=2):
+ chatbot = gr.Chatbot(label=f"🤖 {MODEL_NAME}", type="messages", value=[])
+
+ generate_button.click(
+ user,
+ [user_input, chatbot],
+ [user_input, chatbot],
+ queue=False
+ ).then(
+ bot,
+ [chatbot, system_prompt, max_length, temperature, top_p, advanced_checkbox, enable_thinking],
+ chatbot
+ )
+
+ advanced_checkbox.change(
+ fn=lambda x: gr.update(visible=x),
+ inputs=[advanced_checkbox],
+ outputs=[advanced_settings]
+ )
+
+if __name__ == "__main__":
+
+ demo.queue()
+ demo.launch(ssr_mode=False, mcp_server=True)
\ No newline at end of file
diff --git a/templates/spaces/demo/requirements.txt b/templates/spaces/demo/requirements.txt
new file mode 100644
index 0000000000000000000000000000000000000000..c1888109fbeb126b6792e9bf435f4f52fe3988b9
--- /dev/null
+++ b/templates/spaces/demo/requirements.txt
@@ -0,0 +1,7 @@
+gradio>=5.38.2
+torch>=2.0.0
+transformers>=4.54.0
+accelerate>=0.20.0
+sentencepiece>=0.1.99
+protobuf>=3.20.0
+requests>=2.28.0
diff --git a/tests/debug_trackio.py b/tests/debug_trackio.py
deleted file mode 100644
index f67aee6bfe579c52cf46f0c887678758663e49e3..0000000000000000000000000000000000000000
--- a/tests/debug_trackio.py
+++ /dev/null
@@ -1,94 +0,0 @@
-#!/usr/bin/env python3
-"""
-Debug script to test Trackio data structure and identify plotting issues
-"""
-
-import json
-import os
-from datetime import datetime
-import pandas as pd
-
-def debug_trackio_data():
- """Debug the Trackio data structure"""
-
- # Check if data file exists
- data_file = "trackio_experiments.json"
- print(f"🔍 Checking for data file: {data_file}")
-
- if os.path.exists(data_file):
- print("✅ Data file exists")
- with open(data_file, 'r') as f:
- data = json.load(f)
- print(f"📊 Data structure: {json.dumps(data, indent=2)}")
-
- experiments = data.get('experiments', {})
- print(f"📈 Found {len(experiments)} experiments")
-
- for exp_id, exp_data in experiments.items():
- print(f"\n🔬 Experiment: {exp_id}")
- print(f" Name: {exp_data.get('name', 'N/A')}")
- print(f" Status: {exp_data.get('status', 'N/A')}")
- print(f" Metrics count: {len(exp_data.get('metrics', []))}")
-
- # Check metrics structure
- metrics = exp_data.get('metrics', [])
- if metrics:
- print(f" Latest metric entry: {json.dumps(metrics[-1], indent=2)}")
-
- # Test DataFrame conversion
- data_list = []
- for metric_entry in metrics:
- step = metric_entry.get('step', 0)
- timestamp = metric_entry.get('timestamp', '')
- metrics_data = metric_entry.get('metrics', {})
-
- row = {'step': step, 'timestamp': timestamp}
- row.update(metrics_data)
- data_list.append(row)
-
- df = pd.DataFrame(data_list)
- print(f" DataFrame shape: {df.shape}")
- print(f" DataFrame columns: {list(df.columns)}")
- if not df.empty:
- print(f" Sample data:\n{df.head()}")
- else:
- print(" ❌ No metrics found")
- else:
- print("❌ Data file does not exist")
-
- # Create a test experiment to see if data persists
- print("\n🧪 Creating test experiment...")
- test_data = {
- 'experiments': {
- 'test_exp_001': {
- 'id': 'test_exp_001',
- 'name': 'Test Experiment',
- 'description': 'Debug test',
- 'created_at': datetime.now().isoformat(),
- 'status': 'running',
- 'metrics': [
- {
- 'timestamp': datetime.now().isoformat(),
- 'step': 25,
- 'metrics': {
- 'loss': 1.165,
- 'accuracy': 0.75,
- 'learning_rate': 3.5e-6
- }
- }
- ],
- 'parameters': {},
- 'artifacts': [],
- 'logs': []
- }
- },
- 'current_experiment': 'test_exp_001',
- 'last_updated': datetime.now().isoformat()
- }
-
- with open(data_file, 'w') as f:
- json.dump(test_data, f, indent=2)
- print("✅ Created test data file")
-
-if __name__ == "__main__":
- debug_trackio_data()
\ No newline at end of file
diff --git a/tests/fix_trackio_persistence.py b/tests/fix_trackio_persistence.py
deleted file mode 100644
index 8dfdd1b8ae04f212adc28ff75cd700ec2e9d7434..0000000000000000000000000000000000000000
--- a/tests/fix_trackio_persistence.py
+++ /dev/null
@@ -1,264 +0,0 @@
-#!/usr/bin/env python3
-"""
-Fix script to manually add missing experiments to trackio_experiments.json
-"""
-
-import json
-import os
-from datetime import datetime
-
-def add_missing_experiments():
- """Add the missing experiments from the logs to the data file"""
-
- data_file = "trackio_experiments.json"
-
- # Load existing data
- if os.path.exists(data_file):
- with open(data_file, 'r') as f:
- data = json.load(f)
- else:
- data = {
- 'experiments': {},
- 'current_experiment': None,
- 'last_updated': datetime.now().isoformat()
- }
-
- # Add the missing experiments based on the logs
- experiments = data['experiments']
-
- # Experiment 1: exp_20250720_130853
- experiments['exp_20250720_130853'] = {
- 'id': 'exp_20250720_130853',
- 'name': 'petite-elle-l-aime-3',
- 'description': 'SmolLM3 fine-tuning experiment',
- 'created_at': '2025-07-20T11:20:01.780908',
- 'status': 'running',
- 'metrics': [
- {
- 'timestamp': '2025-07-20T11:20:01.780908',
- 'step': 25,
- 'metrics': {
- 'loss': 1.1659,
- 'grad_norm': 10.3125,
- 'learning_rate': 7e-08,
- 'num_tokens': 1642080.0,
- 'mean_token_accuracy': 0.75923578992486,
- 'epoch': 0.004851130919895701
- }
- },
- {
- 'timestamp': '2025-07-20T11:26:39.042155',
- 'step': 50,
- 'metrics': {
- 'loss': 1.165,
- 'grad_norm': 10.75,
- 'learning_rate': 1.4291666666666667e-07,
- 'num_tokens': 3324682.0,
- 'mean_token_accuracy': 0.7577659255266189,
- 'epoch': 0.009702261839791402
- }
- },
- {
- 'timestamp': '2025-07-20T11:33:16.203045',
- 'step': 75,
- 'metrics': {
- 'loss': 1.1639,
- 'grad_norm': 10.6875,
- 'learning_rate': 2.1583333333333334e-07,
- 'num_tokens': 4987941.0,
- 'mean_token_accuracy': 0.7581205774843692,
- 'epoch': 0.014553392759687101
- }
- },
- {
- 'timestamp': '2025-07-20T11:39:53.453917',
- 'step': 100,
- 'metrics': {
- 'loss': 1.1528,
- 'grad_norm': 10.75,
- 'learning_rate': 2.8875e-07,
- 'num_tokens': 6630190.0,
- 'mean_token_accuracy': 0.7614579878747463,
- 'epoch': 0.019404523679582803
- }
- }
- ],
- 'parameters': {
- 'model_name': 'HuggingFaceTB/SmolLM3-3B',
- 'max_seq_length': 12288,
- 'use_flash_attention': True,
- 'use_gradient_checkpointing': False,
- 'batch_size': 8,
- 'gradient_accumulation_steps': 16,
- 'learning_rate': 3.5e-06,
- 'weight_decay': 0.01,
- 'warmup_steps': 1200,
- 'max_iters': 18000,
- 'eval_interval': 1000,
- 'log_interval': 25,
- 'save_interval': 2000,
- 'optimizer': 'adamw_torch',
- 'beta1': 0.9,
- 'beta2': 0.999,
- 'eps': 1e-08,
- 'scheduler': 'cosine',
- 'min_lr': 3.5e-07,
- 'fp16': False,
- 'bf16': True,
- 'ddp_backend': 'nccl',
- 'ddp_find_unused_parameters': False,
- 'save_steps': 2000,
- 'eval_steps': 1000,
- 'logging_steps': 25,
- 'save_total_limit': 5,
- 'eval_strategy': 'steps',
- 'metric_for_best_model': 'eval_loss',
- 'greater_is_better': False,
- 'load_best_model_at_end': True,
- 'data_dir': None,
- 'train_file': None,
- 'validation_file': None,
- 'test_file': None,
- 'use_chat_template': True,
- 'chat_template_kwargs': {'add_generation_prompt': True, 'no_think_system_message': True},
- 'enable_tracking': True,
- 'trackio_url': 'https://tonic-test-trackio-test.hf.space',
- 'trackio_token': None,
- 'log_artifacts': True,
- 'log_metrics': True,
- 'log_config': True,
- 'experiment_name': 'petite-elle-l-aime-3',
- 'dataset_name': 'legmlai/openhermes-fr',
- 'dataset_split': 'train',
- 'input_field': 'prompt',
- 'target_field': 'accepted_completion',
- 'filter_bad_entries': True,
- 'bad_entry_field': 'bad_entry',
- 'packing': False,
- 'max_prompt_length': 12288,
- 'max_completion_length': 8192,
- 'truncation': True,
- 'dataloader_num_workers': 10,
- 'dataloader_pin_memory': True,
- 'dataloader_prefetch_factor': 3,
- 'max_grad_norm': 1.0,
- 'group_by_length': True
- },
- 'artifacts': [],
- 'logs': []
- }
-
- # Experiment 2: exp_20250720_134319
- experiments['exp_20250720_134319'] = {
- 'id': 'exp_20250720_134319',
- 'name': 'petite-elle-l-aime-3-1',
- 'description': 'SmolLM3 fine-tuning experiment',
- 'created_at': '2025-07-20T11:54:31.993219',
- 'status': 'running',
- 'metrics': [
- {
- 'timestamp': '2025-07-20T11:54:31.993219',
- 'step': 25,
- 'metrics': {
- 'loss': 1.166,
- 'grad_norm': 10.375,
- 'learning_rate': 7e-08,
- 'num_tokens': 1642080.0,
- 'mean_token_accuracy': 0.7590958896279335,
- 'epoch': 0.004851130919895701
- }
- },
- {
- 'timestamp': '2025-07-20T11:54:33.589487',
- 'step': 25,
- 'metrics': {
- 'gpu_0_memory_allocated': 17.202261447906494,
- 'gpu_0_memory_reserved': 75.474609375,
- 'gpu_0_utilization': 0,
- 'cpu_percent': 2.7,
- 'memory_percent': 10.1
- }
- }
- ],
- 'parameters': {
- 'model_name': 'HuggingFaceTB/SmolLM3-3B',
- 'max_seq_length': 12288,
- 'use_flash_attention': True,
- 'use_gradient_checkpointing': False,
- 'batch_size': 8,
- 'gradient_accumulation_steps': 16,
- 'learning_rate': 3.5e-06,
- 'weight_decay': 0.01,
- 'warmup_steps': 1200,
- 'max_iters': 18000,
- 'eval_interval': 1000,
- 'log_interval': 25,
- 'save_interval': 2000,
- 'optimizer': 'adamw_torch',
- 'beta1': 0.9,
- 'beta2': 0.999,
- 'eps': 1e-08,
- 'scheduler': 'cosine',
- 'min_lr': 3.5e-07,
- 'fp16': False,
- 'bf16': True,
- 'ddp_backend': 'nccl',
- 'ddp_find_unused_parameters': False,
- 'save_steps': 2000,
- 'eval_steps': 1000,
- 'logging_steps': 25,
- 'save_total_limit': 5,
- 'eval_strategy': 'steps',
- 'metric_for_best_model': 'eval_loss',
- 'greater_is_better': False,
- 'load_best_model_at_end': True,
- 'data_dir': None,
- 'train_file': None,
- 'validation_file': None,
- 'test_file': None,
- 'use_chat_template': True,
- 'chat_template_kwargs': {'add_generation_prompt': True, 'no_think_system_message': True},
- 'enable_tracking': True,
- 'trackio_url': 'https://tonic-test-trackio-test.hf.space',
- 'trackio_token': None,
- 'log_artifacts': True,
- 'log_metrics': True,
- 'log_config': True,
- 'experiment_name': 'petite-elle-l-aime-3-1',
- 'dataset_name': 'legmlai/openhermes-fr',
- 'dataset_split': 'train',
- 'input_field': 'prompt',
- 'target_field': 'accepted_completion',
- 'filter_bad_entries': True,
- 'bad_entry_field': 'bad_entry',
- 'packing': False,
- 'max_prompt_length': 12288,
- 'max_completion_length': 8192,
- 'truncation': True,
- 'dataloader_num_workers': 10,
- 'dataloader_pin_memory': True,
- 'dataloader_prefetch_factor': 3,
- 'max_grad_norm': 1.0,
- 'group_by_length': True
- },
- 'artifacts': [],
- 'logs': []
- }
-
- # Update metadata
- data['current_experiment'] = 'exp_20250720_134319'
- data['last_updated'] = datetime.now().isoformat()
-
- # Save the updated data
- with open(data_file, 'w') as f:
- json.dump(data, f, indent=2)
-
- print("✅ Added missing experiments to trackio_experiments.json")
- print(f"📊 Total experiments: {len(experiments)}")
- print("🔬 Experiments added:")
- print(" - exp_20250720_130853 (petite-elle-l-aime-3)")
- print(" - exp_20250720_134319 (petite-elle-l-aime-3-1)")
- print("\n🎯 You can now view these experiments in the Trackio interface!")
-
-if __name__ == "__main__":
- add_missing_experiments()
\ No newline at end of file
diff --git a/tests/integrate_monitoring.py b/tests/integrate_monitoring.py
deleted file mode 100644
index 965224ec4e6018c63dc9e1c96b2910015fd8ba0c..0000000000000000000000000000000000000000
--- a/tests/integrate_monitoring.py
+++ /dev/null
@@ -1,267 +0,0 @@
-#!/usr/bin/env python3
-"""
-Script to integrate improved monitoring with HF Datasets into training scripts
-"""
-
-import os
-import sys
-import re
-from pathlib import Path
-
-def update_training_script(script_path: str):
- """Update a training script to include improved monitoring"""
-
- print(f"🔧 Updating {script_path}...")
-
- with open(script_path, 'r', encoding='utf-8') as f:
- content = f.read()
-
- # Check if monitoring is already imported
- if 'from monitoring import' in content:
- print(f" ⚠️ Monitoring already imported in {script_path}")
- return False
-
- # Add monitoring import
- import_pattern = r'(from \w+ import.*?)(\n\n|\n$)'
- match = re.search(import_pattern, content, re.MULTILINE | re.DOTALL)
-
- if match:
- # Add monitoring import after existing imports
- new_import = match.group(1) + '\nfrom monitoring import create_monitor_from_config\n' + match.group(2)
- content = content.replace(match.group(0), new_import)
- else:
- # Add at the beginning if no imports found
- content = 'from monitoring import create_monitor_from_config\n\n' + content
-
- # Find the main training function and add monitoring
- # Look for patterns like "def main():" or "def train():"
- main_patterns = [
- r'def main\(\):',
- r'def train\(\):',
- r'def run_training\(\):'
- ]
-
- monitoring_added = False
- for pattern in main_patterns:
- if re.search(pattern, content):
- # Add monitoring initialization after config loading
- config_pattern = r'(config\s*=\s*get_config\([^)]+\))'
- config_match = re.search(config_pattern, content)
-
- if config_match:
- monitoring_code = '''
- # Initialize monitoring
- monitor = None
- if config.enable_tracking:
- try:
- monitor = create_monitor_from_config(config, getattr(config, 'experiment_name', None))
- logger.info(f"✅ Monitoring initialized for experiment: {monitor.experiment_name}")
- logger.info(f"📊 Dataset repository: {monitor.dataset_repo}")
-
- # Log configuration
- config_dict = {k: v for k, v in vars(config).items() if not k.startswith('_')}
- monitor.log_configuration(config_dict)
-
- except Exception as e:
- logger.error(f"Failed to initialize monitoring: {e}")
- logger.warning("Continuing without monitoring...")
-'''
-
- # Insert monitoring code after config loading
- insert_point = config_match.end()
- content = content[:insert_point] + monitoring_code + content[insert_point:]
-
- # Add monitoring callback to trainer
- trainer_pattern = r'(trainer\s*=\s*[^)]+\))'
- trainer_match = re.search(trainer_pattern, content)
-
- if trainer_match:
- callback_code = '''
- # Add monitoring callback if available
- if monitor:
- try:
- callback = monitor.create_monitoring_callback()
- trainer.add_callback(callback)
- logger.info("✅ Monitoring callback added to trainer")
- except Exception as e:
- logger.error(f"Failed to add monitoring callback: {e}")
-'''
-
- insert_point = trainer_match.end()
- content = content[:insert_point] + callback_code + content[insert_point:]
-
- # Add training summary logging
- train_pattern = r'(trainer\.train\(\))'
- train_match = re.search(train_pattern, content)
-
- if train_match:
- summary_code = '''
- # Log training summary
- if monitor:
- try:
- summary = {
- 'final_loss': getattr(trainer, 'final_loss', None),
- 'total_steps': getattr(trainer, 'total_steps', None),
- 'training_duration': getattr(trainer, 'training_duration', None),
- 'model_path': output_path,
- 'config_file': config_path
- }
- monitor.log_training_summary(summary)
- logger.info("✅ Training summary logged")
- except Exception as e:
- logger.error(f"Failed to log training summary: {e}")
-'''
-
- # Find the training call and add summary after it
- train_call_pattern = r'(trainer\.train\(\)\s*\n\s*logger\.info\("Training completed successfully!"\))'
- train_call_match = re.search(train_call_pattern, content)
-
- if train_call_match:
- insert_point = train_call_match.end()
- content = content[:insert_point] + summary_code + content[insert_point:]
-
- # Add error handling and cleanup
- error_pattern = r'(except Exception as e:\s*\n\s*logger\.error\(f"Training failed: {e}"\)\s*\n\s*raise)'
- error_match = re.search(error_pattern, content)
-
- if error_match:
- error_code = '''
- # Log error to monitoring
- if monitor:
- try:
- error_summary = {
- 'error': str(e),
- 'status': 'failed',
- 'model_path': output_path,
- 'config_file': config_path
- }
- monitor.log_training_summary(error_summary)
- except Exception as log_error:
- logger.error(f"Failed to log error to monitoring: {log_error}")
-'''
-
- insert_point = error_match.end()
- content = content[:insert_point] + error_code + content[insert_point:]
-
- # Add finally block for cleanup
- finally_pattern = r'(raise\s*\n\s*if __name__ == \'__main__\':)'
- finally_match = re.search(finally_pattern, content)
-
- if finally_match:
- cleanup_code = '''
- finally:
- # Close monitoring
- if monitor:
- try:
- monitor.close()
- logger.info("✅ Monitoring session closed")
- except Exception as e:
- logger.error(f"Failed to close monitoring: {e}")
-
-'''
-
- insert_point = finally_match.start()
- content = content[:insert_point] + cleanup_code + content[insert_point:]
-
- monitoring_added = True
- break
-
- if monitoring_added:
- # Write updated content
- with open(script_path, 'w', encoding='utf-8') as f:
- f.write(content)
-
- print(f" ✅ Updated {script_path} with monitoring integration")
- return True
- else:
- print(f" ⚠️ Could not find main training function in {script_path}")
- return False
-
-def update_config_files():
- """Update configuration files to include HF Datasets support"""
-
- config_dir = Path("config")
- config_files = list(config_dir.glob("*.py"))
-
- print(f"🔧 Updating configuration files...")
-
- for config_file in config_files:
- if config_file.name.startswith("__"):
- continue
-
- print(f" 📝 Checking {config_file.name}...")
-
- with open(config_file, 'r', encoding='utf-8') as f:
- content = f.read()
-
- # Check if HF Datasets config is already present
- if 'TRACKIO_DATASET_REPO' in content:
- print(f" ⚠️ HF Datasets config already present in {config_file.name}")
- continue
-
- # Add HF Datasets configuration
- trackio_pattern = r'(# Trackio monitoring configuration.*?experiment_name: Optional\[str\] = None)'
- trackio_match = re.search(trackio_pattern, content, re.DOTALL)
-
- if trackio_match:
- hf_config = '''
- # HF Datasets configuration
- hf_token: Optional[str] = None
- dataset_repo: Optional[str] = None
-'''
-
- insert_point = trackio_match.end()
- content = content[:insert_point] + hf_config + content[insert_point:]
-
- # Write updated content
- with open(config_file, 'w', encoding='utf-8') as f:
- f.write(content)
-
- print(f" ✅ Added HF Datasets config to {config_file.name}")
- else:
- print(f" ⚠️ Could not find Trackio config section in {config_file.name}")
-
-def main():
- """Main function to integrate monitoring into all training scripts"""
-
- print("🚀 Integrating improved monitoring with HF Datasets...")
- print("=" * 60)
-
- # Update main training script
- main_script = "train.py"
- if os.path.exists(main_script):
- update_training_script(main_script)
- else:
- print(f"⚠️ Main training script {main_script} not found")
-
- # Update configuration files
- update_config_files()
-
- # Update any other training scripts in config directory
- config_dir = Path("config")
- training_scripts = [
- "train_smollm3_openhermes_fr.py",
- "train_smollm3_openhermes_fr_a100_balanced.py",
- "train_smollm3_openhermes_fr_a100_large.py",
- "train_smollm3_openhermes_fr_a100_max_performance.py",
- "train_smollm3_openhermes_fr_a100_multiple_passes.py"
- ]
-
- print(f"\n🔧 Updating training scripts in config directory...")
-
- for script_name in training_scripts:
- script_path = config_dir / script_name
- if script_path.exists():
- update_training_script(str(script_path))
- else:
- print(f" ⚠️ Training script {script_name} not found")
-
- print(f"\n✅ Monitoring integration completed!")
- print(f"\n📋 Next steps:")
- print(f"1. Set HF_TOKEN environment variable")
- print(f"2. Optionally set TRACKIO_DATASET_REPO")
- print(f"3. Run your training scripts with monitoring enabled")
- print(f"4. Check your HF Dataset repository for experiment data")
-
-if __name__ == "__main__":
- main()
\ No newline at end of file
diff --git a/tests/quick_test_training.py b/tests/quick_test_training.py
deleted file mode 100644
index 519da523945f03d77ffa09dcc6082bcada08bf74..0000000000000000000000000000000000000000
--- a/tests/quick_test_training.py
+++ /dev/null
@@ -1,60 +0,0 @@
-#!/usr/bin/env python3
-"""
-Quick test for the training fix
-"""
-
-import os
-import sys
-
-# Add project root to path
-project_root = os.path.dirname(os.path.abspath(__file__))
-sys.path.insert(0, project_root)
-
-def main():
- print("🔧 Testing H100 Lightweight Training Fix")
- print("=" * 50)
-
- # Set environment variables to fix mixed precision issues
- os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
- os.environ["CUDA_LAUNCH_BLOCKING"] = "1"
- os.environ["TORCH_USE_CUDA_DSA"] = "1"
-
- print("✅ Environment variables set")
-
- # Test configuration
- try:
- from config.train_smollm3_h100_lightweight import SmolLM3ConfigH100Lightweight
- config = SmolLM3ConfigH100Lightweight()
- print(f"✅ Configuration loaded: fp16={config.fp16}, bf16={config.bf16}")
-
- # Test model loading (without actually loading the full model)
- from src.model import SmolLM3Model
-
- # Create model instance
- model = SmolLM3Model(
- model_name="HuggingFaceTB/SmolLM3-3B",
- max_seq_length=4096,
- config=config
- )
-
- print(f"✅ Model dtype: {model.torch_dtype}")
- print(f"✅ Model device map: {model.device_map}")
-
- # Test training arguments
- training_args = model.get_training_arguments("/tmp/test")
- print(f"✅ Training args: fp16={training_args.fp16}, bf16={training_args.bf16}")
-
- print("\n🎉 All tests passed!")
- print("You can now run the training with:")
- print(" ./launch.sh")
-
- except Exception as e:
- print(f"❌ Error: {e}")
- import traceback
- traceback.print_exc()
- return 1
-
- return 0
-
-if __name__ == "__main__":
- exit(main())
\ No newline at end of file
diff --git a/tests/test_app_config.py b/tests/test_app_config.py
deleted file mode 100644
index 9ef7b13cb3d2e0d882ebdb93263fdc0957e2081b..0000000000000000000000000000000000000000
--- a/tests/test_app_config.py
+++ /dev/null
@@ -1,112 +0,0 @@
-#!/usr/bin/env python3
-"""
-Test script for the new configuration functionality in app.py
-"""
-
-import os
-import sys
-from unittest.mock import patch
-
-def test_trackio_space_initialization():
- """Test TrackioSpace initialization with different parameters"""
- print("🧪 Testing TrackioSpace initialization...")
-
- # Import the app module
- import templates.spaces.app as app
-
- # Test 1: Default initialization (uses environment variables)
- print("\n1. Testing default initialization...")
- trackio = app.TrackioSpace()
- print(f" Dataset repo: {trackio.dataset_repo}")
- print(f" HF token set: {'Yes' if trackio.hf_token else 'No'}")
-
- # Test 2: Custom initialization
- print("\n2. Testing custom initialization...")
- trackio_custom_config = app.TrackioSpace(
- hf_token="test_token_123",
- dataset_repo="test-user/test-dataset"
- )
- print(f" Dataset repo: {trackio_custom_config.dataset_repo}")
- print(f" HF token set: {'Yes' if trackio_custom_config.hf_token else 'No'}")
-
- # Test 3: Partial custom initialization
- print("\n3. Testing partial custom initialization...")
- trackio_partial = app.TrackioSpace(dataset_repo="another-user/another-dataset")
- print(f" Dataset repo: {trackio_partial.dataset_repo}")
- print(f" HF token set: {'Yes' if trackio_partial.hf_token else 'No'}")
-
- print("✅ TrackioSpace initialization tests passed!")
-
-def test_configuration_functions():
- """Test the configuration functions"""
- print("\n🧪 Testing configuration functions...")
-
- import templates.spaces.app as app
-
- # Test update_trackio_config function
- print("\n1. Testing update_trackio_config...")
- result = app.update_trackio_config("test_token", "test-user/test-dataset")
- print(f" Result: {result}")
-
- # Test test_dataset_connection function
- print("\n2. Testing test_dataset_connection...")
- result = app.test_dataset_connection("", "test-user/test-dataset")
- print(f" Result: {result}")
-
- # Test create_dataset_repository function
- print("\n3. Testing create_dataset_repository...")
- result = app.create_dataset_repository("", "test-user/test-dataset")
- print(f" Result: {result}")
-
- print("✅ Configuration function tests passed!")
-
-def test_environment_variables():
- """Test environment variable handling"""
- print("\n🧪 Testing environment variable handling...")
-
- # Test with environment variables set
- with patch.dict(os.environ, {
- 'HF_TOKEN': 'env_test_token',
- 'TRACKIO_DATASET_REPO': 'env-user/env-dataset'
- }):
- import templates.spaces.app as app
- trackio = app.TrackioSpace()
- print(f" Dataset repo: {trackio.dataset_repo}")
- print(f" HF token set: {'Yes' if trackio.hf_token else 'No'}")
-
- # Test with no environment variables
- with patch.dict(os.environ, {}, clear=True):
- import templates.spaces.app as app
- trackio = app.TrackioSpace()
- print(f" Dataset repo: {trackio.dataset_repo}")
- print(f" HF token set: {'Yes' if trackio.hf_token else 'No'}")
-
- print("✅ Environment variable tests passed!")
-
-def main():
- """Run all tests"""
- print("🚀 Testing App Configuration Features")
- print("=" * 50)
-
- try:
- test_trackio_space_initialization()
- test_configuration_functions()
- test_environment_variables()
-
- print("\n🎉 All tests passed!")
- print("\n📋 Configuration Features:")
- print("✅ HF Token input field")
- print("✅ Dataset Repository input field")
- print("✅ Environment variable fallback")
- print("✅ Configuration update function")
- print("✅ Connection testing function")
- print("✅ Dataset creation function")
- print("✅ Gradio interface integration")
-
- except Exception as e:
- print(f"\n❌ Test failed: {e}")
- import traceback
- traceback.print_exc()
-
-if __name__ == "__main__":
- main()
\ No newline at end of file
diff --git a/tests/test_dataset_setup_fix.py b/tests/test_dataset_setup_fix.py
deleted file mode 100644
index a86b7efb2f33df979b9b8411cfb11756549407b6..0000000000000000000000000000000000000000
--- a/tests/test_dataset_setup_fix.py
+++ /dev/null
@@ -1,182 +0,0 @@
-#!/usr/bin/env python3
-"""
-Test script to verify dataset setup works with the token
-"""
-
-import os
-import sys
-from pathlib import Path
-
-# Add the scripts directory to the path
-sys.path.append(str(Path(__file__).parent.parent / "scripts" / "dataset_tonic"))
-
-def test_dataset_setup_with_token():
- """Test dataset setup with the provided token"""
- print("🔍 Testing Dataset Setup with Token")
- print("=" * 50)
-
- # Test token from user
- test_token = "xx"
-
- print(f"Testing dataset setup with token: {'*' * 10}...{test_token[-4:]}")
-
- # Set environment variable
- os.environ['HUGGING_FACE_HUB_TOKEN'] = test_token
- os.environ['HF_TOKEN'] = test_token
-
- # Import the dataset setup function
- try:
- from setup_hf_dataset import get_username_from_token, setup_trackio_dataset
- print("✅ Dataset setup module imported successfully")
- except ImportError as e:
- print(f"❌ Failed to import dataset setup module: {e}")
- return False
-
- # Test username extraction
- try:
- username = get_username_from_token(test_token)
-
- if username:
- print(f"✅ Username extraction successful: {username}")
- else:
- print(f"❌ Username extraction failed")
- return False
-
- except Exception as e:
- print(f"❌ Username extraction error: {e}")
- return False
-
- # Test setup function with token parameter
- try:
- # Test with token parameter
- success = setup_trackio_dataset("test-dataset", test_token)
-
- if success:
- print("✅ Dataset setup with token parameter successful")
- return True
- else:
- print("❌ Dataset setup with token parameter failed")
- return False
-
- except Exception as e:
- print(f"❌ Dataset setup error: {e}")
- return False
-
-def test_dataset_setup_with_environment():
- """Test dataset setup with environment variables"""
- print("\n🔍 Testing Dataset Setup with Environment Variables")
- print("=" * 50)
-
- # Test token from user
- test_token = "xxx"
-
- print(f"Testing dataset setup with environment variables: {'*' * 10}...{test_token[-4:]}")
-
- # Set environment variables
- os.environ['HUGGING_FACE_HUB_TOKEN'] = test_token
- os.environ['HF_TOKEN'] = test_token
-
- # Import the dataset setup function
- try:
- from setup_hf_dataset import setup_trackio_dataset
- print("✅ Dataset setup module imported successfully")
- except ImportError as e:
- print(f"❌ Failed to import dataset setup module: {e}")
- return False
-
- # Test setup function with environment variables
- try:
- # Test with environment variables only
- success = setup_trackio_dataset("test-dataset-env")
-
- if success:
- print("✅ Dataset setup with environment variables successful")
- return True
- else:
- print("❌ Dataset setup with environment variables failed")
- return False
-
- except Exception as e:
- print(f"❌ Dataset setup error: {e}")
- return False
-
-def test_main_function():
- """Test the main function with command line arguments"""
- print("\n🔍 Testing Main Function with Command Line Arguments")
- print("=" * 50)
-
- # Test token from user
- test_token = "xxx"
-
- print(f"Testing main function with command line arguments: {'*' * 10}...{test_token[-4:]}")
-
- # Import the main function
- try:
- from setup_hf_dataset import main
- print("✅ Main function imported successfully")
- except ImportError as e:
- print(f"❌ Failed to import main function: {e}")
- return False
-
- # Test main function (this will actually try to create a dataset)
- try:
- # Save original sys.argv
- original_argv = sys.argv.copy()
-
- # Set up command line arguments
- sys.argv = ['setup_hf_dataset.py', test_token, 'test-dataset-main']
-
- # Set environment variables
- os.environ['HUGGING_FACE_HUB_TOKEN'] = test_token
- os.environ['HF_TOKEN'] = test_token
-
- # Note: We won't actually call main() as it would create a real dataset
- # Instead, we'll just verify the function exists and can be imported
- print("✅ Main function is properly configured")
- print("✅ Command line argument handling is set up correctly")
-
- # Restore original sys.argv
- sys.argv = original_argv
-
- return True
-
- except Exception as e:
- print(f"❌ Main function test error: {e}")
- return False
-
-def main():
- """Run all dataset setup tests"""
- print("🚀 Dataset Setup Token Fix Verification")
- print("=" * 50)
-
- tests = [
- test_dataset_setup_with_token,
- test_dataset_setup_with_environment,
- test_main_function
- ]
-
- all_passed = True
- for test in tests:
- try:
- if not test():
- all_passed = False
- except Exception as e:
- print(f"❌ Test failed with error: {e}")
- all_passed = False
-
- print("\n" + "=" * 50)
- if all_passed:
- print("🎉 ALL DATASET SETUP TESTS PASSED!")
- print("✅ Token parameter handling: Working")
- print("✅ Environment variable handling: Working")
- print("✅ Main function configuration: Working")
- print("\nThe dataset setup token handling is working correctly!")
- else:
- print("❌ SOME DATASET SETUP TESTS FAILED!")
- print("Please check the failed tests above.")
-
- return all_passed
-
-if __name__ == "__main__":
- success = main()
- sys.exit(0 if success else 1)
\ No newline at end of file
diff --git a/tests/test_dataset_token_fix.py b/tests/test_dataset_token_fix.py
deleted file mode 100644
index 3474c74f2f70f669945eeef7fa3caf19e2dca3f3..0000000000000000000000000000000000000000
--- a/tests/test_dataset_token_fix.py
+++ /dev/null
@@ -1,214 +0,0 @@
-#!/usr/bin/env python3
-"""
-Test script to verify dataset setup works with token passed as argument
-"""
-
-import os
-import sys
-import subprocess
-from pathlib import Path
-
-def test_dataset_setup_with_token_argument():
- """Test dataset setup with token passed as command line argument"""
- print("🔍 Testing Dataset Setup with Token Argument")
- print("=" * 50)
-
- # Test token from user
- test_token = "xxxx"
-
- print(f"Testing dataset setup with token argument: {'*' * 10}...{test_token[-4:]}")
-
- # Set environment variables
- os.environ['HF_TOKEN'] = test_token
- os.environ['HUGGING_FACE_HUB_TOKEN'] = test_token
- os.environ['HF_USERNAME'] = 'Tonic'
-
- # Import the dataset setup function
- try:
- sys.path.append(str(Path(__file__).parent.parent / "scripts" / "dataset_tonic"))
- from setup_hf_dataset import setup_trackio_dataset
- print("✅ Dataset setup module imported successfully")
- except ImportError as e:
- print(f"❌ Failed to import dataset setup module: {e}")
- return False
-
- # Test setup function with token parameter
- try:
- # Test with token parameter
- success = setup_trackio_dataset("test-dataset-token-arg", test_token)
-
- if success:
- print("✅ Dataset setup with token argument successful")
- return True
- else:
- print("❌ Dataset setup with token argument failed")
- return False
-
- except Exception as e:
- print(f"❌ Dataset setup error: {e}")
- return False
-
-def test_dataset_setup_with_environment():
- """Test dataset setup with environment variables only"""
- print("\n🔍 Testing Dataset Setup with Environment Variables")
- print("=" * 50)
-
- # Test token from user
- test_token = "xxxx"
-
- print(f"Testing dataset setup with environment variables: {'*' * 10}...{test_token[-4:]}")
-
- # Set environment variables
- os.environ['HF_TOKEN'] = test_token
- os.environ['HUGGING_FACE_HUB_TOKEN'] = test_token
- os.environ['HF_USERNAME'] = 'Tonic'
-
- # Import the dataset setup function
- try:
- sys.path.append(str(Path(__file__).parent.parent / "scripts" / "dataset_tonic"))
- from setup_hf_dataset import setup_trackio_dataset
- print("✅ Dataset setup module imported successfully")
- except ImportError as e:
- print(f"❌ Failed to import dataset setup module: {e}")
- return False
-
- # Test setup function with environment variables only
- try:
- # Test with environment variables only
- success = setup_trackio_dataset("test-dataset-env")
-
- if success:
- print("✅ Dataset setup with environment variables successful")
- return True
- else:
- print("❌ Dataset setup with environment variables failed")
- return False
-
- except Exception as e:
- print(f"❌ Dataset setup error: {e}")
- return False
-
-def test_launch_script_token_passing():
- """Test that launch script passes token to dataset setup script"""
- print("\n🔍 Testing Launch Script Token Passing")
- print("=" * 50)
-
- # Check if launch.sh exists
- launch_script = Path("launch.sh")
- if not launch_script.exists():
- print("❌ launch.sh not found")
- return False
-
- # Read launch script and check for token passing
- script_content = launch_script.read_text(encoding='utf-8')
-
- # Check for token passing to dataset setup script
- token_passing_patterns = [
- 'python3 scripts/dataset_tonic/setup_hf_dataset.py "$HF_TOKEN"',
- 'python3 scripts/dataset_tonic/setup_hf_dataset.py "$HF_TOKEN" "$CUSTOM_DATASET_NAME"'
- ]
-
- all_found = True
- for pattern in token_passing_patterns:
- if pattern in script_content:
- print(f"✅ Found: {pattern}")
- else:
- print(f"❌ Missing: {pattern}")
- all_found = False
-
- # Check that old calls without token are removed
- old_patterns = [
- 'python3 scripts/dataset_tonic/setup_hf_dataset.py "$CUSTOM_DATASET_NAME"',
- 'python3 scripts/dataset_tonic/setup_hf_dataset.py'
- ]
-
- for pattern in old_patterns:
- if pattern in script_content:
- print(f"❌ Found old pattern (should be updated): {pattern}")
- all_found = False
- else:
- print(f"✅ Old pattern removed: {pattern}")
-
- return all_found
-
-def test_main_function_token_handling():
- """Test the main function handles token correctly"""
- print("\n🔍 Testing Main Function Token Handling")
- print("=" * 50)
-
- # Test token from user
- test_token = "xxxx"
-
- # Import the main function
- try:
- sys.path.append(str(Path(__file__).parent.parent / "scripts" / "dataset_tonic"))
- from setup_hf_dataset import main
- print("✅ Main function imported successfully")
- except ImportError as e:
- print(f"❌ Failed to import main function: {e}")
- return False
-
- # Test main function (this will actually try to create a dataset)
- try:
- # Save original sys.argv
- original_argv = sys.argv.copy()
-
- # Set up command line arguments
- sys.argv = ['setup_hf_dataset.py', test_token, 'test-dataset-main']
-
- # Set environment variables
- os.environ['HUGGING_FACE_HUB_TOKEN'] = test_token
- os.environ['HF_TOKEN'] = test_token
-
- # Note: We won't actually call main() as it would create a real dataset
- # Instead, we'll just verify the function exists and can be imported
- print("✅ Main function is properly configured")
- print("✅ Command line argument handling is set up correctly")
-
- # Restore original sys.argv
- sys.argv = original_argv
-
- return True
-
- except Exception as e:
- print(f"❌ Main function test error: {e}")
- return False
-
-def main():
- """Run all dataset token fix tests"""
- print("🚀 Dataset Token Fix Verification")
- print("=" * 50)
-
- tests = [
- test_dataset_setup_with_token_argument,
- test_dataset_setup_with_environment,
- test_launch_script_token_passing,
- test_main_function_token_handling
- ]
-
- all_passed = True
- for test in tests:
- try:
- if not test():
- all_passed = False
- except Exception as e:
- print(f"❌ Test failed with error: {e}")
- all_passed = False
-
- print("\n" + "=" * 50)
- if all_passed:
- print("🎉 ALL DATASET TOKEN FIX TESTS PASSED!")
- print("✅ Token argument handling: Working")
- print("✅ Environment variable handling: Working")
- print("✅ Launch script token passing: Working")
- print("✅ Main function configuration: Working")
- print("\nThe dataset setup token handling is working correctly!")
- else:
- print("❌ SOME DATASET TOKEN FIX TESTS FAILED!")
- print("Please check the failed tests above.")
-
- return all_passed
-
-if __name__ == "__main__":
- success = main()
- sys.exit(0 if success else 1)
\ No newline at end of file
diff --git a/tests/test_demo_deployment.py b/tests/test_demo_deployment.py
new file mode 100644
index 0000000000000000000000000000000000000000..f26957b26f422162047287ca474dbab95f706731
--- /dev/null
+++ b/tests/test_demo_deployment.py
@@ -0,0 +1,169 @@
+#!/usr/bin/env python3
+"""
+Test script for demo space deployment functionality
+"""
+
+import os
+import sys
+import tempfile
+import shutil
+from pathlib import Path
+
+# Add scripts to path
+sys.path.append(str(Path(__file__).parent.parent / "scripts"))
+
+from deploy_demo_space import DemoSpaceDeployer
+
+def test_demo_deployer_initialization():
+ """Test DemoSpaceDeployer initialization"""
+ print("🧪 Testing DemoSpaceDeployer initialization...")
+
+ deployer = DemoSpaceDeployer(
+ hf_token="test_token",
+ hf_username="test_user",
+ model_id="test/model",
+ subfolder="int4",
+ space_name="test-demo"
+ )
+
+ assert deployer.hf_token == "test_token"
+ assert deployer.hf_username == "test_user"
+ assert deployer.model_id == "test/model"
+ assert deployer.subfolder == "int4"
+ assert deployer.space_name == "test-demo"
+ assert deployer.space_id == "test_user/test-demo"
+
+ print("✅ DemoSpaceDeployer initialization test passed")
+
+def test_template_files_exist():
+ """Test that template files exist"""
+ print("🧪 Testing template files existence...")
+
+ template_dir = Path(__file__).parent.parent / "templates" / "spaces" / "demo"
+
+ required_files = ["app.py", "requirements.txt"]
+
+ for file_name in required_files:
+ file_path = template_dir / file_name
+ assert file_path.exists(), f"Required file {file_name} not found in templates"
+ print(f"✅ Found {file_name}")
+
+ print("✅ Template files test passed")
+
+def test_app_py_modification():
+ """Test app.py modification with environment variables"""
+ print("🧪 Testing app.py modification...")
+
+ # Create temporary directory
+ with tempfile.TemporaryDirectory() as temp_dir:
+ temp_path = Path(temp_dir)
+
+ # Copy template files
+ template_dir = Path(__file__).parent.parent / "templates" / "spaces" / "demo"
+ shutil.copytree(template_dir, temp_path, dirs_exist_ok=True)
+
+ # Test the modification logic
+ app_file = temp_path / "app.py"
+ assert app_file.exists()
+
+ # Read original content
+ with open(app_file, 'r', encoding='utf-8') as f:
+ original_content = f.read()
+
+ # Simulate the modification
+ env_setup = """
+# Environment variables for model configuration
+import os
+os.environ['HF_MODEL_ID'] = 'test/model'
+os.environ['MODEL_SUBFOLDER'] = 'int4'
+os.environ['MODEL_NAME'] = 'model'
+
+"""
+
+ # Insert after imports
+ lines = original_content.split('\n')
+ import_end = 0
+ for i, line in enumerate(lines):
+ if line.startswith('import ') or line.startswith('from '):
+ import_end = i + 1
+ elif line.strip() == '' and import_end > 0:
+ break
+
+ lines.insert(import_end, env_setup)
+ modified_content = '\n'.join(lines)
+
+ # Write modified content
+ with open(app_file, 'w', encoding='utf-8') as f:
+ f.write(modified_content)
+
+ # Verify modification
+ with open(app_file, 'r', encoding='utf-8') as f:
+ final_content = f.read()
+
+ assert 'HF_MODEL_ID' in final_content
+ assert 'MODEL_SUBFOLDER' in final_content
+ assert 'MODEL_NAME' in final_content
+
+ print("✅ app.py modification test passed")
+
+def test_readme_generation():
+ """Test README.md generation"""
+ print("🧪 Testing README.md generation...")
+
+ deployer = DemoSpaceDeployer(
+ hf_token="test_token",
+ hf_username="test_user",
+ model_id="test/model",
+ subfolder="int4"
+ )
+
+ # Test README content generation
+ readme_content = f"""# Demo: {deployer.model_id}
+
+This is an interactive demo for the fine-tuned model {deployer.model_id}.
+
+## Features
+- Interactive chat interface
+- Customizable system prompts
+- Advanced generation parameters
+- Thinking mode support
+
+## Model Information
+- **Model ID**: {deployer.model_id}
+- **Subfolder**: {deployer.subfolder}
+- **Deployed by**: {deployer.hf_username}
+
+## Usage
+Simply start chatting with the model using the interface below!
+
+---
+*This demo was automatically deployed by the SmolLM3 Fine-tuning Pipeline*
+"""
+
+ assert "Demo: test/model" in readme_content
+ assert "Model ID: test/model" in readme_content
+ assert "Subfolder: int4" in readme_content
+ assert "Deployed by: test_user" in readme_content
+
+ print("✅ README.md generation test passed")
+
+def main():
+ """Run all tests"""
+ print("🚀 Starting demo deployment tests...")
+ print("=" * 50)
+
+ try:
+ test_demo_deployer_initialization()
+ test_template_files_exist()
+ test_app_py_modification()
+ test_readme_generation()
+
+ print("=" * 50)
+ print("🎉 All demo deployment tests passed!")
+
+ except Exception as e:
+ print(f"❌ Test failed: {e}")
+ sys.exit(1)
+
+if __name__ == "__main__":
+ main()
\ No newline at end of file
diff --git a/tests/test_experiment_id_fix.py b/tests/test_experiment_id_fix.py
deleted file mode 100644
index 8e1fa6ca2a35377ec7c0486276ca59a7f1674ae1..0000000000000000000000000000000000000000
--- a/tests/test_experiment_id_fix.py
+++ /dev/null
@@ -1,123 +0,0 @@
-#!/usr/bin/env python3
-"""
-Test script to verify that both monitoring systems use the same experiment ID format
-"""
-
-import sys
-import os
-import logging
-
-# Add the project root to the path
-sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
-
-from src.monitoring import SmolLM3Monitor
-from src.trackio import init as trackio_init
-
-# Setup logging
-logging.basicConfig(level=logging.INFO)
-logger = logging.getLogger(__name__)
-
-def test_experiment_id_consistency():
- """Test that both monitoring systems use the same experiment ID format"""
- print("🔧 Testing experiment ID consistency...")
-
- # Test 1: SmolLM3Monitor experiment ID format
- print("\n1️⃣ Testing SmolLM3Monitor experiment ID format...")
- monitor = SmolLM3Monitor(
- experiment_name="test_experiment_id_consistency",
- enable_tracking=True
- )
-
- print(f"SmolLM3Monitor experiment ID: {monitor.experiment_id}")
-
- if monitor.experiment_id and monitor.experiment_id.startswith('exp_'):
- print("✅ SmolLM3Monitor uses correct experiment ID format (exp_)")
- else:
- print("❌ SmolLM3Monitor uses incorrect experiment ID format")
- return False
-
- # Test 2: Trackio experiment ID format
- print("\n2️⃣ Testing Trackio experiment ID format...")
- trackio_experiment_id = trackio_init(
- project_name="test_experiment_id_consistency",
- experiment_name="test_experiment_id_consistency"
- )
-
- print(f"Trackio experiment ID: {trackio_experiment_id}")
-
- if trackio_experiment_id and trackio_experiment_id.startswith('exp_'):
- print("✅ Trackio uses correct experiment ID format (exp_)")
- else:
- print("❌ Trackio uses incorrect experiment ID format")
- return False
-
- # Test 3: Verify both use the same format
- print("\n3️⃣ Testing experiment ID format consistency...")
- if monitor.experiment_id.startswith('exp_') and trackio_experiment_id.startswith('exp_'):
- print("✅ Both monitoring systems use the same experiment ID format")
- return True
- else:
- print("❌ Monitoring systems use different experiment ID formats")
- return False
-
-def test_monitoring_integration():
- """Test that both monitoring systems can work together"""
- print("\n🔧 Testing monitoring integration...")
-
- try:
- # Create monitor
- monitor = SmolLM3Monitor(
- experiment_name="test_monitoring_integration",
- enable_tracking=True
- )
-
- print(f"✅ Monitor created with experiment ID: {monitor.experiment_id}")
-
- # Initialize trackio with the same experiment ID
- trackio_experiment_id = trackio_init(
- project_name="test_monitoring_integration",
- experiment_name="test_monitoring_integration"
- )
-
- print(f"✅ Trackio initialized with experiment ID: {trackio_experiment_id}")
-
- # Test logging metrics to both systems
- metrics = {"loss": 1.234, "accuracy": 0.85}
-
- # Log to monitor
- monitor.log_metrics(metrics, step=100)
- print("✅ Metrics logged to monitor")
-
- # Log to trackio
- from src.trackio import log as trackio_log
- trackio_log(metrics, step=100)
- print("✅ Metrics logged to trackio")
-
- print("🎉 Monitoring integration test passed!")
- return True
-
- except Exception as e:
- print(f"❌ Monitoring integration test failed: {e}")
- return False
-
-if __name__ == "__main__":
- print("🚀 Starting Experiment ID Consistency Tests")
- print("=" * 60)
-
- # Test 1: Experiment ID format consistency
- format_consistency = test_experiment_id_consistency()
-
- # Test 2: Monitoring integration
- integration_success = test_monitoring_integration()
-
- print("\n" + "=" * 60)
- print("📊 Test Results Summary:")
- print(f"Experiment ID Format Consistency: {'✅ PASSED' if format_consistency else '❌ FAILED'}")
- print(f"Monitoring Integration: {'✅ PASSED' if integration_success else '❌ FAILED'}")
-
- if format_consistency and integration_success:
- print("\n🎉 All tests passed! Experiment ID conflict is resolved.")
- sys.exit(0)
- else:
- print("\n❌ Some tests failed. Please check the errors above.")
- sys.exit(1)
\ No newline at end of file
diff --git a/tests/test_formatting_fix.py b/tests/test_formatting_fix.py
deleted file mode 100644
index e81de60c95c46e2eb6e05f3e7f16e60028ed68fc..0000000000000000000000000000000000000000
--- a/tests/test_formatting_fix.py
+++ /dev/null
@@ -1,138 +0,0 @@
-#!/usr/bin/env python3
-"""
-Test script to verify the string formatting fix
-"""
-
-import sys
-import os
-import logging
-
-# Setup logging
-logging.basicConfig(level=logging.INFO)
-logger = logging.getLogger(__name__)
-
-def test_logging():
- """Test that logging works without f-string formatting errors"""
- try:
- # Test various logging scenarios that were causing issues
- logger.info("Testing logging with %s", "string formatting")
- logger.info("Testing with %d numbers", 42)
- logger.info("Testing with %s and %d", "text", 123)
-
- # Test error logging
- try:
- raise ValueError("Test error")
- except Exception as e:
- logger.error("Caught error: %s", e)
-
- print("✅ All logging tests passed!")
- return True
-
- except Exception as e:
- print("❌ Logging test failed: {}".format(e))
- return False
-
-def test_imports():
- """Test that all modules can be imported without formatting errors"""
- try:
- # Test importing the main modules
- from src.monitoring import SmolLM3Monitor
- print("✅ monitoring module imported successfully")
-
- from src.trainer import SmolLM3Trainer
- print("✅ trainer module imported successfully")
-
- from src.model import SmolLM3Model
- print("✅ model module imported successfully")
-
- from src.data import SmolLM3Dataset
- print("✅ data module imported successfully")
-
- return True
-
- except Exception as e:
- print("❌ Import test failed: {}".format(e))
- return False
-
-def test_config_loading():
- """Test that configuration files can be loaded"""
- try:
- # Test loading a configuration
- config_path = "config/train_smollm3_openhermes_fr_a100_balanced.py"
- if os.path.exists(config_path):
- import importlib.util
- spec = importlib.util.spec_from_file_location("config_module", config_path)
- config_module = importlib.util.module_from_spec(spec)
- spec.loader.exec_module(config_module)
-
- if hasattr(config_module, 'config'):
- config = config_module.config
- print("✅ Configuration loaded successfully")
- print(" Model: {}".format(config.model_name))
- print(" Batch size: {}".format(config.batch_size))
- print(" Learning rate: {}".format(config.learning_rate))
- return True
- else:
- print("❌ No config found in {}".format(config_path))
- return False
- else:
- print("❌ Config file not found: {}".format(config_path))
- return False
-
- except Exception as e:
- print("❌ Config loading test failed: {}".format(e))
- return False
-
-def test_monitoring_creation():
- """Test that monitoring can be created without formatting errors"""
- try:
- from src.monitoring import SmolLM3Monitor
-
- # Test creating a monitor instance
- monitor = SmolLM3Monitor(
- experiment_name="test_experiment",
- enable_tracking=False # Disable tracking for test
- )
-
- print("✅ Monitoring instance created successfully")
- return True
-
- except Exception as e:
- print("❌ Monitoring creation test failed: {}".format(e))
- return False
-
-def main():
- """Run all tests"""
- print("🧪 Testing String Formatting Fix")
- print("=" * 40)
-
- tests = [
- ("Logging", test_logging),
- ("Imports", test_imports),
- ("Config Loading", test_config_loading),
- ("Monitoring Creation", test_monitoring_creation),
- ]
-
- passed = 0
- total = len(tests)
-
- for test_name, test_func in tests:
- print("\n🔍 Testing: {}".format(test_name))
- if test_func():
- passed += 1
- print("✅ {} test passed".format(test_name))
- else:
- print("❌ {} test failed".format(test_name))
-
- print("\n" + "=" * 40)
- print("📊 Test Results: {}/{} tests passed".format(passed, total))
-
- if passed == total:
- print("🎉 All tests passed! The formatting fix is working correctly.")
- return 0
- else:
- print("⚠️ Some tests failed. Please check the errors above.")
- return 1
-
-if __name__ == "__main__":
- sys.exit(main())
\ No newline at end of file
diff --git a/tests/test_git_config_fix.py b/tests/test_git_config_fix.py
deleted file mode 100644
index 02e20abcb7a977f2df6e41187d71f4aa69f67084..0000000000000000000000000000000000000000
--- a/tests/test_git_config_fix.py
+++ /dev/null
@@ -1,231 +0,0 @@
-#!/usr/bin/env python3
-"""
-Test script to verify the git configuration fix for Trackio Space deployment
-"""
-
-import os
-import sys
-import tempfile
-import shutil
-import subprocess
-from pathlib import Path
-
-# Add project root to path
-project_root = Path(__file__).parent.parent
-sys.path.insert(0, str(project_root))
-
-def test_git_config_in_temp_dir():
- """Test that git configuration works in temporary directory"""
- print("🔍 Testing git configuration in temporary directory...")
-
- try:
- # Create temporary directory
- temp_dir = tempfile.mkdtemp()
- print(f"✅ Created temp directory: {temp_dir}")
-
- # Change to temp directory
- original_dir = os.getcwd()
- os.chdir(temp_dir)
-
- # Initialize git repository
- subprocess.run(["git", "init"], check=True, capture_output=True)
- print("✅ Initialized git repository")
-
- # Test git configuration
- test_email = "test@example.com"
- test_name = "Test User"
-
- # Set git config
- subprocess.run(["git", "config", "user.email", test_email], check=True, capture_output=True)
- subprocess.run(["git", "config", "user.name", test_name], check=True, capture_output=True)
-
- # Verify git config
- result = subprocess.run(["git", "config", "user.email"], capture_output=True, text=True)
- if result.returncode == 0 and result.stdout.strip() == test_email:
- print("✅ Git email configured correctly")
- else:
- print(f"❌ Git email not configured correctly: {result.stdout}")
- return False
-
- result = subprocess.run(["git", "config", "user.name"], capture_output=True, text=True)
- if result.returncode == 0 and result.stdout.strip() == test_name:
- print("✅ Git name configured correctly")
- else:
- print(f"❌ Git name not configured correctly: {result.stdout}")
- return False
-
- # Test git commit
- # Create a test file
- with open("test.txt", "w") as f:
- f.write("Test file for git commit")
-
- subprocess.run(["git", "add", "test.txt"], check=True, capture_output=True)
- subprocess.run(["git", "commit", "-m", "Test commit"], check=True, capture_output=True)
- print("✅ Git commit successful")
-
- # Return to original directory
- os.chdir(original_dir)
-
- # Clean up
- shutil.rmtree(temp_dir)
- print("✅ Cleanup successful")
-
- return True
-
- except Exception as e:
- print(f"❌ Error testing git config: {e}")
- # Return to original directory
- os.chdir(original_dir)
- return False
-
-def test_deployment_script_git_config():
- """Test that the deployment script handles git configuration correctly"""
- print("\n🔍 Testing deployment script git configuration...")
-
- try:
- sys.path.insert(0, str(project_root / "scripts" / "trackio_tonic"))
- from deploy_trackio_space import TrackioSpaceDeployer
-
- # Test with git configuration
- deployer = TrackioSpaceDeployer(
- "test-space",
- "test-user",
- "test-token",
- git_email="test@example.com",
- git_name="Test User"
- )
-
- # Check that git config is set
- if deployer.git_email == "test@example.com":
- print("✅ Git email set correctly")
- else:
- print(f"❌ Git email not set correctly: {deployer.git_email}")
- return False
-
- if deployer.git_name == "Test User":
- print("✅ Git name set correctly")
- else:
- print(f"❌ Git name not set correctly: {deployer.git_name}")
- return False
-
- return True
-
- except Exception as e:
- print(f"❌ Error testing deployment script: {e}")
- return False
-
-def test_git_config_fallback():
- """Test git configuration fallback behavior"""
- print("\n🔍 Testing git configuration fallback...")
-
- try:
- sys.path.insert(0, str(project_root / "scripts" / "trackio_tonic"))
- from deploy_trackio_space import TrackioSpaceDeployer
-
- # Test without git configuration (should use defaults)
- deployer = TrackioSpaceDeployer("test-space", "test-user", "test-token")
-
- # Check default values
- expected_email = "test-user@huggingface.co"
- expected_name = "test-user"
-
- if deployer.git_email == expected_email:
- print("✅ Default git email set correctly")
- else:
- print(f"❌ Default git email not set correctly: {deployer.git_email}")
- return False
-
- if deployer.git_name == expected_name:
- print("✅ Default git name set correctly")
- else:
- print(f"❌ Default git name not set correctly: {deployer.git_name}")
- return False
-
- return True
-
- except Exception as e:
- print(f"❌ Error testing git config fallback: {e}")
- return False
-
-def test_git_commit_with_config():
- """Test that git commit works with proper configuration"""
- print("\n🔍 Testing git commit with configuration...")
-
- try:
- # Create temporary directory
- temp_dir = tempfile.mkdtemp()
- print(f"✅ Created temp directory: {temp_dir}")
-
- # Change to temp directory
- original_dir = os.getcwd()
- os.chdir(temp_dir)
-
- # Initialize git repository
- subprocess.run(["git", "init"], check=True, capture_output=True)
-
- # Set git configuration
- subprocess.run(["git", "config", "user.email", "test@example.com"], check=True, capture_output=True)
- subprocess.run(["git", "config", "user.name", "Test User"], check=True, capture_output=True)
-
- # Create test file
- with open("test.txt", "w") as f:
- f.write("Test content")
-
- # Add and commit
- subprocess.run(["git", "add", "test.txt"], check=True, capture_output=True)
- subprocess.run(["git", "commit", "-m", "Test commit"], check=True, capture_output=True)
- print("✅ Git commit successful with configuration")
-
- # Return to original directory
- os.chdir(original_dir)
-
- # Clean up
- shutil.rmtree(temp_dir)
- print("✅ Cleanup successful")
-
- return True
-
- except Exception as e:
- print(f"❌ Error testing git commit: {e}")
- # Return to original directory
- os.chdir(original_dir)
- return False
-
-def main():
- """Run all git configuration tests"""
- print("🚀 Testing Git Configuration Fix")
- print("=" * 40)
-
- tests = [
- test_git_config_in_temp_dir,
- test_deployment_script_git_config,
- test_git_config_fallback,
- test_git_commit_with_config
- ]
-
- passed = 0
- total = len(tests)
-
- for test in tests:
- try:
- if test():
- passed += 1
- except Exception as e:
- print(f"❌ Test {test.__name__} crashed: {e}")
-
- print(f"\n📊 Test Results: {passed}/{total} tests passed")
-
- if passed == total:
- print("✅ All git configuration tests passed! The deployment should work correctly.")
- print("\n🎯 Next steps:")
- print("1. Run the deployment script: python scripts/trackio_tonic/deploy_trackio_space.py")
- print("2. Provide your HF username, space name, token, and git config")
- print("3. The git commit should now work correctly")
- return True
- else:
- print("❌ Some git configuration tests failed. Please check the errors above.")
- return False
-
-if __name__ == "__main__":
- success = main()
- sys.exit(0 if success else 1)
\ No newline at end of file
diff --git a/tests/test_pipeline_1.py b/tests/test_pipeline_1.py
deleted file mode 100644
index d57fa8be321a4258fb3f70b3b21248f113ff2815..0000000000000000000000000000000000000000
--- a/tests/test_pipeline_1.py
+++ /dev/null
@@ -1,125 +0,0 @@
-#!/usr/bin/env python3
-"""
-Test script to verify the training pipeline works correctly
-"""
-
-import sys
-import os
-from pathlib import Path
-
-# Add project root to path
-project_root = Path(__file__).parent
-sys.path.insert(0, str(project_root))
-
-def test_config_imports():
- """Test that all configuration files can be imported correctly"""
- print("🧪 Testing configuration imports...")
-
- try:
- # Test base config only
- from config.train_smollm3 import SmolLM3Config, get_config
- print("✅ Base config imported successfully")
-
- # Test H100 lightweight config (without triggering __post_init__)
- import importlib.util
- spec = importlib.util.spec_from_file_location("h100_config", "config/train_smollm3_h100_lightweight.py")
- h100_module = importlib.util.module_from_spec(spec)
- spec.loader.exec_module(h100_module)
- print("✅ H100 lightweight config imported successfully")
-
- return True
-
- except ImportError as e:
- print(f"❌ Import error: {e}")
- return False
-
-def test_training_script():
- """Test that the training script can be imported"""
- print("\n🧪 Testing training script...")
-
- try:
- # Add src to path
- src_path = str(project_root / "src")
- sys.path.insert(0, src_path)
-
- # Test importing training modules
- from train import main as train_main
- print("✅ Training script imported successfully")
-
- from model import SmolLM3Model
- print("✅ Model module imported successfully")
-
- from data import load_dataset
- print("✅ Data module imported successfully")
-
- from monitoring import SmolLM3Monitor, create_monitor_from_config
- print("✅ Monitoring module imported successfully")
-
- return True
-
- except ImportError as e:
- print(f"❌ Import error: {e}")
- return False
-
-def test_scripts():
- """Test that the scripts can be imported"""
- print("\n🧪 Testing scripts...")
-
- try:
- # Test dataset setup script
- sys.path.insert(0, str(project_root / "scripts" / "dataset_tonic"))
- from setup_hf_dataset import setup_trackio_dataset
- print("✅ Dataset setup script imported successfully")
-
- # Test trackio scripts
- sys.path.insert(0, str(project_root / "scripts" / "trackio_tonic"))
- from deploy_trackio_space import TrackioSpaceDeployer
- print("✅ Trackio deployment script imported successfully")
-
- from configure_trackio import configure_trackio
- print("✅ Trackio configuration script imported successfully")
-
- # Test model push script
- sys.path.insert(0, str(project_root / "scripts" / "model_tonic"))
- from push_to_huggingface import HuggingFacePusher
- print("✅ Model push script imported successfully")
-
- return True
-
- except ImportError as e:
- print(f"❌ Import error: {e}")
- return False
-
-def main():
- """Run all tests"""
- print("🚀 Testing SmolLM3 Fine-tuning Pipeline")
- print("=" * 50)
-
- tests = [
- test_config_imports,
- test_training_script,
- test_scripts
- ]
-
- passed = 0
- total = len(tests)
-
- for test in tests:
- if test():
- passed += 1
- print()
-
- print("=" * 50)
- print(f"📊 Test Results: {passed}/{total} tests passed")
-
- if passed == total:
- print("🎉 All tests passed! Pipeline is ready to use.")
- print("\n🚀 You can now run: ./launch.sh")
- else:
- print("❌ Some tests failed. Please check the errors above.")
- return 1
-
- return 0
-
-if __name__ == "__main__":
- exit(main())
\ No newline at end of file
diff --git a/tests/test_quantization_fix.py b/tests/test_quantization_fix.py
deleted file mode 100644
index cf7c8567c784d5195a1d9cd007a40e5214a2a70a..0000000000000000000000000000000000000000
--- a/tests/test_quantization_fix.py
+++ /dev/null
@@ -1,149 +0,0 @@
-#!/usr/bin/env python3
-"""
-Test script to verify quantization fixes
-"""
-
-import os
-import sys
-import logging
-from pathlib import Path
-
-# Setup logging
-logging.basicConfig(
- level=logging.INFO,
- format='%(asctime)s - %(levelname)s - %(message)s'
-)
-logger = logging.getLogger(__name__)
-
-def test_quantization_imports():
- """Test that all required imports work"""
- try:
- # Test torchao imports
- from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig
- from torchao.quantization import (
- Int8WeightOnlyConfig,
- Int4WeightOnlyConfig,
- Int8DynamicActivationInt8WeightConfig
- )
- from torchao.dtypes import Int4CPULayout
- logger.info("✅ torchao imports successful")
-
- # Test bitsandbytes imports
- try:
- import bitsandbytes as bnb
- from transformers import BitsAndBytesConfig
- logger.info("✅ bitsandbytes imports successful")
- except ImportError:
- logger.warning("⚠️ bitsandbytes not available - alternative quantization disabled")
-
- # Test HF imports
- from huggingface_hub import HfApi
- logger.info("✅ huggingface_hub imports successful")
-
- return True
-
- except ImportError as e:
- logger.error(f"❌ Import failed: {e}")
- return False
-
-def test_model_quantizer():
- """Test ModelQuantizer initialization"""
- try:
- from scripts.model_tonic.quantize_model import ModelQuantizer
-
- # Test with dummy values
- quantizer = ModelQuantizer(
- model_path="/output-checkpoint",
- repo_name="test/test-repo",
- token="dummy_token"
- )
-
- logger.info("✅ ModelQuantizer initialization successful")
- return True
-
- except Exception as e:
- logger.error(f"❌ ModelQuantizer test failed: {e}")
- return False
-
-def test_quantization_configs():
- """Test quantization config creation"""
- try:
- from scripts.model_tonic.quantize_model import ModelQuantizer
-
- quantizer = ModelQuantizer(
- model_path="/output-checkpoint",
- repo_name="test/test-repo",
- token="dummy_token"
- )
-
- # Test int8 config
- config = quantizer.create_quantization_config("int8_weight_only", 128)
- logger.info("✅ int8_weight_only config creation successful")
-
- # Test int4 config
- config = quantizer.create_quantization_config("int4_weight_only", 128)
- logger.info("✅ int4_weight_only config creation successful")
-
- return True
-
- except Exception as e:
- logger.error(f"❌ Quantization config test failed: {e}")
- return False
-
-def test_device_selection():
- """Test optimal device selection"""
- try:
- from scripts.model_tonic.quantize_model import ModelQuantizer
-
- quantizer = ModelQuantizer(
- model_path="/output-checkpoint",
- repo_name="test/test-repo",
- token="dummy_token"
- )
-
- # Test device selection
- device = quantizer.get_optimal_device("int8_weight_only")
- logger.info(f"✅ int8 device selection: {device}")
-
- device = quantizer.get_optimal_device("int4_weight_only")
- logger.info(f"✅ int4 device selection: {device}")
-
- return True
-
- except Exception as e:
- logger.error(f"❌ Device selection test failed: {e}")
- return False
-
-def main():
- """Run all tests"""
- logger.info("🧪 Testing quantization fixes...")
-
- tests = [
- ("Import Test", test_quantization_imports),
- ("ModelQuantizer Test", test_model_quantizer),
- ("Config Creation Test", test_quantization_configs),
- ("Device Selection Test", test_device_selection),
- ]
-
- passed = 0
- total = len(tests)
-
- for test_name, test_func in tests:
- logger.info(f"\n🔍 Running {test_name}...")
- if test_func():
- passed += 1
- logger.info(f"✅ {test_name} passed")
- else:
- logger.error(f"❌ {test_name} failed")
-
- logger.info(f"\n📊 Test Results: {passed}/{total} tests passed")
-
- if passed == total:
- logger.info("🎉 All tests passed! Quantization fixes are working.")
- return 0
- else:
- logger.error("❌ Some tests failed. Please check the errors above.")
- return 1
-
-if __name__ == "__main__":
- exit(main())
\ No newline at end of file
diff --git a/tests/test_safetensors_fix.py b/tests/test_safetensors_fix.py
deleted file mode 100644
index a5e9a76a15faa6b6fb74be6ecae99d71c1b61185..0000000000000000000000000000000000000000
--- a/tests/test_safetensors_fix.py
+++ /dev/null
@@ -1,122 +0,0 @@
-#!/usr/bin/env python3
-"""
-Test script to verify safetensors model validation fix
-"""
-
-import os
-import sys
-import logging
-from pathlib import Path
-
-# Setup logging
-logging.basicConfig(
- level=logging.INFO,
- format='%(asctime)s - %(levelname)s - %(message)s'
-)
-logger = logging.getLogger(__name__)
-
-def test_safetensors_validation():
- """Test that safetensors models are properly validated"""
- try:
- from scripts.model_tonic.quantize_model import ModelQuantizer
-
- # Test with dummy values
- quantizer = ModelQuantizer(
- model_path="/output-checkpoint",
- repo_name="test/test-repo",
- token="dummy_token"
- )
-
- # Mock the model path to simulate the Linux environment
- # In the real environment, this would be /output-checkpoint
- # with safetensors files
-
- # Test validation logic
- if quantizer.validate_model_path():
- logger.info("✅ Safetensors validation test passed")
- return True
- else:
- logger.error("❌ Safetensors validation test failed")
- return False
-
- except Exception as e:
- logger.error(f"❌ Safetensors validation test failed: {e}")
- return False
-
-def test_model_file_detection():
- """Test model file detection logic"""
- try:
- from scripts.model_tonic.quantize_model import ModelQuantizer
-
- quantizer = ModelQuantizer(
- model_path="/output-checkpoint",
- repo_name="test/test-repo",
- token="dummy_token"
- )
-
- # Test the validation logic directly
- model_path = Path("/output-checkpoint")
-
- # Check for essential files
- required_files = ['config.json']
- model_files = [
- "model.safetensors.index.json", # Safetensors format
- "pytorch_model.bin" # PyTorch format
- ]
-
- missing_required = []
- for file in required_files:
- if not (model_path / file).exists():
- missing_required.append(file)
-
- # Check if at least one model file exists
- model_file_exists = any((model_path / file).exists() for file in model_files)
- if not model_file_exists:
- missing_required.extend(model_files)
-
- if missing_required:
- logger.error(f"❌ Missing required model files: {missing_required}")
- return False
-
- logger.info("✅ Model file detection test passed")
- return True
-
- except Exception as e:
- logger.error(f"❌ Model file detection test failed: {e}")
- return False
-
-def main():
- """Run safetensors validation tests"""
- logger.info("🧪 Testing safetensors validation fix...")
-
- tests = [
- ("Safetensors Validation Test", test_safetensors_validation),
- ("Model File Detection Test", test_model_file_detection),
- ]
-
- passed = 0
- total = len(tests)
-
- for test_name, test_func in tests:
- logger.info(f"\n🔍 Running {test_name}...")
- if test_func():
- passed += 1
- logger.info(f"✅ {test_name} passed")
- else:
- logger.error(f"❌ {test_name} failed")
-
- logger.info(f"\n📊 Test Results: {passed}/{total} tests passed")
-
- if passed == total:
- logger.info("🎉 All safetensors tests passed! The fix should work in the Linux environment.")
- logger.info("💡 The validation now properly handles:")
- logger.info(" - Safetensors format (model.safetensors.index.json)")
- logger.info(" - PyTorch format (pytorch_model.bin)")
- logger.info(" - Either format is accepted")
- return 0
- else:
- logger.error("❌ Some tests failed. The fix may need adjustment.")
- return 1
-
-if __name__ == "__main__":
- exit(main())
\ No newline at end of file
diff --git a/tests/test_token_switch.py b/tests/test_token_switch.py
new file mode 100644
index 0000000000000000000000000000000000000000..a3bedc177eca5f8275d455f3092baf3930a1755a
--- /dev/null
+++ b/tests/test_token_switch.py
@@ -0,0 +1,136 @@
+#!/usr/bin/env python3
+"""
+Test script for token switching functionality
+"""
+
+import os
+import sys
+import subprocess
+from pathlib import Path
+
+def test_token_validation():
+ """Test token validation script"""
+ print("🧪 Testing token validation...")
+
+ # Test with invalid token
+ result = subprocess.run([
+ "python3", "scripts/validate_hf_token.py", "invalid_token"
+ ], capture_output=True, text=True)
+
+ if result.returncode != 0:
+ print("✅ Invalid token correctly rejected")
+ else:
+ print("❌ Invalid token should have been rejected")
+ return False
+
+ # Test with environment variable
+ os.environ['HF_TOKEN'] = 'test_token'
+ result = subprocess.run([
+ "python3", "scripts/validate_hf_token.py"
+ ], capture_output=True, text=True)
+
+ if result.returncode != 0:
+ print("✅ Environment token validation works")
+ else:
+ print("❌ Environment token validation failed")
+ return False
+
+ return True
+
+def test_token_switch_script():
+ """Test token switch script"""
+ print("🧪 Testing token switch script...")
+
+ # Test with invalid arguments
+ result = subprocess.run([
+ "python3", "scripts/trackio_tonic/switch_to_read_token.py"
+ ], capture_output=True, text=True)
+
+ if result.returncode != 0:
+ print("✅ Script correctly handles missing arguments")
+ else:
+ print("❌ Script should have failed with missing arguments")
+ return False
+
+ # Test with invalid space_id format
+ result = subprocess.run([
+ "python3", "scripts/trackio_tonic/switch_to_read_token.py",
+ "invalid_space", "token1", "token2"
+ ], capture_output=True, text=True)
+
+ if result.returncode != 0:
+ print("✅ Script correctly validates space_id format")
+ else:
+ print("❌ Script should have failed with invalid space_id")
+ return False
+
+ return True
+
+def test_secure_input_function():
+ """Test the secure input function in launch.sh"""
+ print("🧪 Testing secure input function...")
+
+ # This would require interactive testing, so we'll just check if the function exists
+ launch_script = Path("launch.sh")
+ if launch_script.exists():
+ try:
+ with open(launch_script, 'r', encoding='utf-8') as f:
+ content = f.read()
+ if "get_secure_token_input" in content:
+ print("✅ Secure input function found in launch.sh")
+ return True
+ else:
+ print("❌ Secure input function not found in launch.sh")
+ return False
+ except UnicodeDecodeError:
+ # Try with different encoding
+ try:
+ with open(launch_script, 'r', encoding='latin-1') as f:
+ content = f.read()
+ if "get_secure_token_input" in content:
+ print("✅ Secure input function found in launch.sh")
+ return True
+ else:
+ print("❌ Secure input function not found in launch.sh")
+ return False
+ except Exception as e:
+ print(f"❌ Error reading launch.sh: {e}")
+ return False
+ else:
+ print("❌ launch.sh not found")
+ return False
+
+def main():
+ """Run all tests"""
+ print("🔍 Testing Token Security Features")
+ print("=" * 40)
+
+ tests = [
+ test_token_validation,
+ test_token_switch_script,
+ test_secure_input_function
+ ]
+
+ passed = 0
+ total = len(tests)
+
+ for test in tests:
+ try:
+ if test():
+ passed += 1
+ else:
+ print(f"❌ Test failed: {test.__name__}")
+ except Exception as e:
+ print(f"❌ Test error: {test.__name__} - {e}")
+
+ print(f"\n📊 Test Results: {passed}/{total} tests passed")
+
+ if passed == total:
+ print("✅ All tests passed!")
+ return 0
+ else:
+ print("❌ Some tests failed!")
+ return 1
+
+if __name__ == "__main__":
+ sys.exit(main())
\ No newline at end of file
diff --git a/tests/test_trackio_api_fix.py b/tests/test_trackio_api_fix.py
deleted file mode 100644
index f6eede4586a5c3db147b37921ed34a0efa5da882..0000000000000000000000000000000000000000
--- a/tests/test_trackio_api_fix.py
+++ /dev/null
@@ -1,229 +0,0 @@
-#!/usr/bin/env python3
-"""
-Test script for the fixed Trackio API client
-Verifies connection to the deployed Trackio Space with automatic URL resolution
-"""
-
-import sys
-import os
-import logging
-
-# Add the project root to the path
-sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
-
-from scripts.trackio_tonic.trackio_api_client import TrackioAPIClient
-
-# Setup logging
-logging.basicConfig(level=logging.INFO)
-logger = logging.getLogger(__name__)
-
-def test_trackio_connection():
- """Test connection to Trackio Space"""
- print("🔧 Testing Trackio API Client with automatic URL resolution...")
-
- # Initialize the API client with Space ID
- space_id = "Tonic/trackio-monitoring-20250727"
- client = TrackioAPIClient(space_id)
-
- # Test 1: Space info
- print("\n1️⃣ Testing Space info resolution...")
- space_info = client.get_space_info()
- print(f"Space info result: {space_info}")
-
- if space_info.get('error'):
- print("❌ Space info failed")
- return False
-
- print("✅ Space info successful!")
-
- # Test 2: Connection test
- print("\n2️⃣ Testing connection...")
- connection_result = client.test_connection()
- print(f"Connection result: {connection_result}")
-
- if connection_result.get('error'):
- print("❌ Connection failed")
- return False
-
- print("✅ Connection successful!")
-
- # Test 3: List experiments
- print("\n3️⃣ Testing list experiments...")
- list_result = client.list_experiments()
- print(f"List experiments result: {list_result}")
-
- if list_result.get('error'):
- print("❌ List experiments failed")
- return False
-
- print("✅ List experiments successful!")
-
- # Test 4: Create a test experiment
- print("\n4️⃣ Testing create experiment...")
- create_result = client.create_experiment(
- name="test_experiment_auto_resolve",
- description="Test experiment with automatic URL resolution"
- )
- print(f"Create experiment result: {create_result}")
-
- if create_result.get('error'):
- print("❌ Create experiment failed")
- return False
-
- print("✅ Create experiment successful!")
-
- # Test 5: Log metrics
- print("\n5️⃣ Testing log metrics...")
- metrics = {
- "loss": 1.234,
- "accuracy": 0.85,
- "learning_rate": 2e-5,
- "gpu_memory": 22.5
- }
-
- log_metrics_result = client.log_metrics(
- experiment_id="test_experiment_auto_resolve",
- metrics=metrics,
- step=100
- )
- print(f"Log metrics result: {log_metrics_result}")
-
- if log_metrics_result.get('error'):
- print("❌ Log metrics failed")
- return False
-
- print("✅ Log metrics successful!")
-
- # Test 6: Log parameters
- print("\n6️⃣ Testing log parameters...")
- parameters = {
- "learning_rate": 2e-5,
- "batch_size": 8,
- "model_name": "HuggingFaceTB/SmolLM3-3B",
- "max_iters": 18000,
- "mixed_precision": "bf16"
- }
-
- log_params_result = client.log_parameters(
- experiment_id="test_experiment_auto_resolve",
- parameters=parameters
- )
- print(f"Log parameters result: {log_params_result}")
-
- if log_params_result.get('error'):
- print("❌ Log parameters failed")
- return False
-
- print("✅ Log parameters successful!")
-
- # Test 7: Get experiment details
- print("\n7️⃣ Testing get experiment details...")
- details_result = client.get_experiment_details("test_experiment_auto_resolve")
- print(f"Get experiment details result: {details_result}")
-
- if details_result.get('error'):
- print("❌ Get experiment details failed")
- return False
-
- print("✅ Get experiment details successful!")
-
- print("\n🎉 All tests passed! Trackio API client with automatic URL resolution is working correctly.")
- return True
-
-def test_monitoring_integration():
- """Test the monitoring integration with the fixed API client"""
- print("\n🔧 Testing monitoring integration...")
-
- try:
- from src.monitoring import SmolLM3Monitor
-
- # Create a monitor instance
- monitor = SmolLM3Monitor(
- experiment_name="test_monitoring_auto_resolve",
- enable_tracking=True,
- log_metrics=True,
- log_config=True
- )
-
- print("✅ Monitor created successfully")
-
- # Test logging metrics
- metrics = {
- "loss": 1.123,
- "accuracy": 0.87,
- "learning_rate": 2e-5
- }
-
- monitor.log_metrics(metrics, step=50)
- print("✅ Metrics logged successfully")
-
- # Test logging configuration
- config = {
- "model_name": "HuggingFaceTB/SmolLM3-3B",
- "batch_size": 8,
- "learning_rate": 2e-5
- }
-
- monitor.log_config(config)
- print("✅ Configuration logged successfully")
-
- print("🎉 Monitoring integration test passed!")
- return True
-
- except Exception as e:
- print(f"❌ Monitoring integration test failed: {e}")
- return False
-
-def test_space_url_resolution():
- """Test automatic Space URL resolution"""
- print("\n🔧 Testing Space URL resolution...")
-
- try:
- from huggingface_hub import HfApi
-
- # Test Space info retrieval
- api = HfApi()
- space_id = "Tonic/trackio-monitoring-20250727"
-
- space_info = api.space_info(space_id)
- print(f"✅ Space info retrieved: {space_info}")
-
- if hasattr(space_info, 'host'):
- space_url = f"https://{space_info.host}"
- print(f"✅ Resolved Space URL: {space_url}")
- else:
- print("⚠️ Space host not available, using fallback")
- space_url = f"https://{space_id.replace('/', '-')}.hf.space"
- print(f"✅ Fallback Space URL: {space_url}")
-
- return True
-
- except Exception as e:
- print(f"❌ Space URL resolution failed: {e}")
- return False
-
-if __name__ == "__main__":
- print("🚀 Starting Trackio API Client Tests with Automatic URL Resolution")
- print("=" * 70)
-
- # Test 1: Space URL Resolution
- url_resolution_success = test_space_url_resolution()
-
- # Test 2: API Client
- api_success = test_trackio_connection()
-
- # Test 3: Monitoring Integration
- monitoring_success = test_monitoring_integration()
-
- print("\n" + "=" * 70)
- print("📊 Test Results Summary:")
- print(f"Space URL Resolution: {'✅ PASSED' if url_resolution_success else '❌ FAILED'}")
- print(f"API Client Test: {'✅ PASSED' if api_success else '❌ FAILED'}")
- print(f"Monitoring Integration: {'✅ PASSED' if monitoring_success else '❌ FAILED'}")
-
- if url_resolution_success and api_success and monitoring_success:
- print("\n🎉 All tests passed! The Trackio integration with automatic URL resolution is working correctly.")
- sys.exit(0)
- else:
- print("\n❌ Some tests failed. Please check the errors above.")
- sys.exit(1)
\ No newline at end of file
diff --git a/tests/test_trackio_fixes.py b/tests/test_trackio_fixes.py
deleted file mode 100644
index f65f3e7a67b31bab68c21fce55abe6ee90297028..0000000000000000000000000000000000000000
--- a/tests/test_trackio_fixes.py
+++ /dev/null
@@ -1,212 +0,0 @@
-#!/usr/bin/env python3
-"""
-Test script to verify Trackio deployment fixes
-"""
-
-import os
-import sys
-import subprocess
-from pathlib import Path
-
-def test_imports():
- """Test that required packages are available"""
- print("🔍 Testing imports...")
-
- try:
- from huggingface_hub import HfApi, create_repo, upload_file
- print("✅ huggingface_hub imports successful")
- except ImportError as e:
- print(f"❌ huggingface_hub import failed: {e}")
- return False
-
- try:
- from datasets import Dataset
- print("✅ datasets import successful")
- except ImportError as e:
- print(f"❌ datasets import failed: {e}")
- return False
-
- return True
-
-def test_script_exists(script_path):
- """Test that a script exists and is executable"""
- path = Path(script_path)
- if not path.exists():
- print(f"❌ Script not found: {script_path}")
- return False
-
- if not path.is_file():
- print(f"❌ Not a file: {script_path}")
- return False
-
- print(f"✅ Script exists: {script_path}")
- return True
-
-def test_script_syntax(script_path):
- """Test that a script has valid Python syntax"""
- try:
- with open(script_path, 'r', encoding='utf-8') as f:
- compile(f.read(), script_path, 'exec')
- print(f"✅ Syntax valid: {script_path}")
- return True
- except SyntaxError as e:
- print(f"❌ Syntax error in {script_path}: {e}")
- return False
- except Exception as e:
- print(f"❌ Error reading {script_path}: {e}")
- return False
-
-def test_environment_variables():
- """Test that required environment variables are set"""
- print("🔍 Testing environment variables...")
-
- hf_token = os.environ.get('HF_TOKEN')
- if hf_token:
- print("✅ HF_TOKEN is set")
- else:
- print("⚠️ HF_TOKEN is not set (this is normal for testing)")
-
- dataset_repo = os.environ.get('TRACKIO_DATASET_REPO', 'tonic/trackio-experiments')
- print(f"📊 TRACKIO_DATASET_REPO: {dataset_repo}")
-
- return True
-
-def test_api_connection():
- """Test HF API connection if token is available"""
- hf_token = os.environ.get('HF_TOKEN')
- if not hf_token:
- print("⚠️ Skipping API connection test - no HF_TOKEN")
- return True
-
- try:
- from huggingface_hub import HfApi
- api = HfApi(token=hf_token)
-
- # Test basic API call
- user_info = api.whoami()
- print(f"✅ API connection successful - User: {user_info.get('name', 'Unknown')}")
- return True
- except Exception as e:
- print(f"❌ API connection failed: {e}")
- return False
-
-def test_script_functions():
- """Test that scripts can be imported and have required functions"""
- print("🔍 Testing script functions...")
-
- # Test deploy script
- try:
- sys.path.append(str(Path(__file__).parent.parent / "scripts" / "trackio_tonic"))
- from deploy_trackio_space import TrackioSpaceDeployer
- print("✅ TrackioSpaceDeployer class imported successfully")
- except Exception as e:
- print(f"❌ Failed to import TrackioSpaceDeployer: {e}")
- return False
-
- # Test dataset script
- try:
- sys.path.append(str(Path(__file__).parent.parent / "scripts" / "dataset_tonic"))
- import setup_hf_dataset
- print("✅ setup_hf_dataset module imported successfully")
- except Exception as e:
- print(f"❌ Failed to import setup_hf_dataset: {e}")
- return False
-
- # Test configure script
- try:
- sys.path.append(str(Path(__file__).parent.parent / "scripts" / "trackio_tonic"))
- import configure_trackio
- print("✅ configure_trackio module imported successfully")
- except Exception as e:
- print(f"❌ Failed to import configure_trackio: {e}")
- return False
-
- return True
-
-def test_template_files():
- """Test that template files exist"""
- print("🔍 Testing template files...")
-
- project_root = Path(__file__).parent.parent
- templates_dir = project_root / "templates"
-
- required_files = [
- "spaces/app.py",
- "spaces/requirements.txt",
- "spaces/README.md",
- "datasets/readme.md"
- ]
-
- all_exist = True
- for file_path in required_files:
- full_path = templates_dir / file_path
- if full_path.exists():
- print(f"✅ Template exists: {file_path}")
- else:
- print(f"❌ Template missing: {file_path}")
- all_exist = False
-
- return all_exist
-
-def main():
- """Run all tests"""
- print("🧪 Testing Trackio Deployment Fixes")
- print("=" * 40)
-
- tests = [
- ("Import Tests", test_imports),
- ("Script Existence", lambda: all([
- test_script_exists("scripts/trackio_tonic/deploy_trackio_space.py"),
- test_script_exists("scripts/dataset_tonic/setup_hf_dataset.py"),
- test_script_exists("scripts/trackio_tonic/configure_trackio.py"),
- test_script_exists("scripts/model_tonic/push_to_huggingface.py")
- ])),
- ("Script Syntax", lambda: all([
- test_script_syntax("scripts/trackio_tonic/deploy_trackio_space.py"),
- test_script_syntax("scripts/dataset_tonic/setup_hf_dataset.py"),
- test_script_syntax("scripts/trackio_tonic/configure_trackio.py"),
- test_script_syntax("scripts/model_tonic/push_to_huggingface.py")
- ])),
- ("Environment Variables", test_environment_variables),
- ("API Connection", test_api_connection),
- ("Script Functions", test_script_functions),
- ("Template Files", test_template_files)
- ]
-
- results = []
- for test_name, test_func in tests:
- print(f"\n📋 {test_name}")
- print("-" * 20)
- try:
- result = test_func()
- results.append((test_name, result))
- except Exception as e:
- print(f"❌ Test failed with exception: {e}")
- results.append((test_name, False))
-
- # Summary
- print("\n" + "=" * 40)
- print("📊 Test Results Summary")
- print("=" * 40)
-
- passed = 0
- total = len(results)
-
- for test_name, result in results:
- status = "✅ PASS" if result else "❌ FAIL"
- print(f"{status}: {test_name}")
- if result:
- passed += 1
-
- print(f"\n🎯 Overall: {passed}/{total} tests passed")
-
- if passed == total:
- print("🎉 All tests passed! The fixes are working correctly.")
- return True
- else:
- print("⚠️ Some tests failed. Please check the issues above.")
- return False
-
-if __name__ == "__main__":
- success = main()
- sys.exit(0 if success else 1)
\ No newline at end of file
diff --git a/tests/test_trackio_trl_fix.py b/tests/test_trackio_trl_fix.py
deleted file mode 100644
index bb280a71bdd85b0ef3f4ca004715951e11359b7e..0000000000000000000000000000000000000000
--- a/tests/test_trackio_trl_fix.py
+++ /dev/null
@@ -1,176 +0,0 @@
-#!/usr/bin/env python3
-"""
-Test script to verify Trackio TRL compatibility fix
-Tests that our trackio module provides the interface expected by TRL library
-"""
-
-import sys
-import os
-import logging
-
-# Add src to path
-sys.path.insert(0, os.path.join(os.path.dirname(__file__), '..'))
-
-def test_trackio_interface():
- """Test that trackio module provides the expected interface"""
- print("🔍 Testing Trackio TRL Interface")
-
- try:
- # Test importing trackio
- import trackio
- print("✅ Successfully imported trackio module")
-
- # Test that required functions exist
- required_functions = ['init', 'log', 'finish']
- for func_name in required_functions:
- if hasattr(trackio, func_name):
- print(f"✅ Found required function: {func_name}")
- else:
- print(f"❌ Missing required function: {func_name}")
- return False
-
- # Test initialization with arguments
- experiment_id = trackio.init(
- project_name="test_project",
- experiment_name="test_experiment",
- trackio_url="https://test.hf.space",
- dataset_repo="test/trackio-experiments"
- )
- print(f"✅ Trackio initialization with args successful: {experiment_id}")
-
- # Test initialization without arguments (TRL compatibility)
- experiment_id2 = trackio.init()
- print(f"✅ Trackio initialization without args successful: {experiment_id2}")
-
- # Test logging
- metrics = {'loss': 0.5, 'learning_rate': 1e-4}
- trackio.log(metrics, step=1)
- print("✅ Trackio logging successful")
-
- # Test finishing
- trackio.finish()
- print("✅ Trackio finish successful")
-
- return True
-
- except Exception as e:
- print(f"❌ Trackio interface test failed: {e}")
- return False
-
-def test_trl_compatibility():
- """Test that our trackio module is compatible with TRL expectations"""
- print("\n🔍 Testing TRL Compatibility")
-
- try:
- # Simulate what TRL would do
- import trackio
-
- # TRL expects these functions to be available
- assert hasattr(trackio, 'init'), "trackio.init not found"
- assert hasattr(trackio, 'log'), "trackio.log not found"
- assert hasattr(trackio, 'finish'), "trackio.finish not found"
-
- # Test function signatures
- import inspect
-
- # Check init signature
- init_sig = inspect.signature(trackio.init)
- print(f"✅ init signature: {init_sig}")
-
- # Test that init can be called without arguments (TRL compatibility)
- try:
- # This simulates what TRL might do
- trackio.init()
- print("✅ init() can be called without arguments")
- except Exception as e:
- print(f"❌ init() failed when called without arguments: {e}")
- return False
-
- # Test that config attribute is available (TRL compatibility)
- try:
- config = trackio.config
- print(f"✅ trackio.config is available: {type(config)}")
- print(f"✅ config.project_name: {config.project_name}")
- print(f"✅ config.experiment_name: {config.experiment_name}")
- except Exception as e:
- print(f"❌ trackio.config failed: {e}")
- return False
-
- # Check log signature
- log_sig = inspect.signature(trackio.log)
- print(f"✅ log signature: {log_sig}")
-
- # Check finish signature
- finish_sig = inspect.signature(trackio.finish)
- print(f"✅ finish signature: {finish_sig}")
-
- print("✅ TRL compatibility test passed")
- return True
-
- except Exception as e:
- print(f"❌ TRL compatibility test failed: {e}")
- return False
-
-def test_monitoring_integration():
- """Test that our trackio module integrates with our monitoring system"""
- print("\n🔍 Testing Monitoring Integration")
-
- try:
- import trackio
-
- # Test that we can get the monitor
- monitor = trackio.get_monitor()
- if monitor is not None:
- print("✅ Monitor integration working")
- else:
- print("⚠️ Monitor not available (this is normal if not initialized)")
-
- # Test availability check
- is_avail = trackio.is_available()
- print(f"✅ Trackio availability check: {is_avail}")
-
- return True
-
- except Exception as e:
- print(f"❌ Monitoring integration test failed: {e}")
- return False
-
-def main():
- """Run all tests"""
- print("🚀 Testing Trackio TRL Fix")
- print("=" * 50)
-
- tests = [
- test_trackio_interface,
- test_trl_compatibility,
- test_monitoring_integration
- ]
-
- passed = 0
- total = len(tests)
-
- for test in tests:
- try:
- if test():
- passed += 1
- except Exception as e:
- print(f"❌ Test {test.__name__} failed with exception: {e}")
-
- print("\n" + "=" * 50)
- print(f"Test Results: {passed}/{total} tests passed")
-
- if passed == total:
- print("✅ All tests passed! Trackio TRL fix is working correctly.")
- print("\nThe trackio module now provides the interface expected by TRL library:")
- print("- init(): Initialize experiment")
- print("- log(): Log metrics")
- print("- finish(): Finish experiment")
- print("\nThis should resolve the 'module trackio has no attribute init' error.")
- else:
- print("❌ Some tests failed. Please check the implementation.")
- return 1
-
- return 0
-
-if __name__ == "__main__":
- sys.exit(main())
\ No newline at end of file
diff --git a/tests/test_trackio_update_fix.py b/tests/test_trackio_update_fix.py
deleted file mode 100644
index d95d4ad8a552e2c3e833f1cf66ed160224d374e8..0000000000000000000000000000000000000000
--- a/tests/test_trackio_update_fix.py
+++ /dev/null
@@ -1,136 +0,0 @@
-#!/usr/bin/env python3
-"""
-Test script to verify TrackioConfig update method fix
-"""
-
-import sys
-import os
-sys.path.append(os.path.dirname(os.path.abspath(__file__)))
-
-def test_trackio_config_update():
- """Test that TrackioConfig update method works correctly"""
- print("🧪 Testing TrackioConfig update method...")
-
- try:
- # Import trackio module
- import trackio
-
- # Test that config attribute exists
- assert hasattr(trackio, 'config'), "trackio.config not found"
- print("✅ trackio.config exists")
-
- # Test that config has update method
- config = trackio.config
- assert hasattr(config, 'update'), "TrackioConfig.update method not found"
- print("✅ TrackioConfig.update method exists")
-
- # Test update method functionality with dictionary
- test_config = {
- 'project_name': 'test_project',
- 'experiment_name': 'test_experiment',
- 'new_attribute': 'test_value'
- }
-
- # Call update method with dictionary
- config.update(test_config)
-
- # Verify updates
- assert config.project_name == 'test_project', f"Expected 'test_project', got '{config.project_name}'"
- assert config.experiment_name == 'test_experiment', f"Expected 'test_experiment', got '{config.experiment_name}'"
- assert config.new_attribute == 'test_value', f"Expected 'test_value', got '{config.new_attribute}'"
-
- print("✅ TrackioConfig.update method works correctly with dictionary")
-
- # Test update method with keyword arguments (TRL style)
- config.update(allow_val_change=True, trl_setting='test_value')
-
- # Verify keyword argument updates
- assert config.allow_val_change == True, f"Expected True, got '{config.allow_val_change}'"
- assert config.trl_setting == 'test_value', f"Expected 'test_value', got '{config.trl_setting}'"
-
- print("✅ TrackioConfig.update method works correctly with keyword arguments")
- print("✅ All attributes updated successfully")
-
- return True
-
- except Exception as e:
- print(f"❌ Test failed: {e}")
- return False
-
-def test_trackio_trl_compatibility():
- """Test that trackio is fully compatible with TRL expectations"""
- print("\n🔍 Testing TRL Compatibility...")
-
- try:
- import trackio
-
- # Test all required functions exist
- required_functions = ['init', 'log', 'finish']
- for func_name in required_functions:
- assert hasattr(trackio, func_name), f"trackio.{func_name} not found"
- print(f"✅ trackio.{func_name} exists")
-
- # Test config attribute exists and has update method
- assert hasattr(trackio, 'config'), "trackio.config not found"
- assert hasattr(trackio.config, 'update'), "trackio.config.update not found"
- print("✅ trackio.config.update exists")
-
- # Test that init can be called without arguments (TRL compatibility)
- try:
- experiment_id = trackio.init()
- print(f"✅ trackio.init() called successfully: {experiment_id}")
- except Exception as e:
- print(f"❌ trackio.init() failed: {e}")
- return False
-
- # Test that log can be called
- try:
- trackio.log({'test_metric': 1.0})
- print("✅ trackio.log() called successfully")
- except Exception as e:
- print(f"❌ trackio.log() failed: {e}")
- return False
-
- # Test that finish can be called
- try:
- trackio.finish()
- print("✅ trackio.finish() called successfully")
- except Exception as e:
- print(f"❌ trackio.finish() failed: {e}")
- return False
-
- print("✅ All TRL compatibility tests passed")
- return True
-
- except Exception as e:
- print(f"❌ TRL compatibility test failed: {e}")
- return False
-
-def main():
- """Run all tests"""
- print("🧪 TrackioConfig Update Fix Test")
- print("=" * 40)
-
- # Test 1: Update method functionality
- test1_passed = test_trackio_config_update()
-
- # Test 2: TRL compatibility
- test2_passed = test_trackio_trl_compatibility()
-
- # Summary
- print("\n" + "=" * 40)
- print("📊 Test Results Summary")
- print("=" * 40)
- print(f"✅ Update Method Test: {'PASSED' if test1_passed else 'FAILED'}")
- print(f"✅ TRL Compatibility Test: {'PASSED' if test2_passed else 'FAILED'}")
-
- if test1_passed and test2_passed:
- print("\n🎉 All tests passed! TrackioConfig update fix is working correctly.")
- return True
- else:
- print("\n❌ Some tests failed. Please check the implementation.")
- return False
-
-if __name__ == "__main__":
- success = main()
- sys.exit(0 if success else 1)
\ No newline at end of file
diff --git a/tests/test_trainer_type_fix.py b/tests/test_trainer_type_fix.py
deleted file mode 100644
index 4fd200a7f232f18adb1eb5e68e592ffc36785e23..0000000000000000000000000000000000000000
--- a/tests/test_trainer_type_fix.py
+++ /dev/null
@@ -1,169 +0,0 @@
-#!/usr/bin/env python3
-"""
-Test script to verify trainer type conversion works correctly
-"""
-
-import os
-import sys
-import subprocess
-from pathlib import Path
-
-def test_trainer_type_conversion():
- """Test that trainer type is converted to lowercase correctly"""
- print("🔍 Testing Trainer Type Conversion")
- print("=" * 50)
-
- # Test cases
- test_cases = [
- ("SFT", "sft"),
- ("DPO", "dpo"),
- ("sft", "sft"),
- ("dpo", "dpo")
- ]
-
- all_passed = True
- for input_type, expected_output in test_cases:
- # Simulate the bash conversion: echo "$TRAINER_TYPE" | tr '[:upper:]' '[:lower:]'
- converted = input_type.lower()
-
- if converted == expected_output:
- print(f"✅ '{input_type}' -> '{converted}' (expected: '{expected_output}')")
- else:
- print(f"❌ '{input_type}' -> '{converted}' (expected: '{expected_output}')")
- all_passed = False
-
- return all_passed
-
-def test_launch_script_trainer_type():
- """Test that launch script handles trainer type correctly"""
- print("\n🔍 Testing Launch Script Trainer Type Handling")
- print("=" * 50)
-
- # Check if launch.sh exists
- launch_script = Path("launch.sh")
- if not launch_script.exists():
- print("❌ launch.sh not found")
- return False
-
- # Read launch script and check for trainer type handling
- script_content = launch_script.read_text(encoding='utf-8')
-
- # Check for trainer type conversion
- conversion_patterns = [
- 'TRAINER_TYPE_LOWER=$(echo "$TRAINER_TYPE" | tr \'[:upper:]\' \'[:lower:]\')',
- '--trainer-type "$TRAINER_TYPE_LOWER"'
- ]
-
- all_found = True
- for pattern in conversion_patterns:
- if pattern in script_content:
- print(f"✅ Found: {pattern}")
- else:
- print(f"❌ Missing: {pattern}")
- all_found = False
-
- # Check that old pattern is removed
- old_pattern = '--trainer-type "$TRAINER_TYPE"'
- if old_pattern in script_content:
- print(f"❌ Found old pattern (should be updated): {old_pattern}")
- all_found = False
- else:
- print(f"✅ Old pattern removed: {old_pattern}")
-
- return all_found
-
-def test_training_script_validation():
- """Test that training script accepts the correct trainer types"""
- print("\n🔍 Testing Training Script Validation")
- print("=" * 50)
-
- # Check if training script exists
- training_script = Path("scripts/training/train.py")
- if not training_script.exists():
- print("❌ Training script not found")
- return False
-
- # Read training script and check for argument validation
- script_content = training_script.read_text(encoding='utf-8')
-
- # Check for trainer type argument definition
- if '--trainer-type' in script_content:
- print("✅ Found trainer-type argument in training script")
- else:
- print("❌ Missing trainer-type argument in training script")
- return False
-
- # Check for valid choices
- if 'sft' in script_content and 'dpo' in script_content:
- print("✅ Found valid trainer type choices: sft, dpo")
- else:
- print("❌ Missing valid trainer type choices")
- return False
-
- return True
-
-def test_trainer_type_integration():
- """Test that trainer type integration works end-to-end"""
- print("\n🔍 Testing Trainer Type Integration")
- print("=" * 50)
-
- # Test the conversion logic
- test_cases = [
- ("SFT", "sft"),
- ("DPO", "dpo")
- ]
-
- all_passed = True
- for input_type, expected_output in test_cases:
- # Simulate the bash conversion
- converted = input_type.lower()
-
- # Check if the converted value is valid for the training script
- valid_types = ["sft", "dpo"]
-
- if converted in valid_types:
- print(f"✅ '{input_type}' -> '{converted}' (valid for training script)")
- else:
- print(f"❌ '{input_type}' -> '{converted}' (invalid for training script)")
- all_passed = False
-
- return all_passed
-
-def main():
- """Run all trainer type fix tests"""
- print("🚀 Trainer Type Fix Verification")
- print("=" * 50)
-
- tests = [
- test_trainer_type_conversion,
- test_launch_script_trainer_type,
- test_training_script_validation,
- test_trainer_type_integration
- ]
-
- all_passed = True
- for test in tests:
- try:
- if not test():
- all_passed = False
- except Exception as e:
- print(f"❌ Test failed with error: {e}")
- all_passed = False
-
- print("\n" + "=" * 50)
- if all_passed:
- print("🎉 ALL TRAINER TYPE FIX TESTS PASSED!")
- print("✅ Trainer type conversion: Working")
- print("✅ Launch script handling: Working")
- print("✅ Training script validation: Working")
- print("✅ Integration: Working")
- print("\nThe trainer type fix is working correctly!")
- else:
- print("❌ SOME TRAINER TYPE FIX TESTS FAILED!")
- print("Please check the failed tests above.")
-
- return all_passed
-
-if __name__ == "__main__":
- success = main()
- sys.exit(0 if success else 1)
\ No newline at end of file
diff --git a/tests/test_training_fix.py b/tests/test_training_fix.py
deleted file mode 100644
index 28e206afdd1b6ac3a25953210312d5b6b6377d59..0000000000000000000000000000000000000000
--- a/tests/test_training_fix.py
+++ /dev/null
@@ -1,217 +0,0 @@
-#!/usr/bin/env python3
-"""
-Test script to verify the training pipeline fixes
-"""
-
-import os
-import sys
-import logging
-from pathlib import Path
-
-# Add project root to path
-project_root = Path(__file__).parent.parent
-sys.path.insert(0, str(project_root))
-
-def test_imports():
- """Test that all imports work correctly"""
- print("🔍 Testing imports...")
-
- try:
- from src.config import get_config
- print("✅ config.py imported successfully")
- except Exception as e:
- print(f"❌ config.py import failed: {e}")
- return False
-
- try:
- from src.model import SmolLM3Model
- print("✅ model.py imported successfully")
- except Exception as e:
- print(f"❌ model.py import failed: {e}")
- return False
-
- try:
- from src.data import SmolLM3Dataset
- print("✅ data.py imported successfully")
- except Exception as e:
- print(f"❌ data.py import failed: {e}")
- return False
-
- try:
- from src.trainer import SmolLM3Trainer
- print("✅ trainer.py imported successfully")
- except Exception as e:
- print(f"❌ trainer.py import failed: {e}")
- return False
-
- try:
- from src.monitoring import create_monitor_from_config
- print("✅ monitoring.py imported successfully")
- except Exception as e:
- print(f"❌ monitoring.py import failed: {e}")
- return False
-
- return True
-
-def test_config_loading():
- """Test configuration loading"""
- print("\n🔍 Testing configuration loading...")
-
- try:
- from src.config import get_config
-
- # Test loading the H100 lightweight config
- config = get_config("config/train_smollm3_h100_lightweight.py")
- print("✅ Configuration loaded successfully")
- print(f" Model: {config.model_name}")
- print(f" Dataset: {config.dataset_name}")
- print(f" Batch size: {config.batch_size}")
- print(f" Learning rate: {config.learning_rate}")
-
- return True
- except Exception as e:
- print(f"❌ Configuration loading failed: {e}")
- return False
-
-def test_monitoring_setup():
- """Test monitoring setup without Trackio Space"""
- print("\n🔍 Testing monitoring setup...")
-
- try:
- from src.monitoring import create_monitor_from_config
- from src.config import get_config
-
- # Load config
- config = get_config("config/train_smollm3_h100_lightweight.py")
-
- # Set Trackio URL to a non-existent one to test fallback
- config.trackio_url = "https://non-existent-space.hf.space"
- config.experiment_name = "test_experiment"
-
- # Create monitor
- monitor = create_monitor_from_config(config)
- print("✅ Monitoring setup successful")
- print(f" Experiment: {monitor.experiment_name}")
- print(f" Tracking enabled: {monitor.enable_tracking}")
- print(f" HF Dataset: {monitor.dataset_repo}")
-
- return True
- except Exception as e:
- print(f"❌ Monitoring setup failed: {e}")
- return False
-
-def test_trainer_creation():
- """Test trainer creation"""
- print("\n🔍 Testing trainer creation...")
-
- try:
- from src.config import get_config
- from src.model import SmolLM3Model
- from src.data import SmolLM3Dataset
- from src.trainer import SmolLM3Trainer
-
- # Load config
- config = get_config("config/train_smollm3_h100_lightweight.py")
-
- # Create model (without loading the actual model)
- model = SmolLM3Model(
- model_name=config.model_name,
- max_seq_length=config.max_seq_length,
- config=config
- )
- print("✅ Model created successfully")
-
- # Create dataset (without loading actual data)
- dataset = SmolLM3Dataset(
- data_path=config.dataset_name,
- tokenizer=model.tokenizer,
- max_seq_length=config.max_seq_length,
- config=config
- )
- print("✅ Dataset created successfully")
-
- # Create trainer
- trainer = SmolLM3Trainer(
- model=model,
- dataset=dataset,
- config=config,
- output_dir="/tmp/test_output",
- init_from="scratch"
- )
- print("✅ Trainer created successfully")
-
- return True
- except Exception as e:
- print(f"❌ Trainer creation failed: {e}")
- return False
-
-def test_format_string_fix():
- """Test that the format string fix works"""
- print("\n🔍 Testing format string fix...")
-
- try:
- from src.trainer import SmolLM3Trainer
-
- # Test the SimpleConsoleCallback format string handling
- from transformers import TrainerCallback
-
- class TestCallback(TrainerCallback):
- def on_log(self, args, state, control, logs=None, **kwargs):
- if logs and isinstance(logs, dict):
- step = getattr(state, 'global_step', 'unknown')
- loss = logs.get('loss', 'N/A')
- lr = logs.get('learning_rate', 'N/A')
-
- # Test the fixed format string logic
- if isinstance(loss, (int, float)):
- loss_str = f"{loss:.4f}"
- else:
- loss_str = str(loss)
- if isinstance(lr, (int, float)):
- lr_str = f"{lr:.2e}"
- else:
- lr_str = str(lr)
-
- print(f"Step {step}: loss={loss_str}, lr={lr_str}")
-
- print("✅ Format string fix works correctly")
- return True
- except Exception as e:
- print(f"❌ Format string fix test failed: {e}")
- return False
-
-def main():
- """Run all tests"""
- print("🚀 Testing SmolLM3 Training Pipeline Fixes")
- print("=" * 50)
-
- tests = [
- test_imports,
- test_config_loading,
- test_monitoring_setup,
- test_trainer_creation,
- test_format_string_fix
- ]
-
- passed = 0
- total = len(tests)
-
- for test in tests:
- try:
- if test():
- passed += 1
- except Exception as e:
- print(f"❌ Test {test.__name__} crashed: {e}")
-
- print(f"\n📊 Test Results: {passed}/{total} tests passed")
-
- if passed == total:
- print("✅ All tests passed! The training pipeline should work correctly.")
- return True
- else:
- print("❌ Some tests failed. Please check the errors above.")
- return False
-
-if __name__ == "__main__":
- success = main()
- sys.exit(0 if success else 1)
\ No newline at end of file
diff --git a/tests/test_training_fix_1.py b/tests/test_training_fix_1.py
deleted file mode 100644
index 3845dd5942eca78b1941b59f7cea3683e8ed07de..0000000000000000000000000000000000000000
--- a/tests/test_training_fix_1.py
+++ /dev/null
@@ -1,62 +0,0 @@
-#!/usr/bin/env python3
-"""
-Quick test to verify the training configuration fix
-"""
-
-import os
-import sys
-
-# Add project root to path
-project_root = os.path.dirname(os.path.abspath(__file__))
-sys.path.insert(0, project_root)
-
-def test_configuration():
- """Test the H100 lightweight configuration"""
- print("Testing H100 Lightweight Configuration...")
-
- try:
- from config.train_smollm3_h100_lightweight import SmolLM3ConfigH100Lightweight
-
- config = SmolLM3ConfigH100Lightweight()
-
- print("✅ Configuration loaded successfully")
- print(f" Model: {config.model_name}")
- print(f" Batch size: {config.batch_size}")
- print(f" Learning rate: {config.learning_rate}")
- print(f" FP16: {config.fp16}")
- print(f" BF16: {config.bf16}")
- print(f" Mixed precision: {'fp16' if config.fp16 else 'bf16'}")
- print(f" Sample size: {config.sample_size}")
-
- # Test training arguments creation
- from src.model import SmolLM3Model
-
- # Create a minimal model instance for testing
- model = SmolLM3Model(
- model_name="HuggingFaceTB/SmolLM3-3B",
- max_seq_length=4096,
- config=config
- )
-
- # Test training arguments
- training_args = model.get_training_arguments("/tmp/test")
- print(f"✅ Training arguments created successfully")
- print(f" Training args FP16: {training_args.fp16}")
- print(f" Training args BF16: {training_args.bf16}")
-
- return True
-
- except Exception as e:
- print(f"❌ Error: {e}")
- import traceback
- traceback.print_exc()
- return False
-
-if __name__ == "__main__":
- success = test_configuration()
- if success:
- print("\n🎉 Configuration test passed!")
- print("You can now run the training with: ./launch.sh")
- else:
- print("\n❌ Configuration test failed!")
- sys.exit(1)
\ No newline at end of file
diff --git a/tests/test_training_fixes.py b/tests/test_training_fixes.py
deleted file mode 100644
index 413de2e7c60fadbc9f1e57ca8b11c00eb86d6ab1..0000000000000000000000000000000000000000
--- a/tests/test_training_fixes.py
+++ /dev/null
@@ -1,244 +0,0 @@
-#!/usr/bin/env python3
-"""
-Test script to verify all training fixes work correctly
-"""
-
-import os
-import sys
-import subprocess
-from pathlib import Path
-
-def test_trainer_type_fix():
- """Test that trainer type conversion works correctly"""
- print("🔍 Testing Trainer Type Fix")
- print("=" * 50)
-
- # Test cases
- test_cases = [
- ("SFT", "sft"),
- ("DPO", "dpo"),
- ("sft", "sft"),
- ("dpo", "dpo")
- ]
-
- all_passed = True
- for input_type, expected_output in test_cases:
- converted = input_type.lower()
- if converted == expected_output:
- print(f"✅ '{input_type}' -> '{converted}' (expected: '{expected_output}')")
- else:
- print(f"❌ '{input_type}' -> '{converted}' (expected: '{expected_output}')")
- all_passed = False
-
- return all_passed
-
-def test_trackio_conflict_fix():
- """Test that trackio package conflicts are handled"""
- print("\n🔍 Testing Trackio Conflict Fix")
- print("=" * 50)
-
- try:
- # Test monitoring import
- sys.path.append(str(Path(__file__).parent.parent / "src"))
- from monitoring import SmolLM3Monitor
-
- # Test monitor creation
- monitor = SmolLM3Monitor("test-experiment")
- print("✅ Monitor created successfully")
- print(f" Dataset repo: {monitor.dataset_repo}")
- print(f" Enable tracking: {monitor.enable_tracking}")
-
- # Check that dataset repo is not empty
- if monitor.dataset_repo and monitor.dataset_repo.strip() != '':
- print("✅ Dataset repository is properly set")
- else:
- print("❌ Dataset repository is empty")
- return False
-
- return True
-
- except Exception as e:
- print(f"❌ Trackio conflict fix failed: {e}")
- return False
-
-def test_dataset_repo_fix():
- """Test that dataset repository is properly set"""
- print("\n🔍 Testing Dataset Repository Fix")
- print("=" * 50)
-
- # Test environment variable handling
- test_cases = [
- ("user/test-dataset", "user/test-dataset"),
- ("", "tonic/trackio-experiments"), # Default fallback
- (None, "tonic/trackio-experiments"), # Default fallback
- ]
-
- all_passed = True
- for input_repo, expected_repo in test_cases:
- # Simulate the monitoring logic
- if input_repo and input_repo.strip() != '':
- actual_repo = input_repo
- else:
- actual_repo = "tonic/trackio-experiments"
-
- if actual_repo == expected_repo:
- print(f"✅ '{input_repo}' -> '{actual_repo}' (expected: '{expected_repo}')")
- else:
- print(f"❌ '{input_repo}' -> '{actual_repo}' (expected: '{expected_repo}')")
- all_passed = False
-
- return all_passed
-
-def test_launch_script_fixes():
- """Test that launch script fixes are in place"""
- print("\n🔍 Testing Launch Script Fixes")
- print("=" * 50)
-
- # Check if launch.sh exists
- launch_script = Path("launch.sh")
- if not launch_script.exists():
- print("❌ launch.sh not found")
- return False
-
- # Read launch script and check for fixes
- script_content = launch_script.read_text(encoding='utf-8')
-
- # Check for trainer type conversion
- if 'TRAINER_TYPE_LOWER=$(echo "$TRAINER_TYPE" | tr \'[:upper:]\' \'[:lower:]\')' in script_content:
- print("✅ Trainer type conversion found")
- else:
- print("❌ Trainer type conversion missing")
- return False
-
- # Check for trainer type usage
- if '--trainer-type "$TRAINER_TYPE_LOWER"' in script_content:
- print("✅ Trainer type usage updated")
- else:
- print("❌ Trainer type usage not updated")
- return False
-
- # Check for dataset repository default
- if 'TRACKIO_DATASET_REPO="$HF_USERNAME/trackio-experiments"' in script_content:
- print("✅ Dataset repository default found")
- else:
- print("❌ Dataset repository default missing")
- return False
-
- # Check for dataset repository validation
- if 'if [ -z "$TRACKIO_DATASET_REPO" ]' in script_content:
- print("✅ Dataset repository validation found")
- else:
- print("❌ Dataset repository validation missing")
- return False
-
- return True
-
-def test_monitoring_fixes():
- """Test that monitoring fixes are in place"""
- print("\n🔍 Testing Monitoring Fixes")
- print("=" * 50)
-
- # Check if monitoring.py exists
- monitoring_file = Path("src/monitoring.py")
- if not monitoring_file.exists():
- print("❌ monitoring.py not found")
- return False
-
- # Read monitoring file and check for fixes
- script_content = monitoring_file.read_text(encoding='utf-8')
-
- # Check for trackio conflict handling
- if 'import trackio' in script_content:
- print("✅ Trackio conflict handling found")
- else:
- print("❌ Trackio conflict handling missing")
- return False
-
- # Check for dataset repository validation
- if 'if not self.dataset_repo or self.dataset_repo.strip() == \'\'' in script_content:
- print("✅ Dataset repository validation found")
- else:
- print("❌ Dataset repository validation missing")
- return False
-
- # Check for improved error handling
- if 'Trackio Space not accessible' in script_content:
- print("✅ Improved Trackio error handling found")
- else:
- print("❌ Improved Trackio error handling missing")
- return False
-
- return True
-
-def test_training_script_validation():
- """Test that training script accepts correct parameters"""
- print("\n🔍 Testing Training Script Validation")
- print("=" * 50)
-
- # Check if training script exists
- training_script = Path("scripts/training/train.py")
- if not training_script.exists():
- print("❌ Training script not found")
- return False
-
- # Read training script and check for argument validation
- script_content = training_script.read_text(encoding='utf-8')
-
- # Check for trainer type argument
- if '--trainer-type' in script_content:
- print("✅ Trainer type argument found")
- else:
- print("❌ Trainer type argument missing")
- return False
-
- # Check for valid choices
- if 'choices=[\'sft\', \'dpo\']' in script_content:
- print("✅ Valid trainer type choices found")
- else:
- print("❌ Valid trainer type choices missing")
- return False
-
- return True
-
-def main():
- """Run all training fix tests"""
- print("🚀 Training Fixes Verification")
- print("=" * 50)
-
- tests = [
- test_trainer_type_fix,
- test_trackio_conflict_fix,
- test_dataset_repo_fix,
- test_launch_script_fixes,
- test_monitoring_fixes,
- test_training_script_validation
- ]
-
- all_passed = True
- for test in tests:
- try:
- if not test():
- all_passed = False
- except Exception as e:
- print(f"❌ Test failed with error: {e}")
- all_passed = False
-
- print("\n" + "=" * 50)
- if all_passed:
- print("🎉 ALL TRAINING FIXES PASSED!")
- print("✅ Trainer type conversion: Working")
- print("✅ Trackio conflict handling: Working")
- print("✅ Dataset repository fixes: Working")
- print("✅ Launch script fixes: Working")
- print("✅ Monitoring fixes: Working")
- print("✅ Training script validation: Working")
- print("\nAll training issues have been resolved!")
- else:
- print("❌ SOME TRAINING FIXES FAILED!")
- print("Please check the failed tests above.")
-
- return all_passed
-
-if __name__ == "__main__":
- success = main()
- sys.exit(0 if success else 1)
\ No newline at end of file
diff --git a/tests/test_update_fix.py b/tests/test_update_fix.py
deleted file mode 100644
index eeb65714b1e1b93ed9a59155c59245d278f9afee..0000000000000000000000000000000000000000
--- a/tests/test_update_fix.py
+++ /dev/null
@@ -1,27 +0,0 @@
-#!/usr/bin/env python3
-"""
-Simple test for TrackioConfig update method fix
-"""
-
-import trackio
-
-print("Testing TrackioConfig update method...")
-
-# Test that config exists and has update method
-config = trackio.config
-print(f"Config type: {type(config)}")
-print(f"Has update method: {hasattr(config, 'update')}")
-
-# Test update functionality
-test_data = {
- 'project_name': 'test_project',
- 'experiment_name': 'test_experiment',
- 'new_attribute': 'test_value'
-}
-
-print(f"Before update - project_name: {config.project_name}")
-config.update(test_data)
-print(f"After update - project_name: {config.project_name}")
-print(f"New attribute: {config.new_attribute}")
-
-print("✅ Update method works correctly!")
\ No newline at end of file
diff --git a/tests/test_update_kwargs_1.py b/tests/test_update_kwargs_1.py
deleted file mode 100644
index 2973da35fcad741a3f4f4018508e4ad3aa504630..0000000000000000000000000000000000000000
--- a/tests/test_update_kwargs_1.py
+++ /dev/null
@@ -1,35 +0,0 @@
-#!/usr/bin/env python3
-"""
-Test script to verify TrackioConfig update method works with keyword arguments
-"""
-
-import trackio
-
-print("Testing TrackioConfig update method with keyword arguments...")
-
-# Test that config exists and has update method
-config = trackio.config
-print(f"Config type: {type(config)}")
-print(f"Has update method: {hasattr(config, 'update')}")
-
-# Test update with keyword arguments (like TRL does)
-print(f"Before update - project_name: {config.project_name}")
-config.update(allow_val_change=True, project_name="test_project")
-print(f"After update - project_name: {config.project_name}")
-print(f"New attribute allow_val_change: {config.allow_val_change}")
-
-# Test update with dictionary
-test_data = {
- 'experiment_name': 'test_experiment',
- 'new_attribute': 'test_value'
-}
-config.update(test_data)
-print(f"After dict update - experiment_name: {config.experiment_name}")
-print(f"New attribute: {config.new_attribute}")
-
-# Test update with both dictionary and keyword arguments
-config.update({'another_attr': 'dict_value'}, kwarg_attr='keyword_value')
-print(f"Another attr: {config.another_attr}")
-print(f"Kwarg attr: {config.kwarg_attr}")
-
-print("✅ Update method works correctly with keyword arguments!")
\ No newline at end of file
diff --git a/tests/verify_fix.py b/tests/verify_fix.py
deleted file mode 100644
index 513d5179d0435408fbad30db65dbeb17a25c172a..0000000000000000000000000000000000000000
--- a/tests/verify_fix.py
+++ /dev/null
@@ -1,35 +0,0 @@
-#!/usr/bin/env python3
-"""
-Simple verification script for TrackioConfig update fix
-"""
-
-try:
- import trackio
- print("✅ Trackio imported successfully")
-
- # Test config access
- config = trackio.config
- print(f"✅ Config accessed: {type(config)}")
-
- # Test update method exists
- print(f"✅ Update method exists: {hasattr(config, 'update')}")
-
- # Test update with keyword arguments (TRL style)
- config.update(allow_val_change=True, test_attr='test_value')
- print(f"✅ Update with kwargs worked: allow_val_change={config.allow_val_change}, test_attr={config.test_attr}")
-
- # Test update with dictionary
- config.update({'project_name': 'test_project', 'new_attr': 'dict_value'})
- print(f"✅ Update with dict worked: project_name={config.project_name}, new_attr={config.new_attr}")
-
- # Test TRL functions
- print(f"✅ Init function exists: {hasattr(trackio, 'init')}")
- print(f"✅ Log function exists: {hasattr(trackio, 'log')}")
- print(f"✅ Finish function exists: {hasattr(trackio, 'finish')}")
-
- print("\n🎉 All tests passed! The fix is working correctly.")
-
-except Exception as e:
- print(f"❌ Test failed: {e}")
- import traceback
- traceback.print_exc()
\ No newline at end of file
diff --git a/tests/verify_fix_1.py b/tests/verify_fix_1.py
deleted file mode 100644
index 513d5179d0435408fbad30db65dbeb17a25c172a..0000000000000000000000000000000000000000
--- a/tests/verify_fix_1.py
+++ /dev/null
@@ -1,35 +0,0 @@
-#!/usr/bin/env python3
-"""
-Simple verification script for TrackioConfig update fix
-"""
-
-try:
- import trackio
- print("✅ Trackio imported successfully")
-
- # Test config access
- config = trackio.config
- print(f"✅ Config accessed: {type(config)}")
-
- # Test update method exists
- print(f"✅ Update method exists: {hasattr(config, 'update')}")
-
- # Test update with keyword arguments (TRL style)
- config.update(allow_val_change=True, test_attr='test_value')
- print(f"✅ Update with kwargs worked: allow_val_change={config.allow_val_change}, test_attr={config.test_attr}")
-
- # Test update with dictionary
- config.update({'project_name': 'test_project', 'new_attr': 'dict_value'})
- print(f"✅ Update with dict worked: project_name={config.project_name}, new_attr={config.new_attr}")
-
- # Test TRL functions
- print(f"✅ Init function exists: {hasattr(trackio, 'init')}")
- print(f"✅ Log function exists: {hasattr(trackio, 'log')}")
- print(f"✅ Finish function exists: {hasattr(trackio, 'finish')}")
-
- print("\n🎉 All tests passed! The fix is working correctly.")
-
-except Exception as e:
- print(f"❌ Test failed: {e}")
- import traceback
- traceback.print_exc()
\ No newline at end of file