Tonic commited on
Commit
8f6fe61
·
verified ·
1 Parent(s): 42f4411

adds more documentation

Browse files
Files changed (2) hide show
  1. .cursorrules +277 -0
  2. .gitignore +2 -1
.cursorrules ADDED
@@ -0,0 +1,277 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ description: SmolLM3 Fine-tuning Pipeline - Project Rules and Conventions
3
+ globs: ["**/*.py", "**/*.sh", "**/*.md", "**/*.json"]
4
+ alwaysApply: true
5
+ ---
6
+
7
+ # SmolLM3 Fine-tuning Pipeline Project Rules
8
+
9
+ ## Project Overview
10
+ This is a comprehensive end-to-end fine-tuning pipeline for SmolLM3 models with Trackio monitoring, Hugging Face integration, and interactive configuration management.
11
+
12
+ ## Core Architecture
13
+
14
+ ### Directory Structure
15
+ - `config/` - Training configuration files for different scenarios
16
+ - `src/` - Core training and model logic
17
+ - `scripts/` - Utility scripts for deployment, dataset management, and model pushing
18
+ - `docs/` - Comprehensive documentation and guides
19
+ - `templates/` - Templates for HF Spaces and datasets
20
+ - `tests/` - Test files and debugging scripts
21
+ - `outputs/` - Training outputs and checkpoints
22
+
23
+ ### Key Components
24
+
25
+ #### Training Configurations
26
+ - **Basic Training**: SmolLM3-3B + OpenHermes-FR, 3 epochs, batch size 2
27
+ - **H100 Lightweight**: SmolLM3-3B + OpenHermes-FR (80K samples), 1 epoch, batch size 16
28
+ - **A100 Large Scale**: SmolLM3-3B + OpenHermes-FR, 1.3 passes, batch size 8
29
+ - **Multiple Passes**: SmolLM3-3B + OpenHermes-FR, 4 epochs, batch size 6
30
+ - **Custom Configuration**: User-defined parameters
31
+
32
+ #### Core Modules
33
+ - `src/train.py` - Main training orchestration
34
+ - `src/model.py` - Model loading and configuration
35
+ - `src/data.py` - Dataset processing and loading
36
+ - `src/monitoring.py` - Trackio integration and metrics
37
+ - `src/trainer.py` - Training loop and optimization
38
+
39
+ ## Coding Conventions
40
+
41
+ ### Python Style
42
+ - Use type hints for all function parameters and return values
43
+ - Follow PEP 8 for formatting
44
+ - Use descriptive variable names in snake_case
45
+ - Add comprehensive docstrings for all functions
46
+ - Use f-strings for string formatting
47
+
48
+ ### Configuration Management
49
+ - All training configs inherit from `SmolLM3Config` base class
50
+ - Use dataclasses for configuration objects
51
+ - Validate configuration parameters in __post_init__
52
+ - Support both YAML and Python configuration files
53
+
54
+ ### Error Handling
55
+ - Use try-except blocks for external API calls (HF, Trackio)
56
+ - Log errors with appropriate context
57
+ - Provide user-friendly error messages
58
+ - Implement graceful degradation for optional features
59
+
60
+ ### Monitoring Integration
61
+ - Always include Trackio URL and experiment name in configs
62
+ - Log metrics every N steps (configurable)
63
+ - Save checkpoints and artifacts to HF Datasets
64
+ - Use structured logging with consistent field names
65
+
66
+ ## File Naming Conventions
67
+
68
+ ### Configuration Files
69
+ - `train_smollm3_*.py` - Training configurations
70
+ - `*_config.py` - General configuration files
71
+ - Use descriptive suffixes: `_h100_lightweight`, `_a100_large`, `_multiple_passes`
72
+
73
+ ### Script Files
74
+ - `deploy_*.py` - Deployment scripts
75
+ - `setup_*.py` - Setup and initialization scripts
76
+ - `push_*.py` - Model pushing scripts
77
+ - `configure_*.py` - Configuration scripts
78
+
79
+ ### Test Files
80
+ - `test_*.py` - Test files
81
+ - `debug_*.py` - Debugging scripts
82
+ - Include descriptive names indicating what they test
83
+
84
+ ## Training Pipeline Workflow
85
+
86
+ ### Interactive Pipeline (`launch.sh`)
87
+ 1. **Authentication**: HF username and token validation
88
+ 2. **Configuration Selection**: Choose from predefined configs or custom
89
+ 3. **Experiment Setup**: Configure experiment name and repositories
90
+ 4. **Environment Setup**: Install dependencies and setup virtual environment
91
+ 5. **Deployment**: Deploy Trackio Space and setup HF Dataset
92
+ 6. **Training**: Execute training with monitoring
93
+ 7. **Model Push**: Upload model to HF Hub with documentation
94
+ 8. **Testing**: Validate uploaded model functionality
95
+
96
+ ### Configuration Selection Logic
97
+ - Basic Training: Default for beginners and learning
98
+ - H100 Lightweight: Rapid experiments on H100 GPUs
99
+ - A100 Large Scale: Serious research and production
100
+ - Multiple Passes: Thorough training for production models
101
+ - Custom: User-defined parameters for specific needs
102
+
103
+ ## Dataset Management
104
+
105
+ ### Supported Formats
106
+ - Hugging Face Datasets format
107
+ - JSON files with prompt/completion pairs
108
+ - Chat format with messages array
109
+ - Custom formats with conversion functions
110
+
111
+ ### Dataset Processing
112
+ - Automatic format detection and conversion
113
+ - Random sampling for lightweight configurations
114
+ - Validation split creation
115
+ - Bad entry filtering and handling
116
+
117
+ ### Dataset Sampling (H100 Lightweight)
118
+ - 80,000 random samples from OpenHermes-FR
119
+ - 1,000 validation samples (if available)
120
+ - Fixed random seed (42) for reproducibility
121
+ - Automatic sampling during dataset preparation
122
+
123
+ ## Model Management
124
+
125
+ ### Model Loading
126
+ - Support for HuggingFaceTB/SmolLM3-3B
127
+ - Flash attention and gradient checkpointing
128
+ - Mixed precision training (fp16/bf16)
129
+ - Device mapping and memory optimization
130
+
131
+ ### Model Pushing
132
+ - Comprehensive model cards with training details
133
+ - Automatic README generation
134
+ - License and usage information
135
+ - Training metrics and configuration
136
+
137
+ ## Monitoring and Tracking
138
+
139
+ ### Trackio Integration
140
+ - Real-time metrics logging
141
+ - Training curves visualization
142
+ - Resource usage monitoring
143
+ - Artifact storage and versioning
144
+
145
+ ### Metrics to Track
146
+ - Training and validation loss
147
+ - Learning rate schedule
148
+ - Gradient norms
149
+ - GPU utilization and memory
150
+ - Training speed (steps/second)
151
+
152
+ ## Error Handling and Validation
153
+
154
+ ### Input Validation
155
+ - Validate HF tokens before use
156
+ - Check CUDA availability
157
+ - Verify dataset accessibility
158
+ - Validate configuration parameters
159
+
160
+ ### Error Recovery
161
+ - Graceful handling of network issues
162
+ - Automatic retry for failed operations
163
+ - Checkpoint recovery for interrupted training
164
+ - Fallback options for optional features
165
+
166
+ ## Documentation Standards
167
+
168
+ ### README Files
169
+ - Clear project description
170
+ - Installation instructions
171
+ - Usage examples
172
+ - Configuration options
173
+ - Troubleshooting guide
174
+
175
+ ### Code Documentation
176
+ - Comprehensive docstrings
177
+ - Type hints for all functions
178
+ - Example usage in docstrings
179
+ - Parameter descriptions
180
+ - Return value documentation
181
+
182
+ ## Testing and Validation
183
+
184
+ ### Test Categories
185
+ - Unit tests for core functions
186
+ - Integration tests for pipeline
187
+ - Configuration validation tests
188
+ - Model loading and saving tests
189
+ - Dataset processing tests
190
+
191
+ ### Debugging Tools
192
+ - Standalone test scripts
193
+ - Configuration validation
194
+ - Model testing utilities
195
+ - Dataset inspection tools
196
+
197
+ ## Performance Optimization
198
+
199
+ ### H100 Optimizations
200
+ - Larger batch sizes (16 vs 8 for A100)
201
+ - Reduced gradient accumulation (4 vs 16)
202
+ - Higher learning rates (8e-6 vs 5e-6)
203
+ - Optimized data loading (4 workers, pin memory)
204
+
205
+ ### Memory Management
206
+ - Gradient checkpointing for large models
207
+ - Mixed precision training
208
+ - Dynamic batch sizing
209
+ - Memory-efficient data loading
210
+
211
+ ## Security and Best Practices
212
+
213
+ ### Token Management
214
+ - Never hardcode tokens in code
215
+ - Use environment variables
216
+ - Validate tokens before use
217
+ - Secure token storage
218
+
219
+ ### Data Privacy
220
+ - Filter sensitive data from datasets
221
+ - Validate dataset contents
222
+ - Secure data transmission
223
+ - Proper data disposal
224
+
225
+ ## Deployment and CI/CD
226
+
227
+ ### Environment Setup
228
+ - Python virtual environments
229
+ - CUDA-compatible PyTorch
230
+ - Required dependencies installation
231
+ - System package management
232
+
233
+ ### Automated Deployment
234
+ - Trackio Space deployment
235
+ - HF Dataset setup
236
+ - Model repository creation
237
+ - Configuration file generation
238
+
239
+ ## Troubleshooting Guidelines
240
+
241
+ ### Common Issues
242
+ - CUDA out of memory: Reduce batch size
243
+ - Network timeouts: Check internet connection
244
+ - Token validation: Verify HF token permissions
245
+ - Dataset loading: Check dataset accessibility
246
+
247
+ ### Debugging Steps
248
+ 1. Check system requirements
249
+ 2. Validate configuration
250
+ 3. Test individual components
251
+ 4. Review logs and error messages
252
+ 5. Verify external service connectivity
253
+
254
+ ## Future Enhancements
255
+
256
+ ### Planned Features
257
+ - Multi-GPU training support
258
+ - Advanced dataset sampling strategies
259
+ - Automated hyperparameter optimization
260
+ - Enhanced monitoring and visualization
261
+ - Support for additional model architectures
262
+
263
+ ### Extensibility
264
+ - Modular configuration system
265
+ - Plugin architecture for custom features
266
+ - Support for custom datasets and models
267
+ - Flexible monitoring integration
268
+
269
+ ---
270
+
271
+ **When working with this codebase:**
272
+ - Always consider the end-to-end pipeline workflow
273
+ - Follow the established configuration patterns
274
+ - Include proper error handling and validation
275
+ - Maintain comprehensive documentation
276
+ - Test changes thoroughly before deployment
277
+ - Consider performance implications for different hardware configurations
.gitignore CHANGED
@@ -1,4 +1,5 @@
1
- .cursorrules/
 
2
  *.mdc
3
 
4
  # Python
 
1
+ .cursor/
2
+ .cursor/rules/
3
  *.mdc
4
 
5
  # Python