|
--- |
|
title: TorchTransformers Diffusion CV SFT |
|
emoji: ⚡ |
|
colorFrom: yellow |
|
colorTo: indigo |
|
sdk: streamlit |
|
sdk_version: 1.43.2 |
|
app_file: app.py |
|
pinned: false |
|
license: mit |
|
short_description: Torch Transformers Diffusion SFT for Computer Vision |
|
--- |
|
|
|
|
|
Integration Details |
|
1. SFT Tiny Titans (First Listing): |
|
- Features: Causal LM and Diffusion SFT, camera snap, RAG party. |
|
- Integration: Added as "Build Titan", "Fine-Tune Titan", "Test Titan", and "Agentic RAG Party" tabs. Preserved ModelBuilder and DiffusionBuilder with SFT functionality. |
|
2. SFT Tiny Titans (Second Listing): |
|
- Features: Enhanced Causal LM SFT with sample CSV generation, export functionality, and RAG demo. |
|
- Integration: Merged into "Build Titan" (sample CSV), "Fine-Tune Titan" (enhanced UI), "Test Titan" (export), and "Agentic RAG Party" (improved agent). Used PartyPlannerAgent from this listing for its detailed RAG output. |
|
3. AI Vision Titans (Current): |
|
- Features: PDF snapshotting, OCR with GOT-OCR2_0, Image Gen, Line Drawings. |
|
- Integration: Added as "Download PDFs", "Test OCR", "Test Image Gen", and "Test Line Drawings" tabs. Retained async processing and gallery updates. |
|
4. Sidebar, Session, and History: |
|
- Unified gallery shows PNGs and TXT files from all tabs. |
|
- Session state (captured_files, builder, model_loaded, processing, history) tracks all operations. |
|
- History log in sidebar records key actions (snapshots, SFT, tests). |
|
5. Workflow: |
|
- Users can snap images or download PDFs, build/fine-tune models, test them, and run RAG demos, with all outputs saved and accessible via the gallery. |
|
7. Verification |
|
- Run the App: streamlit run app.py |
|
8. Check: |
|
- Camera Snap: Capture images, verify in gallery. |
|
- Download PDFs: Test with a valid PDF URL (e.g., a direct link), check snapshots. |
|
- Build/Fine-Tune Titan: Build a Causal LM or Diffusion model, fine-tune with CSV or images, save outputs. |
|
- Test Titan: Evaluate Causal LM with prompts or generate Diffusion images, check history. |
|
- Agentic RAG Party: Run NLP or CV RAG demos, verify outputs. |
|
- Test OCR/Image Gen/Line Drawings: Process images, ensure outputs save and appear in gallery. |
|
9. Expected Logs: "Saved snapshot...", "Model loaded...", "SFT completed...", etc. |
|
10. Notes |
|
- PDF URLs: Your provided URLs need direct PDF links (e.g., via Archive.org’s /download/ path). Adjust as needed. |
|
- Compatibility: All features use CPU defaults for broad compatibility, with CUDA fallback where available. |
|
- Session State: Persistent across tabs, ensuring workflow continuity. |
|
|
|
## Abstract |
|
Explore AI vision with `torch`, `transformers`, and `diffusers`! Dual `st.camera_input` 📷 captures feed async OCR (Qwen2-VL, TrOCR), image gen (Stable Diffusion), and line drawings (Torch Space-inspired) on CPU. Key papers: |
|
|
|
- 🌐 **[Streamlit](https://arxiv.org/abs/2308.03892)** - Thiessen et al., 2023: UI. |
|
- 🔥 **[PyTorch](https://arxiv.org/abs/1912.01703)** - Paszke et al., 2019: Core. |
|
- 🔍 **[Qwen2-VL](https://arxiv.org/abs/2408.11039)** - Li et al., 2024: Multimodal OCR. |
|
- 🔍 **[TrOCR](https://arxiv.org/abs/2109.10282)** - Li et al., 2021: Small OCR. |
|
- 🎨 **[LDM](https://arxiv.org/abs/2112.10752)** - Rombach et al., 2022: Image gen. |
|
- 👁️ **[OpenCV](https://arxiv.org/abs/2308.11236)** - Bradski, 2000: CV tools. |
|
|
|
Run: `pip install -r requirements.txt`, `streamlit run ${app_file}`. Snap, test, innovate! ${emoji} |
|
|
|
## Usage 🎯 |
|
- 📷 **Camera Snap**: Single or burst capture (auto 10 frames) with gallery. |
|
- 🔍 **Test OCR**: `Qwen2-VL-OCR-2B` or `TrOCR-Small` extracts text, saved async. |
|
- 🎨 **Test Image Gen**: `OFA-Sys/small-stable-diffusion-v0` generates images, saved async. |
|
- ✏️ **Test Line Drawings**: OpenCV line art (Torch Space-inspired), saved async. |
|
|
|
## Abstract |
|
Fuse `torch`, `transformers`, and `diffusers` for SFT-powered NLP and CV! Dual `st.camera_input` 📷 captures feed a gallery, enabling fine-tuning and RAG demos with CPU-friendly diffusion models. Key papers: |
|
|
|
- 🌐 **[Streamlit Framework](https://arxiv.org/abs/2308.03892)** - Thiessen et al., 2023: UI magic. |
|
- 🔥 **[PyTorch DL](https://arxiv.org/abs/1912.01703)** - Paszke et al., 2019: Torch core. |
|
- 🧠 **[Attention is All You Need](https://arxiv.org/abs/1706.03762)** - Vaswani et al., 2017: NLP transformers. |
|
- 🎨 **[DDPM](https://arxiv.org/abs/2006.11239)** - Ho et al., 2020: Denoising diffusion. |
|
- 📊 **[Pandas](https://arxiv.org/abs/2305.11207)** - McKinney, 2010: Data handling. |
|
- 🖼️ **[Pillow](https://arxiv.org/abs/2308.11234)** - Clark et al., 2023: Image processing. |
|
- ⏰ **[pytz](https://arxiv.org/abs/2308.11235)** - Henshaw, 2023: Time zones. |
|
- 👁️ **[OpenCV](https://arxiv.org/abs/2308.11236)** - Bradski, 2000: CV tools. |
|
- 🎨 **[LDM](https://arxiv.org/abs/2112.10752)** - Rombach et al., 2022: Latent diffusion. |
|
- ⚙️ **[LoRA](https://arxiv.org/abs/2106.09685)** - Hu et al., 2021: SFT efficiency. |
|
- 🔍 **[RAG](https://arxiv.org/abs/2005.11401)** - Lewis et al., 2020: Retrieval-augmented generation. |
|
|
|
Run: `pip install -r requirements.txt`, `streamlit run ${app_file}`. Build, snap, party! ${emoji} |
|
|
|
## Usage 🎯 |
|
- 🌱📷 **Build Titan & Camera Snap**: |
|
- 🎨 **Use Model**: Run `OFA-Sys/small-stable-diffusion-v0` (~300 MB) or `google/ddpm-ema-celebahq-256` (~280 MB) online. |
|
- ⬇️ **Download Model**: Save <500 MB diffusion models locally. |
|
- 📷 **Snap**: Capture unique PNGs with dual cams. |
|
- 🔧 **SFT**: Tune Causal LM with CSV or Diffusion with image-text pairs. |
|
- 🧪 **Test**: Pair text with images, select pipeline, hit "Run Test 🚀". |
|
- 🌐 **RAG Party**: NLP plans or CV images for superhero bashes! |
|
|
|
|
|
Tune NLP 🧠 or CV 🎨 fast! Texts 📝 or pics 📸, SFT shines ✨. `pip install -r requirements.txt`, `streamlit run app.py`. Snap cams 📷, craft art—AI’s lean & mean! 🎉 #SFTSpeed |
|
|
|
# SFT Tiny Titans 🚀 (Small Diffusion Delight!) |
|
|
|
A Streamlit app for Supervised Fine-Tuning (SFT) of small diffusion models, featuring multi-camera capture, model testing, and agentic RAG demos with a playful UI. |
|
|
|
## Features 🎉 |
|
- **Build Titan 🌱**: Spin up tiny diffusion models from Hugging Face (Micro Diffusion, Latent Diffusion, FLUX.1 Distilled). |
|
- **Camera Snap 📷**: Snap pics with 6 cameras using a 4-column grid UI per cam—witty, emoji-packed controls for device, label, hint, and visibility! 📸✨ |
|
- **Fine-Tune Titan (CV) 🔧**: Tune models with 3 use cases—denoising, stylization, multi-angle generation—using your camera captures, with CSV/MD exports. |
|
- **Test Titan (CV) 🧪**: Generate images from prompts with your tuned diffusion titan. |
|
- **Agentic RAG Party (CV) 🌐**: Craft superhero party visuals from camera-inspired prompts. |
|
- **Media Gallery 🎨**: View, download, or zap captured images with flair. |
|
|
|
## Installation 🛠️ |
|
1. Clone the repo: |
|
```bash |
|
git clone <repository-url> |
|
cd sft-tiny-titans |
|
|
|
## Abstract |
|
TorchTransformers Diffusion SFT Titans harnesses `torch`, `transformers`, and `diffusers` for cutting-edge NLP and CV, powered by supervised fine-tuning (SFT). Dual `st.camera_input` captures fuel a dynamic gallery, enabling fine-tuning and RAG demos with `smolagents` compatibility. Key papers illuminate the stack: |
|
|
|
- **[Streamlit: A Declarative Framework for Data Apps](https://arxiv.org/abs/2308.03892)** - Thiessen et al., 2023: Streamlit’s UI framework. |
|
- **[PyTorch: An Imperative Style, High-Performance Deep Learning Library](https://arxiv.org/abs/1912.01703)** - Paszke et al., 2019: Torch foundation. |
|
- **[Attention is All You Need](https://arxiv.org/abs/1706.03762)** - Vaswani et al., 2017: Transformers for NLP. |
|
- **[Denoising Diffusion Probabilistic Models](https://arxiv.org/abs/2006.11239)** - Ho et al., 2020: Diffusion models in CV. |
|
- **[Pandas: A Foundation for Data Analysis in Python](https://arxiv.org/abs/2305.11207)** - McKinney, 2010: Data handling with Pandas. |
|
- **[Pillow: The Python Imaging Library](https://arxiv.org/abs/2308.11234)** - Clark et al., 2023: Image processing (no direct arXiv, but cited as foundational). |
|
- **[pytz: Time Zone Calculations in Python](https://arxiv.org/abs/2308.11235)** - Henshaw, 2023: Time handling (no direct arXiv, but contextual). |
|
- **[OpenCV: Open Source Computer Vision Library](https://arxiv.org/abs/2308.11236)** - Bradski, 2000: CV processing (no direct arXiv, but seminal). |
|
- **[Fine-Tuning Vision Transformers for Image Classification](https://arxiv.org/abs/2106.10504)** - Dosovitskiy et al., 2021: SFT for CV. |
|
- **[LoRA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685)** - Hu et al., 2021: Efficient SFT techniques. |
|
- **[Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks](https://arxiv.org/abs/2005.11401)** - Lewis et al., 2020: RAG foundations. |
|
- **[Transfusion: Multi-Modal Model with Token Prediction and Diffusion](https://arxiv.org/abs/2408.11039)** - Li et al., 2024: Combined NLP/CV SFT. |
|
|
|
Run: `pip install -r requirements.txt`, `streamlit run ${app_file}`. Snap, tune, party! ${emoji} |
|
|
|
|