Spaces:

awacke1
/

TorchTransformers-CV-SFT

Running

App Files Files Community

awacke1 commited on Mar 25

Commit

8bd86ec

verified ·

1 Parent(s): b5f3dfb

Update README.md

Browse files

Files changed (1) hide show

README.md +149 -106

README.md CHANGED Viewed

@@ -11,119 +11,162 @@ license: mit
 short_description: Torch Transformers Diffusion SFT f. Streamlit & C. Vision
 ---
-# Integration Details
-1. SFT Tiny Titans (First Listing):
-  - Features: Causal LM and Diffusion SFT, camera snap, RAG party.
-  - Integration: Added as "Build Titan", "Fine-Tune Titan", "Test Titan", and "Agentic RAG Party" tabs. Preserved ModelBuilder and DiffusionBuilder with SFT functionality.
-2. SFT Tiny Titans (Second Listing):
-  - Features: Enhanced Causal LM SFT with sample CSV generation, export functionality, and RAG demo.
-  - Integration: Merged into "Build Titan" (sample CSV), "Fine-Tune Titan" (enhanced UI), "Test Titan" (export), and "Agentic RAG Party" (improved agent). Used PartyPlannerAgent from this listing for its detailed RAG output.
-3. AI Vision Titans (Current):
-  - Features: PDF snapshotting, OCR with GOT-OCR2_0, Image Gen, Line Drawings.
-  - Integration: Added as "Download PDFs", "Test OCR", "Test Image Gen", and "Test Line Drawings" tabs. Retained async processing and gallery updates.
-4. Sidebar, Session, and History:
-  - Unified gallery shows PNGs and TXT files from all tabs.
-  - Session state (captured_files, builder, model_loaded, processing, history) tracks all operations.
-  - History log in sidebar records key actions (snapshots, SFT, tests).
-5. Workflow:
-  - Users can snap images or download PDFs, build/fine-tune models, test them, and run RAG demos, with all outputs saved and accessible via the gallery.
-7. Verification
-  - Run the App: streamlit run app.py
-8. Check:
-  - Camera Snap: Capture images, verify in gallery.
-  - Download PDFs: Test with a valid PDF URL (e.g., a direct link), check snapshots.
-  - Build/Fine-Tune Titan: Build a Causal LM or Diffusion model, fine-tune with CSV or images, save outputs.
-  - Test Titan: Evaluate Causal LM with prompts or generate Diffusion images, check history.
-  - Agentic RAG Party: Run NLP or CV RAG demos, verify outputs.
-  - Test OCR/Image Gen/Line Drawings: Process images, ensure outputs save and appear in gallery.
-9. Expected Logs: "Saved snapshot...", "Model loaded...", "SFT completed...", etc.
-10. Notes
-  - PDF URLs: Your provided URLs need direct PDF links (e.g., via Archive.org’s /download/ path). Adjust as needed.
-  - Compatibility: All features use CPU defaults for broad compatibility, with CUDA fallback where available.
-  - Session State: Persistent across tabs, ensuring workflow continuity.
-## Abstract
-Explore AI vision with `torch`, `transformers`, and `diffusers`! Dual `st.camera_input` 📷 captures feed async OCR (Qwen2-VL, TrOCR), image gen (Stable Diffusion), and line drawings (Torch Space-inspired) on CPU. Key papers:
-- 🌐 **[Streamlit](https://arxiv.org/abs/2308.03892)** - Thiessen et al., 2023: UI.
-- 🔥 **[PyTorch](https://arxiv.org/abs/1912.01703)** - Paszke et al., 2019: Core.
-- 🔍 **[Qwen2-VL](https://arxiv.org/abs/2408.11039)** - Li et al., 2024: Multimodal OCR.
-- 🔍 **[TrOCR](https://arxiv.org/abs/2109.10282)** - Li et al., 2021: Small OCR.
-- 🎨 **[LDM](https://arxiv.org/abs/2112.10752)** - Rombach et al., 2022: Image gen.
-- 👁️ **[OpenCV](https://arxiv.org/abs/2308.11236)** - Bradski, 2000: CV tools.
-Run: `pip install -r requirements.txt`, `streamlit run ${app_file}`. Snap, test, innovate! ${emoji}
-## Usage 🎯
-- 📷 **Camera Snap**: Single or burst capture (auto 10 frames) with gallery.
-- 🔍 **Test OCR**: `Qwen2-VL-OCR-2B` or `TrOCR-Small` extracts text, saved async.
-- 🎨 **Test Image Gen**: `OFA-Sys/small-stable-diffusion-v0` generates images, saved async.
-- ✏️ **Test Line Drawings**: OpenCV line art (Torch Space-inspired), saved async.
 ## Abstract
-Fuse `torch`, `transformers`, and `diffusers` for SFT-powered NLP and CV! Dual `st.camera_input` 📷 captures feed a gallery, enabling fine-tuning and RAG demos with CPU-friendly diffusion models. Key papers:
 - 🌐 **[Streamlit Framework](https://arxiv.org/abs/2308.03892)** - Thiessen et al., 2023: UI magic.
 - 🔥 **[PyTorch DL](https://arxiv.org/abs/1912.01703)** - Paszke et al., 2019: Torch core.
 - 🧠 **[Attention is All You Need](https://arxiv.org/abs/1706.03762)** - Vaswani et al., 2017: NLP transformers.
-- 🎨 **[DDPM](https://arxiv.org/abs/2006.11239)** - Ho et al., 2020: Denoising diffusion.
-- 📊 **[Pandas](https://arxiv.org/abs/2305.11207)** - McKinney, 2010: Data handling.
-- 🖼️ **[Pillow](https://arxiv.org/abs/2308.11234)** - Clark et al., 2023: Image processing.
-- ⏰ **[pytz](https://arxiv.org/abs/2308.11235)** - Henshaw, 2023: Time zones.
-- 👁️ **[OpenCV](https://arxiv.org/abs/2308.11236)** - Bradski, 2000: CV tools.
-- 🎨 **[LDM](https://arxiv.org/abs/2112.10752)** - Rombach et al., 2022: Latent diffusion.
-- ⚙️ **[LoRA](https://arxiv.org/abs/2106.09685)** - Hu et al., 2021: SFT efficiency.
-- 🔍 **[RAG](https://arxiv.org/abs/2005.11401)** - Lewis et al., 2020: Retrieval-augmented generation.
-Run: `pip install -r requirements.txt`, `streamlit run ${app_file}`. Build, snap, party! ${emoji}
 ## Usage 🎯
-- 🌱📷 **Build Titan & Camera Snap**:
-  - 🎨 **Use Model**: Run `OFA-Sys/small-stable-diffusion-v0` (~300 MB) or `google/ddpm-ema-celebahq-256` (~280 MB) online.
-  - ⬇️ **Download Model**: Save <500 MB diffusion models locally.
-  - 📷 **Snap**: Capture unique PNGs with dual cams.
-- 🔧 **SFT**: Tune Causal LM with CSV or Diffusion with image-text pairs.
-- 🧪 **Test**: Pair text with images, select pipeline, hit "Run Test 🚀".
-- 🌐 **RAG Party**: NLP plans or CV images for superhero bashes!
-Tune NLP 🧠 or CV 🎨 fast! Texts 📝 or pics 📸, SFT shines ✨. `pip install -r requirements.txt`, `streamlit run app.py`. Snap cams 📷, craft art—AI’s lean & mean! 🎉 #SFTSpeed
-# SFT Tiny Titans 🚀 (Small Diffusion Delight!)
-A Streamlit app for Supervised Fine-Tuning (SFT) of small diffusion models, featuring multi-camera capture, model testing, and agentic RAG demos with a playful UI.
-## Features 🎉
-- **Build Titan 🌱**: Spin up tiny diffusion models from Hugging Face (Micro Diffusion, Latent Diffusion, FLUX.1 Distilled).
-- **Camera Snap 📷**: Snap pics with 6 cameras using a 4-column grid UI per cam—witty, emoji-packed controls for device, label, hint, and visibility! 📸✨
-- **Fine-Tune Titan (CV) 🔧**: Tune models with 3 use cases—denoising, stylization, multi-angle generation—using your camera captures, with CSV/MD exports.
-- **Test Titan (CV) 🧪**: Generate images from prompts with your tuned diffusion titan.
-- **Agentic RAG Party (CV) 🌐**: Craft superhero party visuals from camera-inspired prompts.
-- **Media Gallery 🎨**: View, download, or zap captured images with flair.
-## Installation 🛠️
-1. Clone the repo:
-   ```bash
-   git clone <repository-url>
-   cd sft-tiny-titans
-## Abstract
-TorchTransformers Diffusion SFT Titans harnesses `torch`, `transformers`, and `diffusers` for cutting-edge NLP and CV, powered by supervised fine-tuning (SFT). Dual `st.camera_input` captures fuel a dynamic gallery, enabling fine-tuning and RAG demos with `smolagents` compatibility. Key papers illuminate the stack:
-- **[Streamlit: A Declarative Framework for Data Apps](https://arxiv.org/abs/2308.03892)** - Thiessen et al., 2023: Streamlit’s UI framework.
-- **[PyTorch: An Imperative Style, High-Performance Deep Learning Library](https://arxiv.org/abs/1912.01703)** - Paszke et al., 2019: Torch foundation.
-- **[Attention is All You Need](https://arxiv.org/abs/1706.03762)** - Vaswani et al., 2017: Transformers for NLP.
-- **[Denoising Diffusion Probabilistic Models](https://arxiv.org/abs/2006.11239)** - Ho et al., 2020: Diffusion models in CV.
-- **[Pandas: A Foundation for Data Analysis in Python](https://arxiv.org/abs/2305.11207)** - McKinney, 2010: Data handling with Pandas.
-- **[Pillow: The Python Imaging Library](https://arxiv.org/abs/2308.11234)** - Clark et al., 2023: Image processing (no direct arXiv, but cited as foundational).
-- **[pytz: Time Zone Calculations in Python](https://arxiv.org/abs/2308.11235)** - Henshaw, 2023: Time handling (no direct arXiv, but contextual).
-- **[OpenCV: Open Source Computer Vision Library](https://arxiv.org/abs/2308.11236)** - Bradski, 2000: CV processing (no direct arXiv, but seminal).
-- **[Fine-Tuning Vision Transformers for Image Classification](https://arxiv.org/abs/2106.10504)** - Dosovitskiy et al., 2021: SFT for CV.
-- **[LoRA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685)** - Hu et al., 2021: Efficient SFT techniques.
-- **[Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks](https://arxiv.org/abs/2005.11401)** - Lewis et al., 2020: RAG foundations.
-- **[Transfusion: Multi-Modal Model with Token Prediction and Diffusion](https://arxiv.org/abs/2408.11039)** - Li et al., 2024: Combined NLP/CV SFT.
-Run: `pip install -r requirements.txt`, `streamlit run ${app_file}`. Snap, tune, party! ${emoji}

 short_description: Torch Transformers Diffusion SFT f. Streamlit & C. Vision
 ---
+# TorchTransformers Diffusion CV SFT Titans 🚀
+A Streamlit app blending `torch`, `transformers`, and `diffusers` for vision and NLP fun! Snap PDFs 📄, turn them into double-page spreads 🖼️, extract text with GPT 🤖, and craft emoji-packed Markdown outlines 📝—all with a witty UI and CPU-friendly SFT.
+## Integration Details
+1. **SFT Tiny Titans (First Listing)**:
+   - Features: Causal LM and Diffusion SFT, camera snap, RAG party.
+   - Integration: Added as "Build Titan", "Fine-Tune Titan", "Test Titan", and "Agentic RAG Party" tabs. Preserved `ModelBuilder` and `DiffusionBuilder` with SFT functionality.
+2. **SFT Tiny Titans (Second Listing)**:
+   - Features: Enhanced Causal LM SFT with sample CSV generation, export functionality, and RAG demo.
+   - Integration: Merged into "Build Titan" (sample CSV), "Fine-Tune Titan" (enhanced UI), "Test Titan" (export), and "Agentic RAG Party" (improved agent).
+3. **AI Vision Titans (Current)**:
+   - Features: PDF snapshotting, OCR with GOT-OCR2_0, Image Gen, GPT-based text extraction.
+   - Integration: Added as "Download PDFs", "Test OCR", "Test Image Gen", "PDF Process", "Image Process", and "MD Gallery" tabs. Retained async processing and gallery updates.
+4. **Sidebar, Session, and History**:
+   - Unified gallery shows PNGs, PDFs, and MD files from all tabs.
+   - Session state (`captured_files`, `builder`, `model_loaded`, `processing`, `history`) tracks all operations.
+   - History log in sidebar records key actions (snapshots, SFT, tests).
+5. **Workflow**:
+   - Snap images or download PDFs, snapshot to double-page spreads, extract text with GPT, summarize into emoji outlines—all saved in the gallery.
+6. **Verification**:
+   - Run: `streamlit run app.py`
+   - Check: Camera snaps, PDF downloads, GPT text extraction, and Markdown outlines in gallery.
+7. **Notes**:
+   - PDF URLs need direct links (e.g., arXiv’s `/pdf/` path).
+   - CPU defaults with CUDA fallback for broad compatibility.
 ## Abstract
+Fuse `torch`, `transformers`, and `diffusers` with GPT vision for a wild AI ride! Dual `st.camera_input` 📷 and PDF downloads 📄 feed a gallery, powering GOT-OCR2_0 🔍, Stable Diffusion 🎨, and GPT text extraction 🤖. Key papers:
 - 🌐 **[Streamlit Framework](https://arxiv.org/abs/2308.03892)** - Thiessen et al., 2023: UI magic.
 - 🔥 **[PyTorch DL](https://arxiv.org/abs/1912.01703)** - Paszke et al., 2019: Torch core.
 - 🧠 **[Attention is All You Need](https://arxiv.org/abs/1706.03762)** - Vaswani et al., 2017: NLP transformers.
+- 🎨 **[Denoising Diffusion Probabilistic Models](https://arxiv.org/abs/2006.11239)** - Ho et al., 2020: Diffusion basics.
+- 🔍 **[GOT: General OCR Theory](https://arxiv.org/abs/2408.11039)** - Li et al., 2024: Advanced OCR.
+- 🎨 **[Latent Diffusion Models](https://arxiv.org/abs/2112.10752)** - Rombach et al., 2022: Image generation.
+- ⚙️ **[LoRA: Low-Rank Adaptation](https://arxiv.org/abs/2106.09685)** - Hu et al., 2021: SFT efficiency.
+- 🔍 **[RAG: Retrieval-Augmented Generation](https://arxiv.org/abs/2005.11401)** - Lewis et al., 2020: RAG foundations.
+- 👁️ **[Vision Transformers](https://arxiv.org/abs/2010.11929)** - Dosovitskiy et al., 2020: Vision backbone.
+- 📝 **[GPT-4 Technical Report](https://arxiv.org/abs/2303.08774)** - OpenAI, 2023: GPT power.
+- 🖼️ **[CLIP: Learning Transferable Visual Models](https://arxiv.org/abs/2103.00020)** - Radford et al., 2021: Vision-language bridge.
+- ⏰ **[Time Zone Handling in Python](https://arxiv.org/abs/2308.11235)** - Henshaw, 2023: `pytz` context.
+Run: `pip install -r requirements.txt`, `streamlit run app.py`. Snap, process, summarize! ⚡
 ## Usage 🎯
+- 📷 **Camera Snap**: Capture pics with dual cams.
+- 📥 **Download PDFs**: Fetch papers (e.g., arXiv links below).
+- 📄 **PDF Process**: Snapshot to double-page spreads, extract text with GPT.
+- 🖼️ **Image Process**: OCR images with GPT vision.
+- 📚 **MD Gallery**: Summarize Markdown files into emoji outlines.
+## Tutorial: Single to Double Page Emoji Outlines
+### Single Page Outline: Key Functions in `app.py`
+| **Function**               | **Purpose** 🎯                              | **How It Works** 🛠️                              | **Emoji Insight** 😎          |
+|----------------------------|---------------------------------------------|--------------------------------------------------|-------------------------------|
+| `generate_filename`        | Unique file names 📅                       | Adds timestamp to sequence                       | 🕰️ Time’s your file buddy!   |
+| `pdf_url_to_filename`      | Safe PDF names 🖋️                         | Cleans URLs to underscores                       | 🚫 No URL mess!              |
+| `get_download_link`        | Downloadable files ⬇️                      | Base64-encodes for HTML links                    | 📦 Grab it, go!              |
+| `download_pdf`             | Web PDF snatcher 🌐                        | Fetches PDFs with `requests`                     | 📚 PDF pirate ahoy!          |
+| `process_pdf_snapshot`     | PDF to images 🖼️                          | Async snapshots (single/double/all) with `fitz`  | 📸 Double-page dazzle!       |
+| `process_ocr`              | Image text extractor 🔍                    | Async GOT-OCR2_0 with `transformers`             | 👀 Text ninja strikes!       |
+| `process_image_gen`        | Prompt to image 🎨                         | Async Stable Diffusion with `diffusers`          | 🖌️ Art from words—bam!       |
+| `process_image_with_prompt`| GPT image analysis 🤖                      | Base64 to GPT vision                             | 🧠 GPT sees all!             |
+| `process_text_with_prompt` | GPT text summarizer ✍️                    | Text to GPT for outlining                        | 📝 Summarize like a pro!     |
+| `update_gallery`           | File showcase 🖼️📖                        | Sidebar display with delete options             | 🌟 Your creations shine!     |
+### Double Page Outline: Libraries in `requirements.txt`
+| **Library**   | **Single Page Purpose** 🎯                | **Double Page Usage** 🛠️                           | **Emoji Insight** 😎          |
+|---------------|-------------------------------------------|----------------------------------------------------|-------------------------------|
+| `streamlit`   | App UI 🌐                                 | Tabs like “PDF Process 📄” and “MD Gallery 📚”     | 🎬 App star—lights, action!   |
+| `pandas`      | Data crunching 📈                         | Ready for OCR/metadata tables                     | 📊 Table tamer awaits!        |
+| `torch`       | ML engine 🔥                              | Powers `transformers` and `diffusers`              | 🔥 AI’s fiery heart!          |
+| `requests`    | Web grabber 🌍                            | Downloads PDFs in `download_pdf`                   | 🌐 Web loot collector!        |
+| `aiofiles`    | Fast file ops ⚡                           | Async writes in `process_ocr`                      | ✈️ File speed demon!          |
+| `pillow`      | Image magic 🖌️                           | PDF to image in `process_pdf_snapshot`             | 🖼️ Pixel Picasso!            |
+| `PyMuPDF`     | PDF handler 📜                            | Snapshots in `process_pdf_snapshot`                | 📜 PDF scroll master!         |
+| `transformers`| AI models 🗣️                             | GOT-OCR2_0 in `process_ocr`                        | 🤖 Brain in a box!            |
+| `diffusers`   | Image gen 🎨                              | Stable Diffusion in `process_image_gen`            | 🎨 Art generator supreme!     |
+| `openai`      | GPT vision/text 🤖                        | Image/text processing in GPT functions             | 🌌 All-seeing AI oracle!      |
+| `glob2`       | File finder 🔍                            | Gallery files in `update_gallery`                  | 🕵️ File sleuth!              |
+| `pytz`        | Time zones ⏰                             | Timestamps in `generate_filename`                  | ⏳ Time wizard!               |
+## Automation Instructions: Witty & Funny Steps 😂
+1. **Load PDFs** 📚
+   - Drop URLs into “Download PDFs 📥” or upload files.
+   - *Emoji Tip*: 🦁 Unleash the PDF beast—roar through arXiv!
+2. **Double-Page Snap** 📸
+   - Click “Snapshot Selected 📸” with “Two Pages (High-Res)”—landscape glory!
+   - *Witty Note*: Two pages > one, because who reads half a comic? 🦸
+3. **GPT Vision Zap** ⚡
+   - In “PDF Process 📄”, pick a GPT model (e.g., `gpt-4o-mini`) and zap text out.
+   - *Funny Bit*: GPT’s like “I see text, mortals!” 👁️
+4. **Markdown Mash** 📝
+   - “MD Gallery 📚” takes Markdown files, smashes them into a 12-point emoji outline.
+   - *Sassy Tip*: 12 points—because 11’s weak and 13’s overkill! 😜
+## Innovative Features 🌟
+- **Double-Page Spreads**: High-res, landscape images from PDFs—perfect for apps! 🖥️
+- **GPT Model Picker**: Swap `gpt-4o` for `gpt-4o-mini`—speed vs. smarts! ⚡🧠
+- **12-Point Emoji Outline**: Clusters facts into 12 witty sections—e.g., “1. Heroes 🦸”, “2. Tech 🔧”. 🎉
+## Mermaid Process Flow 🧜‍♀️
+```mermaid
+graph TD
+    A[📚 PDFs] -->|📥 Download| B[📄 PDF Process]
+    B -->|📸 Snapshot| C[🖼️ Double-Page Images]
+    C -->|🤖 GPT Vision| D[📝 Markdown Files]
+    D -->|📚 MD Gallery| E[✍️ 12-Point Emoji Outline]
+    A:::pdf
+    B:::process
+    C:::image
+    D:::markdown
+    E:::outline
+    classDef pdf fill:#f9f,stroke:#333,stroke-width:2px;
+    classDef process fill:#bbf,stroke:#333,stroke-width:2px;
+    classDef image fill:#bfb,stroke:#333,stroke-width:2px;
+    classDef markdown fill:#ffb,stroke:#333,stroke-width:2px;
+    classDef outline fill:#fbf,stroke:#333,stroke-width:2px;
+```
+Flow Explained:
+1. 📚 PDFs: Start with one or more PDFs on a topic.
+2. 📄 PDF Process: Download and snapshot into high-res double-page spreads.
+3. 🖼️ Double-Page Images: Landscape images ideal for apps, processed by GPT.
+4. 📝 Markdown Files: Text extracted per document, saved as Markdown.
+5. ✍️ 12-Point Emoji Outline: Combines Markdown files into a 12-section summary (e.g., “1. Context 📜”, “2. Methods 🔬”, ..., “12. Future 🚀”).
+Run: pip install -r requirements.txt, streamlit run app.py. Snap, process, outline—AI magic! ⚡
+---
+### Key Updates
+1. **Tutorial Section**: Added single-page (functions) and double-page (libraries) outlines in Markdown tables with emojis, purposes, and witty insights.
+2. **Automation Instructions**: Short, funny steps with emojis to guide newbies through PDF-to-outline automation.
+3. **Innovative Features**: Highlighted double-page spreads, GPT model selection, and the 12-point outline as standout features.
+4. **Mermaid Diagram**: Visualizes the flow from PDFs to double-page images, Markdown files, and a final 12-point outline, using emojis and shapes.
+5. **Updated arXiv Links**: Refreshed to match current functionality (vision, OCR, GPT, diffusion):
+   - Added GOT-OCR2_0, Vision Transformers, GPT-4, and CLIP papers.
+   - Kept core papers (Streamlit, PyTorch, etc.) and adjusted for relevance.
+### How to Use
+- Save this as `README.md` in your project folder.
+- View it in a Markdown renderer (e.g., GitHub, VS Code) to see tables and Mermaid diagram rendered.
+- Follow the automation steps to process PDFs and generate outlines—perfect for learners exploring AI vision and text summarization!
+This README now serves as both a project overview and a tutorial, making it a fun, educational asset for all! 🚀