|
--- |
|
language: |
|
- en |
|
tags: |
|
- vision-language |
|
- phi |
|
- llava |
|
- clip |
|
- qlora |
|
- multimodal |
|
license: mit |
|
datasets: |
|
- laion/instructional-image-caption-data |
|
base_model: microsoft/phi-1_5 |
|
library_name: transformers |
|
pipeline_tag: image-to-text |
|
--- |
|
|
|
# LLaVA-Phi Model |
|
|
|
This is a vision-language model based on Microsoft's Phi-1.5 architecture with CLIP for image processing capabilities. |
|
|
|
## Model Description |
|
|
|
- **Base Model**: Microsoft Phi-1.5 |
|
- **Vision Encoder**: CLIP ViT-B/32 |
|
- **Training**: QLoRA fine-tuning |
|
- **Dataset**: Instruct 150K |
|
|
|
## Usage |
|
|
|
```python |
|
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoProcessor |
|
import torch |
|
from PIL import Image |
|
|
|
# Load model and tokenizer |
|
model = AutoModelForCausalLM.from_pretrained("sagar007/Lava_phi") |
|
tokenizer = AutoTokenizer.from_pretrained("sagar007/Lava_phi") |
|
processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32") |
|
|
|
# For text |
|
def generate_text(prompt): |
|
inputs = tokenizer(f"human: {prompt}\ngpt:", return_tensors="pt") |
|
outputs = model.generate(**inputs, max_new_tokens=128) |
|
return tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
|
|
# For images |
|
def process_image_and_prompt(image_path, prompt): |
|
image = Image.open(image_path) |
|
image_tensor = processor(images=image, return_tensors="pt").pixel_values |
|
|
|
inputs = tokenizer(f"human: <image>\n{prompt}\ngpt:", return_tensors="pt") |
|
outputs = model.generate( |
|
input_ids=inputs["input_ids"], |
|
attention_mask=inputs["attention_mask"], |
|
images=image_tensor, |
|
max_new_tokens=128 |
|
) |
|
return tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
``` |
|
|
|
## Training Details |
|
|
|
- Trained using QLoRA (Quantized Low-Rank Adaptation) |
|
- 4-bit quantization for efficiency |
|
- Gradient checkpointing enabled |
|
- Mixed precision training (bfloat16) |
|
|
|
## License |
|
|
|
MIT License |
|
|
|
## Citation |
|
|
|
```bibtex |
|
@software{llava_phi_2024, |
|
author = {sagar007}, |
|
title = {LLaVA-Phi: Vision-Language Model}, |
|
year = {2024}, |
|
publisher = {Hugging Face}, |
|
url = {https://huggingface.co/sagar007/Lava_phi} |
|
} |
|
``` |
|
|