---
language:
- en
library_name: transformers
tags:
- jiviai
- medical ai
widget:
- example_title: X-ray Report Generation
messages:
- role: user
content:
- type: text
text: Generate an xray report for the above xray.
- type: image_url
image_url:
url: >-
https://jarvis-01j48hrq5383vpdk8csp3r60xa.s3.amazonaws.com/dev/MISC/2024-10-04/01J9BCTFZBH1AMD246G55DCY93.jpeg
- role: assistant
content: |
Patient: [Name not provided]
Gender: Female
Age: [Age not provided]
Clinical History: [Not provided]
View: PA (Posterior-Anterior)
Image Quality: The image is of good quality. The patient positioning is adequate, and the exposure is appropriate for interpretation.
Findings:
* The lung fields are clear bilaterally, with no evidence of consolidation, masses, or infiltrates.
* The heart size is within normal limits, and the cardiac silhouette is unremarkable.
* The diaphragm and costophrenic angles are clear.
* A pacemaker and leads are present in the right upper chest, with the pacemaker generator in the left upper chest.
* No acute bony abnormalities are identified.
* No pneumothorax, pleural effusion, or consolidation is seen.
Conclusion:
* No acute cardiopulmonary abnormality is detected.
* The presence of a pacemaker and leads suggests a history of cardiac rhythm management.
* Clinical correlation is advised for further management.
pipeline_tag: image-text-to-text
---
# Jivi-RadX: Large Language Vision Assistant for Xrays
Introducing **Jivi-RadX-v1**, an advanced visual language model specifically designed for sophisticated image reasoning in the healthcare domain. With its robust capabilities, this model excels at interpreting radiographic X-ray images, offering accurate and insightful responses to a wide range of diagnostic and analytical questions. Whether it's aiding clinicians in making informed decisions or assisting researchers in understanding complex image data, Jivi-RadX-v1
pushes the boundaries of medical imaging analysis.
**Model Architecture:** `Jivi-RadX-v1` is built on top of Llama 3.1 text-only model, which is an auto-regressive language model that uses an optimized transformer architecture. To support image recognition tasks, we use a separately trained vision encoder and a vision projector that integrates with our base language model.
## Benchmarks
We have released our x-ray benchmark [jivi_chexnet](https://huggingface.co/datasets/jiviai/jivi_chexnet) which combines rich and diverse x-ray images along with verified question and answer on them. Below is the comparision of our model with other open-source and closed-source vision LLMs.
| Model | CheXpert (Stanford dataset) | NIH Chest X-Ray (NIH UK Dataset) | Overall Accuracy |
|------------------|-----------------------------|----------------------------------|------------------|
| **Jivi AI (Jivi-RadX)** | **85%** | **64.5%** | **75%** |
| Open AI (GPT-4o) | 81.5% | 66.3% | 74% |
| Stanford AIMI (ChexAgent-8b) | 55.6% | 50.3% | 53% |
| Microsoft (Llava-Med) | 51% | 46.6% | 49% |
| *Google (Gemini 1.5 Pro) | 36% | 58% | 47% |
*Gemini 1.5 Pro occasionally declines to respond to medical questions, which affects its overall accuracy.
*Benchmark numbers were calculated using [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval) by *LMMs-Lab*.
## Training process:
## Training Data
`Jivi-RadX-v1` was pretrained on 365k medical image and text pairs. The instruction tuning data includes over 280k synthetically generated examples.
### Synthetic Data Generation
We leveraged various closed-source and open-source visual LLMs and used the metadata of the x-ray images to generate rich captions for training.
## How to use
### Use with transformers
Please ensure `transformers>=4.45.2`
```python
import requests
import torch
from PIL import Image
from transformers import (AutoProcessor, AutoTokenizer,
LlavaForConditionalGeneration)
conversation = [
{"role": "system", "content": "You a helpful AI assistant."},
{
"role": "user",
"content": "\n Please describe this x-ray.",
},
]
model_id = "jiviai/Jivi-RadX-v1"
tokenizer = AutoTokenizer.from_pretrained(model_id)
processor = AutoProcessor.from_pretrained(model_id)
model = LlavaForConditionalGeneration.from_pretrained(
model_id, attn_implementation="eager", device_map="cuda", torch_dtype=torch.float16
)
prompt = tokenizer.apply_chat_template(
conversation, tokenize=False, add_generation_prompt=True
)
url = "https://jarvis-01j48hrq5383vpdk8csp3r60xa.s3.amazonaws.com/dev/MISC/2024-10-03/01J991DRQ2G5TAB24A9QNMFAXN.jpg"
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(text=prompt, images=image, return_tensors="pt").to(
model.device, dtype=model.dtype
)
# Generate
generate_ids = model.generate(**inputs, max_new_tokens=30)
output = processor.decode(
generate_ids[0], skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output)
```
**Supported Languages:** Currently we only support english. We are planning to introduce multi-lingual support shortly.
**Feedback:** To send any feedback/questions please use the community section of the model.
## Intended use
The data, code, and model checkpoints are intended to be used solely for:
1. Future research on visual-language processing.
2. Reproducibility of the experimental results reported in the reference paper.
### Disclaimer: The data, code, and model checkpoints are not intended to be used in clinical care or for any clinical decision making purposes.