--- language: - en library_name: transformers tags: - jiviai - medical ai widget: - example_title: X-ray Report Generation messages: - role: user content: - type: text text: Generate an xray report for the above xray. - type: image_url image_url: url: >- https://jarvis-01j48hrq5383vpdk8csp3r60xa.s3.amazonaws.com/dev/MISC/2024-10-04/01J9BCTFZBH1AMD246G55DCY93.jpeg - role: assistant content: | Patient: [Name not provided] Gender: Female Age: [Age not provided] Clinical History: [Not provided] View: PA (Posterior-Anterior) Image Quality: The image is of good quality. The patient positioning is adequate, and the exposure is appropriate for interpretation. Findings: * The lung fields are clear bilaterally, with no evidence of consolidation, masses, or infiltrates. * The heart size is within normal limits, and the cardiac silhouette is unremarkable. * The diaphragm and costophrenic angles are clear. * A pacemaker and leads are present in the right upper chest, with the pacemaker generator in the left upper chest. * No acute bony abnormalities are identified. * No pneumothorax, pleural effusion, or consolidation is seen. Conclusion: * No acute cardiopulmonary abnormality is detected. * The presence of a pacemaker and leads suggests a history of cardiac rhythm management. * Clinical correlation is advised for further management. pipeline_tag: image-text-to-text --- # Jivi-RadX: Large Language Vision Assistant for Xrays Introducing **Jivi-RadX-v1**, an advanced visual language model specifically designed for sophisticated image reasoning in the healthcare domain. With its robust capabilities, this model excels at interpreting radiographic X-ray images, offering accurate and insightful responses to a wide range of diagnostic and analytical questions. Whether it's aiding clinicians in making informed decisions or assisting researchers in understanding complex image data, Jivi-RadX-v1 pushes the boundaries of medical imaging analysis. Jivi-RadX-v1 **Model Architecture:** `Jivi-RadX-v1` is built on top of Llama 3.1 text-only model, which is an auto-regressive language model that uses an optimized transformer architecture. To support image recognition tasks, we use a separately trained vision encoder and a vision projector that integrates with our base language model. ## Benchmarks We have released our x-ray benchmark [jivi_chexnet](https://huggingface.co/datasets/jiviai/jivi_chexnet) which combines rich and diverse x-ray images along with verified question and answer on them. Below is the comparision of our model with other open-source and closed-source vision LLMs. | Model | CheXpert (Stanford dataset) | NIH Chest X-Ray (NIH UK Dataset) | Overall Accuracy | |------------------|-----------------------------|----------------------------------|------------------| | **Jivi AI (Jivi-RadX)** | **85%** | **64.5%** | **75%** | | Open AI (GPT-4o) | 81.5% | 66.3% | 74% | | Stanford AIMI (ChexAgent-8b) | 55.6% | 50.3% | 53% | | Microsoft (Llava-Med) | 51% | 46.6% | 49% | | *Google (Gemini 1.5 Pro) | 36% | 58% | 47% | *Gemini 1.5 Pro occasionally declines to respond to medical questions, which affects its overall accuracy. *Benchmark numbers were calculated using [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval) by *LMMs-Lab*. ## Training process: Jivi-RadX-v1 steps ## Training Data `Jivi-RadX-v1` was pretrained on 365k medical image and text pairs. The instruction tuning data includes over 280k synthetically generated examples. ### Synthetic Data Generation We leveraged various closed-source and open-source visual LLMs and used the metadata of the x-ray images to generate rich captions for training. ## How to use ### Use with transformers Please ensure `transformers>=4.45.2` ```python import requests import torch from PIL import Image from transformers import (AutoProcessor, AutoTokenizer, LlavaForConditionalGeneration) conversation = [ {"role": "system", "content": "You a helpful AI assistant."}, { "role": "user", "content": "\n Please describe this x-ray.", }, ] model_id = "jiviai/Jivi-RadX-v1" tokenizer = AutoTokenizer.from_pretrained(model_id) processor = AutoProcessor.from_pretrained(model_id) model = LlavaForConditionalGeneration.from_pretrained( model_id, attn_implementation="eager", device_map="cuda", torch_dtype=torch.float16 ) prompt = tokenizer.apply_chat_template( conversation, tokenize=False, add_generation_prompt=True ) url = "https://jarvis-01j48hrq5383vpdk8csp3r60xa.s3.amazonaws.com/dev/MISC/2024-10-03/01J991DRQ2G5TAB24A9QNMFAXN.jpg" image = Image.open(requests.get(url, stream=True).raw) inputs = processor(text=prompt, images=image, return_tensors="pt").to( model.device, dtype=model.dtype ) # Generate generate_ids = model.generate(**inputs, max_new_tokens=30) output = processor.decode( generate_ids[0], skip_special_tokens=True, clean_up_tokenization_spaces=False ) print(output) ``` **Supported Languages:** Currently we only support english. We are planning to introduce multi-lingual support shortly. **Feedback:** To send any feedback/questions please use the community section of the model. ## Intended use The data, code, and model checkpoints are intended to be used solely for: 1. Future research on visual-language processing. 2. Reproducibility of the experimental results reported in the reference paper. ### Disclaimer: The data, code, and model checkpoints are not intended to be used in clinical care or for any clinical decision making purposes.