Spaces:
Running
Running
title: README | |
emoji: 🐢 | |
colorFrom: red | |
colorTo: yellow | |
sdk: static | |
pinned: false | |
Sightation Counts: Leveraging Sighted User Feedback in Building a BLV-aligned Dataset of Diagram Descriptions | |
============================================================================================================== | |
Wan Ju Kang, Eunki Kim, Na Min An, Sangryul Kim, Haemin Choi, Ki Hoon Kwak, and James Thorne | |
## 📄 [Paper](https://arxiv.org/abs/2503.13369) | |
Hello, we are a team of researchers based in [KAIST AI](https://gsai.kaist.ac.kr) working on accessible visualization. | |
In specific, we compiled a diagram description dataset for blind and low-vision (BLV) individuals. | |
We worked in close cooperation with two schools for the blind, as well as over 30 sighted annotators, and we are grateful for their contribution. | |
Check out our [preprint](https://arxiv.org/abs/2503.13369), and feel free to contact us at [email protected]. | |
--------------------------------------- | |
## Abstract | |
> Often, the needs and visual abilities differ between the annotator group and the end user | |
group. Generating detailed diagram descriptions for blind and low-vision (BLV) users is one such challenging domain. | |
Sighted annotators could describe visuals with ease, but existing studies have shown that direct generations by them are costly, bias-prone, and somewhat | |
lacking by BLV standards. In this study, we ask sighted individuals to assess—rather than produce—diagram descriptions generated by vision-language models (VLM) that have been | |
guided with latent supervision via a multi-pass inference. The sighted assessments prove effective and useful to professional educators | |
who are themselves BLV and teach visually impaired learners. We release SIGHTATION, a collection of diagram description datasets | |
spanning 5k diagrams and 137k samples for completion, preference, retrieval, question answering, and reasoning training purposes and | |
demonstrate their fine-tuning potential in various downstream tasks. | |
## Sightation Collection | |
- SightationCompletions | |
- SightationPreference | |
- SightationRetrieval | |
- SightationVQA | |
- SightationReasoning | |
<img src="https://cdn-uploads.huggingface.co/production/uploads/67a86f66c6f66e2fa5888b41/cNshK4QAdiNMqk7x6J6j7.png" width="100%" height="100%" title="visual_abstract" alt="visual_abstract"></img> | |
The key benefit of utilizing sighted user feedback lies in their assessments, which are based on solid visual | |
grounding. The compiled assessments prove an effective training substance for steering VLMs towards more | |
accessible descriptions. | |
<img src="https://cdn-uploads.huggingface.co/production/uploads/67a86f66c6f66e2fa5888b41/8oYvtq7dtv_Ck-U6OlcAE.png" width="70%" height="70%" title="dimensions_assignment" alt="dimensions_assignment"></img> | |
The description qualities assessed by their respective evaluator groups. | |
## Results | |
<img src="https://cdn-uploads.huggingface.co/production/uploads/67a86f66c6f66e2fa5888b41/094e9Hw7lauvT1tshg1Wj.png" width="90%" height="90%" title="spider_chart" alt="spider_chart"></img> | |
Tuning VLMs on Sightation enhanced various qualities of the diagram descriptions, evaluated by BLV educators, and shown here as normalized ratings averaged in each aspect. | |
The capability of the dataset is most strongly pronounced with Qwen2-VL-2B model, shown above. | |
## BibTeX | |
If you find our dataset helpful, please cite our work! | |
```bash | |
@misc{kang2025sightationcountsleveragingsighted, | |
title={Sightation Counts: Leveraging Sighted User Feedback in Building a BLV-aligned Dataset of Diagram Descriptions}, | |
author={Wan Ju Kang and Eunki Kim and Na Min An and Sangryul Kim and Haemin Choi and Ki Hoon Kwak and James Thorne}, | |
year={2025}, | |
eprint={2503.13369}, | |
archivePrefix={arXiv}, | |
primaryClass={cs.AI}, | |
url={https://arxiv.org/abs/2503.13369}, | |
} | |
``` | |
### What's in the name? | |
- Training on our dataset means using the sighted user feedback; in a distant way, you would be citing them. | |
- Suppose you refer to our dataset in a spoken conversation. The sightation/citation confusion is meant to mimic a small part of the inconvenience faced by BLV learners, who often must rely only on auditory cues for disambiguating homophones. |