BachelorThesis (BachelorThesis Steij14 Gujen1)

Organization Card

Welcome to the official repository for the bachelor's thesis titled "Fine-tuning of Large Language Models for the Analysis of Medical Texts," conducted at the Bern University of Applied Sciences.

Disclaimer

This project is part of a bachelor thesis and is not meant to be used in a production environment. The code is not optimized for performance and is not guaranteed to work in all environments. We won't provide any support for this and won't be responsible for any damage caused by the usage of this project.

Overview

In modern medicine, medical documentation is essential yet challenging due to the largely unstructured nature of medical data—approximately 80% by current estimates. This unstructured data complicates the analysis and extraction of actionable insights, contributing significantly to the administrative burden and stress experienced by healthcare providers.

To address these challenges, we explore the potential of Artificial Intelligence, particularly through the use of Large Language Models (LLMs), to support the management and analysis of medical texts. Our research focuses on the adaptability and effectiveness of open-source LLMs, specifically tuned to process and analyze medical documentation efficiently while ensuring data privacy and security within medical institutions. The main focus of this project is to develop models that can be used for german clinical texts, but the models we used are mainly developed for german context and documents.

Objectives

Data Extraction: Accurately extracting relevant information from German medical texts.
Entity Normalization: Standardizing extracted entities with medical terminologies.
Attribute Identification: Detecting attributes of the entities within the medical texts, such as the position of a diagnosis on the body or the level of truth.
Text Summarization: Generating summaries of clinical documents.

Data used

Annotated Medical Gold-Standard Datasets: Specifically BRONCO150 and Cardio:DE.
Synthetic Data: Around 200 synthetic summaries to enhance model training. One part was created by our own using GPT-4, the other is from the Dev4Med/Notfallberichte-German-100 dataset
Medical Coding Systems: Data from ICD10GM, ATC, and OPS coding systems.

Team

Nicolas Gujer ([email protected])
Jorma Steiner ([email protected])

Source-Code

All necessary data acquisition scripts and additional resources are available on our GitHub Repository. Access to certain datasets is restricted and requires formal requests to the respective institutions.

For detailed information on each model's performance and guidelines on usage, please refer to the individual model repositories linked within our GitHub.

Collections

Decoder Information Extraction: These are the main models we developed in our thesis. They can execute all the objectives of our work with one single model. They all are based on the decoder transformers architecture.
Encoder Information Extraction: These are models which we developed to compare our decoder models to. Encoder transformers are the current state of the art for some of the objectives of this thesis

Collections 2

models 6

datasets

None public yet

BachelorThesis Steij14 Gujen1

AI & ML interests

Recent Activity

Disclaimer

Overview

Objectives

Data used

Team

Source-Code

Collections

Collections 2

BachelorThesis/LLama-3-8b_V03_BRONCO_CARDIO_SUMMARY_CATALOG

BachelorThesis/Gemma-2b_V03_BRONCO_CARDIO_SUMMARY_CATALOG

BachelorThesis/LeoMistral-7b_V06_BRONCO_CARDIO_SUMMARY_CATALOG

BachelorThesis/GerMedBERT_NER_V01_BRONCO_CARDIO

BachelorThesis/GerMedBert_NORMTOP50_V02_BRONCO

BachelorThesis/GerMedBert_ATTR_V02_BRONCO

models 6

BachelorThesis/GerMedBert_ATTR_V02_BRONCO

BachelorThesis/GerMedBert_NORMTOP50_V02_BRONCO

BachelorThesis/LeoMistral-7b_V06_BRONCO_CARDIO_SUMMARY_CATALOG

BachelorThesis/GerMedBERT_NER_V01_BRONCO_CARDIO

BachelorThesis/Gemma-2b_V03_BRONCO_CARDIO_SUMMARY_CATALOG

BachelorThesis/LLama-3-8b_V03_BRONCO_CARDIO_SUMMARY_CATALOG

datasets

AI & ML interests

Recent Activity

Team members 2

Disclaimer

Overview

Objectives

Data used

Team

Source-Code

Collections

Collections 2

models 6 Sort: Recently updated

datasets

models 6