AngryBacteria commited on
Commit
dc5fcad
·
verified ·
1 Parent(s): c76346e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +27 -15
README.md CHANGED
@@ -6,21 +6,33 @@ colorTo: blue
6
  sdk: static
7
  pinned: false
8
  ---
9
- This is the official Repository for the bachelor thesis "Fine-tuning of large language models for the analysis of medical texts" at the [bern university of applied sciences](https://www.bfh.ch/de/).
10
-
11
- Medical documentation is a fundamental part of modern medicine. However, an estimated 80% of medical data is unstructured, which complicates the analysis and further
12
- processing of information. Additionally, time-consuming documentation is one of the main sources of stress for physicians. The use of artificial intelligence,
13
- especially Large Language Models (LLMs), offers significant potential for the analysis of medical texts thanks to their advanced language understanding.
14
- That is why we tried to find out what can be achieved with Open-Source Large Language Models by finetuning them on task specific data. Compared to using big names like
15
- GPT-4, with locally deployable models all data stays safe in your institution at all times. The focus of the developed models is german unstructured text, which the models
16
- should be able to extract relevant data out of. Additionally the extracted entities should be normalized and relevant relations/attributes should be identified as well. Also the
17
- models should be able to create summarizations of clinical texts. The two main participants of the thesis are:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
  - Nicolas Gujer ([[email protected]](mailto:[email protected]))
19
  - Jorma Steiner ([[email protected]](mailto:[email protected]))
20
 
21
- The code for acquiring the necessary data to finetune the models can be found on our [GitHub Repository](https://github.com/AngryBacteria/ba-gujen1-steij14). Some of the datasets we used are not publicly available and
22
- you have to formally issue a request to the institutions. You can find more information on the individual models, such as their performance and how to use them, in the respective
23
- model repositories. We used a mix of different kind of data to finetune the models:
24
- - Two annotated german medical datasets: [BRONCO150](https://www2.informatik.hu-berlin.de/~leser/bronco/index.html) and [Cardio:DE](https://heidata.uni-heidelberg.de/dataset.xhtml?persistentId=doi:10.11588/data/AFYQDY).
25
- - 220 Synthetic summarizations
26
- - Data from the coding systems ICD10GM, ATC and OPS
 
6
  sdk: static
7
  pinned: false
8
  ---
9
+ Welcome to the official repository for the bachelor's thesis titled "Fine-tuning of Large Language Models for the Analysis of Medical Texts," conducted at the [Bern University of Applied Sciences](https://www.bfh.ch/de/).
10
+
11
+ ## Disclaimer
12
+ This project is part of a bachelor thesis and is not meant to be used in a production environment. The code is not optimized for performance and is not guaranteed to work in all environments. We won't provide any support for this and won't be responsible for any damage caused by the usage of this project.
13
+
14
+ ## Overview
15
+ In modern medicine, medical documentation is essential yet challenging due to the largely unstructured nature of medical data—approximately 80% by current estimates. This unstructured data complicates the analysis and extraction of actionable insights, contributing significantly to the administrative burden and stress experienced by healthcare providers.
16
+
17
+ To address these challenges, we explore the potential of Artificial Intelligence, particularly through the use of Large Language Models (LLMs), to support the management and analysis of medical texts. Our research focuses on the adaptability and effectiveness of open-source LLMs, specifically tuned to process and analyze medical documentation efficiently while ensuring data privacy and security within medical institutions.
18
+ The main focus of this project is to develop models that can be used for german clinical texts, but the models we used are mainly developed for english context.
19
+
20
+ ## Objectives
21
+ - Data Extraction: Accurately extracting relevant information from German medical texts, which are predominantly unstructured.
22
+ - Entity Normalization: Standardizing extracted entities to align with recognized medical terminologies.
23
+ - Relationship and Attribute Identification: Detecting and categorizing pertinent relationships and attributes within the medical data.
24
+ - Text Summarization: Generating concise summaries of extensive clinical documents to aid quick comprehension and decision-making.
25
+
26
+ ## Data used
27
+ - Annotated Medical Gold-Standard Datasets: Specifically [BRONCO150](https://www2.informatik.hu-berlin.de/~leser/bronco/index.html) and [Cardio:DE](https://heidata.uni-heidelberg.de/dataset.xhtml?persistentId=doi:10.11588/data/AFYQDY).
28
+ - Synthetic Data: Around 200 synthetic summaries to enhance model training. One part was created by our own using GPT-4, the other is from the [Dev4Med/Notfallberichte-German-100](https://huggingface.co/datasets/Dev4Med/Notfallberichte-German-100) dataset
29
+ - Medical Coding Systems: Data from ICD10GM, ATC, and OPS coding systems.
30
+
31
+ ## Team
32
  - Nicolas Gujer ([[email protected]](mailto:[email protected]))
33
  - Jorma Steiner ([[email protected]](mailto:[email protected]))
34
 
35
+ ## Resources
36
+ All necessary data acquisition scripts and additional resources are available on our GitHub Repository. Access to certain datasets is restricted and requires formal requests to the respective institutions.
37
+
38
+ For detailed information on each model's performance and guidelines on usage, please refer to the individual model repositories linked within our GitHub.