Spaces:
Running
Running
Update README.md
Browse files
README.md
CHANGED
@@ -1,10 +1,26 @@
|
|
1 |
---
|
2 |
title: README
|
3 |
-
emoji:
|
4 |
colorFrom: blue
|
5 |
colorTo: blue
|
6 |
sdk: static
|
7 |
pinned: false
|
8 |
---
|
|
|
9 |
|
10 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
title: README
|
3 |
+
emoji: 🚀
|
4 |
colorFrom: blue
|
5 |
colorTo: blue
|
6 |
sdk: static
|
7 |
pinned: false
|
8 |
---
|
9 |
+
This is the official Repository for the bachelor thesis "Fine-tuning of large language models for the analysis of medical texts" at the [bern university of applied sciences](https://www.bfh.ch/de/).
|
10 |
|
11 |
+
Medical documentation is a fundamental part of modern medicine. However, an estimated 80% of medical data is unstructured, which complicates the analysis and further
|
12 |
+
processing of information. Additionally, time-consuming documentation is one of the main sources of stress for physicians. The use of artificial intelligence,
|
13 |
+
especially Large Language Models (LLMs), offers significant potential for the analysis of medical texts thanks to their advanced language understanding.
|
14 |
+
That is why we tried to find out what can be achieved with Open-Source Large Language Models by finetuning them on task specific data. Compared to using big names like
|
15 |
+
GPT-4, with locally deployable models all data stays safe in your institution at all times. The focus of the developed models is german unstructured text, which the models
|
16 |
+
should be able to extract relevant data out of. Additionally the extracted entities should be normalized and relevant relations/attributes should be identified as well. Also the
|
17 |
+
models should be able to create summarizations of clinical texts. The two main participants of the thesis are:
|
18 |
+
- Nicolas Gujer ([[email protected]](mailto:[email protected]))
|
19 |
+
- Jorma Steiner ([[email protected]](mailto:[email protected]))
|
20 |
+
|
21 |
+
The code for acquiring the necessary data to finetune the models can be found on our [GitHub Repository](https://github.com/AngryBacteria/ba-gujen1-steij14). Some of the datasets we used are not publicly available and
|
22 |
+
you have to formally issue a request to the institutions. You can find more information on the individual models, such as their performance and how to use them, in the respective
|
23 |
+
model repositories. We used a mix of different kind of data to finetune the models:
|
24 |
+
- Two annotated german medical datasets: [BRONCO150](https://www2.informatik.hu-berlin.de/~leser/bronco/index.html) and [Cardio:DE](https://heidata.uni-heidelberg.de/dataset.xhtml?persistentId=doi:10.11588/data/AFYQDY).
|
25 |
+
- 220 Synthetic summarizations
|
26 |
+
- Data from the coding systems ICD10GM, ATC and OPS
|