Titovs commited on
Commit
4a2951c
·
verified ·
1 Parent(s): d227a25

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +10 -14
README.md CHANGED
@@ -21,17 +21,6 @@ tags:
21
  KStack-full models is a collection of fine-tuned open-source generative text models fine-tuned on KStack dataset with rule-based filtering.
22
  This is a repository for fine-tuned CodeLlama-7b model in the Hugging Face Transformers format.
23
 
24
- ## Rule-based filtering
25
- To increase the quality of the dataset and filter out statistical outliers such as homework assignments, we filter out the dataset entries according to the following rules:
26
- * We filter out files which belong to the low-popular repos (the sum of stars and forks is less than 6)
27
- * Next, we filter out files which belong to the repos with less than 5 Kotlin files
28
- * Finally, we remove files which have less than 20 SLOC
29
-
30
- We clean the content of the remaining dataset entries according to the following rules:
31
- * We remove all non-ASCII entries
32
- * We remove all package lines such as _package kotlinx.coroutines.channels_
33
- * We remove half of the import lines.
34
-
35
  # Model use
36
 
37
  ```python
@@ -83,10 +72,17 @@ The model was trained on one A100 GPU with following hyperparameters:
83
 
84
  More details about finetuning can be found in the technical report
85
 
86
- # Fine-tuning data
87
 
88
- For this model we used 15K exmaples of Kotlin Exercices dataset {TODO: link!}. Every example follows HumanEval like format. In total dataset contains about 3.5M tokens.
89
- For more information about the dataset follow the link.
 
 
 
 
 
 
 
90
 
91
  # Evaluation
92
 
 
21
  KStack-full models is a collection of fine-tuned open-source generative text models fine-tuned on KStack dataset with rule-based filtering.
22
  This is a repository for fine-tuned CodeLlama-7b model in the Hugging Face Transformers format.
23
 
 
 
 
 
 
 
 
 
 
 
 
24
  # Model use
25
 
26
  ```python
 
72
 
73
  More details about finetuning can be found in the technical report
74
 
75
+ # Data filtering
76
 
77
+ To increase the quality of the dataset and filter out statistical outliers such as homework assignments, we filter out the dataset entries according to the following rules:
78
+ * We filter out files which belong to the low-popular repos (the sum of stars and forks is less than 6)
79
+ * Next, we filter out files which belong to the repos with less than 5 Kotlin files
80
+ * Finally, we remove files which have less than 20 SLOC
81
+
82
+ We clean the content of the remaining dataset entries according to the following rules:
83
+ * We remove all non-ASCII entries
84
+ * We remove all package lines such as _package kotlinx.coroutines.channels_
85
+ * We remove half of the import lines.
86
 
87
  # Evaluation
88