Ba2han commited on
Commit
190cda3
·
verified ·
1 Parent(s): 3ae043a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +19 -0
README.md CHANGED
@@ -1,3 +1,22 @@
1
  ---
2
  license: cc-by-4.0
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: cc-by-4.0
3
+ datasets:
4
+ - HuggingFaceH4/no_robots
5
+ - Ba2han/Reddit-instruct-curated_rated-1.2k
6
+ - Ba2han/databricks-dolly_rated
7
+ - Open-Orca/OpenOrca
8
  ---
9
+ The training dataset consists of 2k (longest) examples from no_robots, reddit_instruct, dolly, OpenOrca plus two other personal datasets.
10
+
11
+ Please use with ChatML and the default system message or enter your own. It was trained with various system messages, the one in the config being the default one.
12
+
13
+ The model is:
14
+ - Very good at generating long and coherent text.
15
+
16
+ - Creative due to data from Reddit ELI5 and a few other sources.
17
+
18
+ - Better at handling longer input.
19
+
20
+ - Not great with short text both in input and generation.
21
+
22
+ The aim is to see how the **"Long is More for Alignment"** paper holds. This is basically a combination of LIMA + LMA. There should be no benchmark contamination as far as I am aware of. Around 70% of the data is from the mentioned datasets. I am happy with how it turned out.