balrogbob commited on
Commit
9f32ea6
·
verified ·
1 Parent(s): 0115d87

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +72 -3
README.md CHANGED
@@ -1,3 +1,72 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ datasets:
4
+ - AdapterOcean/python-code-instructions-18k-alpaca-standardized_cluster_1_alpaca
5
+ - AdapterOcean/python-code-instructions-18k-alpaca-standardized_cluster_2_alpaca
6
+ - AdapterOcean/python-code-instructions-18k-alpaca-standardized_cluster_3_alpaca
7
+ - AdapterOcean/python-code-instructions-18k-alpaca-standardized_cluster_4_alpaca
8
+ - kejian/codesearchnet-python-raw
9
+ pipeline_tag: text-generation
10
+ ---
11
+
12
+ # Model Card for MicroBOB-python
13
+
14
+ MicroBOB-python is a new, from scratch micro model based on RWKV x051a, which doesnt require a special kernel to train or inference. Developed and trained using a modified version of nanoRWKV.
15
+ ## Model Details
16
+
17
+ ### Model Description
18
+
19
+ MicroBOB-python is a new, from scratch micro model based on RWKV x051a, which doesnt require a special kernel to train or inference. Developed and trained using a modified version of nanoRWKV.
20
+ Base trained on 10's of thousands of lines of opensource and internal python code, and finetuned in 5 rounds using kejian/codesearchnet-python-raw, AdapterOcean/python-code-instructions-18k-alpaca-standardized_cluster_1_alpaca and 3 others in the same series.
21
+
22
+ Developed for an in-house python code editor to act as a simple autocomplete, it's gotten smart enough for it's extreme small size (30 million parameters) I thought I should share. Model weights only licenced under MIT.
23
+
24
+
25
+
26
+ - **Developed by:** BalrogBob
27
+ - **Model type:** Custom implementation of RWKV x051a
28
+ - **License:** MIT (Model Weights)
29
+ - **Finetuned from model [optional]:** MicroBOB
30
+
31
+ ## Uses
32
+
33
+ Simple autocompletion of python or python syntax like code.
34
+
35
+ ### Direct Use
36
+
37
+ https://github.com/BlinkDL/nanoRWKV sample.py is sufficient to inference, and the training script included does work with the model weights, though at a slight loss of performance as far as training speed and memory usage during training, but should produce functionally identical results.
38
+
39
+ ### Downstream Use
40
+
41
+ Code replacement and re-formatting - It is possible with a small amount of finetuning and clever python code to use the model to replace words, functions, and variables in python code.
42
+
43
+ ### Out-of-Scope Use
44
+
45
+ HRLF Training can be used to instruct train the model with limited success. HRLF training blunts the models intelligence due to the limited availible parameters. The HRLF training replaces info from the dataset in the model weights.
46
+
47
+ ## Bias, Risks, and Limitations
48
+
49
+ Model has no bias training or safety guardrails. It was trained on code from open web sources. The model may incidentally produce mallicious or insecure code. Use at your own risk! You are fully responsible for any generations produced using these model weights.
50
+
51
+ ## How to Get Started with the Model
52
+
53
+ Clone https://github.com/BlinkDL/nanoRWKV. All the code contained therein is compatible with this model. While the code that generated the model is optimized and customized, the base nanoRWKV package can finetune the MicroBOB-python weights without issue at a reduced performance memory-wise.
54
+
55
+ ## Training Details
56
+
57
+ ### Training Data
58
+
59
+ https://huggingface.co/datasets/AdapterOcean/python-code-instructions-18k-alpaca-standardized_cluster_1_alpaca
60
+ https://huggingface.co/datasets/AdapterOcean/python-code-instructions-18k-alpaca-standardized_cluster_2_alpaca
61
+ https://huggingface.co/datasets/AdapterOcean/python-code-instructions-18k-alpaca-standardized_cluster_3_alpaca
62
+ https://huggingface.co/datasets/AdapterOcean/python-code-instructions-18k-alpaca-standardized_cluster_4_alpaca
63
+ https://huggingface.co/datasets/kejian/codesearchnet-python-raw
64
+ My personal python code folder with 40+ projects and 30k lines of code
65
+
66
+ ### Training Procedure
67
+
68
+ Standard nanoRWKV data prep, custom training loop.
69
+
70
+ #### Preprocessing [optional]
71
+
72
+ Tokenized all datasets with gpt2 encoding for simplicity. Version of MicroBOB with custom BPE encoder in development.