Reverb commited on
Commit
9a8e37f
1 Parent(s): da59624

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +28 -0
README.md CHANGED
@@ -40,6 +40,34 @@ import pytest
40
 
41
  import pandas as pd<N
42
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
43
 
44
  ## Considerations:
45
 
 
40
 
41
  import pandas as pd<N
42
  ```
43
+ ---
44
+
45
+ ## The Journey
46
+ The model took 6 major steps which are:
47
+
48
+ 1. Data Collection
49
+ 2. Raw Data Cleaning
50
+ 3. Data Preprocessing
51
+ 4. Building & Training the Tokenizer
52
+ 5. Testing the Model on Large Dataset
53
+ 6. Deploying the Final Model on HuggingFace
54
+
55
+ #### Data Collection
56
+ The data was collected from python github repositories using web scraping techniques, It took nearly a day to gather 200GB worth of data.
57
+
58
+ #### Raw Data Cleaning
59
+ 200GB of python code?? sounds ridiculous! that's why we needed to clean the downloaded repositories from any non-python files such as PDF,idx..etc
60
+
61
+ #### Data Preprocessing
62
+ I tried splitting the lines of code for each repository then merged them all under one single text file named **python_text_data.txt**
63
+
64
+ #### Building & Training the Tokenizer
65
+ For this step I have used **ByteLevelBPETokenizer** and trained it then saved the model on the desktop
66
+
67
+ #### Testing the Model on Large Dataset
68
+ After training the tokenizer on a large dataset, It was time for some tests to see how good is the model before proceeding.
69
+
70
+ ---
71
 
72
  ## Considerations:
73