aarohanverma commited on
Commit
9657ade
·
verified ·
1 Parent(s): a7a95d4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +177 -3
README.md CHANGED
@@ -1,3 +1,177 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - aarohanverma/simple-daily-conversations-cleaned
5
+ language:
6
+ - en
7
+ pipeline_tag: text-generation
8
+ tags:
9
+ - lstm
10
+ - next-word-prediction
11
+ - text-generation
12
+ - auto-completion
13
+ ---
14
+
15
+ # Model Card
16
+
17
+ <!-- Provide a quick summary of what the model is/does. -->
18
+
19
+ This repository contains an LSTM-based next word prediction model implemented in PyTorch.
20
+ The model utilizes advanced techniques including an extra fully connected layer with ReLU and dropout, layer normalization, label smoothing loss, gradient clipping, and learning rate scheduling to improve performance.
21
+ It also uses SentencePiece for subword tokenization.
22
+
23
+ ## Model Details
24
+
25
+ ### Model Description
26
+
27
+ <!-- Provide a longer summary of what this model is. -->
28
+
29
+ The LSTM Next Word Predictor is designed to predict the next word or subword given an input sentence.
30
+ The model is trained on a dataset provided in CSV format (with a 'data' column) and uses an LSTM network with many enhancements.
31
+
32
+ - **Developed by:** Aarohan Verma
33
+ - **Model type:** LSTM-based Next Word Prediction
34
+ - **Language(s) (NLP):** English
35
+ - **License:** Apache-2.0
36
+
37
+ ### Model Sources [optional]
38
+
39
+ - **Repository:** https://huggingface.co/aarohanverma/lstm-next-word-predictor
40
+ - **Demo:** [LSTM Next Word Predictor Demo](https://huggingface.co/spaces/aarohanverma/lstm-next-word-predictor-demo)
41
+
42
+ ## Uses
43
+
44
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
45
+
46
+ ### Direct Use
47
+
48
+ <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
49
+
50
+ This model can be directly used for next word prediction in text autocompletion.
51
+
52
+ ### Downstream Use [optional]
53
+
54
+ <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
55
+
56
+ The model can be fine-tuned for related tasks such as:
57
+ - Text generation.
58
+ - Language modeling for specific domains.
59
+
60
+ ### Out-of-Scope Use
61
+
62
+ <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
63
+
64
+ This model is not suitable for:
65
+ - Tasks requiring deep contextual understanding beyond next-word prediction.
66
+ - Applications where transformer-based architectures are preferred for longer contexts.
67
+ - Sensitive applications where data bias could lead to unintended outputs.
68
+
69
+ ## Risks and Limitations
70
+
71
+ <!-- This section is meant to convey both technical and sociotechnical limitations. -->
72
+
73
+ - **Risks:** Inaccurate or unexpected predictions may occur if the input context is too complex or ambiguous.
74
+ - **Limitations:** The model’s performance is bounded by the size and quality of the training data as well as the inherent limitations of LSTM architectures in modeling long-range dependencies.
75
+
76
+ ### Recommendations
77
+
78
+ <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
79
+
80
+ Users should be aware of the above limitations and conduct appropriate evaluations before deploying the model in production.
81
+ Consider further fine-tuning or additional data preprocessing if the model is applied in sensitive contexts.
82
+
83
+ ## How to Get Started with the Model
84
+
85
+ Use the code below to get started with the model.
86
+
87
+ To run the model, follow these steps:
88
+
89
+ 1. **Training:**
90
+ - Ensure you have a CSV file with a column named `data` containing your training sentences.
91
+ - Run training with:
92
+ ```bash
93
+ python next_word_prediction.py --data_path data.csv --train
94
+ ```
95
+ - This will train the model, save a checkpoint (`best_model.pth`), and export a TorchScript version (`best_model_scripted.pt`).
96
+
97
+ 2. **Inference:**
98
+ - To predict the next word, run:
99
+ ```bash
100
+ python next_word_prediction.py --inference "Your partial sentence"
101
+ ```
102
+ - The model will output the top predicted word or subword.
103
+
104
+ ## Training Details
105
+
106
+ ### Training Data
107
+
108
+ <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
109
+
110
+ - **Data Source:** CSV file with a column `data` containing sentences.
111
+ - **Preprocessing:** Uses SentencePiece for subword tokenization.
112
+ - **Dataset:** The training and validation datasets are split based on a user-defined ratio.
113
+
114
+ ### Training Procedure
115
+
116
+ <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
117
+
118
+
119
+ - **Preprocessing:** Tokenization using a SentencePiece model.
120
+ - **Training Hyperparameters:**
121
+ - **Batch Size:** Configurable via `--batch_size` (default: 512)
122
+ - **Learning Rate:** Configurable via `--learning_rate` (default: 0.001)
123
+ - **Epochs:** Configurable via `--num_epochs` (default: 25)
124
+ - **LSTM Parameters:** Configurable number of layers, dropout, and hidden dimensions.
125
+ - **Label Smoothing:** Applied with a configurable factor (default: 0.1)
126
+ - **Optimization:** Uses Adam optimizer with weight decay and gradient clipping.
127
+ - **Learning Rate Scheduling:** ReduceLROnPlateau scheduler based on validation loss.
128
+
129
+ #### Speeds, Sizes, Times
130
+
131
+ <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
132
+
133
+ - **Checkpoint and TorchScript models** are saved during training for later inference.
134
+
135
+ ## Evaluation
136
+
137
+ <!-- This section describes the evaluation protocols and provides the results. -->
138
+
139
+ ### Testing Data, Factors & Metrics
140
+
141
+ - **Testing Data:** Derived from the same CSV, split from the training data.
142
+ - **Metrics:** Primary metric is the loss (with label smoothing), with qualitative evaluation based on next-word accuracy.
143
+ - **Factors:** Evaluations may vary based on sentence length and dataset diversity.
144
+
145
+ #### Summary
146
+
147
+ - The model demonstrates promising performance on next word prediction tasks;
148
+ however, quantitative results (e.g., accuracy, loss) should be validated on your specific dataset.
149
+
150
+ ## Model Examination
151
+
152
+ <!-- Relevant interpretability work for the model goes here -->
153
+
154
+ - Interpretability techniques such as examining predicted token distributions can be applied to further understand model behavior.
155
+
156
+ ## Technical Specifications [optional]
157
+
158
+ ### Model Architecture and Objective
159
+
160
+ - **Architecture:** LSTM-based network with enhancements such as an extra fully connected layer, dropout, and layer normalization.
161
+ - **Objective:** Predict the next word/subword given a sequence of tokens.
162
+
163
+ ## Citation
164
+
165
+ <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
166
+
167
+ **BibTeX:**
168
+
169
+ [More Information Needed]
170
+
171
+ ## Model Card Contact
172
+
173
+ For inquiries or further information, please contact:
174
+
175
+ LinkedIn: https://www.linkedin.com/in/aarohanverma/
176
+
177