VarsaGupta
commited on
Commit
·
e65f5e1
1
Parent(s):
f4e4932
Upload 11 files
Browse files- .gitattributes +1 -0
- NLP-Based-Chatbot/.DS_Store +0 -0
- NLP-Based-Chatbot/Dronealexa.csv +0 -0
- NLP-Based-Chatbot/NLP_Based_Chatbot/.gitattributes +35 -0
- NLP-Based-Chatbot/NLP_Based_Chatbot/README.md +78 -0
- NLP-Based-Chatbot/README.md +48 -0
- NLP-Based-Chatbot/backend.py +291 -0
- NLP-Based-Chatbot/config.json +12 -0
- NLP-Based-Chatbot/embeddings_chatbot.csv +3 -0
- NLP-Based-Chatbot/feedback_data.csv +2 -0
- NLP-Based-Chatbot/frontend.py +79 -0
- NLP-Based-Chatbot/model_card.md.txt +80 -0
.gitattributes
CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
36 |
+
NLP-Based-Chatbot/embeddings_chatbot.csv filter=lfs diff=lfs merge=lfs -text
|
NLP-Based-Chatbot/.DS_Store
ADDED
Binary file (6.15 kB). View file
|
|
NLP-Based-Chatbot/Dronealexa.csv
ADDED
The diff for this file is too large to render.
See raw diff
|
|
NLP-Based-Chatbot/NLP_Based_Chatbot/.gitattributes
ADDED
@@ -0,0 +1,35 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
*.7z filter=lfs diff=lfs merge=lfs -text
|
2 |
+
*.arrow filter=lfs diff=lfs merge=lfs -text
|
3 |
+
*.bin filter=lfs diff=lfs merge=lfs -text
|
4 |
+
*.bz2 filter=lfs diff=lfs merge=lfs -text
|
5 |
+
*.ckpt filter=lfs diff=lfs merge=lfs -text
|
6 |
+
*.ftz filter=lfs diff=lfs merge=lfs -text
|
7 |
+
*.gz filter=lfs diff=lfs merge=lfs -text
|
8 |
+
*.h5 filter=lfs diff=lfs merge=lfs -text
|
9 |
+
*.joblib filter=lfs diff=lfs merge=lfs -text
|
10 |
+
*.lfs.* filter=lfs diff=lfs merge=lfs -text
|
11 |
+
*.mlmodel filter=lfs diff=lfs merge=lfs -text
|
12 |
+
*.model filter=lfs diff=lfs merge=lfs -text
|
13 |
+
*.msgpack filter=lfs diff=lfs merge=lfs -text
|
14 |
+
*.npy filter=lfs diff=lfs merge=lfs -text
|
15 |
+
*.npz filter=lfs diff=lfs merge=lfs -text
|
16 |
+
*.onnx filter=lfs diff=lfs merge=lfs -text
|
17 |
+
*.ot filter=lfs diff=lfs merge=lfs -text
|
18 |
+
*.parquet filter=lfs diff=lfs merge=lfs -text
|
19 |
+
*.pb filter=lfs diff=lfs merge=lfs -text
|
20 |
+
*.pickle filter=lfs diff=lfs merge=lfs -text
|
21 |
+
*.pkl filter=lfs diff=lfs merge=lfs -text
|
22 |
+
*.pt filter=lfs diff=lfs merge=lfs -text
|
23 |
+
*.pth filter=lfs diff=lfs merge=lfs -text
|
24 |
+
*.rar filter=lfs diff=lfs merge=lfs -text
|
25 |
+
*.safetensors filter=lfs diff=lfs merge=lfs -text
|
26 |
+
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
27 |
+
*.tar.* filter=lfs diff=lfs merge=lfs -text
|
28 |
+
*.tar filter=lfs diff=lfs merge=lfs -text
|
29 |
+
*.tflite filter=lfs diff=lfs merge=lfs -text
|
30 |
+
*.tgz filter=lfs diff=lfs merge=lfs -text
|
31 |
+
*.wasm filter=lfs diff=lfs merge=lfs -text
|
32 |
+
*.xz filter=lfs diff=lfs merge=lfs -text
|
33 |
+
*.zip filter=lfs diff=lfs merge=lfs -text
|
34 |
+
*.zst filter=lfs diff=lfs merge=lfs -text
|
35 |
+
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
NLP-Based-Chatbot/NLP_Based_Chatbot/README.md
ADDED
@@ -0,0 +1,78 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language: en
|
3 |
+
tags:
|
4 |
+
- chatbot
|
5 |
+
- natural language processing
|
6 |
+
license: mit
|
7 |
+
---
|
8 |
+
|
9 |
+
Model Card: NLP-Based Chatbot
|
10 |
+
|
11 |
+
Overview
|
12 |
+
|
13 |
+
The NLP-Based Chatbot is designed to explore Science & Technology topics. It utilizes a combination of semantic search and summarization techniques to provide relevant and concise responses to user queries.
|
14 |
+
|
15 |
+
Model Details
|
16 |
+
|
17 |
+
- Model Name: NLP-Based Chatbot
|
18 |
+
- Model Type: Natural Language Processing (NLP) Chatbot
|
19 |
+
- Framework: Gradio Blocks Interface, spaCy, Transformers
|
20 |
+
|
21 |
+
Components
|
22 |
+
|
23 |
+
1. Semantic Search
|
24 |
+
|
25 |
+
The chatbot employs semantic search to retrieve relevant information from a preprocessed dataset (Dronealexa.csv). The search is based on a TF-IDF vectorizer and cosine similarity calculations.
|
26 |
+
|
27 |
+
2. Summarization
|
28 |
+
|
29 |
+
A summarization pipeline is used to generate concise summaries of the retrieved information. The Hugging Face Transformers library is utilized for summarization tasks.
|
30 |
+
|
31 |
+
3. Custom Embeddings
|
32 |
+
|
33 |
+
The model incorporates custom text embeddings using spaCy and pre-trained word embeddings. These embeddings enhance the understanding of user queries and contribute to the semantic search.
|
34 |
+
|
35 |
+
4. Gradio Blocks Interface
|
36 |
+
|
37 |
+
The chatbot's frontend is built using Gradio Blocks Interface, providing an interactive and user-friendly platform for users to input queries and receive responses.
|
38 |
+
|
39 |
+
5. Model Card Generation
|
40 |
+
|
41 |
+
The model card generation involves constructing prompts based on search results and utilizing a summarization pipeline to produce model card content.
|
42 |
+
|
43 |
+
Intended Use
|
44 |
+
|
45 |
+
The NLP-Based Chatbot is intended for users interested in exploring Science & Technology topics. It can be used to obtain information from the provided dataset, and users are encouraged to provide feedback for continuous improvement.
|
46 |
+
|
47 |
+
Training Data
|
48 |
+
|
49 |
+
The model is trained on a custom dataset (Dronealexa.csv) containing Science & Technology-related information. The dataset has been preprocessed to handle missing values and ensure efficient semantic search.
|
50 |
+
|
51 |
+
|
52 |
+
Evaluation Metrics
|
53 |
+
|
54 |
+
- Semantic Search: TF-IDF Vectorizer, Cosine Similarity
|
55 |
+
- Summarization: Hugging Face Transformers Pipeline
|
56 |
+
|
57 |
+
|
58 |
+
Ethical Considerations
|
59 |
+
|
60 |
+
The chatbot aims to provide accurate and relevant information. However, users are advised to critically evaluate the responses and understand that the model's knowledge is based on the training data.
|
61 |
+
|
62 |
+
|
63 |
+
Usage Instructions
|
64 |
+
|
65 |
+
1. Input your query in the provided textbox.
|
66 |
+
2. Click the "Send" button to receive a response.
|
67 |
+
3. Optionally, submit feedback using the "Submit Feedback" button.
|
68 |
+
|
69 |
+
|
70 |
+
License
|
71 |
+
|
72 |
+
This model is released under the Apache 2.0 License.
|
73 |
+
|
74 |
+
|
75 |
+
Contact Information
|
76 |
+
|
77 |
+
For inquiries or issues, please contact [email protected].
|
78 |
+
|
NLP-Based-Chatbot/README.md
ADDED
@@ -0,0 +1,48 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
### NLP Powered Chatbot to Explore Science and Technologies
|
2 |
+
|
3 |
+
Project URL : https://060d8e72c0a548cb38.gradio.live
|
4 |
+
|
5 |
+
Screenshots :
|
6 |
+
<img width="1552" alt="image" src="https://github.com/dhawansolanki/NLP-Chatbot-Hackaphasia/assets/91565429/d3066564-ae96-44c9-8507-64ee7b4f0a14">
|
7 |
+
|
8 |
+
|
9 |
+
<img width="1552" alt="image" src="https://github.com/dhawansolanki/NLP-Chatbot-Hackaphasia/assets/91565429/4fe19f0c-8fe5-4f9f-af71-a3e7cd5f3138">
|
10 |
+
|
11 |
+
Tech Stack : Python, NLP, Gradio
|
12 |
+
|
13 |
+
#How to Run
|
14 |
+
|
15 |
+
1. Frontend : (Gradio)
|
16 |
+
1. Create Environment :
|
17 |
+
python -m venv envname
|
18 |
+
2. Install Dependencies :
|
19 |
+
pip install -qU openai==0.28 transformers plotly scikit-learn transformers
|
20 |
+
3. python frontend.py
|
21 |
+
|
22 |
+
2. Backend : (Python)
|
23 |
+
3. python backend.py
|
24 |
+
|
25 |
+
|
26 |
+
|
27 |
+
Dummy Input Data :
|
28 |
+
|
29 |
+
User Input (Dummy) :
|
30 |
+
|
31 |
+
```
|
32 |
+
In the realm of drone technology, a groundbreaking feature has emerged with the integration of an advanced voice command recognition system. This system is meticulously configured to capture and process an audio stream, discerning voice commands from the operator in real-time. What sets this innovation apart is its ability to not only identify the operator's voice within a cacophony of ambient sounds but also to exclusively recognize and act upon the commands issued by the authorized operator. This heightened level of personalization and security is further complemented by the inclusion of a directional camera, which automatically focuses on the operator, providing a continuous and detailed video stream. This video stream not only aids in documentation and monitoring but also plays a pivotal role in disambiguating different voice streams, ensuring that the drone responds with precision in complex and dynamic environments. In essence, this drone system represents a leap forward in human-machine interaction, offering a seamless and efficient mode of control through voice commands.
|
33 |
+
Question:
|
34 |
+
How does the integration of a directional camera enhance the functionality of the drone's voice command recognition system, and what specific benefits does it bring to the disambiguation of voice streams in dynamic operational environments?
|
35 |
+
```
|
36 |
+
|
37 |
+
|
38 |
+
Feedback Input (Dummy):
|
39 |
+
```
|
40 |
+
How does the integration of a directional camera enhance the functionality of the drone's voice command recognition system, and what specific benefits does it bring to the disambiguation of voice streams in dynamic operational environments?
|
41 |
+
```
|
42 |
+
|
43 |
+
|
44 |
+
|
45 |
+
|
46 |
+
|
47 |
+
|
48 |
+
|
NLP-Based-Chatbot/backend.py
ADDED
@@ -0,0 +1,291 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import openai
|
2 |
+
import pandas as pd
|
3 |
+
import numpy as np
|
4 |
+
# from openai.embeddings_utils import get_embedding
|
5 |
+
from transformers import GPT2TokenizerFast
|
6 |
+
from tqdm.auto import tqdm
|
7 |
+
import os
|
8 |
+
|
9 |
+
|
10 |
+
|
11 |
+
tqdm.pandas()
|
12 |
+
|
13 |
+
import spacy
|
14 |
+
# import numpy as np
|
15 |
+
|
16 |
+
# Load spaCy model with GloVe embeddings
|
17 |
+
import en_core_web_sm
|
18 |
+
|
19 |
+
nlp = en_core_web_sm.load()
|
20 |
+
|
21 |
+
def custom_embedding(text, model_name="text-embedding-ada-002"):
|
22 |
+
# Process the text with spaCy
|
23 |
+
doc = nlp(text)
|
24 |
+
|
25 |
+
# Extract word embeddings and average them to get the text embedding
|
26 |
+
word_embeddings = [token.vector for token in doc if token.has_vector]
|
27 |
+
|
28 |
+
if not word_embeddings:
|
29 |
+
return None # No embeddings found for any word in the text
|
30 |
+
|
31 |
+
text_embedding = np.mean(word_embeddings, axis=0)
|
32 |
+
|
33 |
+
# Create a response dictionary
|
34 |
+
response = {
|
35 |
+
"data": [
|
36 |
+
{
|
37 |
+
"embedding": text_embedding.tolist(),
|
38 |
+
"index": 0,
|
39 |
+
"object": "embedding"
|
40 |
+
}
|
41 |
+
],
|
42 |
+
"model": model_name,
|
43 |
+
"object": "list",
|
44 |
+
"usage": {
|
45 |
+
"prompt_tokens": len(text.split()),
|
46 |
+
"total_tokens": len(text.split())
|
47 |
+
}
|
48 |
+
}
|
49 |
+
|
50 |
+
return response
|
51 |
+
|
52 |
+
# Example usage
|
53 |
+
text = "Rome"
|
54 |
+
response = custom_embedding(text)
|
55 |
+
|
56 |
+
if response["data"][0]["embedding"] is not None:
|
57 |
+
print(f"Custom Embedding for '{text}': {response['data'][0]['embedding']}")
|
58 |
+
else:
|
59 |
+
print(f"No embeddings found for words in '{text}'.")
|
60 |
+
|
61 |
+
print(response)
|
62 |
+
|
63 |
+
|
64 |
+
# import spacy
|
65 |
+
# import numpy as np
|
66 |
+
|
67 |
+
# Load spaCy model with GloVe embeddings
|
68 |
+
# import en_core_web_sm
|
69 |
+
|
70 |
+
nlp = en_core_web_sm.load()
|
71 |
+
|
72 |
+
def custom_embedding(text_list, model_name="text-embedding-ada-002"):
|
73 |
+
embeddings = []
|
74 |
+
|
75 |
+
for text in text_list:
|
76 |
+
# Process the text with spaCy
|
77 |
+
doc = nlp(text)
|
78 |
+
|
79 |
+
# Extract word embeddings and average them to get the text embedding
|
80 |
+
word_embeddings = [token.vector for token in doc if token.has_vector]
|
81 |
+
|
82 |
+
if not word_embeddings:
|
83 |
+
embeddings.append(None) # No embeddings found for any word in the text
|
84 |
+
else:
|
85 |
+
text_embedding = np.mean(word_embeddings, axis=0)
|
86 |
+
embeddings.append(text_embedding.tolist())
|
87 |
+
|
88 |
+
# Create a response dictionary
|
89 |
+
response = {
|
90 |
+
"data": [
|
91 |
+
{
|
92 |
+
"embedding": emb,
|
93 |
+
"index": idx,
|
94 |
+
"object": "embedding"
|
95 |
+
}
|
96 |
+
for idx, emb in enumerate(embeddings)
|
97 |
+
],
|
98 |
+
"model": model_name,
|
99 |
+
"object": "list",
|
100 |
+
"usage": {
|
101 |
+
"prompt_tokens": sum(len(text.split()) for text in text_list),
|
102 |
+
"total_tokens": sum(len(text.split()) for text in text_list)
|
103 |
+
}
|
104 |
+
}
|
105 |
+
|
106 |
+
return response
|
107 |
+
|
108 |
+
# Example usage
|
109 |
+
text = ["She is running", "Fitness is good", "I am hungry", "Basketball is healthy"]
|
110 |
+
response = custom_embedding(text)
|
111 |
+
|
112 |
+
for idx, embedding in enumerate(response["data"]):
|
113 |
+
if embedding["embedding"] is not None:
|
114 |
+
print(f"Custom Embedding for '{text[idx]}': {embedding['embedding']}")
|
115 |
+
else:
|
116 |
+
print(f"No embeddings found for words in '{text[idx]}'.")
|
117 |
+
|
118 |
+
print(response)
|
119 |
+
|
120 |
+
emb1 = response['data'][0]['embedding']
|
121 |
+
emb2 = response['data'][1]['embedding']
|
122 |
+
emb3 = response['data'][2]['embedding']
|
123 |
+
emb4 = response['data'][3]['embedding']
|
124 |
+
|
125 |
+
np.dot(emb1, emb2)
|
126 |
+
np.dot(emb2, emb4)
|
127 |
+
|
128 |
+
df = pd.read_csv('Dronealexa.csv')
|
129 |
+
df = df.dropna()
|
130 |
+
df.info()
|
131 |
+
df.head()
|
132 |
+
df['combined'] = "Title: " + df['Title'].str.strip() + "; URL: " + df['URL'].str.strip() + "; Publication Year: " + df['Publication Year'].astype(str).str.strip() + "; Abstract: " + df['Abstract'].str.strip()
|
133 |
+
df.head()
|
134 |
+
|
135 |
+
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
|
136 |
+
|
137 |
+
df['n_tokens'] = df.combined.progress_apply(lambda x: len(tokenizer.encode(x)))
|
138 |
+
df = df[df.n_tokens < 8000]
|
139 |
+
df.info()
|
140 |
+
df.head()
|
141 |
+
|
142 |
+
|
143 |
+
# import spacy
|
144 |
+
# import numpy as np
|
145 |
+
|
146 |
+
# Load spaCy model with GloVe embeddings
|
147 |
+
# import en_core_web_sm
|
148 |
+
|
149 |
+
nlp = en_core_web_sm.load()
|
150 |
+
|
151 |
+
def get_embeddings(text, model):
|
152 |
+
# Process the text with spaCy
|
153 |
+
doc = model(text)
|
154 |
+
|
155 |
+
# Extract word embeddings and average them to get the text embedding
|
156 |
+
word_embeddings = [token.vector for token in doc if token.has_vector]
|
157 |
+
|
158 |
+
if not word_embeddings:
|
159 |
+
return None # No embeddings found for any word in the text
|
160 |
+
|
161 |
+
text_embedding = np.mean(word_embeddings, axis=0)
|
162 |
+
|
163 |
+
# Create a response dictionary
|
164 |
+
response = {
|
165 |
+
"data": [
|
166 |
+
{
|
167 |
+
"embedding": text_embedding.tolist(),
|
168 |
+
"index": 0,
|
169 |
+
"object": "embedding"
|
170 |
+
}
|
171 |
+
],
|
172 |
+
"model": model.meta["name"],
|
173 |
+
"object": "list",
|
174 |
+
"usage": {
|
175 |
+
"prompt_tokens": len(text.split()),
|
176 |
+
"total_tokens": len(doc)
|
177 |
+
}
|
178 |
+
}
|
179 |
+
|
180 |
+
return response
|
181 |
+
|
182 |
+
# Example usage
|
183 |
+
input_text = "Your input text goes here"
|
184 |
+
custom_model = nlp # You can replace this with any other spaCy model
|
185 |
+
|
186 |
+
# Renaming 'input_text' to avoid conflict with the built-in 'input' function
|
187 |
+
text_to_process = input_text
|
188 |
+
|
189 |
+
response = get_embeddings(text_to_process, custom_model)
|
190 |
+
|
191 |
+
if response["data"][0]["embedding"] is not None:
|
192 |
+
print(f"Custom Embedding for '{text_to_process}': {response['data'][0]['embedding']}")
|
193 |
+
else:
|
194 |
+
print(f"No embeddings found for words in '{text_to_process}'.")
|
195 |
+
|
196 |
+
print(response)
|
197 |
+
|
198 |
+
from tqdm import tqdm
|
199 |
+
|
200 |
+
batch_size = 2000
|
201 |
+
model_name = 'text-embedding-ada-002'
|
202 |
+
|
203 |
+
# Assuming df is your DataFrame
|
204 |
+
for i in tqdm(range(0, len(df.combined), batch_size)):
|
205 |
+
# find end of batch
|
206 |
+
i_end = min(i + batch_size, len(df.combined))
|
207 |
+
|
208 |
+
# Get embeddings for the current batch
|
209 |
+
batch_text = list(df.combined)[i:i_end]
|
210 |
+
|
211 |
+
# Initialize an empty list to store the embeddings for each text in the batch
|
212 |
+
batch_embeddings = []
|
213 |
+
|
214 |
+
# Process each text in the batch and get embeddings
|
215 |
+
for text in batch_text:
|
216 |
+
response = get_embeddings(text, nlp)
|
217 |
+
|
218 |
+
# Check if embeddings were found
|
219 |
+
if response and response["data"][0]["embedding"] is not None:
|
220 |
+
batch_embeddings.append(response["data"][0]["embedding"])
|
221 |
+
else:
|
222 |
+
# Handle the case where no embeddings are found for a text
|
223 |
+
batch_embeddings.append(None)
|
224 |
+
|
225 |
+
# Update the DataFrame with the embeddings
|
226 |
+
for j in range(i, i_end):
|
227 |
+
df.loc[j, 'ada_vector'] = str(batch_embeddings[j - i])
|
228 |
+
|
229 |
+
df.head()
|
230 |
+
df.info()
|
231 |
+
df['ada_vector'] = df.ada_vector.progress_apply(eval).progress_apply(np.array)
|
232 |
+
df.to_csv('embeddings_chatbot.csv',index=False)
|
233 |
+
df=pd.read_csv('embeddings_chatbot.csv')
|
234 |
+
|
235 |
+
user_query = input("Enter query - ")
|
236 |
+
|
237 |
+
query_response = get_embeddings(user_query, nlp)
|
238 |
+
|
239 |
+
if query_response["data"][0]["embedding"] is not None:
|
240 |
+
print(f"Embedding for '{user_query}': {query_response['data'][0]['embedding']}")
|
241 |
+
else:
|
242 |
+
print(f"No embeddings found for words in '{user_query}'.")
|
243 |
+
|
244 |
+
searchvector = get_embeddings(user_query, custom_model)["data"][0]["embedding"]
|
245 |
+
|
246 |
+
|
247 |
+
|
248 |
+
from sklearn.metrics.pairwise import cosine_similarity
|
249 |
+
|
250 |
+
# Assuming df['ada_vector'] contains the vectors you want to compare
|
251 |
+
|
252 |
+
# Ensure 'ada_vector' column contains valid numeric arrays
|
253 |
+
df['ada_vector'] = df['ada_vector'].apply(lambda x: np.array(x) if isinstance(x, (list, np.ndarray)) else x)
|
254 |
+
|
255 |
+
# Filter out rows where 'ada_vector' is not a valid numeric array
|
256 |
+
valid_rows = df['ada_vector'].apply(lambda x: isinstance(x, np.ndarray))
|
257 |
+
|
258 |
+
# Calculate cosine similarity only for valid rows
|
259 |
+
df.loc[valid_rows, 'similarities'] = df.loc[valid_rows, 'ada_vector'].apply(
|
260 |
+
lambda x: cosine_similarity([x], [searchvector])[0][0]
|
261 |
+
)
|
262 |
+
|
263 |
+
# If you are using the 'progress_apply' from the 'tqdm' library
|
264 |
+
# You can keep it as follows:
|
265 |
+
# df.loc[valid_rows, 'similarities'] = df.loc[valid_rows, 'ada_vector'].progress_apply(
|
266 |
+
# lambda x: cosine_similarity([x], [searchvector])[0][0]
|
267 |
+
# )
|
268 |
+
|
269 |
+
df.head()
|
270 |
+
df.sort_values('similarities', ascending = False)
|
271 |
+
result = df.sort_values('similarities', ascending = False).head(3)
|
272 |
+
|
273 |
+
result.head()
|
274 |
+
|
275 |
+
xc = list(result.combined)
|
276 |
+
|
277 |
+
def construct_prompt(query, xc):
|
278 |
+
context = ''
|
279 |
+
for i in range(3):
|
280 |
+
context += xc[i] + "\n"
|
281 |
+
header = """Answer the question as truthfully as possible using the provided context, and if the answer is not contained within the text below, say "I don't know."\n\nContext:\n"""
|
282 |
+
header += context + "\n\n Q: " + query + "\n A:"
|
283 |
+
return header
|
284 |
+
|
285 |
+
|
286 |
+
|
287 |
+
from transformers import pipeline
|
288 |
+
|
289 |
+
summarizer = pipeline("summarization")
|
290 |
+
Fresult = construct_prompt(user_query, xc)
|
291 |
+
summarizer("\n".join(xc), max_length=130, min_length=30, do_sample=False)
|
NLP-Based-Chatbot/config.json
ADDED
@@ -0,0 +1,12 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"model": {
|
3 |
+
"repository": "https://github.com/VarsaGupta/NLP_Based_Chatbot.git",
|
4 |
+
"subfolder": "",
|
5 |
+
"files": ["model_card.md", "embeddings_chatbot.csv", "backend.py", "feedback_data.csv", "frontend.py", "README.txt", "Dronealexa.csv", "config.json"]
|
6 |
+
},
|
7 |
+
"card": {
|
8 |
+
"repository": "https://github.com/VarsaGupta/NLP_Based_Chatbot.git",
|
9 |
+
"subfolder": "",
|
10 |
+
"files": ["model_card.md"]
|
11 |
+
}
|
12 |
+
}
|
NLP-Based-Chatbot/embeddings_chatbot.csv
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:7387cb50de3d3db315cb8cac661aeceb160883647e279eb8a06e484b58a85255
|
3 |
+
size 14166418
|
NLP-Based-Chatbot/feedback_data.csv
ADDED
@@ -0,0 +1,2 @@
|
|
|
|
|
|
|
1 |
+
hello,aerial drone companion device is autonomously oriented such that an image capture device faces the first user in response to detecting the first voice command . a second voice command is detected while the image captured device faces a first user . the task signal is transmitted by the computer based on the second voice commands .,when i say "hello" chatbot should say "hi"
|
2 |
+
hai,aerial drone companion device is autonomously oriented such that an image capture device faces the first user in response to detecting the first voice command . a second voice command is detected while the image captured device faces a first user . the task signal is transmitted by the computer based on the second voice commands .,when i say "hai" you should say "hai"
|
NLP-Based-Chatbot/frontend.py
ADDED
@@ -0,0 +1,79 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import pandas as pd
|
2 |
+
from sklearn.feature_extraction.text import TfidfVectorizer
|
3 |
+
from sklearn.metrics.pairwise import cosine_similarity
|
4 |
+
import gradio as gr
|
5 |
+
from transformers import pipeline
|
6 |
+
|
7 |
+
# Load the CSV file and preprocess it
|
8 |
+
def load_csv_and_preprocess(csv_file):
|
9 |
+
df = pd.read_csv(csv_file)
|
10 |
+
df = df.dropna().head(100000)
|
11 |
+
|
12 |
+
column_names = list(df.columns)
|
13 |
+
df['combined'] = df.apply(lambda x: "Title: " + '; '.join(x[column_names].astype(str)), axis=1)
|
14 |
+
df['combined'] = df['combined'].str.strip()
|
15 |
+
|
16 |
+
vectorizer = TfidfVectorizer()
|
17 |
+
embeddings = vectorizer.fit_transform(df['combined'])
|
18 |
+
|
19 |
+
return df, vectorizer, embeddings
|
20 |
+
|
21 |
+
# Initialize the summarization pipeline outside of the chatbot_response function
|
22 |
+
summarizer = pipeline("summarization")
|
23 |
+
|
24 |
+
# Perform semantic search for a given query
|
25 |
+
def semantic_search(df, vectorizer, embeddings, query):
|
26 |
+
search_vector = vectorizer.transform([query])
|
27 |
+
similarities = cosine_similarity(search_vector, embeddings).flatten()
|
28 |
+
df['similarities'] = similarities
|
29 |
+
result = df.sort_values('similarities', ascending=False).head(3)
|
30 |
+
|
31 |
+
return result['combined'].tolist()
|
32 |
+
|
33 |
+
# Define the chatbot response function with summarization
|
34 |
+
def chatbot_response(query, history):
|
35 |
+
if not query.strip():
|
36 |
+
return "", history
|
37 |
+
search_results = semantic_search(df, vectorizer, embeddings, query)
|
38 |
+
# Summarize the search results
|
39 |
+
summary = summarizer("\n".join(search_results), max_length=130, min_length=30, do_sample=False)[0]['summary_text']
|
40 |
+
# Format the summarized response and update the chat history
|
41 |
+
history = f"{history}User: {query}\nBot: {summary}\n\n"
|
42 |
+
return "", history # Clear the input box after each message, update history
|
43 |
+
|
44 |
+
# Load CSV and preprocess on server startup
|
45 |
+
csv_file_path = "Dronealexa.csv" # Update this to your CSV file path
|
46 |
+
df, vectorizer, embeddings = load_csv_and_preprocess(csv_file_path)
|
47 |
+
|
48 |
+
# Define a function to handle feedback
|
49 |
+
def handle_feedback(feedback, response, history_box):
|
50 |
+
# Simple logic to prepend feedback to the user's query
|
51 |
+
# This could be replaced with more sophisticated logic or ML model updating
|
52 |
+
response = f"Based on your feedback ('{feedback}'): {response}"
|
53 |
+
history = history_box + "\nBot: " + response + "\n"
|
54 |
+
return "", history # Update the history with the feedback-aware response
|
55 |
+
|
56 |
+
|
57 |
+
# Gradio Blocks Interface
|
58 |
+
with gr.Blocks() as blocks_app:
|
59 |
+
gr.Markdown("<h1 style='text-align: center;'>Explore Science & Technology with Chatbot</h1>")
|
60 |
+
history_box = gr.Textbox(label="", value="", interactive=False, lines=20)
|
61 |
+
with gr.Row():
|
62 |
+
query_input = gr.Textbox(show_label=False, placeholder="Type your message here...", lines=1)
|
63 |
+
with gr.Row():
|
64 |
+
send_button = gr.Button("Send")
|
65 |
+
|
66 |
+
send_button.click(
|
67 |
+
fn=chatbot_response,
|
68 |
+
inputs=[query_input, history_box],
|
69 |
+
outputs=[query_input, history_box]
|
70 |
+
)
|
71 |
+
feedback_input = gr.Textbox(show_label=False, placeholder="Type your feedback here...", lines=1)
|
72 |
+
feedback_button = gr.Button("Submit Feedback")
|
73 |
+
|
74 |
+
feedback_button.click(
|
75 |
+
fn=handle_feedback,
|
76 |
+
inputs=[feedback_input, history_box, history_box],
|
77 |
+
outputs=[query_input, history_box]
|
78 |
+
)
|
79 |
+
blocks_app.launch(share=True)
|
NLP-Based-Chatbot/model_card.md.txt
ADDED
@@ -0,0 +1,80 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language: en
|
3 |
+
tags:
|
4 |
+
- chatbot
|
5 |
+
- natural language processing
|
6 |
+
license: Apache 2.0
|
7 |
+
datasets:
|
8 |
+
- Custom Dataset (Dronealexa)
|
9 |
+
---
|
10 |
+
|
11 |
+
Model Card: NLP-Based Chatbot
|
12 |
+
|
13 |
+
Overview
|
14 |
+
|
15 |
+
The NLP-Based Chatbot is designed to explore Science & Technology topics. It utilizes a combination of semantic search and summarization techniques to provide relevant and concise responses to user queries.
|
16 |
+
|
17 |
+
Model Details
|
18 |
+
|
19 |
+
- **Model Name:** NLP-Based Chatbot
|
20 |
+
- **Model Type:** Natural Language Processing (NLP) Chatbot
|
21 |
+
- **Framework:** Gradio Blocks Interface, spaCy, Transformers
|
22 |
+
|
23 |
+
Components
|
24 |
+
|
25 |
+
1. Semantic Search
|
26 |
+
|
27 |
+
The chatbot employs semantic search to retrieve relevant information from a preprocessed dataset (Dronealexa.csv). The search is based on a TF-IDF vectorizer and cosine similarity calculations.
|
28 |
+
|
29 |
+
2. Summarization
|
30 |
+
|
31 |
+
A summarization pipeline is used to generate concise summaries of the retrieved information. The Hugging Face Transformers library is utilized for summarization tasks.
|
32 |
+
|
33 |
+
3. Custom Embeddings
|
34 |
+
|
35 |
+
The model incorporates custom text embeddings using spaCy and pre-trained word embeddings. These embeddings enhance the understanding of user queries and contribute to the semantic search.
|
36 |
+
|
37 |
+
4. Gradio Blocks Interface
|
38 |
+
|
39 |
+
The chatbot's frontend is built using Gradio Blocks Interface, providing an interactive and user-friendly platform for users to input queries and receive responses.
|
40 |
+
|
41 |
+
5. Model Card Generation
|
42 |
+
|
43 |
+
The model card generation involves constructing prompts based on search results and utilizing a summarization pipeline to produce model card content.
|
44 |
+
|
45 |
+
Intended Use
|
46 |
+
|
47 |
+
The NLP-Based Chatbot is intended for users interested in exploring Science & Technology topics. It can be used to obtain information from the provided dataset, and users are encouraged to provide feedback for continuous improvement.
|
48 |
+
|
49 |
+
Training Data
|
50 |
+
|
51 |
+
The model is trained on a custom dataset (Dronealexa.csv) containing Science & Technology-related information. The dataset has been preprocessed to handle missing values and ensure efficient semantic search.
|
52 |
+
|
53 |
+
|
54 |
+
Evaluation Metrics
|
55 |
+
|
56 |
+
- Semantic Search: TF-IDF Vectorizer, Cosine Similarity
|
57 |
+
- Summarization: Hugging Face Transformers Pipeline
|
58 |
+
|
59 |
+
|
60 |
+
Ethical Considerations
|
61 |
+
|
62 |
+
The chatbot aims to provide accurate and relevant information. However, users are advised to critically evaluate the responses and understand that the model's knowledge is based on the training data.
|
63 |
+
|
64 |
+
|
65 |
+
Usage Instructions
|
66 |
+
|
67 |
+
1. Input your query in the provided textbox.
|
68 |
+
2. Click the "Send" button to receive a response.
|
69 |
+
3. Optionally, submit feedback using the "Submit Feedback" button.
|
70 |
+
|
71 |
+
|
72 |
+
License
|
73 |
+
|
74 |
+
This model is released under the Apache 2.0 License.
|
75 |
+
|
76 |
+
|
77 |
+
Contact Information
|
78 |
+
|
79 |
+
For inquiries or issues, please contact [email protected].
|
80 |
+
|