Spaces:
Sleeping
Sleeping
srikanth-nm
commited on
Commit
·
b30ed6a
1
Parent(s):
557a8c5
Upload 19 files
Browse files- SOURCE_DOCUMENTS/confluence.txt +1 -0
- SOURCE_DOCUMENTS/transcript.txt +1 -0
- __init__.py +0 -0
- app.py +216 -0
- chunks.json +1 -0
- chunks_create.py +44 -0
- confluence.py +75 -0
- constants.py +42 -0
- end_calculate.py +24 -0
- ingest.py +131 -0
- jira.csv +4 -0
- modJira.py +49 -0
- requirements.txt +34 -0
- run_localGPT.py +245 -0
- similarity.py +22 -0
- transcript.json +312 -0
- transcript_end.json +312 -0
- utils.py +51 -0
- youtube.py +68 -0
SOURCE_DOCUMENTS/confluence.txt
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
India's Income Tax Laws are framed by the Government The Government imposes a tax on taxable income of all persons who are individuals, Hindu Undivided Families (HUF's), companies, firms, LLP, association of persons, body of individuals, local authority and any other artificial juridical person. According to these laws, levy of tax on a person depends upon his residential status. Every individual who qualifies as a resident of India is required to pay tax on his or her global income. Every financial year, taxpayers have to follow certain rules while filing their Income Tax Returns (ITRs).Income Tax Return - What is it?An Income tax return (ITR) is a form used to file information about your income and tax to the Income Tax Department. The tax liability of a taxpayer is calculated based on his or her income. In case the return shows that excess tax has been paid during a year, then the individual will be eligible to receive a income tax refund from the Income Tax Department.As per the income tax laws, the return must be filed every year by an individual or business that earns any income during a financial year. The income could be in the form of a salary, business profits, income from house property or earned through dividends, capital gains, interests or other sources.Tax returns have to be filed by an individual or a business before a specified date. If a taxpayer fails to abide by the deadline, he or she has to pay a penalty. Is it mandatory to file Income Tax Return?As per the tax laws laid down in India, it is compulsory to file your income tax returns if your income is more than the basic exemption limit. The income tax rate is pre-decided for taxpayers. A delay in filing returns will not only attract late filing fees but also hamper your chances of getting a loan or a visa for travel purposes.Who should file Income Tax Returns?According to the Income Tax Act, income tax has to be paid only by individuals or businesses who fall within certain income brackets.
|
SOURCE_DOCUMENTS/transcript.txt
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
In this video, I'm going to answer the top 3 questions my students ask me about Python. What is Python? What can you do with it? And why is it so popular? In other words, what does it do that other programming languages don't? Python is the world's fastest growing and most popular programming language, not just amongst software engineers, but also amongst mathematicians, data analysts, scientists, accountants, networking engineers, and even kids! Because it's a very beginner friendly programming language. So people from different disciplines use Python for a variety of different tasks, such as data analysis and visualization, artificial intelligence and machine learning, automation in fact this is one of the big uses of Python amongst people who are not software developers. If you constantly have to do boring, repetitive tasks, such as copying files and folders around, renaming them, uploading them to a server, you can easily write a Python script to automate all that and save your time. And that's just one example, if you continuously have to work with excel spreadsheets, PDF's, CS View files, download websites and parse them, you can automate all that stuff with Python. So you don't have to be a software developer to use Python. You could be an accountant, a mathematician, or a scientist, and use Python to make your life easier. You can also use Python to build web, mobile and desktop applications as well as software testing or even hacking. So Python is a multi purpose language. Now if you have some programming experience you may say, "But Mosh we can do all this stuff with other programming languages, so what's the big deal about Python?" Here are a few reasons. With Python you can solve complex problems in less time with fewer lines of code. Here's an example. Let's say we want to extract the first three letters of the text Hello World. This is the code we have to write in C# this is how we do it in JavaScript and here's how we do it in Python. See how short and clean the language is? And that's just the beginning. Python makes a lot of trivial things really easy with a simple yet powerful syntax. Here are a few other reasons Python is so popular. It's a high level language so you don't have to worry about complex tasks such as memory management, like you do in C++. It's cross platform which means you can build and run Python applications on Windows, Mac, and Linux. It has a huge community so whenever you get stuck, there is someone out there to help. It has a large ecosystem of libraries, frameworks and tools which means whatever you wanna do it is likely that someone else has done it before because Python has been around for over 20 years. So in a nutshell, Python is a multi-purpose language with a simple, clean, and beginner-friendly syntax. All of that means Python is awesome. Technically everything you do with Python you can do with other programming languages, but Python's simplicity and elegance has made it grow way more than other programming languages. That's why it's the number onne language employers are looking for. So whether you're a programmer or an absolute beginner, learning Python opens up lots of job opportunities to you. In fact, the average Python developer earns a whopping 116,000 dollars a year. If you found this video helpful, please support my hard work by liking and sharing it with others. Also, be sure to subscribe to my channel, because I have a couple of awesome Python tutorials for you, you're going to see them on the screen now. Here's my Python tutorial for beginners, it's a great starting point if you have limited or no programming experience. On the other hand, if you do have some programming experience and want to quickly get up to speed with Python, I have another tutorial just for you. I'm not going to waste your time telling you what a variable or a function is. I will talk to you like a programmer. There's never been a better time to master Python programming, so click on the tutorial that is right for you and get started. Thank you for watching!
|
__init__.py
ADDED
File without changes
|
app.py
ADDED
@@ -0,0 +1,216 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Import required libraries
|
2 |
+
import streamlit as st
|
3 |
+
import youtube
|
4 |
+
import confluence
|
5 |
+
import modJira
|
6 |
+
import time
|
7 |
+
import similarity
|
8 |
+
import ingest
|
9 |
+
|
10 |
+
# global transcript_result
|
11 |
+
# transcript_result = ""
|
12 |
+
|
13 |
+
# Set page configuration and title for Streamlit
|
14 |
+
st.set_page_config(page_title="AI-Seeker", page_icon="📼", layout="wide")
|
15 |
+
|
16 |
+
# Add header with title and description
|
17 |
+
st.markdown(
|
18 |
+
'<p style="display:inline-block;font-size:40px;font-weight:bold;">AI-Seeker</p> <p style="display:inline-block;font-size:16px;">AI-Seeker is a web-app tool that utilizes APIs to extract text content from YouTube, Confluence and Jira. It incorporates Llama-2-7B-Chat-GGML model with Langchain to provide users with a summary and query-based smart response depending on the content of the media source.<br><br></p>',
|
19 |
+
unsafe_allow_html=True
|
20 |
+
)
|
21 |
+
|
22 |
+
txtInputBox = "YouTube"
|
23 |
+
|
24 |
+
|
25 |
+
with st.sidebar.title("Configuration"):
|
26 |
+
usecase = st.sidebar.selectbox("Select Media Type:",("YouTube", "Confluence", "Jira"))
|
27 |
+
if usecase == "YouTube":
|
28 |
+
txtInputBox = "Enter ID of YouTube Video"
|
29 |
+
default_value = "Y8Tko2YC5hA"
|
30 |
+
elif usecase == "Confluence":
|
31 |
+
txtInputBox = "Enter ID of your Confluence Page"
|
32 |
+
default_value = "393217"
|
33 |
+
elif usecase == "Jira":
|
34 |
+
txtInputBox = "Enter the name of your JIRA Project"
|
35 |
+
default_value = "jira_test"
|
36 |
+
|
37 |
+
video_id = st.sidebar.text_input(txtInputBox,value=default_value)
|
38 |
+
|
39 |
+
strTranscript = ""
|
40 |
+
training_status = "yet_to_start"
|
41 |
+
btnTranscript = st.sidebar.button("Transcript")
|
42 |
+
btnSummary = st.sidebar.button("Summary")
|
43 |
+
btnTrain = st.sidebar.button("Train")
|
44 |
+
if btnTrain:
|
45 |
+
with st.spinner("Training in Progress..."):
|
46 |
+
ingest.main()
|
47 |
+
|
48 |
+
query = st.sidebar.text_input('Enter your question below:', value="What is Python?")
|
49 |
+
btnAsk = st.sidebar.button("Query")
|
50 |
+
|
51 |
+
btnClear = st.sidebar.button("Clear Data")
|
52 |
+
if btnClear:
|
53 |
+
st.session_state.clear()
|
54 |
+
|
55 |
+
def fnJira():
|
56 |
+
st.info("Transcription")
|
57 |
+
|
58 |
+
if btnTranscript:
|
59 |
+
|
60 |
+
if 'transcript_result' not in st.session_state:
|
61 |
+
st.session_state['transcript_result'] = modJira.get_details(video_id)
|
62 |
+
transcript_result = st.session_state['transcript_result']
|
63 |
+
st.dataframe(transcript_result)
|
64 |
+
else:
|
65 |
+
if 'transcript_result' in st.session_state:
|
66 |
+
transcript_result = st.session_state['transcript_result']
|
67 |
+
st.dataframe(transcript_result)
|
68 |
+
|
69 |
+
st.info("Query")
|
70 |
+
|
71 |
+
if btnAsk:
|
72 |
+
with st.spinner(text="Retrieving..."):
|
73 |
+
if 'transcript_answer' not in st.session_state:
|
74 |
+
answer = modJira.ask_question(query)
|
75 |
+
st.session_state['transcript_answer'] = answer
|
76 |
+
#st.success(answer)
|
77 |
+
if 'transcript_answer' in st.session_state:
|
78 |
+
answer = st.session_state['transcript_answer']
|
79 |
+
|
80 |
+
st.success(answer)
|
81 |
+
|
82 |
+
else:
|
83 |
+
if 'transcript_answer' in st.session_state:
|
84 |
+
answer = st.session_state['transcript_answer']
|
85 |
+
|
86 |
+
st.success(answer)
|
87 |
+
|
88 |
+
|
89 |
+
def fnConfluence():
|
90 |
+
st.info("Transcription")
|
91 |
+
|
92 |
+
if btnTranscript:
|
93 |
+
|
94 |
+
if 'transcript_result' not in st.session_state:
|
95 |
+
st.session_state['transcript_result'] = confluence.transcript(video_id)
|
96 |
+
transcript_result = st.session_state['transcript_result']
|
97 |
+
st.markdown(f"<div style='height: 100px; overflow-y: scroll;'>{transcript_result}</div>", unsafe_allow_html=True)
|
98 |
+
else:
|
99 |
+
if 'transcript_result' in st.session_state:
|
100 |
+
transcript_result = st.session_state['transcript_result']
|
101 |
+
st.markdown(f"<div style='height: 100px; overflow-y: scroll;'>{transcript_result}</div>", unsafe_allow_html=True)
|
102 |
+
|
103 |
+
col1, col2 = st.columns([1, 1])
|
104 |
+
|
105 |
+
with col1:
|
106 |
+
# with col12:
|
107 |
+
st.info("Summary")
|
108 |
+
if btnSummary:
|
109 |
+
if 'transcript_summary' not in st.session_state:
|
110 |
+
with st.spinner(text="Retrieving..."):
|
111 |
+
st.session_state['transcript_summary'] = confluence.summarize()
|
112 |
+
summary = st.session_state['transcript_summary']
|
113 |
+
st.success(summary)
|
114 |
+
else:
|
115 |
+
if 'transcript_summary' in st.session_state:
|
116 |
+
summary = st.session_state['transcript_summary']
|
117 |
+
st.success(summary)
|
118 |
+
|
119 |
+
with col2:
|
120 |
+
st.info("Query")
|
121 |
+
|
122 |
+
if btnAsk:
|
123 |
+
with st.spinner(text="Retrieving..."):
|
124 |
+
if 'transcript_answer' not in st.session_state:
|
125 |
+
answer = confluence.ask_question(query)
|
126 |
+
st.session_state['transcript_answer'] = answer
|
127 |
+
#st.success(answer)
|
128 |
+
if 'transcript_answer' in st.session_state:
|
129 |
+
answer = st.session_state['transcript_answer']
|
130 |
+
|
131 |
+
st.success(answer)
|
132 |
+
|
133 |
+
else:
|
134 |
+
if 'transcript_answer' in st.session_state:
|
135 |
+
answer = st.session_state['transcript_answer']
|
136 |
+
|
137 |
+
st.success(answer)
|
138 |
+
|
139 |
+
def fnYoutube():
|
140 |
+
st.info("Transcription")
|
141 |
+
|
142 |
+
if btnTranscript:
|
143 |
+
|
144 |
+
if 'transcript_result' not in st.session_state:
|
145 |
+
st.session_state['transcript_result'] = youtube.audio_to_transcript(video_id)
|
146 |
+
transcript_result = st.session_state['transcript_result']
|
147 |
+
st.markdown(f"<div style='height: 100px; overflow-y: scroll;'>{transcript_result}</div>", unsafe_allow_html=True)
|
148 |
+
else:
|
149 |
+
if 'transcript_result' in st.session_state:
|
150 |
+
transcript_result = st.session_state['transcript_result']
|
151 |
+
st.markdown(f"<div style='height: 100px; overflow-y: scroll;'>{transcript_result}</div>", unsafe_allow_html=True)
|
152 |
+
|
153 |
+
col1, col2 = st.columns([1, 1])
|
154 |
+
|
155 |
+
with col1:
|
156 |
+
# with col12:
|
157 |
+
st.info("Summary")
|
158 |
+
if btnSummary:
|
159 |
+
if 'transcript_summary' not in st.session_state:
|
160 |
+
with st.spinner(text="Retrieving..."):
|
161 |
+
st.session_state['transcript_summary'] = youtube.summarize()
|
162 |
+
summary = st.session_state['transcript_summary']
|
163 |
+
st.success(summary)
|
164 |
+
else:
|
165 |
+
if 'transcript_summary' in st.session_state:
|
166 |
+
summary = st.session_state['transcript_summary']
|
167 |
+
st.success(summary)
|
168 |
+
|
169 |
+
with col2:
|
170 |
+
st.info("Query")
|
171 |
+
|
172 |
+
if btnAsk:
|
173 |
+
with st.spinner(text="Retrieving..."):
|
174 |
+
if 'transcript_answer' not in st.session_state:
|
175 |
+
answer = youtube.ask_question(query)
|
176 |
+
st.session_state['transcript_answer'] = answer
|
177 |
+
#st.success(answer)
|
178 |
+
if 'transcript_answer' in st.session_state:
|
179 |
+
answer = st.session_state['transcript_answer']
|
180 |
+
|
181 |
+
st.success(answer)
|
182 |
+
|
183 |
+
transcript_start_time, transcript_end_time = similarity.similarity(strQuery=answer)
|
184 |
+
|
185 |
+
st.video(f"https://www.youtube.com/embed/{video_id}", format="video/mp4", start_time=int(transcript_start_time))
|
186 |
+
|
187 |
+
else:
|
188 |
+
if 'transcript_answer' in st.session_state:
|
189 |
+
answer = st.session_state['transcript_answer']
|
190 |
+
|
191 |
+
st.success(answer)
|
192 |
+
|
193 |
+
transcript_start_time, transcript_end_time = similarity.similarity(strQuery=answer)
|
194 |
+
|
195 |
+
st.video(f"https://www.youtube.com/embed/{video_id}", format="video/mp4", start_time=int(transcript_start_time))
|
196 |
+
|
197 |
+
if usecase == "YouTube":
|
198 |
+
fnYoutube()
|
199 |
+
elif usecase == "Confluence":
|
200 |
+
fnConfluence()
|
201 |
+
elif usecase == "Jira":
|
202 |
+
fnJira()
|
203 |
+
|
204 |
+
# Hide Streamlit header, footer, and menu
|
205 |
+
hide_st_style = """
|
206 |
+
<style>
|
207 |
+
#MainMenu {visibility: hidden;}
|
208 |
+
footer {visibility: hidden;}
|
209 |
+
header {visibility: hidden;}
|
210 |
+
</style>
|
211 |
+
"""
|
212 |
+
#"""footer {visibility: hidden;}
|
213 |
+
# header {visibility: hidden;}"""
|
214 |
+
|
215 |
+
# Apply CSS code to hide header, footer, and menu
|
216 |
+
st.markdown(hide_st_style, unsafe_allow_html=True)
|
chunks.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
[{"text": "In this video, I'm going to answer the top 3 questions my students ask me about Python. What is Python? What can you do with it? And why is it so popular? In other words, what does it do that other programming languages don't? Python is the ", "start": 0.0, "end": 16.0}, {"text": "world's fastest growing and most popular programming language, not just amongst software engineers, but also amongst mathematicians, data analysts, scientists, accountants, networking engineers, and even kids! Because it's a very beginner friendly programming ", "start": 16.0, "end": 32.0}, {"text": "language. So people from different disciplines use Python for a variety of different tasks, such as data analysis and visualization, artificial intelligence and machine learning, automation in fact this is one of the big uses of Python amongst people who are not software", "start": 32.0, "end": 48.0}, {"text": "developers. If you constantly have to do boring, repetitive tasks, such as copying files and folders around, renaming them, uploading them to a server, you can easily write a Python script to automate all that and save your time. And that's just one example, if you", "start": 48.0, "end": 64.0}, {"text": "continuously have to work with excel spreadsheets, PDF's, CS View files, download websites and parse them, you can automate all that stuff with Python. So you don't have to be a software developer to use Python. You could be an accountant, a mathematician, or a scientist, and use Python ", "start": 64.0, "end": 80.0}, {"text": "to make your life easier. You can also use Python to build web, mobile and desktop applications as well as software testing or even hacking. So Python is a multi purpose language. Now if you have some programming experience you may say, \"But Mosh", "start": 80.0, "end": 96.0}, {"text": "we can do all this stuff with other programming languages, so what's the big deal about Python?\" Here are a few reasons. With Python you can solve complex problems in less time with fewer lines of code. Here's an example. Let's say we want to extract the first three ", "start": 96.0, "end": 112.0}, {"text": "letters of the text Hello World. This is the code we have to write in C# this is how we do it in JavaScript and here's how we do it in Python. See how short and clean the language is? And that's just the beginning. Python makes a lot of trivial things", "start": 112.0, "end": 128.0}, {"text": "really easy with a simple yet powerful syntax. Here are a few other reasons Python is so popular. It's a high level language so you don't have to worry about complex tasks such as memory management, like you do in C++. It's cross platform which means ", "start": 128.0, "end": 144.0}, {"text": "you can build and run Python applications on Windows, Mac, and Linux. It has a huge community so whenever you get stuck, there is someone out there to help. It has a large ecosystem of libraries, frameworks and tools which means whatever you wanna do", "start": 144.0, "end": 160.0}, {"text": "it is likely that someone else has done it before because Python has been around for over 20 years. So in a nutshell, Python is a multi-purpose language with a simple, clean, and beginner-friendly syntax. All of that means Python is awesome.", "start": 160.0, "end": 176.0}, {"text": "Technically everything you do with Python you can do with other programming languages, but Python's simplicity and elegance has made it grow way more than other programming languages. That's why it's the number onne language employers are looking for. So whether you're a programmer or ", "start": 176.0, "end": 192.0}, {"text": "an absolute beginner, learning Python opens up lots of job opportunities to you. In fact, the average Python developer earns a whopping 116,000 dollars a year. If you found this video helpful, please support my hard work by liking and sharing it with others. ", "start": 192.0, "end": 208.0}, {"text": "Also, be sure to subscribe to my channel, because I have a couple of awesome Python tutorials for you, you're going to see them on the screen now. Here's my Python tutorial for beginners, it's a great starting point if you have limited or no programming experience. On the other hand, if you ", "start": 208.0, "end": 224.0}, {"text": "do have some programming experience and want to quickly get up to speed with Python, I have another tutorial just for you. I'm not going to waste your time telling you what a variable or a function is. I will talk to you like a programmer. There's never been a better time to master Python programming,", "start": 224.0, "end": 240.0}, {"text": "so click on the tutorial that is right for you and get started. Thank you for watching!", "start": 240.0, "end": 246.63}]
|
chunks_create.py
ADDED
@@ -0,0 +1,44 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import json
|
2 |
+
|
3 |
+
def combine_and_calculate(input_file_path, output_file_path):
|
4 |
+
with open(input_file_path, 'r') as file:
|
5 |
+
output_data = json.load(file)
|
6 |
+
|
7 |
+
combined_json_list = []
|
8 |
+
|
9 |
+
# Calculate the number of groups to create
|
10 |
+
num_groups = (len(output_data) + 7) // 8
|
11 |
+
|
12 |
+
for group_num in range(num_groups):
|
13 |
+
# Calculate the starting index and ending index for the current group
|
14 |
+
start_index = group_num * 8
|
15 |
+
end_index = min(start_index + 8, len(output_data))
|
16 |
+
|
17 |
+
# Extract the "text" values from the current group of dictionaries
|
18 |
+
combined_text = " ".join([item["text"] for item in output_data[start_index:end_index]])
|
19 |
+
|
20 |
+
# Calculate the "start" and "end" for the current group
|
21 |
+
group_start = output_data[start_index]["start"]
|
22 |
+
group_end = output_data[end_index - 1]["end"]
|
23 |
+
|
24 |
+
# Create the combined JSON for the current group
|
25 |
+
combined_json = {
|
26 |
+
"text": combined_text,
|
27 |
+
"start": group_start,
|
28 |
+
"end": group_end,
|
29 |
+
}
|
30 |
+
|
31 |
+
combined_json_list.append(combined_json)
|
32 |
+
|
33 |
+
# Save the combined JSON list to a new file
|
34 |
+
with open(output_file_path, 'w') as output_file:
|
35 |
+
json.dump(combined_json_list, output_file)
|
36 |
+
|
37 |
+
# Replace 'output_file.json' with the path to the output JSON file you created previously
|
38 |
+
input_file_path = '/home/bharathi/langchain_experiments/GenAI/transcript_end.json'
|
39 |
+
|
40 |
+
# Replace 'combined_output_file.json' with the desired path and filename for the combined JSON file
|
41 |
+
output_file_path = '/home/bharathi/langchain_experiments/GenAI/chunks.json'
|
42 |
+
|
43 |
+
# Call the function to create the combined JSON and save it to a new file
|
44 |
+
combine_and_calculate(input_file_path, output_file_path)
|
confluence.py
ADDED
@@ -0,0 +1,75 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
from bs4 import BeautifulSoup
|
2 |
+
import requests
|
3 |
+
from requests.auth import HTTPBasicAuth
|
4 |
+
import run_localGPT
|
5 |
+
|
6 |
+
def start_training():
|
7 |
+
training_status = ingest.main()
|
8 |
+
return training_status
|
9 |
+
|
10 |
+
def replace_substring_and_following(input_string, substring):
|
11 |
+
index = input_string.find(substring)
|
12 |
+
if index != -1:
|
13 |
+
return input_string[:index]
|
14 |
+
else:
|
15 |
+
return input_string
|
16 |
+
|
17 |
+
def ask_question(strQuestion):
|
18 |
+
answer = run_localGPT.main(device_type='cpu', strQuery=strQuestion)
|
19 |
+
answer_cleaned = replace_substring_and_following(answer, "Unhelpful Answer")
|
20 |
+
return answer_cleaned
|
21 |
+
|
22 |
+
def transcript(page_id):
|
23 |
+
|
24 |
+
url = f"https://srikanthnm.atlassian.net/wiki/rest/api/content/{page_id}?expand=body.storage" # Replace with the actual URL you want to access
|
25 |
+
username = "[email protected]"
|
26 |
+
password = "ATATT3xFfGF09rugcjiT06v8xMayt5ggayMNiwz4b6w07PWQxPvpi4fMDzwwHxKt-v8dGx49uiulIMKHUUYroeS8cXvMKYfi7sQnFsYNfGslPVqSq1BQrzPhTio-xmYOHcit5ijzU9cSGGa7eLXUMxQTsSQjLhtZ-EQPI8h6aki690_-evLFZmU=3910FFD4"
|
27 |
+
|
28 |
+
|
29 |
+
response = requests.get(url, auth=HTTPBasicAuth(username, password))
|
30 |
+
|
31 |
+
# Check if the request was successful (status code 200)
|
32 |
+
if response.status_code == 200:
|
33 |
+
# Process the response data (if applicable)
|
34 |
+
data = response.json()
|
35 |
+
else:
|
36 |
+
data = f"Error: Unable to access the URL. Status code: {response.status_code}"
|
37 |
+
|
38 |
+
soup = BeautifulSoup(data['body']['storage']['value'],"html.parser")
|
39 |
+
|
40 |
+
page_content = soup.get_text()
|
41 |
+
page_content_cleaned = page_content.replace('\xa0',' ')
|
42 |
+
page_content_cleaned
|
43 |
+
|
44 |
+
with open('SOURCE_DOCUMENTS/confluence.txt', 'w') as outfile:
|
45 |
+
outfile.write(page_content_cleaned[:1998])
|
46 |
+
|
47 |
+
return page_content_cleaned[:1998]
|
48 |
+
|
49 |
+
def summarize():
|
50 |
+
from langchain import PromptTemplate, LLMChain
|
51 |
+
from langchain.text_splitter import CharacterTextSplitter
|
52 |
+
from langchain.chains.mapreduce import MapReduceChain
|
53 |
+
from langchain.prompts import PromptTemplate
|
54 |
+
|
55 |
+
model_id = "TheBloke/Llama-2-7B-Chat-GGML"
|
56 |
+
model_basename = "llama-2-7b-chat.ggmlv3.q4_0.bin"
|
57 |
+
|
58 |
+
llm = run_localGPT.load_model(device_type='cpu', model_id=model_id, model_basename=model_basename)
|
59 |
+
|
60 |
+
text_splitter = CharacterTextSplitter()
|
61 |
+
|
62 |
+
with open("SOURCE_DOCUMENTS/confluence.txt") as f:
|
63 |
+
file_content = f.read()
|
64 |
+
texts = text_splitter.split_text(file_content)
|
65 |
+
|
66 |
+
from langchain.docstore.document import Document
|
67 |
+
|
68 |
+
docs = [Document(page_content=t) for t in texts]
|
69 |
+
|
70 |
+
from langchain.chains.summarize import load_summarize_chain
|
71 |
+
|
72 |
+
chain = load_summarize_chain(llm, chain_type="map_reduce")
|
73 |
+
summary = chain.run(docs)
|
74 |
+
|
75 |
+
return summary
|
constants.py
ADDED
@@ -0,0 +1,42 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import os
|
2 |
+
|
3 |
+
# from dotenv import load_dotenv
|
4 |
+
from chromadb.config import Settings
|
5 |
+
|
6 |
+
# https://python.langchain.com/en/latest/modules/indexes/document_loaders/examples/excel.html?highlight=xlsx#microsoft-excel
|
7 |
+
from langchain.document_loaders import CSVLoader, PDFMinerLoader, TextLoader, UnstructuredExcelLoader, Docx2txtLoader
|
8 |
+
|
9 |
+
# load_dotenv()
|
10 |
+
ROOT_DIRECTORY = os.path.dirname(os.path.realpath(__file__))
|
11 |
+
|
12 |
+
# Define the folder for storing database
|
13 |
+
SOURCE_DIRECTORY = f"{ROOT_DIRECTORY}/SOURCE_DOCUMENTS"
|
14 |
+
|
15 |
+
PERSIST_DIRECTORY = f"{ROOT_DIRECTORY}/DB"
|
16 |
+
|
17 |
+
# Can be changed to a specific number
|
18 |
+
INGEST_THREADS = os.cpu_count() or 8
|
19 |
+
|
20 |
+
# Define the Chroma settings
|
21 |
+
CHROMA_SETTINGS = Settings(
|
22 |
+
chroma_db_impl="duckdb+parquet", persist_directory=PERSIST_DIRECTORY, anonymized_telemetry=False
|
23 |
+
)
|
24 |
+
|
25 |
+
# https://python.langchain.com/en/latest/_modules/langchain/document_loaders/excel.html#UnstructuredExcelLoader
|
26 |
+
DOCUMENT_MAP = {
|
27 |
+
".txt": TextLoader,
|
28 |
+
".md": TextLoader,
|
29 |
+
".py": TextLoader,
|
30 |
+
".pdf": PDFMinerLoader,
|
31 |
+
".csv": CSVLoader,
|
32 |
+
".xls": UnstructuredExcelLoader,
|
33 |
+
".xlsx": UnstructuredExcelLoader,
|
34 |
+
".docx": Docx2txtLoader,
|
35 |
+
".doc": Docx2txtLoader,
|
36 |
+
}
|
37 |
+
|
38 |
+
# Default Instructor Model
|
39 |
+
EMBEDDING_MODEL_NAME = "hkunlp/instructor-large"
|
40 |
+
# You can also choose a smaller model, don't forget to change HuggingFaceInstructEmbeddings
|
41 |
+
# to HuggingFaceEmbeddings in both ingest.py and run_localGPT.py
|
42 |
+
# EMBEDDING_MODEL_NAME = "all-MiniLM-L6-v2"
|
end_calculate.py
ADDED
@@ -0,0 +1,24 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import json
|
2 |
+
|
3 |
+
def calculate_end_from_file(input_file_path, output_file_path):
|
4 |
+
with open(input_file_path, 'r') as file:
|
5 |
+
input_data = json.load(file)
|
6 |
+
|
7 |
+
# Iterate through the list of dictionaries and calculate "end" for each one
|
8 |
+
for item in input_data:
|
9 |
+
item["end"] =round(item["start"] + item["duration"],2)
|
10 |
+
del item["duration"] # Remove the "duration" key from each dictionary
|
11 |
+
|
12 |
+
# Save the updated data to a new JSON file
|
13 |
+
with open(output_file_path, 'w') as output_file:
|
14 |
+
json.dump(input_data, output_file)
|
15 |
+
|
16 |
+
# # Replace 'input_file.json' with the actual path to your input JSON file
|
17 |
+
# input_file_path = '/home/bharathi/langchain_experiments/GenAI/transcript.json'
|
18 |
+
|
19 |
+
# # Replace 'output_file.json' with the desired path and filename for the output JSON file
|
20 |
+
# output_file_path = '/home/bharathi/langchain_experiments/GenAI/transcript_end.json'
|
21 |
+
|
22 |
+
# # Call the function to calculate the "end" values and remove "duration" and save the new JSON file
|
23 |
+
# calculate_end_from_file(input_file_path, output_file_path)
|
24 |
+
|
ingest.py
ADDED
@@ -0,0 +1,131 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import logging
|
2 |
+
import os
|
3 |
+
from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor, as_completed
|
4 |
+
|
5 |
+
import click
|
6 |
+
import torch
|
7 |
+
from langchain.docstore.document import Document
|
8 |
+
from langchain.embeddings import HuggingFaceInstructEmbeddings
|
9 |
+
from langchain.text_splitter import Language, RecursiveCharacterTextSplitter
|
10 |
+
from langchain.vectorstores import Chroma
|
11 |
+
|
12 |
+
from constants import (
|
13 |
+
CHROMA_SETTINGS,
|
14 |
+
DOCUMENT_MAP,
|
15 |
+
EMBEDDING_MODEL_NAME,
|
16 |
+
INGEST_THREADS,
|
17 |
+
PERSIST_DIRECTORY,
|
18 |
+
SOURCE_DIRECTORY,
|
19 |
+
)
|
20 |
+
|
21 |
+
|
22 |
+
def load_single_document(file_path: str) -> Document:
|
23 |
+
# Loads a single document from a file path
|
24 |
+
file_extension = os.path.splitext(file_path)[1]
|
25 |
+
loader_class = DOCUMENT_MAP.get(file_extension)
|
26 |
+
if loader_class:
|
27 |
+
loader = loader_class(file_path)
|
28 |
+
else:
|
29 |
+
raise ValueError("Document type is undefined")
|
30 |
+
return loader.load()[0]
|
31 |
+
|
32 |
+
|
33 |
+
def load_document_batch(filepaths):
|
34 |
+
logging.info("Loading document batch")
|
35 |
+
# create a thread pool
|
36 |
+
with ThreadPoolExecutor(len(filepaths)) as exe:
|
37 |
+
# load files
|
38 |
+
futures = [exe.submit(load_single_document, name) for name in filepaths]
|
39 |
+
# collect data
|
40 |
+
data_list = [future.result() for future in futures]
|
41 |
+
# return data and file paths
|
42 |
+
return (data_list, filepaths)
|
43 |
+
|
44 |
+
|
45 |
+
def load_documents(source_dir: str) -> list[Document]:
|
46 |
+
# Loads all documents from the source documents directory
|
47 |
+
all_files = os.listdir(source_dir)
|
48 |
+
paths = []
|
49 |
+
for file_path in all_files:
|
50 |
+
file_extension = os.path.splitext(file_path)[1]
|
51 |
+
source_file_path = os.path.join(source_dir, file_path)
|
52 |
+
if file_extension in DOCUMENT_MAP.keys():
|
53 |
+
paths.append(source_file_path)
|
54 |
+
|
55 |
+
# Have at least one worker and at most INGEST_THREADS workers
|
56 |
+
n_workers = min(INGEST_THREADS, max(len(paths), 1))
|
57 |
+
chunksize = round(len(paths) / n_workers)
|
58 |
+
docs = []
|
59 |
+
with ProcessPoolExecutor(n_workers) as executor:
|
60 |
+
futures = []
|
61 |
+
# split the load operations into chunks
|
62 |
+
for i in range(0, len(paths), chunksize):
|
63 |
+
# select a chunk of filenames
|
64 |
+
filepaths = paths[i : (i + chunksize)]
|
65 |
+
# submit the task
|
66 |
+
future = executor.submit(load_document_batch, filepaths)
|
67 |
+
futures.append(future)
|
68 |
+
# process all results
|
69 |
+
for future in as_completed(futures):
|
70 |
+
# open the file and load the data
|
71 |
+
contents, _ = future.result()
|
72 |
+
docs.extend(contents)
|
73 |
+
|
74 |
+
return docs
|
75 |
+
|
76 |
+
|
77 |
+
def split_documents(documents: list[Document]) -> tuple[list[Document], list[Document]]:
|
78 |
+
# Splits documents for correct Text Splitter
|
79 |
+
text_docs, python_docs = [], []
|
80 |
+
for doc in documents:
|
81 |
+
file_extension = os.path.splitext(doc.metadata["source"])[1]
|
82 |
+
if file_extension == ".py":
|
83 |
+
python_docs.append(doc)
|
84 |
+
else:
|
85 |
+
text_docs.append(doc)
|
86 |
+
|
87 |
+
return text_docs, python_docs
|
88 |
+
|
89 |
+
def main():#device_type):
|
90 |
+
# Load documents and split in chunks
|
91 |
+
logging.info(f"Loading documents from {SOURCE_DIRECTORY}")
|
92 |
+
documents = load_documents(SOURCE_DIRECTORY)
|
93 |
+
text_documents, python_documents = split_documents(documents)
|
94 |
+
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
|
95 |
+
python_splitter = RecursiveCharacterTextSplitter.from_language(
|
96 |
+
language=Language.PYTHON, chunk_size=1000, chunk_overlap=200
|
97 |
+
)
|
98 |
+
texts = text_splitter.split_documents(text_documents)
|
99 |
+
texts.extend(python_splitter.split_documents(python_documents))
|
100 |
+
logging.info(f"Loaded {len(documents)} documents from {SOURCE_DIRECTORY}")
|
101 |
+
logging.info(f"Split into {len(texts)} chunks of text")
|
102 |
+
|
103 |
+
# Create embeddings
|
104 |
+
embeddings = HuggingFaceInstructEmbeddings(
|
105 |
+
model_name=EMBEDDING_MODEL_NAME,
|
106 |
+
model_kwargs={"device": "cpu"},
|
107 |
+
)
|
108 |
+
# change the embedding type here if you are running into issues.
|
109 |
+
# These are much smaller embeddings and will work for most appications
|
110 |
+
# If you use HuggingFaceEmbeddings, make sure to also use the same in the
|
111 |
+
# run_localGPT.py file.
|
112 |
+
|
113 |
+
# embeddings = HuggingFaceEmbeddings(model_name=EMBEDDING_MODEL_NAME)
|
114 |
+
|
115 |
+
db = Chroma.from_documents(
|
116 |
+
texts,
|
117 |
+
embeddings,
|
118 |
+
persist_directory=PERSIST_DIRECTORY,
|
119 |
+
client_settings=CHROMA_SETTINGS,
|
120 |
+
)
|
121 |
+
db.persist()
|
122 |
+
db = None
|
123 |
+
|
124 |
+
return "done"
|
125 |
+
|
126 |
+
|
127 |
+
if __name__ == "__main__":
|
128 |
+
logging.basicConfig(
|
129 |
+
format="%(asctime)s - %(levelname)s - %(filename)s:%(lineno)s - %(message)s", level=logging.INFO
|
130 |
+
)
|
131 |
+
main()
|
jira.csv
ADDED
@@ -0,0 +1,4 @@
|
|
|
|
|
|
|
|
|
|
|
1 |
+
Key,Summary,Assignee
|
2 |
+
JT-3,Youtube querying,Srikanth Murleedharan
|
3 |
+
JT-2,Functionality for Teams transcript,Bharathi Sriram
|
4 |
+
JT-1,Multiple-page UI,Lakshmi Narayanan
|
modJira.py
ADDED
@@ -0,0 +1,49 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
from jira.client import JIRA
|
2 |
+
import pandas as pd
|
3 |
+
|
4 |
+
def get_details(project_id):
|
5 |
+
# Specify a server key. It should be your
|
6 |
+
# domain name link. yourdomainname.atlassian.net
|
7 |
+
jiraOptions = {'server': "https://srikanthnm.atlassian.net"}
|
8 |
+
|
9 |
+
# Get a JIRA client instance, pass,
|
10 |
+
# Authentication parameters
|
11 |
+
# and the Server name.
|
12 |
+
# emailID = your emailID
|
13 |
+
# token = token you receive after registration
|
14 |
+
jira = JIRA(options=jiraOptions, basic_auth=(
|
15 |
+
"[email protected]", "ATATT3xFfGF09rugcjiT06v8xMayt5ggayMNiwz4b6w07PWQxPvpi4fMDzwwHxKt-v8dGx49uiulIMKHUUYroeS8cXvMKYfi7sQnFsYNfGslPVqSq1BQrzPhTio-xmYOHcit5ijzU9cSGGa7eLXUMxQTsSQjLhtZ-EQPI8h6aki690_-evLFZmU=3910FFD4"))
|
16 |
+
|
17 |
+
# Search all issues mentioned against a project name.
|
18 |
+
lstKeys = []
|
19 |
+
lstSummary = []
|
20 |
+
lstReporter = []
|
21 |
+
for singleIssue in jira.search_issues(jql_str=f'project = {project_id}'):
|
22 |
+
lstKeys.append(singleIssue.key)
|
23 |
+
lstSummary.append(singleIssue.fields.summary)
|
24 |
+
lstReporter.append(singleIssue.fields.assignee.displayName)
|
25 |
+
|
26 |
+
df_output = pd.DataFrame()
|
27 |
+
df_output['Key'] = lstKeys
|
28 |
+
df_output['Summary'] = lstSummary
|
29 |
+
df_output['Assignee'] = lstReporter
|
30 |
+
|
31 |
+
df_output.to_csv('jira.csv', index=False)
|
32 |
+
|
33 |
+
return df_output
|
34 |
+
|
35 |
+
def ask_question(strQuery):
|
36 |
+
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
|
37 |
+
import pandas as pd
|
38 |
+
|
39 |
+
tokenizer = AutoTokenizer.from_pretrained("Yale-LILY/reastap-large")
|
40 |
+
model = AutoModelForSeq2SeqLM.from_pretrained("Yale-LILY/reastap-large")
|
41 |
+
|
42 |
+
table = pd.read_csv("jira.csv")
|
43 |
+
|
44 |
+
query = strQuery
|
45 |
+
encoding = tokenizer(table=table, query=query, return_tensors="pt")
|
46 |
+
|
47 |
+
outputs = model.generate(**encoding)
|
48 |
+
|
49 |
+
return (tokenizer.batch_decode(outputs, skip_special_tokens=True)[0])
|
requirements.txt
ADDED
@@ -0,0 +1,34 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Natural Language Processing
|
2 |
+
langchain==0.0.191
|
3 |
+
chromadb==0.3.22
|
4 |
+
llama-cpp-python==0.1.66
|
5 |
+
pdfminer.six==20221105
|
6 |
+
InstructorEmbedding
|
7 |
+
sentence-transformers
|
8 |
+
faiss-cpu
|
9 |
+
huggingface_hub
|
10 |
+
transformers
|
11 |
+
protobuf==3.20.0; sys_platform != 'darwin'
|
12 |
+
protobuf==3.20.0; sys_platform == 'darwin' and platform_machine != 'arm64'
|
13 |
+
protobuf==3.20.3; sys_platform == 'darwin' and platform_machine == 'arm64'
|
14 |
+
auto-gptq==0.2.2
|
15 |
+
docx2txt
|
16 |
+
|
17 |
+
# Utilities
|
18 |
+
urllib3==1.26.6
|
19 |
+
accelerate
|
20 |
+
bitsandbytes ; sys_platform != 'win32'
|
21 |
+
bitsandbytes-windows ; sys_platform == 'win32'
|
22 |
+
click
|
23 |
+
flask
|
24 |
+
requests
|
25 |
+
|
26 |
+
# Excel File Manipulation
|
27 |
+
openpyxl
|
28 |
+
|
29 |
+
#custom
|
30 |
+
youtube-transcript-api
|
31 |
+
streamlit
|
32 |
+
jira
|
33 |
+
|
34 |
+
|
run_localGPT.py
ADDED
@@ -0,0 +1,245 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import logging
|
2 |
+
|
3 |
+
import click
|
4 |
+
import torch
|
5 |
+
from auto_gptq import AutoGPTQForCausalLM
|
6 |
+
from huggingface_hub import hf_hub_download
|
7 |
+
from langchain.chains import RetrievalQA
|
8 |
+
from langchain.embeddings import HuggingFaceInstructEmbeddings
|
9 |
+
from langchain.llms import HuggingFacePipeline, LlamaCpp
|
10 |
+
|
11 |
+
# from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
|
12 |
+
from langchain.vectorstores import Chroma
|
13 |
+
from transformers import (
|
14 |
+
AutoModelForCausalLM,
|
15 |
+
AutoTokenizer,
|
16 |
+
GenerationConfig,
|
17 |
+
LlamaForCausalLM,
|
18 |
+
LlamaTokenizer,
|
19 |
+
pipeline,
|
20 |
+
)
|
21 |
+
|
22 |
+
from constants import CHROMA_SETTINGS, EMBEDDING_MODEL_NAME, PERSIST_DIRECTORY
|
23 |
+
|
24 |
+
|
25 |
+
def load_model(device_type, model_id, model_basename=None):
|
26 |
+
"""
|
27 |
+
Select a model for text generation using the HuggingFace library.
|
28 |
+
If you are running this for the first time, it will download a model for you.
|
29 |
+
subsequent runs will use the model from the disk.
|
30 |
+
|
31 |
+
Args:
|
32 |
+
device_type (str): Type of device to use, e.g., "cuda" for GPU or "cpu" for CPU.
|
33 |
+
model_id (str): Identifier of the model to load from HuggingFace's model hub.
|
34 |
+
model_basename (str, optional): Basename of the model if using quantized models.
|
35 |
+
Defaults to None.
|
36 |
+
|
37 |
+
Returns:
|
38 |
+
HuggingFacePipeline: A pipeline object for text generation using the loaded model.
|
39 |
+
|
40 |
+
Raises:
|
41 |
+
ValueError: If an unsupported model or device type is provided.
|
42 |
+
"""
|
43 |
+
logging.info(f"Loading Model: {model_id}, on: {device_type}")
|
44 |
+
logging.info("This action can take a few minutes!")
|
45 |
+
|
46 |
+
if model_basename is not None:
|
47 |
+
if ".ggml" in model_basename:
|
48 |
+
logging.info("Using Llamacpp for GGML quantized models")
|
49 |
+
model_path = hf_hub_download(repo_id=model_id, filename=model_basename)
|
50 |
+
max_ctx_size = 2048
|
51 |
+
kwargs = {
|
52 |
+
"model_path": model_path,
|
53 |
+
"n_ctx": max_ctx_size,
|
54 |
+
"max_tokens": max_ctx_size,
|
55 |
+
}
|
56 |
+
if device_type.lower() == "mps":
|
57 |
+
kwargs["n_gpu_layers"] = 1000
|
58 |
+
if device_type.lower() == "cuda":
|
59 |
+
kwargs["n_gpu_layers"] = 1000
|
60 |
+
kwargs["n_batch"] = max_ctx_size
|
61 |
+
return LlamaCpp(**kwargs)
|
62 |
+
|
63 |
+
else:
|
64 |
+
# The code supports all huggingface models that ends with GPTQ and have some variation
|
65 |
+
# of .no-act.order or .safetensors in their HF repo.
|
66 |
+
logging.info("Using AutoGPTQForCausalLM for quantized models")
|
67 |
+
|
68 |
+
if ".safetensors" in model_basename:
|
69 |
+
# Remove the ".safetensors" ending if present
|
70 |
+
model_basename = model_basename.replace(".safetensors", "")
|
71 |
+
|
72 |
+
tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)
|
73 |
+
logging.info("Tokenizer loaded")
|
74 |
+
|
75 |
+
model = AutoGPTQForCausalLM.from_quantized(
|
76 |
+
model_id,
|
77 |
+
model_basename=model_basename,
|
78 |
+
use_safetensors=True,
|
79 |
+
trust_remote_code=True,
|
80 |
+
device="cuda:0",
|
81 |
+
use_triton=False,
|
82 |
+
quantize_config=None,
|
83 |
+
)
|
84 |
+
elif (
|
85 |
+
device_type.lower() == "cuda"
|
86 |
+
): # The code supports all huggingface models that ends with -HF or which have a .bin
|
87 |
+
# file in their HF repo.
|
88 |
+
logging.info("Using AutoModelForCausalLM for full models")
|
89 |
+
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
90 |
+
logging.info("Tokenizer loaded")
|
91 |
+
|
92 |
+
model = AutoModelForCausalLM.from_pretrained(
|
93 |
+
model_id,
|
94 |
+
device_map="auto",
|
95 |
+
torch_dtype=torch.float16,
|
96 |
+
low_cpu_mem_usage=True,
|
97 |
+
trust_remote_code=True,
|
98 |
+
# max_memory={0: "15GB"} # Uncomment this line with you encounter CUDA out of memory errors
|
99 |
+
)
|
100 |
+
model.tie_weights()
|
101 |
+
else:
|
102 |
+
logging.info("Using LlamaTokenizer")
|
103 |
+
tokenizer = LlamaTokenizer.from_pretrained(model_id)
|
104 |
+
model = LlamaForCausalLM.from_pretrained(model_id)
|
105 |
+
|
106 |
+
# Load configuration from the model to avoid warnings
|
107 |
+
generation_config = GenerationConfig.from_pretrained(model_id)
|
108 |
+
# see here for details:
|
109 |
+
# https://huggingface.co/docs/transformers/
|
110 |
+
# main_classes/text_generation#transformers.GenerationConfig.from_pretrained.returns
|
111 |
+
|
112 |
+
# Create a pipeline for text generation
|
113 |
+
pipe = pipeline(
|
114 |
+
"text-generation",
|
115 |
+
model=model,
|
116 |
+
tokenizer=tokenizer,
|
117 |
+
max_length=2048,
|
118 |
+
temperature=0,
|
119 |
+
top_p=0.95,
|
120 |
+
repetition_penalty=1.15,
|
121 |
+
generation_config=generation_config,
|
122 |
+
)
|
123 |
+
|
124 |
+
local_llm = HuggingFacePipeline(pipeline=pipe)
|
125 |
+
logging.info("Local LLM Loaded")
|
126 |
+
|
127 |
+
return local_llm
|
128 |
+
|
129 |
+
|
130 |
+
# # chose device typ to run on as well as to show source documents.
|
131 |
+
# @click.command()
|
132 |
+
# @click.option(
|
133 |
+
# "--device_type",
|
134 |
+
# default="cuda" if torch.cuda.is_available() else "cpu",
|
135 |
+
# type=click.Choice(
|
136 |
+
# [
|
137 |
+
# "cpu",
|
138 |
+
# "cuda",
|
139 |
+
# "ipu",
|
140 |
+
# "xpu",
|
141 |
+
# "mkldnn",
|
142 |
+
# "opengl",
|
143 |
+
# "opencl",
|
144 |
+
# "ideep",
|
145 |
+
# "hip",
|
146 |
+
# "ve",
|
147 |
+
# "fpga",
|
148 |
+
# "ort",
|
149 |
+
# "xla",
|
150 |
+
# "lazy",
|
151 |
+
# "vulkan",
|
152 |
+
# "mps",
|
153 |
+
# "meta",
|
154 |
+
# "hpu",
|
155 |
+
# "mtia",
|
156 |
+
# ],
|
157 |
+
# ),
|
158 |
+
# help="Device to run on. (Default is cuda)",
|
159 |
+
# )
|
160 |
+
# @click.option(
|
161 |
+
# "--show_sources",
|
162 |
+
# "-s",
|
163 |
+
# is_flag=True,
|
164 |
+
# help="Show sources along with answers (Default is False)",
|
165 |
+
# )
|
166 |
+
def main(device_type, strQuery):
|
167 |
+
"""
|
168 |
+
This function implements the information retrieval task.
|
169 |
+
|
170 |
+
|
171 |
+
1. Loads an embedding model, can be HuggingFaceInstructEmbeddings or HuggingFaceEmbeddings
|
172 |
+
2. Loads the existing vectorestore that was created by inget.py
|
173 |
+
3. Loads the local LLM using load_model function - You can now set different LLMs.
|
174 |
+
4. Setup the Question Answer retreival chain.
|
175 |
+
5. Question answers.
|
176 |
+
"""
|
177 |
+
|
178 |
+
logging.info(f"Running on: {device_type}")
|
179 |
+
#logging.info(f"Display Source Documents set to: {show_sources}")
|
180 |
+
|
181 |
+
embeddings = HuggingFaceInstructEmbeddings(model_name=EMBEDDING_MODEL_NAME, model_kwargs={"device": device_type})
|
182 |
+
|
183 |
+
# uncomment the following line if you used HuggingFaceEmbeddings in the ingest.py
|
184 |
+
# embeddings = HuggingFaceEmbeddings(model_name=EMBEDDING_MODEL_NAME)
|
185 |
+
|
186 |
+
# load the vectorstore
|
187 |
+
db = Chroma(
|
188 |
+
persist_directory=PERSIST_DIRECTORY,
|
189 |
+
embedding_function=embeddings,
|
190 |
+
client_settings=CHROMA_SETTINGS,
|
191 |
+
)
|
192 |
+
retriever = db.as_retriever()
|
193 |
+
|
194 |
+
# load the LLM for generating Natural Language responses
|
195 |
+
|
196 |
+
# for HF models
|
197 |
+
# model_id = "TheBloke/vicuna-7B-1.1-HF"
|
198 |
+
# model_basename = None
|
199 |
+
# model_id = "TheBloke/Wizard-Vicuna-7B-Uncensored-HF"
|
200 |
+
# model_id = "TheBloke/guanaco-7B-HF"
|
201 |
+
# model_id = 'NousResearch/Nous-Hermes-13b' # Requires ~ 23GB VRAM. Using STransformers
|
202 |
+
# alongside will 100% create OOM on 24GB cards.
|
203 |
+
# llm = load_model(device_type, model_id=model_id)
|
204 |
+
|
205 |
+
# for GPTQ (quantized) models
|
206 |
+
# model_id = "TheBloke/Nous-Hermes-13B-GPTQ"
|
207 |
+
# model_basename = "nous-hermes-13b-GPTQ-4bit-128g.no-act.order"
|
208 |
+
# model_id = "TheBloke/WizardLM-30B-Uncensored-GPTQ"
|
209 |
+
# model_basename = "WizardLM-30B-Uncensored-GPTQ-4bit.act-order.safetensors" # Requires
|
210 |
+
# ~21GB VRAM. Using STransformers alongside can potentially create OOM on 24GB cards.
|
211 |
+
# model_id = "TheBloke/wizardLM-7B-GPTQ"
|
212 |
+
# model_basename = "wizardLM-7B-GPTQ-4bit.compat.no-act-order.safetensors"
|
213 |
+
# model_id = "TheBloke/WizardLM-7B-uncensored-GPTQ"
|
214 |
+
# model_basename = "WizardLM-7B-uncensored-GPTQ-4bit-128g.compat.no-act-order.safetensors"
|
215 |
+
|
216 |
+
# for GGML (quantized cpu+gpu+mps) models - check if they support llama.cpp
|
217 |
+
# model_id = "TheBloke/wizard-vicuna-13B-GGML"
|
218 |
+
# model_basename = "wizard-vicuna-13B.ggmlv3.q4_0.bin"
|
219 |
+
# model_basename = "wizard-vicuna-13B.ggmlv3.q6_K.bin"
|
220 |
+
# model_basename = "wizard-vicuna-13B.ggmlv3.q2_K.bin"
|
221 |
+
# model_id = "TheBloke/orca_mini_3B-GGML"
|
222 |
+
# model_basename = "orca-mini-3b.ggmlv3.q4_0.bin"
|
223 |
+
|
224 |
+
model_id = "TheBloke/Llama-2-7B-Chat-GGML"
|
225 |
+
model_basename = "llama-2-7b-chat.ggmlv3.q4_0.bin"
|
226 |
+
|
227 |
+
llm = load_model(device_type, model_id=model_id, model_basename=model_basename)
|
228 |
+
|
229 |
+
qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=retriever, return_source_documents=True)
|
230 |
+
# Interactive questions and answers
|
231 |
+
|
232 |
+
query = strQuery
|
233 |
+
# Get the answer from the chain
|
234 |
+
res = qa(query)
|
235 |
+
answer, docs = res["result"], res["source_documents"]
|
236 |
+
|
237 |
+
return(answer)
|
238 |
+
|
239 |
+
|
240 |
+
|
241 |
+
if __name__ == "__main__":
|
242 |
+
logging.basicConfig(
|
243 |
+
format="%(asctime)s - %(levelname)s - %(filename)s:%(lineno)s - %(message)s", level=logging.INFO
|
244 |
+
)
|
245 |
+
main()
|
similarity.py
ADDED
@@ -0,0 +1,22 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
from sentence_transformers import SentenceTransformer, util
|
2 |
+
import json
|
3 |
+
import numpy as np
|
4 |
+
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
|
5 |
+
|
6 |
+
def similarity(strQuery):
|
7 |
+
|
8 |
+
inputs = json.load(open('chunks.json','r'))
|
9 |
+
lstCorpus = [dct['text'] for dct in inputs]
|
10 |
+
|
11 |
+
strQuery = "How many different document types?"
|
12 |
+
qryEmbedding = model.encode(strQuery, convert_to_tensor=True)
|
13 |
+
corpusEmbedding= model.encode(lstCorpus, convert_to_tensor=True)
|
14 |
+
|
15 |
+
sim_mat = util.pytorch_cos_sim(qryEmbedding, corpusEmbedding)
|
16 |
+
lstSim = sim_mat[0].tolist()
|
17 |
+
npSim = np.array(lstSim)
|
18 |
+
indexMax = npSim.argmax()
|
19 |
+
scoreMax = npSim.max()
|
20 |
+
|
21 |
+
return(inputs[indexMax]['start'], inputs[indexMax]['end'])
|
22 |
+
|
transcript.json
ADDED
@@ -0,0 +1,312 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
[
|
2 |
+
{
|
3 |
+
"text": "In this video, I'm going to answer the top 3 questions",
|
4 |
+
"start": 0.0,
|
5 |
+
"duration": 4.0
|
6 |
+
},
|
7 |
+
{
|
8 |
+
"text": "my students ask me about Python. What is Python? What ",
|
9 |
+
"start": 4.0,
|
10 |
+
"duration": 4.0
|
11 |
+
},
|
12 |
+
{
|
13 |
+
"text": "can you do with it? And why is it so popular? In other words, what",
|
14 |
+
"start": 8.0,
|
15 |
+
"duration": 4.0
|
16 |
+
},
|
17 |
+
{
|
18 |
+
"text": "does it do that other programming languages don't? Python is the ",
|
19 |
+
"start": 12.0,
|
20 |
+
"duration": 4.0
|
21 |
+
},
|
22 |
+
{
|
23 |
+
"text": "world's fastest growing and most popular programming language, not just ",
|
24 |
+
"start": 16.0,
|
25 |
+
"duration": 4.0
|
26 |
+
},
|
27 |
+
{
|
28 |
+
"text": "amongst software engineers, but also amongst mathematicians, ",
|
29 |
+
"start": 20.0,
|
30 |
+
"duration": 4.0
|
31 |
+
},
|
32 |
+
{
|
33 |
+
"text": "data analysts, scientists, accountants, networking engineers,",
|
34 |
+
"start": 24.0,
|
35 |
+
"duration": 4.0
|
36 |
+
},
|
37 |
+
{
|
38 |
+
"text": "and even kids! Because it's a very beginner friendly programming ",
|
39 |
+
"start": 28.0,
|
40 |
+
"duration": 4.0
|
41 |
+
},
|
42 |
+
{
|
43 |
+
"text": "language. So people from different disciplines use Python",
|
44 |
+
"start": 32.0,
|
45 |
+
"duration": 4.0
|
46 |
+
},
|
47 |
+
{
|
48 |
+
"text": "for a variety of different tasks, such as data analysis and visualization, ",
|
49 |
+
"start": 36.0,
|
50 |
+
"duration": 4.0
|
51 |
+
},
|
52 |
+
{
|
53 |
+
"text": "artificial intelligence and machine learning, automation ",
|
54 |
+
"start": 40.0,
|
55 |
+
"duration": 4.0
|
56 |
+
},
|
57 |
+
{
|
58 |
+
"text": "in fact this is one of the big uses of Python amongst people who are not software",
|
59 |
+
"start": 44.0,
|
60 |
+
"duration": 4.0
|
61 |
+
},
|
62 |
+
{
|
63 |
+
"text": "developers. If you constantly have to do boring, repetitive ",
|
64 |
+
"start": 48.0,
|
65 |
+
"duration": 4.0
|
66 |
+
},
|
67 |
+
{
|
68 |
+
"text": "tasks, such as copying files and folders around, renaming them, ",
|
69 |
+
"start": 52.0,
|
70 |
+
"duration": 4.0
|
71 |
+
},
|
72 |
+
{
|
73 |
+
"text": "uploading them to a server, you can easily write a Python script to",
|
74 |
+
"start": 56.0,
|
75 |
+
"duration": 4.0
|
76 |
+
},
|
77 |
+
{
|
78 |
+
"text": "automate all that and save your time. And that's just one example, if you",
|
79 |
+
"start": 60.0,
|
80 |
+
"duration": 4.0
|
81 |
+
},
|
82 |
+
{
|
83 |
+
"text": "continuously have to work with excel spreadsheets, PDF's, CS",
|
84 |
+
"start": 64.0,
|
85 |
+
"duration": 4.0
|
86 |
+
},
|
87 |
+
{
|
88 |
+
"text": "View files, download websites and parse them, you can automate all",
|
89 |
+
"start": 68.0,
|
90 |
+
"duration": 4.0
|
91 |
+
},
|
92 |
+
{
|
93 |
+
"text": "that stuff with Python. So you don't have to be a software developer to use Python.",
|
94 |
+
"start": 72.0,
|
95 |
+
"duration": 4.0
|
96 |
+
},
|
97 |
+
{
|
98 |
+
"text": "You could be an accountant, a mathematician, or a scientist, and use Python ",
|
99 |
+
"start": 76.0,
|
100 |
+
"duration": 4.0
|
101 |
+
},
|
102 |
+
{
|
103 |
+
"text": "to make your life easier. You can also use Python to build ",
|
104 |
+
"start": 80.0,
|
105 |
+
"duration": 4.0
|
106 |
+
},
|
107 |
+
{
|
108 |
+
"text": "web, mobile and desktop applications as well as software ",
|
109 |
+
"start": 84.0,
|
110 |
+
"duration": 4.0
|
111 |
+
},
|
112 |
+
{
|
113 |
+
"text": "testing or even hacking. So Python is a multi purpose language. ",
|
114 |
+
"start": 88.0,
|
115 |
+
"duration": 4.0
|
116 |
+
},
|
117 |
+
{
|
118 |
+
"text": "Now if you have some programming experience you may say, \"But Mosh",
|
119 |
+
"start": 92.0,
|
120 |
+
"duration": 4.0
|
121 |
+
},
|
122 |
+
{
|
123 |
+
"text": "we can do all this stuff with other programming languages, so what's the big deal ",
|
124 |
+
"start": 96.0,
|
125 |
+
"duration": 4.0
|
126 |
+
},
|
127 |
+
{
|
128 |
+
"text": "about Python?\" Here are a few reasons. With Python you can ",
|
129 |
+
"start": 100.0,
|
130 |
+
"duration": 4.0
|
131 |
+
},
|
132 |
+
{
|
133 |
+
"text": "solve complex problems in less time with fewer lines of code. ",
|
134 |
+
"start": 104.0,
|
135 |
+
"duration": 4.0
|
136 |
+
},
|
137 |
+
{
|
138 |
+
"text": "Here's an example. Let's say we want to extract the first three ",
|
139 |
+
"start": 108.0,
|
140 |
+
"duration": 4.0
|
141 |
+
},
|
142 |
+
{
|
143 |
+
"text": "letters of the text Hello World. This is the code we have to write ",
|
144 |
+
"start": 112.0,
|
145 |
+
"duration": 4.0
|
146 |
+
},
|
147 |
+
{
|
148 |
+
"text": "in C# this is how we do it in JavaScript and here's how we ",
|
149 |
+
"start": 116.0,
|
150 |
+
"duration": 4.0
|
151 |
+
},
|
152 |
+
{
|
153 |
+
"text": "do it in Python. See how short and clean the language is?",
|
154 |
+
"start": 120.0,
|
155 |
+
"duration": 4.0
|
156 |
+
},
|
157 |
+
{
|
158 |
+
"text": "And that's just the beginning. Python makes a lot of trivial things",
|
159 |
+
"start": 124.0,
|
160 |
+
"duration": 4.0
|
161 |
+
},
|
162 |
+
{
|
163 |
+
"text": "really easy with a simple yet powerful syntax. Here are a few",
|
164 |
+
"start": 128.0,
|
165 |
+
"duration": 4.0
|
166 |
+
},
|
167 |
+
{
|
168 |
+
"text": "other reasons Python is so popular. It's a high level language",
|
169 |
+
"start": 132.0,
|
170 |
+
"duration": 4.0
|
171 |
+
},
|
172 |
+
{
|
173 |
+
"text": "so you don't have to worry about complex tasks such as memory management, ",
|
174 |
+
"start": 136.0,
|
175 |
+
"duration": 4.0
|
176 |
+
},
|
177 |
+
{
|
178 |
+
"text": "like you do in C++. It's cross platform which means ",
|
179 |
+
"start": 140.0,
|
180 |
+
"duration": 4.0
|
181 |
+
},
|
182 |
+
{
|
183 |
+
"text": "you can build and run Python applications on Windows, Mac, ",
|
184 |
+
"start": 144.0,
|
185 |
+
"duration": 4.0
|
186 |
+
},
|
187 |
+
{
|
188 |
+
"text": "and Linux. It has a huge community so whenever you get ",
|
189 |
+
"start": 148.0,
|
190 |
+
"duration": 4.0
|
191 |
+
},
|
192 |
+
{
|
193 |
+
"text": "stuck, there is someone out there to help. It has a large ecosystem ",
|
194 |
+
"start": 152.0,
|
195 |
+
"duration": 4.0
|
196 |
+
},
|
197 |
+
{
|
198 |
+
"text": "of libraries, frameworks and tools which means whatever you wanna do",
|
199 |
+
"start": 156.0,
|
200 |
+
"duration": 4.0
|
201 |
+
},
|
202 |
+
{
|
203 |
+
"text": "it is likely that someone else has done it before because Python has been around ",
|
204 |
+
"start": 160.0,
|
205 |
+
"duration": 4.0
|
206 |
+
},
|
207 |
+
{
|
208 |
+
"text": "for over 20 years. So in a nutshell, Python",
|
209 |
+
"start": 164.0,
|
210 |
+
"duration": 4.0
|
211 |
+
},
|
212 |
+
{
|
213 |
+
"text": "is a multi-purpose language with a simple, clean, and beginner-friendly ",
|
214 |
+
"start": 168.0,
|
215 |
+
"duration": 4.0
|
216 |
+
},
|
217 |
+
{
|
218 |
+
"text": "syntax. All of that means Python is awesome.",
|
219 |
+
"start": 172.0,
|
220 |
+
"duration": 4.0
|
221 |
+
},
|
222 |
+
{
|
223 |
+
"text": "Technically everything you do with Python you can do with other programming languages, ",
|
224 |
+
"start": 176.0,
|
225 |
+
"duration": 4.0
|
226 |
+
},
|
227 |
+
{
|
228 |
+
"text": "but Python's simplicity and elegance has made it grow way ",
|
229 |
+
"start": 180.0,
|
230 |
+
"duration": 4.0
|
231 |
+
},
|
232 |
+
{
|
233 |
+
"text": "more than other programming languages. That's why it's the number onne",
|
234 |
+
"start": 184.0,
|
235 |
+
"duration": 4.0
|
236 |
+
},
|
237 |
+
{
|
238 |
+
"text": "language employers are looking for. So whether you're a programmer or ",
|
239 |
+
"start": 188.0,
|
240 |
+
"duration": 4.0
|
241 |
+
},
|
242 |
+
{
|
243 |
+
"text": "an absolute beginner, learning Python opens up lots of job opportunities ",
|
244 |
+
"start": 192.0,
|
245 |
+
"duration": 4.0
|
246 |
+
},
|
247 |
+
{
|
248 |
+
"text": "to you. In fact, the average Python developer earns a whopping",
|
249 |
+
"start": 196.0,
|
250 |
+
"duration": 4.0
|
251 |
+
},
|
252 |
+
{
|
253 |
+
"text": "116,000 dollars a year. If you",
|
254 |
+
"start": 200.0,
|
255 |
+
"duration": 4.0
|
256 |
+
},
|
257 |
+
{
|
258 |
+
"text": "found this video helpful, please support my hard work by liking and sharing it with others. ",
|
259 |
+
"start": 204.0,
|
260 |
+
"duration": 4.0
|
261 |
+
},
|
262 |
+
{
|
263 |
+
"text": "Also, be sure to subscribe to my channel, because I have a couple of",
|
264 |
+
"start": 208.0,
|
265 |
+
"duration": 4.0
|
266 |
+
},
|
267 |
+
{
|
268 |
+
"text": "awesome Python tutorials for you, you're going to see them on the screen now. ",
|
269 |
+
"start": 212.0,
|
270 |
+
"duration": 4.0
|
271 |
+
},
|
272 |
+
{
|
273 |
+
"text": "Here's my Python tutorial for beginners, it's a great starting point if you ",
|
274 |
+
"start": 216.0,
|
275 |
+
"duration": 4.0
|
276 |
+
},
|
277 |
+
{
|
278 |
+
"text": "have limited or no programming experience. On the other hand, if you ",
|
279 |
+
"start": 220.0,
|
280 |
+
"duration": 4.0
|
281 |
+
},
|
282 |
+
{
|
283 |
+
"text": "do have some programming experience and want to quickly get up to speed with Python, ",
|
284 |
+
"start": 224.0,
|
285 |
+
"duration": 4.0
|
286 |
+
},
|
287 |
+
{
|
288 |
+
"text": "I have another tutorial just for you. I'm not going to waste your time ",
|
289 |
+
"start": 228.0,
|
290 |
+
"duration": 4.0
|
291 |
+
},
|
292 |
+
{
|
293 |
+
"text": "telling you what a variable or a function is. I will talk to you like a programmer.",
|
294 |
+
"start": 232.0,
|
295 |
+
"duration": 4.0
|
296 |
+
},
|
297 |
+
{
|
298 |
+
"text": "There's never been a better time to master Python programming,",
|
299 |
+
"start": 236.0,
|
300 |
+
"duration": 4.0
|
301 |
+
},
|
302 |
+
{
|
303 |
+
"text": "so click on the tutorial that is right for you and get started. Thank you for",
|
304 |
+
"start": 240.0,
|
305 |
+
"duration": 4.0
|
306 |
+
},
|
307 |
+
{
|
308 |
+
"text": "watching!",
|
309 |
+
"start": 244.0,
|
310 |
+
"duration": 2.633
|
311 |
+
}
|
312 |
+
]
|
transcript_end.json
ADDED
@@ -0,0 +1,312 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
[
|
2 |
+
{
|
3 |
+
"text": "In this video, I'm going to answer the top 3 questions",
|
4 |
+
"start": 0.0,
|
5 |
+
"end": 4.0
|
6 |
+
},
|
7 |
+
{
|
8 |
+
"text": "my students ask me about Python. What is Python? What ",
|
9 |
+
"start": 4.0,
|
10 |
+
"end": 8.0
|
11 |
+
},
|
12 |
+
{
|
13 |
+
"text": "can you do with it? And why is it so popular? In other words, what",
|
14 |
+
"start": 8.0,
|
15 |
+
"end": 12.0
|
16 |
+
},
|
17 |
+
{
|
18 |
+
"text": "does it do that other programming languages don't? Python is the ",
|
19 |
+
"start": 12.0,
|
20 |
+
"end": 16.0
|
21 |
+
},
|
22 |
+
{
|
23 |
+
"text": "world's fastest growing and most popular programming language, not just ",
|
24 |
+
"start": 16.0,
|
25 |
+
"end": 20.0
|
26 |
+
},
|
27 |
+
{
|
28 |
+
"text": "amongst software engineers, but also amongst mathematicians, ",
|
29 |
+
"start": 20.0,
|
30 |
+
"end": 24.0
|
31 |
+
},
|
32 |
+
{
|
33 |
+
"text": "data analysts, scientists, accountants, networking engineers,",
|
34 |
+
"start": 24.0,
|
35 |
+
"end": 28.0
|
36 |
+
},
|
37 |
+
{
|
38 |
+
"text": "and even kids! Because it's a very beginner friendly programming ",
|
39 |
+
"start": 28.0,
|
40 |
+
"end": 32.0
|
41 |
+
},
|
42 |
+
{
|
43 |
+
"text": "language. So people from different disciplines use Python",
|
44 |
+
"start": 32.0,
|
45 |
+
"end": 36.0
|
46 |
+
},
|
47 |
+
{
|
48 |
+
"text": "for a variety of different tasks, such as data analysis and visualization, ",
|
49 |
+
"start": 36.0,
|
50 |
+
"end": 40.0
|
51 |
+
},
|
52 |
+
{
|
53 |
+
"text": "artificial intelligence and machine learning, automation ",
|
54 |
+
"start": 40.0,
|
55 |
+
"end": 44.0
|
56 |
+
},
|
57 |
+
{
|
58 |
+
"text": "in fact this is one of the big uses of Python amongst people who are not software",
|
59 |
+
"start": 44.0,
|
60 |
+
"end": 48.0
|
61 |
+
},
|
62 |
+
{
|
63 |
+
"text": "developers. If you constantly have to do boring, repetitive ",
|
64 |
+
"start": 48.0,
|
65 |
+
"end": 52.0
|
66 |
+
},
|
67 |
+
{
|
68 |
+
"text": "tasks, such as copying files and folders around, renaming them, ",
|
69 |
+
"start": 52.0,
|
70 |
+
"end": 56.0
|
71 |
+
},
|
72 |
+
{
|
73 |
+
"text": "uploading them to a server, you can easily write a Python script to",
|
74 |
+
"start": 56.0,
|
75 |
+
"end": 60.0
|
76 |
+
},
|
77 |
+
{
|
78 |
+
"text": "automate all that and save your time. And that's just one example, if you",
|
79 |
+
"start": 60.0,
|
80 |
+
"end": 64.0
|
81 |
+
},
|
82 |
+
{
|
83 |
+
"text": "continuously have to work with excel spreadsheets, PDF's, CS",
|
84 |
+
"start": 64.0,
|
85 |
+
"end": 68.0
|
86 |
+
},
|
87 |
+
{
|
88 |
+
"text": "View files, download websites and parse them, you can automate all",
|
89 |
+
"start": 68.0,
|
90 |
+
"end": 72.0
|
91 |
+
},
|
92 |
+
{
|
93 |
+
"text": "that stuff with Python. So you don't have to be a software developer to use Python.",
|
94 |
+
"start": 72.0,
|
95 |
+
"end": 76.0
|
96 |
+
},
|
97 |
+
{
|
98 |
+
"text": "You could be an accountant, a mathematician, or a scientist, and use Python ",
|
99 |
+
"start": 76.0,
|
100 |
+
"end": 80.0
|
101 |
+
},
|
102 |
+
{
|
103 |
+
"text": "to make your life easier. You can also use Python to build ",
|
104 |
+
"start": 80.0,
|
105 |
+
"end": 84.0
|
106 |
+
},
|
107 |
+
{
|
108 |
+
"text": "web, mobile and desktop applications as well as software ",
|
109 |
+
"start": 84.0,
|
110 |
+
"end": 88.0
|
111 |
+
},
|
112 |
+
{
|
113 |
+
"text": "testing or even hacking. So Python is a multi purpose language. ",
|
114 |
+
"start": 88.0,
|
115 |
+
"end": 92.0
|
116 |
+
},
|
117 |
+
{
|
118 |
+
"text": "Now if you have some programming experience you may say, \"But Mosh",
|
119 |
+
"start": 92.0,
|
120 |
+
"end": 96.0
|
121 |
+
},
|
122 |
+
{
|
123 |
+
"text": "we can do all this stuff with other programming languages, so what's the big deal ",
|
124 |
+
"start": 96.0,
|
125 |
+
"end": 100.0
|
126 |
+
},
|
127 |
+
{
|
128 |
+
"text": "about Python?\" Here are a few reasons. With Python you can ",
|
129 |
+
"start": 100.0,
|
130 |
+
"end": 104.0
|
131 |
+
},
|
132 |
+
{
|
133 |
+
"text": "solve complex problems in less time with fewer lines of code. ",
|
134 |
+
"start": 104.0,
|
135 |
+
"end": 108.0
|
136 |
+
},
|
137 |
+
{
|
138 |
+
"text": "Here's an example. Let's say we want to extract the first three ",
|
139 |
+
"start": 108.0,
|
140 |
+
"end": 112.0
|
141 |
+
},
|
142 |
+
{
|
143 |
+
"text": "letters of the text Hello World. This is the code we have to write ",
|
144 |
+
"start": 112.0,
|
145 |
+
"end": 116.0
|
146 |
+
},
|
147 |
+
{
|
148 |
+
"text": "in C# this is how we do it in JavaScript and here's how we ",
|
149 |
+
"start": 116.0,
|
150 |
+
"end": 120.0
|
151 |
+
},
|
152 |
+
{
|
153 |
+
"text": "do it in Python. See how short and clean the language is?",
|
154 |
+
"start": 120.0,
|
155 |
+
"end": 124.0
|
156 |
+
},
|
157 |
+
{
|
158 |
+
"text": "And that's just the beginning. Python makes a lot of trivial things",
|
159 |
+
"start": 124.0,
|
160 |
+
"end": 128.0
|
161 |
+
},
|
162 |
+
{
|
163 |
+
"text": "really easy with a simple yet powerful syntax. Here are a few",
|
164 |
+
"start": 128.0,
|
165 |
+
"end": 132.0
|
166 |
+
},
|
167 |
+
{
|
168 |
+
"text": "other reasons Python is so popular. It's a high level language",
|
169 |
+
"start": 132.0,
|
170 |
+
"end": 136.0
|
171 |
+
},
|
172 |
+
{
|
173 |
+
"text": "so you don't have to worry about complex tasks such as memory management, ",
|
174 |
+
"start": 136.0,
|
175 |
+
"end": 140.0
|
176 |
+
},
|
177 |
+
{
|
178 |
+
"text": "like you do in C++. It's cross platform which means ",
|
179 |
+
"start": 140.0,
|
180 |
+
"end": 144.0
|
181 |
+
},
|
182 |
+
{
|
183 |
+
"text": "you can build and run Python applications on Windows, Mac, ",
|
184 |
+
"start": 144.0,
|
185 |
+
"end": 148.0
|
186 |
+
},
|
187 |
+
{
|
188 |
+
"text": "and Linux. It has a huge community so whenever you get ",
|
189 |
+
"start": 148.0,
|
190 |
+
"end": 152.0
|
191 |
+
},
|
192 |
+
{
|
193 |
+
"text": "stuck, there is someone out there to help. It has a large ecosystem ",
|
194 |
+
"start": 152.0,
|
195 |
+
"end": 156.0
|
196 |
+
},
|
197 |
+
{
|
198 |
+
"text": "of libraries, frameworks and tools which means whatever you wanna do",
|
199 |
+
"start": 156.0,
|
200 |
+
"end": 160.0
|
201 |
+
},
|
202 |
+
{
|
203 |
+
"text": "it is likely that someone else has done it before because Python has been around ",
|
204 |
+
"start": 160.0,
|
205 |
+
"end": 164.0
|
206 |
+
},
|
207 |
+
{
|
208 |
+
"text": "for over 20 years. So in a nutshell, Python",
|
209 |
+
"start": 164.0,
|
210 |
+
"end": 168.0
|
211 |
+
},
|
212 |
+
{
|
213 |
+
"text": "is a multi-purpose language with a simple, clean, and beginner-friendly ",
|
214 |
+
"start": 168.0,
|
215 |
+
"end": 172.0
|
216 |
+
},
|
217 |
+
{
|
218 |
+
"text": "syntax. All of that means Python is awesome.",
|
219 |
+
"start": 172.0,
|
220 |
+
"end": 176.0
|
221 |
+
},
|
222 |
+
{
|
223 |
+
"text": "Technically everything you do with Python you can do with other programming languages, ",
|
224 |
+
"start": 176.0,
|
225 |
+
"end": 180.0
|
226 |
+
},
|
227 |
+
{
|
228 |
+
"text": "but Python's simplicity and elegance has made it grow way ",
|
229 |
+
"start": 180.0,
|
230 |
+
"end": 184.0
|
231 |
+
},
|
232 |
+
{
|
233 |
+
"text": "more than other programming languages. That's why it's the number onne",
|
234 |
+
"start": 184.0,
|
235 |
+
"end": 188.0
|
236 |
+
},
|
237 |
+
{
|
238 |
+
"text": "language employers are looking for. So whether you're a programmer or ",
|
239 |
+
"start": 188.0,
|
240 |
+
"end": 192.0
|
241 |
+
},
|
242 |
+
{
|
243 |
+
"text": "an absolute beginner, learning Python opens up lots of job opportunities ",
|
244 |
+
"start": 192.0,
|
245 |
+
"end": 196.0
|
246 |
+
},
|
247 |
+
{
|
248 |
+
"text": "to you. In fact, the average Python developer earns a whopping",
|
249 |
+
"start": 196.0,
|
250 |
+
"end": 200.0
|
251 |
+
},
|
252 |
+
{
|
253 |
+
"text": "116,000 dollars a year. If you",
|
254 |
+
"start": 200.0,
|
255 |
+
"end": 204.0
|
256 |
+
},
|
257 |
+
{
|
258 |
+
"text": "found this video helpful, please support my hard work by liking and sharing it with others. ",
|
259 |
+
"start": 204.0,
|
260 |
+
"end": 208.0
|
261 |
+
},
|
262 |
+
{
|
263 |
+
"text": "Also, be sure to subscribe to my channel, because I have a couple of",
|
264 |
+
"start": 208.0,
|
265 |
+
"end": 212.0
|
266 |
+
},
|
267 |
+
{
|
268 |
+
"text": "awesome Python tutorials for you, you're going to see them on the screen now. ",
|
269 |
+
"start": 212.0,
|
270 |
+
"end": 216.0
|
271 |
+
},
|
272 |
+
{
|
273 |
+
"text": "Here's my Python tutorial for beginners, it's a great starting point if you ",
|
274 |
+
"start": 216.0,
|
275 |
+
"end": 220.0
|
276 |
+
},
|
277 |
+
{
|
278 |
+
"text": "have limited or no programming experience. On the other hand, if you ",
|
279 |
+
"start": 220.0,
|
280 |
+
"end": 224.0
|
281 |
+
},
|
282 |
+
{
|
283 |
+
"text": "do have some programming experience and want to quickly get up to speed with Python, ",
|
284 |
+
"start": 224.0,
|
285 |
+
"end": 228.0
|
286 |
+
},
|
287 |
+
{
|
288 |
+
"text": "I have another tutorial just for you. I'm not going to waste your time ",
|
289 |
+
"start": 228.0,
|
290 |
+
"end": 232.0
|
291 |
+
},
|
292 |
+
{
|
293 |
+
"text": "telling you what a variable or a function is. I will talk to you like a programmer.",
|
294 |
+
"start": 232.0,
|
295 |
+
"end": 236.0
|
296 |
+
},
|
297 |
+
{
|
298 |
+
"text": "There's never been a better time to master Python programming,",
|
299 |
+
"start": 236.0,
|
300 |
+
"end": 240.0
|
301 |
+
},
|
302 |
+
{
|
303 |
+
"text": "so click on the tutorial that is right for you and get started. Thank you for",
|
304 |
+
"start": 240.0,
|
305 |
+
"end": 244.0
|
306 |
+
},
|
307 |
+
{
|
308 |
+
"text": "watching!",
|
309 |
+
"start": 244.0,
|
310 |
+
"end": 246.63
|
311 |
+
}
|
312 |
+
]
|
utils.py
ADDED
@@ -0,0 +1,51 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import json
|
2 |
+
|
3 |
+
def calculate_ends(input_file_path, output_file_path):
|
4 |
+
with open(input_file_path, 'r') as file:
|
5 |
+
input_data = json.load(file)
|
6 |
+
|
7 |
+
# Iterate through the list of dictionaries and calculate "end" for each one
|
8 |
+
for item in input_data:
|
9 |
+
item["end"] =round(item["start"] + item["duration"],2)
|
10 |
+
del item["duration"] # Remove the "duration" key from each dictionary
|
11 |
+
|
12 |
+
# Save the updated data to a new JSON file
|
13 |
+
with open(output_file_path, 'w') as output_file:
|
14 |
+
json.dump(input_data, output_file)
|
15 |
+
|
16 |
+
import json
|
17 |
+
|
18 |
+
def create_chunks(input_file_path, output_file_path):
|
19 |
+
with open(input_file_path, 'r') as file:
|
20 |
+
output_data = json.load(file)
|
21 |
+
|
22 |
+
combined_json_list = []
|
23 |
+
|
24 |
+
# Calculate the number of groups to create
|
25 |
+
num_groups = (len(output_data) + 3) // 4
|
26 |
+
|
27 |
+
for group_num in range(num_groups):
|
28 |
+
# Calculate the starting index and ending index for the current group
|
29 |
+
start_index = group_num * 4
|
30 |
+
end_index = min(start_index + 4, len(output_data))
|
31 |
+
|
32 |
+
# Extract the "text" values from the current group of dictionaries
|
33 |
+
combined_text = " ".join([item["text"] for item in output_data[start_index:end_index]])
|
34 |
+
|
35 |
+
# Calculate the "start" and "end" for the current group
|
36 |
+
group_start = output_data[start_index]["start"]
|
37 |
+
group_end = output_data[end_index - 1]["end"]
|
38 |
+
|
39 |
+
# Create the combined JSON for the current group
|
40 |
+
combined_json = {
|
41 |
+
"text": combined_text,
|
42 |
+
"start": group_start,
|
43 |
+
"end": group_end,
|
44 |
+
}
|
45 |
+
|
46 |
+
combined_json_list.append(combined_json)
|
47 |
+
|
48 |
+
# Save the combined JSON list to a new file
|
49 |
+
with open(output_file_path, 'w') as output_file:
|
50 |
+
json.dump(combined_json_list, output_file)
|
51 |
+
|
youtube.py
ADDED
@@ -0,0 +1,68 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
from youtube_transcript_api import YouTubeTranscriptApi
|
2 |
+
from youtube_transcript_api.formatters import JSONFormatter
|
3 |
+
import json
|
4 |
+
import ingest
|
5 |
+
import run_localGPT
|
6 |
+
import utils
|
7 |
+
|
8 |
+
def audio_to_transcript(video_id):
|
9 |
+
sub = YouTubeTranscriptApi.get_transcript(video_id)
|
10 |
+
formatted_subs = JSONFormatter().format_transcript(transcript=sub)
|
11 |
+
with open("transcript.json", "w") as outfile:
|
12 |
+
json.dump(sub, outfile)
|
13 |
+
lstTexts = []
|
14 |
+
for dct in sub:
|
15 |
+
lstTexts.append(dct['text'])
|
16 |
+
strResult = ' '.join(lstTexts)
|
17 |
+
with open('SOURCE_DOCUMENTS/transcript.txt', 'w') as outfile:
|
18 |
+
outfile.write(strResult)
|
19 |
+
transcript = ' '.join(lstTexts)
|
20 |
+
|
21 |
+
utils.calculate_ends('transcript.json','transcript_end.json')
|
22 |
+
utils.create_chunks('transcript_end.json','chunks.json')
|
23 |
+
|
24 |
+
return transcript
|
25 |
+
|
26 |
+
def start_training():
|
27 |
+
training_status = ingest.main()
|
28 |
+
return training_status
|
29 |
+
|
30 |
+
def replace_substring_and_following(input_string, substring):
|
31 |
+
index = input_string.find(substring)
|
32 |
+
if index != -1:
|
33 |
+
return input_string[:index]
|
34 |
+
else:
|
35 |
+
return input_string
|
36 |
+
|
37 |
+
def ask_question(strQuestion):
|
38 |
+
answer = run_localGPT.main(device_type='cpu', strQuery=strQuestion)
|
39 |
+
answer_cleaned = replace_substring_and_following(answer, "Unhelpful Answer")
|
40 |
+
return answer_cleaned
|
41 |
+
|
42 |
+
def summarize():
|
43 |
+
|
44 |
+
from langchain.text_splitter import CharacterTextSplitter
|
45 |
+
from langchain.chains.mapreduce import MapReduceChain
|
46 |
+
from langchain.prompts import PromptTemplate
|
47 |
+
|
48 |
+
model_id = "TheBloke/Llama-2-7B-Chat-GGML"
|
49 |
+
model_basename = "llama-2-7b-chat.ggmlv3.q4_0.bin"
|
50 |
+
|
51 |
+
llm = run_localGPT.load_model(device_type='cpu', model_id=model_id, model_basename=model_basename)
|
52 |
+
|
53 |
+
text_splitter = CharacterTextSplitter()
|
54 |
+
|
55 |
+
with open("SOURCE_DOCUMENTS/transcript.txt") as f:
|
56 |
+
file_content = f.read()
|
57 |
+
texts = text_splitter.split_text(file_content)
|
58 |
+
|
59 |
+
from langchain.docstore.document import Document
|
60 |
+
|
61 |
+
docs = [Document(page_content=t) for t in texts]
|
62 |
+
|
63 |
+
from langchain.chains.summarize import load_summarize_chain
|
64 |
+
|
65 |
+
chain = load_summarize_chain(llm, chain_type="map_reduce")
|
66 |
+
summary = chain.run(docs)
|
67 |
+
return summary
|
68 |
+
|