File size: 34,923 Bytes
57169a3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
26b21bf
 
 
 
 
 
 
 
 
 
 
57169a3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
26b21bf
57169a3
26b21bf
57169a3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
26b21bf
57169a3
c7ed333
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3c2d301
57169a3
3c2d301
57169a3
3c2d301
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
# -*- coding: utf-8 -*-
"""AmiteshKumarDwivedi_IR_Project.ipynb

Automatically generated by Colab.

Original file is located at
    https://colab.research.google.com/drive/1Y99JVHjS_jukrw_27BTOe1TAYUquFRSO

# **What is RAG?**
RAG stands for Retrieval Augmented Generation. It can simply be broken down in these three steps:


*  **Retrieval** - Seeking relevant information from a source given query. e.g "*What is a term incident matrix?*" -> Retrieves passages related to term incident matrix
*  **Augmented** - Using relevant retrieved information to modify input prompt to LLM with relevant knowledge base.
*   **Generation** - Using first two steps to generate output for a given input

Goal is to retrieve information to pass it to a large language model to generate output based on the knowledge provided.

**Why use RAG?**

Main goal is to improve the quality of the generated output

1.   **Improve Hallucination** - LLMs are prone to hallucination(generating someting that looks correct but isn't). RAG pipelines can help LLMs generate more fact/knowledge based output through providing fact based input(like a textbook). Furthermore, even if we doubt the answer from a RAG pipeline, we always know what source we can refer to
2.   **Work with custom data** - LLMs are trained with text data - they have great ability to model a language but lack knowledge. RAG can provide domain specific knowledge to suite user information need and use cases. e.g Looking up for niche information from niche textbooks.
"""

import os
# if "COLAB_GPU" in os.environ:
#     print("Installing requirements.")
#     !pip install -U torch # requires torch 2.1.1+ (for efficient sdpa implementation)
#     !pip install PyMuPDF # for reading PDFs with Python
#     !pip install tqdm # for progress bars
#     !pip install sentence-transformers # for embedding models
#     !pip install accelerate # for quantization model loading
#     !pip install bitsandbytes # for quantizing models (less storage space)
#     !pip install flash-attn --no-build-isolation # for faster attention mechanism = faster LLM inference
#     !pip install nltk
#     !pip install spacy

"""# Some common terms to know before we proceed

Term | Description
-----|-------------
Token | The smallest meaningful unit of text that a computer can understand. For example, the sentence "hello, world!" could be broken down into the tokens "hello", ",", "world", and "!". A token can be a whole word, part of a word, or a group of punctuation marks. On average, 1 token is roughly equal to 4 characters in English, and 100 tokens is about 75 words. Before an LLM can process text, it needs to be broken down into tokens.
Embedding | A way of representing a piece of data (like a sentence or paragraph of text) as a list of numbers. Similar pieces of data (like sentences with similar meanings) will have similar numerical representations or "embeddings". An embedding for a sentence might be a list of 768 numbers, for example.
Embedding model | A type of computer program that takes in data (like text) and outputs a numerical representation or "embedding" of that data. For example, an embedding model might take in 384 tokens of text and convert it into a list of 768 numbers.
Similarity search/Vector search | A technique for finding data points (like text embeddings) that are "close" or similar to each other in a high-dimensional space. Text about similar topics should have embeddings with high similarity scores, while text on different topics should have lower scores. Common ways to measure similarity include dot product and cosine similarity.
Large Language Model (LLM) | A very large computer program that has learned patterns from a vast amount of text data. When given a piece of text, a generative LLM can continue the text in a way that seems natural and coherent based on the patterns it has learned. For example, if given "hello, world!", it might generate "we're going to build a program today!".
LLM context window | The amount of input data (measured in tokens) that an LLM can process at once. Larger LLMs can handle longer context windows. For example, as of March 2024, GPT-4 could process up to 128,000 tokens (about 384 pages) at once.
Prompt | The input data that is provided to an LLM to generate an output. The way the prompt is structured and framed can greatly influence the LLM's generated text. The technique of carefully designing prompts is called "prompt engineering".

# Document Processing and Creation of Embeddings

## What we need:

*   Information Retrieval Textbook
*   Embedding Model of choice

## Steps:

1.   Import Information Retrieval Textbook - Online or Offline
2.   Process textbook for embedding - splitting into chunks of sentences
3.   Embed text chunks with an embedding model
4.   Save embeddings to a file for later

Importing PDFs and Opening PDFs
"""

import requests

pdf_path = "ir_book.pdf"
if not os.path.exists(pdf_path):
  print("File not available, let me download from the internet")

  #Enter URL to download

  url = "https://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf"

  #local filename to save downloaded file
  filename = pdf_path
  download_pdf(url, filename)

  #Sending a GET request to URL
  response = request.get(url)

  #checking if req was success or fail
  if response.status_code == 200:
    #open file and save
    with open(filename, "wb") as file:
      file.write(response.content)
      print(f"The file has been downloaded and saved as {filename}")
  else:
    print(f"Failed to download the file. Status Code: {response.status_code}")

else:

    print(f"File exists")

import fitz  # PyMuPDF
from tqdm import tqdm
import re
import spacy
import random

nlp = spacy.load("en_core_web_sm")

def textFormat(text: str) -> str:
    # patterns_to_remove = [
    #     r'Online edition c\\n2009 Cambridge UP',
    #     r'An\\nIntroduction\\nto\\nInformation\\nRetrieval',
    #     r'Draft of April 1,? 2009'
    #     r'Online edition (c)\n2009 Cambridge UP\nAn\nIntroduction\nto\nInformation\nRetrieval\nDraft of April 1, 2009\n'
    # ]

    # for pattern in patterns_to_remove:
    #     clean_text = re.sub(pattern, '', text)


    clean_text = text.replace('/n', '')  # removing new lines and whitespaces
    clean_text = re.sub(r'[^a-zA-Z0-9\s]', '', text)  # Remove non-alphanumeric characters
    clean_text = re.sub(r'\s+', ' ', text)  # Collapse multiple spaces into one
    clean_text.strip()
    return clean_text

def open_read_pdf(pdf_path: str) -> list[dict]:
    doc = fitz.open(pdf_path)
    page_n_text = []

    for page_number, page in tqdm(enumerate(doc)):
        text = page.get_text()
        text = textFormat(text=text)
        page_n_text.append({
            "page_number": page_number - 25,
            "page_char_count": len(text),
            "page_word_count": len(text.split(" ")),
            "page_sentence_count_raw": len(text.split(". ")),
            "page_token_count": len(text) / 7,  # 1 token in mostly 4 characters in english - #100 tokens are approx 75 words. I changed this from 4-6 after EDA because of choice of embedding models used
            "text": text
        })

    return page_n_text

page_n_text = open_read_pdf(pdf_path=pdf_path)
random.sample(page_n_text, k=3)

#EDA

import pandas as pd

df = pd.DataFrame(page_n_text)
df.head()

df.describe().round(3)

"""Token count is important, since
1.   embedding models & LLM dont deal with endless tokens

# Further Text Processing

I will be breaking the text into chunks of sentences.

Workflow
```
Ingest Text -> split it in groups -> make embeddings -> use embeddings
```

Two ways of doing this:
1. Split on "."
2. Using spacy and nltk(already installed here)
"""

import spacy
from spacy.lang.en import English

nlp = English()

#add a sentencizer pipeline
#sentencizer - turning text into sentences

nlp.add_pipe("sentencizer")

for item in tqdm(page_n_text):
  item["sentences"] = list(nlp(item["text"]).sents)

  item["sentences"] = [str(sentence) for sentence in item["sentences"]]#making sure all sentences are string

    # Count the sentences
  item["page_sentence_count_spacy"] = len(item["sentences"])

random.sample(page_n_text, k =2)

import pandas as pd

df = pd.DataFrame(page_n_text)
df.describe().round(2)

"""### Chunking our sentences together

Concept of splitting large text into smaller texts is referred to as chunking. I will be splitting in groups of 10 sentences
"""

#define split size to turn groups of sentences to chukns

num_sentence_chunk_size = 10

#splitting text recursively into chunk size. example list of 20 will go to 2 list of 10
def split_list(input_list: list,
               slice_size: int= num_sentence_chunk_size) -> list[list[str]]:
    return [input_list[i:i + slice_size] for i in range(0, len(input_list), slice_size)]

test_list = list(range(25))
split_list(test_list)

#Loop thru pages and split sentences into chunks

for item in tqdm(page_n_text):
  item["sentence_chunks"] = split_list(input_list = item["sentences"], slice_size=num_sentence_chunk_size)
  item["num_chunks"] = len(item["sentence_chunks"])

random.sample(page_n_text, k=2)

df = pd.DataFrame(page_n_text)
df.describe().round(2)

"""Splitting each chunk into their own item(numerical representation) to attach metadata and do other operations

"""

page_n_chunk = []
for item in tqdm(page_n_text):
    for sentence_chunks in item["sentence_chunks"]:
        chunk_dic = {}
        chunk_dic["page_number"] = item["page_number"]

        # Join the sentences together into a paragraph-like structure, aka a chunk (so they are a single string)

        joined_sentence_chunk = "".join(sentence_chunks).replace(" ", " ").strip()
        joined_sentence_chunk = re.sub(r'\.([A-Z])', r'. \1', joined_sentence_chunk) # ".A" -> ". A" for any full-stop/capital letter combo
        chunk_dic["sentence_chunk"] = joined_sentence_chunk

        #Getting stats on chunks
        chunk_dic["chunk_char_count"] = len(joined_sentence_chunk)
        chunk_dic["chunk_word_count"] = len([word for word in joined_sentence_chunk.split(" ")])
        chunk_dic["chunk_token_count"] = len(joined_sentence_chunk) / 4          #As done above in pre-processing
        page_n_chunk.append(chunk_dic)

# How many chunks do we have?
len(page_n_chunk)

random.sample(page_n_chunk, k=3)

df = pd.DataFrame(page_n_chunk)
df.describe().round(2)

#Show random chunks with less than 0 filter as they might not be very useful.

min_token_lenghth = 20
for row in df[df["chunk_token_count"] <= min_token_lenghth].sample(5).iterrows():
    print(f'Chunk token count: {row[1]["chunk_token_count"]} | Text: {row[1]["sentence_chunk"]}')

#Filter less than 20 lengths
pages_n_chunks_over_min_token_len = df[df["chunk_token_count"] > min_token_lenghth].to_dict(orient="records")
pages_n_chunks_over_min_token_len[:10]

random.sample(pages_n_chunks_over_min_token_len, k=2)

"""Now we create embeddings for our text chunks. Embeddings are very powerful concept as machines understand numbers more than free-language.

Embeddings are useful numerical representation of text data. They are a learned representation. Vicky Boykus has a great blog on this which I referred to
"""

from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer(model_name_or_path="all-mpnet-base-v2",
                                      device="cuda")

sentences = [
"RAG combines the power of retrieval and generation to enhance the quality of AI-generated text.",
"It retrieves relevant documents to provide context, which is then used by a generator to produce coherent and informed responses.",
"This method leverages both neural retrieval and transformer-based generative models, merging the best of both worlds.",
"Using RAG can significantly improve the informativeness and accuracy of responses in natural language processing tasks."
]

# Sentences are encoded/embedded by calling model.encode()
embeddings = embedding_model.encode(sentences)
embeddings_dict = dict(zip(sentences, embeddings))

# See the embeddings
for sentence, embedding in embeddings_dict.items():
    print("Sentence:", sentence)
    print("Embedding:", embedding)
    print("")

embeddings[0].shape

# Commented out IPython magic to ensure Python compatibility.
# %%time
# #2.5 mins for CPU and 20.8s in CUDA/GPU
# # Uncomment to see how long it takes to create embeddings on CPU
# # Make sure the model is on the CPU
# embedding_model.to("cuda")
# 
# # Embed each chunk one by one
# for item in tqdm(pages_n_chunks_over_min_token_len):
#     item["embedding"] = embedding_model.encode(item["sentence_chunk"])

#Saving Embeddings into file
text_chunks_and_embeddings_df = pd.DataFrame(pages_n_chunks_over_min_token_len)
embeddings_df_save_path = "text_chunks_and_embeddings_df.csv"
text_chunks_and_embeddings_df.to_csv(embeddings_df_save_path, index=False)

#Importing saved file and view
text_chunks_and_embedding_df_load = pd.read_csv(embeddings_df_save_path)
text_chunks_and_embedding_df_load.head()

"""Now I will be using these embeddings to retrieve relevant passages based on a query and use relevant passages to augment input to LLM so it generates an output on those relevant passages."""

import random
import torch
import numpy as np
import pandas as pd

device = "cuda" if torch.cuda.is_available() else "cpu"

# importing txts and embedding df
text_chunks_and_embedding_df = pd.read_csv("text_chunks_and_embeddings_df.csv")

# since csv was string, converting embedding column back to np.array (it got converted to string when it got saved to CSV)
text_chunks_and_embedding_df["embedding"] = text_chunks_and_embedding_df["embedding"].apply(lambda x: np.fromstring(x.strip("[]"), sep=" "))

# converting texts and embedding df to list of dicts
pages_and_chunks = text_chunks_and_embedding_df.to_dict(orient="records")

# convert embeddings to torch tensor and send to device (!!!note: NumPy arrays are float64, torch tensors are float32 by default)
embeddings = torch.tensor(np.array(text_chunks_and_embedding_df["embedding"].tolist()), dtype=torch.float32).to(device)
embeddings.shape

"""Now I will create a small semantic search pipelines - I will want to search for a query ex. Vector Databases and get relevant passages from textbook

Steps:
1. Define query string
2. Turning query string into embeddings
3. Perform cosine similarity/dot product
4. Rank in decreasing order
"""

from sentence_transformers import util, SentenceTransformer

embedding_model = SentenceTransformer(model_name_or_path="all-mpnet-base-v2",
                                      device=device) # choose the device to load the model to

query = "vector databases"
print(f"Query: {query}")

#Embedding the query with same embedding model

query_embedding = embedding_model.encode(query, convert_to_tensor =True)

#Use Cosine Similarity and Dot Product
dot_score = util.dot_score(a=query_embedding, b=embeddings)[0]

top_res_dot_product = torch.topk(dot_score, k=5)
top_res_dot_product

# defining helper function to print wrapped text
import textwrap

def print_wrapped(text, wrap_length=80):
    wrapped_text = textwrap.fill(text, wrap_length)
    print(wrapped_text)

print(f"Query: '{query}'\n")
print("Results:")
# Loop through zipped together scores and indicies from torch.topk
for score, idx in zip(top_res_dot_product[0], top_res_dot_product[1]):
    print(f"Score: {score:.4f}")
    # Print relevant sentence chunk (since the scores are in descending order, the most relevant chunk will be first)
    print("Text:")
    print_wrapped(pages_and_chunks[idx]["sentence_chunk"])
    # Print the page number too so we can reference the textbook further (and check the results)
    print(f"Page number: {pages_and_chunks[idx]['page_number']}")
    print("\n")

import fitz

# Open PDF and load target page
pdf_path = "ir_book.pdf" # requires PDF to be downloaded
doc = fitz.open(pdf_path)
page = doc.load_page(5 + 25) # number of page (our doc starts page numbers on page 41)

# Get the image of the page
img = page.get_pixmap(dpi=300)

# Optional: save the image
#img.save("output_filename.png")
doc.close()

# Convert the Pixmap to a numpy array
img_array = np.frombuffer(img.samples_mv,
                          dtype=np.uint8).reshape((img.h, img.w, img.n))

# Display the image using Matplotlib
import matplotlib.pyplot as plt
plt.figure(figsize=(13, 10))
plt.imshow(img_array)
plt.title(f"Query: '{query}' | Most relevant page:")
plt.axis('off') # Turn off axis
plt.show()

"""Now I'll be exploring both dot-product and cosine similarity metrics

"""

import torch

def dot_product(vector1, vector2):
    return torch.dot(vector1, vector2)

def cosine_similarity(vector1, vector2):
    dot_product = torch.dot(vector1, vector2)

    # Get Euclidean/L2 norm of each vector (removes the magnitude, keeps direction)
    norm_vector1 = torch.sqrt(torch.sum(vector1**2))
    norm_vector2 = torch.sqrt(torch.sum(vector2**2))

    return dot_product / (norm_vector1 * norm_vector2)

# Example tensors
vector1 = torch.tensor([1, 2, 3], dtype=torch.float32)
vector2 = torch.tensor([1, 2, 3], dtype=torch.float32)
vector3 = torch.tensor([4, 5, 6], dtype=torch.float32)
vector4 = torch.tensor([-1, -2, -3], dtype=torch.float32)

# Calculate dot product
print("Dot product between vector1 and vector2:", dot_product(vector1, vector2))
print("Dot product between vector1 and vector3:", dot_product(vector1, vector3))
print("Dot product between vector1 and vector4:", dot_product(vector1, vector4))

# Calculate cosine similarity
print("Cosine similarity between vector1 and vector2:", cosine_similarity(vector1, vector2))
print("Cosine similarity between vector1 and vector3:", cosine_similarity(vector1, vector3))
print("Cosine similarity between vector1 and vector4:", cosine_similarity(vector1, vector4))

"""Now I'll be functionizing my semantic search pipeline"""

from timeit import default_timer as timer

def retrieve_relevant_resources(query: str,
                                embeddings: torch.tensor,
                                model: SentenceTransformer=embedding_model,
                                n_resources_to_return: int=5,
                                print_time: bool=True):
    """
    Embeds a query with model and returns top k scores and indices from embeddings.
    """

    # Embed the query
    query_embedding = model.encode(query,
                                   convert_to_tensor=True)

    # Get dot product scores on embeddings
    start_time = timer()
    dot_scores = util.dot_score(query_embedding, embeddings)[0]
    end_time = timer()

    if print_time:
        print(f"[INFO] Time taken to get scores on {len(embeddings)} embeddings: {end_time-start_time:.5f} seconds.")

    scores, indices = torch.topk(input=dot_scores,
                                 k=n_resources_to_return)

    return scores, indices

def print_top_results_and_scores(query: str,
                                 embeddings: torch.tensor,
                                 pages_and_chunks: list[dict]=pages_and_chunks,
                                 n_resources_to_return: int=5):
    """
    Takes a query, retrieves most relevant resources and prints them out in descending order.

    Note: Requires pages_and_chunks to be formatted in a specific way (see above for reference).
    """

    scores, indices = retrieve_relevant_resources(query=query,
                                                  embeddings=embeddings,
                                                  n_resources_to_return=n_resources_to_return)

    print(f"Query: {query}\n")
    print("Results:")
    # Loop through zipped together scores and indicies
    for score, index in zip(scores, indices):
        print(f"Score: {score:.4f}")
        # Print relevant sentence chunk (since the scores are in descending order, the most relevant chunk will be first)
        print_wrapped(pages_and_chunks[index]["sentence_chunk"])
        # Print the page number too so we can reference the textbook further and check the results
        print(f"Page number: {pages_and_chunks[index]['page_number']}")
        print("\n")

#testing
query = "what is a vector database"

# Get just the scores and indices of top related results
scores, indices = retrieve_relevant_resources(query=query,
                                              embeddings=embeddings)
scores, indices

print_top_results_and_scores(query=query,
                             embeddings=embeddings)

import torch
gpu_memory_bytes = torch.cuda.get_device_properties(0).total_memory
gpu_memory_gb = round(gpu_memory_bytes / (2**30))
print(f"Available GPU memory: {gpu_memory_gb} GB")

if gpu_memory_gb < 5.1:
    print(f"Your available GPU memory is {gpu_memory_gb}GB, you may not have enough memory to run a Gemma LLM locally without quantization.")
elif gpu_memory_gb < 8.1:
    print(f"GPU memory: {gpu_memory_gb} | Recommended model: Gemma 2B in 4-bit precision.")
    use_quantization_config = True
    model_id = "google/gemma-2b-it"
elif gpu_memory_gb < 19.0:
    print(f"GPU memory: {gpu_memory_gb} | Recommended model: Gemma 2B in float16 or Gemma 7B in 4-bit precision.")
    use_quantization_config = False
    model_id = "google/gemma-2b-it"
elif gpu_memory_gb > 19.0:
    print(f"GPU memory: {gpu_memory_gb} | Recommend model: Gemma 7B in 4-bit or float16 precision.")
    use_quantization_config = False
    model_id = "google/gemma-7b-it"

print(f"use_quantization_config set to: {use_quantization_config}")
print(f"model_id set to: {model_id}")

# !pip install huggingface_hub

# !pip install --upgrade transformers

"""Loading the LLM Locally"""

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from transformers.utils import is_flash_attn_2_available

from huggingface_hub import notebook_login

notebook_login() #Use token 


# 1. Create quantization config for smaller model loading (optional)
# Requires !pip install bitsandbytes accelerate, see: https://github.com/TimDettmers/bitsandbytes, https://huggingface.co/docs/accelerate/
# For models that require 4-bit quantization (use this if you have low GPU memory available)
quantization_config = BitsAndBytesConfig(load_in_4bit=True,
                                         bnb_4bit_compute_dtype=torch.float16)

# Bonus: Setup Flash Attention 2 for faster inference, default to "sdpa" or "scaled dot product attention" if it's not available
# Flash Attention 2 requires NVIDIA GPU compute capability of 8.0 or above, see: https://developer.nvidia.com/cuda-gpus
# Requires !pip install flash-attn, see: https://github.com/Dao-AILab/flash-attention
if (is_flash_attn_2_available()) and (torch.cuda.get_device_capability(0)[0] >= 8):
    attn_implementation = "flash_attention_2"
else:
    attn_implementation = "sdpa"
print(f"[INFO] Using attention implementation: {attn_implementation}")

# 2. Pick a model we'd like to use (this will depend on how much GPU memory you have available)
model_id = "google/gemma-7b-it"  # Ensure this model ID is accessible or replace with an accessible model ID
print(f"[INFO] Using model_id: {model_id}")

# 3. Instantiate tokenizer (tokenizer turns text into numbers ready for the model)
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path=model_id)

# 4. Instantiate the model
use_quantization_config = True  # Set this to True if you want to use quantization, otherwise set it to False
llm_model = AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path=model_id,
                                                 torch_dtype=torch.float16,  # Datatype to use, we want float16
                                                 quantization_config=quantization_config if use_quantization_config else None,
                                                 low_cpu_mem_usage=False,  # Use full memory
                                                 attn_implementation=attn_implementation)  # Which attention version to use

if not use_quantization_config:  # Quantization takes care of device setting automatically, so if it's not used, send model to GPU
    llm_model.to("cuda")

"""Generating Texts.

The tokenized input comes after I pass a string of text to the tokenizer.
"""

input_text = "What is a Vector Database?"
print(f"Input text:\n{input_text}")

# Create prompt template for instruction-tuned model
dialogue_template = [
    {"role": "user",
     "content": input_text}
]

# Apply the chat template
prompt = tokenizer.apply_chat_template(conversation=dialogue_template,
                                       tokenize=False, # keep as raw text (not tokenized)
                                       add_generation_prompt=True)
print(f"\nPrompt (formatted):\n{prompt}")

# Commented out IPython magic to ensure Python compatibility.
# %%time
# 
# # Tokenize the input text (turn it into numbers) and send it to GPU
# input_ids = tokenizer(prompt, return_tensors="pt").to("cuda")
# print(f"Model input (tokenized):\n{input_ids}\n")
# 
# # Generate outputs passed on the tokenized input
# # See generate docs: https://huggingface.co/docs/transformers/v4.38.2/en/main_classes/text_generation#transformers.GenerationConfig
# outputs = llm_model.generate(**input_ids,
#                              max_new_tokens=256) # define the maximum number of new tokens to create
# print(f"Model output (tokens):\n{outputs[0]}\n")

"""decoding output now

"""

# Decode the output tokens to text
outputs_decoded = tokenizer.decode(outputs[0])
print(f"Model output (decoded):\n{outputs_decoded}\n")

""" formatting to replace the prompt in the output text

"""

print(f"Input text: {input_text}\n")
print(f"Output text:\n{outputs_decoded.replace(prompt, '').replace('<bos>', '').replace('<eos>', '')}")

"""Now I will try to use some questions from ChatGPT 3.5, and some of my own questions to out"""

chatgpt_questions = [
    "How does the textbook describe the process of constructing an inverted index?",
    "What are the key differences between Boolean retrieval and vector space models?",
    "What does the textbook say about the evaluation of information retrieval systems?"
]

manual_questions = [
    "What is Boolean IR?",
    "What are the types of term weighing schemes",
]


query_list = chatgpt_questions + manual_questions

#Checking if retrieve_relevant_resources is working with my queries

import random
query = random.choice(query_list)

print(f"Query: {query}")

# Get just the scores and indices of top related results
scores, indices = retrieve_relevant_resources(query=query,
                                              embeddings=embeddings)
scores, indices

"""Now I will be focusing on Augmenting. In augmentation, we take results from our search for relevant resources and insert them into the prompt we give to the LLM.

We start with a base prompt and update it after we get the retrieved text as context text.
"""

def prompt_formatter(query: str,
                     context_items: list[dict]) -> str:
    """
    Augments query with text-based context from context_items.
    """
    # Join context items into one dotted paragraph
    context = "- " + "\n- ".join([item["sentence_chunk"] for item in context_items])

    # Create a base prompt with examples to help the model
    # Note: this is very customizable, I've chosen to use 3 examples of the answer style we'd like.
    # We could also write this in a txt file and import it in if we wanted.
    base_prompt = """Based on the following context items, please answer the query.
Extract relevant passages from the context before answering the query, but do not return the extraction process in your response.
Ensure that the answers are explanatory, leveraging the technical details and examples provided in the textbook as needed.
Use the following examples as reference for the ideal answer style.

Example 1:
Query: How does an inverted index support efficient query processing?
Answer: An inverted index enhances query processing efficiency by mapping each term found in documents to its corresponding list of documents, thus avoiding a linear scan of all documents. This is achieved by maintaining a dictionary where each term is linked to a postings list that records all documents containing the term. To answer a query, the system retrieves the postings lists for the query terms and intersects them, which is efficient due to the sorted nature of these lists. This method significantly reduces the time required to find documents that meet the query criteria compared to scanning each document sequentially.

Example 2:
Query: What role does tokenization play in text preprocessing for information retrieval?
Answer: Tokenization is a crucial step in text preprocessing for information retrieval, where it involves breaking down text into smaller pieces or tokens. This process is fundamental because it determines the granularity at which information is indexed and retrieved. Proper tokenization helps in identifying meaningful elements in the text, such as words or phrases, that are used to build the index. Effective tokenization directly impacts the retrieval effectiveness, as it influences both the construction of the inverted index and the accuracy of the response to user queries.

Example 3:
Query: What are the advantages of using vector space models for information retrieval?
Answer: Vector space models offer significant advantages for information retrieval by allowing the ranking of documents based on their relevance to a query, unlike Boolean models that provide binary results. This model represents both documents and queries as vectors in a multi-dimensional space where each dimension corresponds to a separate term. Relevance is calculated based on the cosine similarity between these vectors, enabling a more nuanced identification of documents that are most likely to satisfy the user's information need. This method facilitates effective retrieval by accommodating partial matching and ranking documents based on their query relevance score.

Now use the following context items from your textbook to answer the user query:
{context}

Relevant passages: <extract relevant passages from the context here>
User query: {query}
Answer:"""


    # Update base prompt with context items and query
    base_prompt = base_prompt.format(context=context, query=query)

    # Create prompt template for instruction-tuned model
    dialogue_template = [
        {"role": "user",
        "content": base_prompt}
    ]

    # Apply the chat template
    prompt = tokenizer.apply_chat_template(conversation=dialogue_template,
                                          tokenize=False,
                                          add_generation_prompt=True)
    return prompt

#Trying out above function

query = random.choice(query_list)
print(f"Query: {query}")

# Get relevant resources
scores, indices = retrieve_relevant_resources(query=query,
                                              embeddings=embeddings)

# Create a list of context items
context_items = [pages_and_chunks[i] for i in indices]

# Format prompt with context items
prompt = prompt_formatter(query=query,
                          context_items=context_items)
print(prompt)

"""Tokenizing the above and passing it to our LLM now"""

# Commented out IPython magic to ensure Python compatibility.
# %%time
# 
# input_ids = tokenizer(prompt, return_tensors="pt").to("cuda")
# 
# # Generate an output of tokens
# outputs = llm_model.generate(**input_ids,
#                              temperature=0.7, # lower temperature = more deterministic outputs, higher temperature = more creative outputs
#                              do_sample=True, # whether or not to use sampling, referenced from https://huyenchip.com/2024/01/16/sampling.html f
#                              max_new_tokens=256) # how many new tokens to generate from prompt
# 
# # Turn the output tokens into text
# output_text = tokenizer.decode(outputs[0])
# 
# print(f"Query: {query}")
# print(f"RAG answer:\n{output_text.replace(prompt, '')}")

"""Functionizing generation step to make it easier. Also formatting output text to make it easier to read and enabling option to return context items"""

def ask(query,
        temperature=0.7,
        max_new_tokens=512,
        format_answer_text=True,
        return_answer_only=True):
    """
    Takes a query, finds relevant resources/context and generates an answer to the query based on the relevant resources.
    """

    # Get just the scores and indices of top related results
    scores, indices = retrieve_relevant_resources(query=query,
                                                  embeddings=embeddings)

    # Create a list of context items
    context_items = [pages_and_chunks[i] for i in indices]

    # Add score to context item
    for i, item in enumerate(context_items):
        item["score"] = scores[i].cpu() # return score back to CPU

    # Format the prompt with context items
    prompt = prompt_formatter(query=query,
                              context_items=context_items)

    # Tokenize the prompt
    input_ids = tokenizer(prompt, return_tensors="pt").to("cuda")

    # Generate an output of tokens
    outputs = llm_model.generate(**input_ids,
                                 temperature=temperature,
                                 do_sample=True,
                                 max_new_tokens=max_new_tokens)

    # Turn the output tokens into text
    output_text = tokenizer.decode(outputs[0])

    if format_answer_text:
        # Replace special tokens and unnecessary help message
        output_text = output_text.replace(prompt, "").replace("<bos>", "").replace("<eos>", "").replace("Sure, here is the answer to the user query:\n\n", "")

    # Only return the answer without the context items
    if return_answer_only:
        return output_text

    return output_text, context_items

"""Trying out the above"""

query = random.choice(query_list)
print(f"Query: {query}")

# Answer query with context and return context
answer, context_items = ask(query=query,
                            temperature=0.7,
                            max_new_tokens=512,
                            return_answer_only=False)

print(f"Answer:\n")
print_wrapped(answer)
print(f"Context items:")
context_items

# !pip install gradio

import gradio as gr

# Function to integrate with Gradio
def rag_chatbot(query):
    answer_text = ask(query, return_answer_only=True)  # This ensures we only get the answer
    return answer_text

# Gradio interface setup
interface = gr.Interface(
    fn=rag_chatbot,
    inputs="text",
    outputs="text",
    title="RAG Chatbot",
    description="This is a Retrieval-Augmented Generation (RAG) Chatbot for Information Retrieval Textbook queries."
)

# Launch the Gradio app
# interface.launch(share = True)

# !git clone [email protected]:spaces/AmiDwivedi/IR_Project

# !gradio deploy