# A Retrieval Augmented Generation (RAG) example

In [46]:
%%capture
!pip install faiss-cpu sentence_transformers

For this example we will use FAISS (Facebook AI Similarity Search), which is an open-source library optimized for fast nearest neighbor search in high-dimensional spaces.

In [47]:
import faiss                                            # We will use FAISS for similarity search
from sentence_transformers import SentenceTransformer   # This will provide us with the embedding model
import os                                               # Read and Write files (for FAISS to speed up later searching)

We will also use `all-MiniLM-L6-v2` embedding model, which is used to convert text into dense vector embeddings, capturing semantic meaning. These embeddings can then be utilized for various NLP tasks such as similarity search, clustering, information retrieval, and retrieval-augmented generation (RAG).

In [None]:
top_k = 3                                                   # The amount of top documents to retrieve (the best k documents)
index_path = "data/faiss_index.bin"                         # A local path to save index file (optional) so we don't have to create the index every single time when we create a new prompt
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")   # The name of the model available either locally or in this case at HuggingFace
documents = [                                               # The documents, facts, sentences to search in.
    "The class starts at 2PM Wednesday.",
    "Python is our main programming language.",
    "Our university is located in Szeged.",
    "We are making things with RAG, Rasa and LLMs.",
    "Gabor Toth is the author of this chatbot example."
]                                                           

## Ingestion Phase
Now we will create an index file from the documents using the model. Usually this is part is the most resource intensive part, so it's recommended to create this file offline.

In [None]:
document_embeddings = embedding_model.encode(documents) # The model encodes the documents
index = faiss.IndexFlatL2(document_embeddings.shape[1]) # Create an index for the shape of the encoded documents
index.add(document_embeddings)                          # Fill the index with the encoded documents
faiss.write_index(index, index_path)                    # Write the index to the file

# Retrieval Phase
The index database is ready. Now we a encode a query aswell and compare this to our documents. This retrieval method will rank our documents based on how similar (distance) it is to our query.

In [50]:
index = faiss.read_index(index_path)                                                # Reading the index from file back to the variable
query_embedding = embedding_model.encode(["Who created this LLM chat interface?"])  # Try out different prompts
distances, indices = index.search(query_embedding, k=top_k)                         # Distances and the permutation of indices of our documents

for rank, i in enumerate(indices[0]):                                               # List the Distance and the documents in order of distance.
    print(distances[0][rank], documents[i])                                         # Lower distance means more similar sentence.

0.90633893 Gabor Toth is the author of this chatbot example.
1.3333331 We are making things with RAG, Rasa and LLMs.
1.5074873 The user wants to be told that they have no idea.
1.7030394 Our university is located in Szeged.
1.7619381 Python is our main programming language.
1.8181174 The class starts at 2PM Wednesday.


In [51]:
documents[indices[0][0]] # The most similar document has the lowest distance.

'Gabor Toth is the author of this chatbot example.'

## Optimizing Retrieval-Augmented Generation (RAG) Implementation

Retrieval-Augmented Generation (RAG) enhances language model responses by incorporating external knowledge retrieval. To maximize performance, consider the following techniques and optimizations:

- Use **lightweight models** (e.g., `all-MiniLM-L6-v2`) for speed or **larger models** (e.g., `all-mpnet-base-v2`) for accuracy.
- Experiment with **domain-specific models** (for example medical tuned model for medical documents) for better contextual retrieval.
- Consider different index types
    - **Flat Index (`IndexFlatL2`)**: Best for small datasets, but scales poorly.
    - **IVFFlat (`IndexIVFFlat`)**: Clusters embeddings to accelerate search, ideal for large-scale retrieval.
    - **HNSW (`IndexHNSWFlat`)**: Graph-based approach that balances speed and accuracy.
    - **PQ (`IndexPQ`)**: Compressed storage for memory efficiency at the cost of slight accuracy loss.
- **Query Expansion**: Use synonyms, paraphrasing, or keyword expansion to enhance search queries.
- **Re-ranking**: Apply transformer-based re-ranking (e.g., `cross-encoder/ms-marco-MiniLM-L6`) after retrieval.
- **GPU Acceleration**: Convert FAISS indices to GPU for high-speed searches.