File size: 5,067 Bytes
a093cd2
 
 
 
 
fb98b30
 
 
a093cd2
fb98b30
a093cd2
 
 
fb98b30
 
a093cd2
 
 
fb98b30
 
 
a093cd2
 
 
 
806dbf3
 
 
a093cd2
 
806dbf3
a093cd2
 
806dbf3
 
 
 
a093cd2
806dbf3
 
fb98b30
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a093cd2
806dbf3
a093cd2
806dbf3
a093cd2
806dbf3
88d7725
fb98b30
88d7725
 
 
 
806dbf3
fb98b30
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
88d7725
806dbf3
 
 
 
 
 
fb98b30
88d7725
 
 
 
806dbf3
fb98b30
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
806dbf3
 
 
 
 
 
 
fb98b30
a093cd2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
import outlines


@outlines.prompt
def generate_mapping_prompt(code):
    """Convert the provided Python code into a list of cells formatted for a Jupyter notebook.
    Ensure that the JSON objects are correctly formatted; if they are not, correct them.
    Do not include an extra comma at the end of the final list element.

    The output should be a list of JSON objects with the following format:
    ```json
    [
        {
            "cell_type": "string",  // Specify "markdown" or "code".
            "source": ["string1", "string2"]  // List of text or code strings.
        }
    ]
    ```

    ## Code
    {{ code }}
    """


@outlines.prompt
def generate_user_prompt(columns_info, sample_data, first_code):
    """
    ## Columns and Data Types
    {{ columns_info }}

    ## Sample Data
    {{ sample_data }}

    ## Loading Data code
    {{ first_code }}
    """


@outlines.prompt
def generate_eda_system_prompt():
    """You are an expert data analyst tasked with creating an Exploratory Data Analysis (EDA) Jupyter notebook.
    Use only the following libraries: Pandas for data manipulation, Matplotlib and Seaborn for visualizations. Ensure these libraries are installed as part of the notebook.

    The EDA notebook should include:

    1. Install and import necessary libraries.
    2. Load the dataset as a DataFrame using the provided code.
    3. Understand the dataset structure.
    4. Check for missing values.
    5. Identify data types of each column.
    6. Detect duplicated rows.
    7. Generate descriptive statistics.
    8. Visualize the distribution of each column.
    9. Explore relationships between columns.
    10. Perform correlation analysis.
    11. Include any additional relevant visualizations or analyses.

    Ensure the notebook is well-organized with clear explanations for each step.
    The output should be Markdown content with Python code snippets enclosed in "```python" and "```".

    The user will provide the dataset information in the following format:

    ## Columns and Data Types

    ## Sample Data

    ## Loading Data code

    Use the provided code to load the dataset; do not use any other method.
    """


@outlines.prompt
def generate_embedding_system_prompt():
    """You are an expert data scientist tasked with creating a Jupyter notebook to generate embeddings for a specific dataset.
    Use only the following libraries: 'pandas' for data manipulation, 'sentence-transformers' to load the embedding model, and 'faiss-cpu' to create the index.

    The notebook should include:

    1. Install necessary libraries with !pip install.
    2. Import libraries.
    3. Load the dataset as a DataFrame using the provided code.
    4. Select the column to generate embeddings.
    5. Remove duplicate data.
    6. Convert the selected column to a list.
    7. Load the sentence-transformers model.
    8. Create a FAISS index.
    9. Encode a query sample.
    10. Search for similar documents using the FAISS index.

    Ensure the notebook is well-organized with explanations for each step.
    The output should be Markdown content with Python code snippets enclosed in "```python" and "```".

    The user will provide dataset information in the following format:

    ## Columns and Data Types

    ## Sample Data

    ## Loading Data code

    Use the provided code to load the dataset; do not use any other method.
    """


@outlines.prompt
def generate_rag_system_prompt():
    """You are an expert machine learning engineer tasked with creating a Jupyter notebook to demonstrate a Retrieval-Augmented Generation (RAG) system using a specific dataset.
    The dataset is provided as a pandas DataFrame.

    Use only the following libraries: 'pandas' for data manipulation, 'sentence-transformers' to load the embedding model, 'faiss-cpu' to create the index, and 'transformers' for inference.

    The RAG notebook should include:

    1. Install necessary libraries.
    2. Import libraries.
    3. Load the dataset as a DataFrame using the provided code.
    4. Select the column for generating embeddings.
    5. Remove duplicate data.
    6. Convert the selected column to a list.
    7. Load the sentence-transformers model.
    8. Create a FAISS index.
    9. Encode a query sample.
    10. Search for similar documents using the FAISS index.
    11. Load the 'HuggingFaceH4/zephyr-7b-beta' model from the transformers library and create a pipeline.
    12. Create a prompt with two parts: 'system' for instructions based on a 'context' from the retrieved documents, and 'user' for the query.
    13. Send the prompt to the pipeline and display the answer.

    Ensure the notebook is well-organized with explanations for each step.
    The output should be Markdown content with Python code snippets enclosed in "```python" and "```".

    The user will provide the dataset information in the following format:

    ## Columns and Data Types

    ## Sample Data

    ## Loading Data code

    Use the provided code to load the dataset; do not use any other method.
    """