omarsol commited on
Commit
a7eefa4
Β·
1 Parent(s): 6c8910a

Add comprehensive Claude instructions for AI Tutor App

Browse files
Files changed (1) hide show
  1. CLAUDE.md +142 -0
CLAUDE.md ADDED
@@ -0,0 +1,142 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # AI Tutor App Instructions for Claude
2
+
3
+ ## Project Overview
4
+ This is an AI tutor application that uses RAG (Retrieval Augmented Generation) to provide accurate responses about AI concepts by searching through multiple documentation sources. The application has a Gradio UI and uses ChromaDB for vector storage.
5
+
6
+ ## Key Repositories and URLs
7
+ - Main code: https://github.com/towardsai/ai-tutor-app
8
+ - Live demo: https://huggingface.co/spaces/towardsai-tutors/ai-tutor-chatbot
9
+ - Vector database: https://huggingface.co/datasets/towardsai-tutors/ai-tutor-vector-db
10
+ - Private JSONL repo: https://huggingface.co/datasets/towardsai-tutors/ai-tutor-data
11
+
12
+ ## Architecture Overview
13
+ - Frontend: Gradio-based UI in `scripts/main.py`
14
+ - Retrieval: Custom retriever using ChromaDB vector stores
15
+ - Embedding: Cohere embeddings for vector search
16
+ - LLM: OpenAI models (GPT-4o, etc.) for context addition and responses
17
+ - Storage: Individual JSONL files per source + combined file for retrieval
18
+
19
+ ## Data Update Workflows
20
+
21
+ ### 1. Adding a New Course
22
+ ```bash
23
+ python data/scraping_scripts/add_course_workflow.py --course [COURSE_NAME]
24
+ ```
25
+ - This requires the course to be configured in `process_md_files.py` under `SOURCE_CONFIGS`
26
+ - The workflow will pause for manual URL addition after processing markdown files
27
+ - Only new content will have context added by default (efficient)
28
+ - Use `--process-all-context` if you need to regenerate context for all documents
29
+ - Both database and data files are uploaded to HuggingFace by default
30
+ - Use `--skip-data-upload` if you don't want to upload data files
31
+
32
+ ### 2. Updating Documentation from GitHub
33
+ ```bash
34
+ python data/scraping_scripts/update_docs_workflow.py
35
+ ```
36
+ - Updates all supported documentation sources (or specify specific ones with `--sources`)
37
+ - Downloads fresh documentation from GitHub repositories
38
+ - Only new content will have context added by default (efficient)
39
+ - Use `--process-all-context` if you need to regenerate context for all documents
40
+ - Both database and data files are uploaded to HuggingFace by default
41
+ - Use `--skip-data-upload` if you don't want to upload data files
42
+
43
+ ### 3. Data File Management
44
+ ```bash
45
+ # Upload both JSONL and PKL files to private HuggingFace repository
46
+ python data/scraping_scripts/upload_data_to_hf.py
47
+ ```
48
+
49
+ ## Data Flow and File Relationships
50
+
51
+ ### Document Processing Pipeline
52
+ 1. **Markdown Files** β†’ `process_md_files.py` β†’ **Individual JSONL files** (e.g., `transformers_data.jsonl`)
53
+ 2. Individual JSONL files β†’ `combine_all_sources()` β†’ `all_sources_data.jsonl`
54
+ 3. `all_sources_data.jsonl` β†’ `add_context_to_nodes.py` β†’ `all_sources_contextual_nodes.pkl`
55
+ 4. `all_sources_contextual_nodes.pkl` β†’ `create_vector_stores.py` β†’ ChromaDB vector stores
56
+
57
+ ### Important Files and Their Purpose
58
+ - `all_sources_data.jsonl` - Combined raw document data without context
59
+ - Source-specific JSONL files (e.g., `transformers_data.jsonl`) - Raw data for individual sources
60
+ - `all_sources_contextual_nodes.pkl` - Processed nodes with added context
61
+ - `chroma-db-all_sources` - Vector database directory containing embeddings
62
+ - `document_dict_all_sources.pkl` - Dictionary mapping document IDs to full documents
63
+
64
+ ## Configuration Details
65
+
66
+ ### Adding a New Course Source
67
+ 1. Update `SOURCE_CONFIGS` in `process_md_files.py`:
68
+ ```python
69
+ "new_course": {
70
+ "base_url": "",
71
+ "input_directory": "data/new_course",
72
+ "output_file": "data/new_course_data.jsonl",
73
+ "source_name": "new_course",
74
+ "use_include_list": False,
75
+ "included_dirs": [],
76
+ "excluded_dirs": [],
77
+ "excluded_root_files": [],
78
+ "included_root_files": [],
79
+ "url_extension": "",
80
+ },
81
+ ```
82
+
83
+ 2. Update UI configurations in:
84
+ - `setup.py`: Add to `AVAILABLE_SOURCES` and `AVAILABLE_SOURCES_UI`
85
+ - `main.py`: Add mapping in `source_mapping` dictionary
86
+
87
+ ## Deployment and Publishing
88
+
89
+ ### GitHub Actions Workflow
90
+ The application is automatically deployed to HuggingFace Spaces when changes are pushed to the main branch (excluding documentation and scraping scripts).
91
+
92
+ ### Manual Deployment
93
+ ```bash
94
+ git push --force https://$HF_USERNAME:[email protected]/spaces/towardsai-tutors/ai-tutor-chatbot main:main
95
+ ```
96
+
97
+ ## Development Environment Setup
98
+
99
+ ### Required Environment Variables
100
+ - `OPENAI_API_KEY` - For LLM processing
101
+ - `COHERE_API_KEY` - For embeddings
102
+ - `HF_TOKEN` - For HuggingFace uploads
103
+ - `GITHUB_TOKEN` - For accessing documentation via the GitHub API
104
+
105
+ ### Running the Application Locally
106
+ ```bash
107
+ # Install dependencies
108
+ pip install -r requirements.txt
109
+
110
+ # Start the Gradio UI
111
+ python scripts/main.py
112
+ ```
113
+
114
+ ## Important Notes
115
+
116
+ 1. When adding new courses, make sure to:
117
+ - Place markdown files exported from Notion in the appropriate directory
118
+ - Add URLs manually from the live course platform
119
+ - Example URL format: `https://academy.towardsai.net/courses/take/python-for-genai/multimedia/62515980-course-structure`
120
+ - Configure the course in `process_md_files.py`
121
+ - Verify it appears in the UI after deployment
122
+
123
+ 2. For updating documentation:
124
+ - The GitHub API is used to fetch the latest documentation
125
+ - The workflow handles updating existing sources without affecting course data
126
+
127
+ 3. For efficient context addition:
128
+ - Only new content gets processed by default
129
+ - Old nodes for updated sources are removed from the PKL file
130
+ - This ensures no duplicate content in the vector database
131
+
132
+ ## Technical Details for Debugging
133
+
134
+ ### Node Removal Logic
135
+ - When adding context, the workflow now removes existing nodes for sources being updated
136
+ - This prevents duplication of content in the vector database
137
+ - The source of each node is extracted from either `node.source_node.metadata` or `node.metadata`
138
+
139
+ ### Performance Considerations
140
+ - Context addition is the most time-consuming step (uses OpenAI API)
141
+ - The new default behavior only processes new content
142
+ - For large updates, consider running in batches