A newer version of the Gradio SDK is available:
5.23.3
AI Tutor App Data Workflows
This directory contains scripts for managing the AI Tutor App's data pipeline.
Workflow Scripts
1. Adding a New Course
To add a new course to the AI Tutor:
python add_course_workflow.py --course [COURSE_NAME]
This will guide you through the complete process:
- Process markdown files from Notion exports
- Prompt you to manually add URLs to the course content
- Merge the course data into the main dataset
- Add contextual information to document nodes
- Create vector stores
- Upload databases to HuggingFace
- Update UI configuration
Requirements before running:
- The course name must be properly configured in
process_md_files.py
underSOURCE_CONFIGS
- Course markdown files must be placed in the directory specified in the configuration
- You must have access to the live course platform to add URLs
2. Updating Documentation via GitHub API
To update library documentation from GitHub repositories:
python update_docs_workflow.py
This will update all supported documentation sources. You can also specify specific sources:
python update_docs_workflow.py --sources transformers peft
The workflow includes:
- Downloading documentation from GitHub using the API
- Processing markdown files to create JSONL data
- Adding contextual information to document nodes
- Creating vector stores
- Uploading databases to HuggingFace
3. Uploading JSONL to HuggingFace
To upload the main JSONL file to a private HuggingFace repository:
python upload_jsonl_to_hf.py
This is useful for sharing the latest data with team members.
Individual Components
If you need to run specific steps individually:
- GitHub to Markdown:
github_to_markdown_ai_docs.py
- Process Markdown:
process_md_files.py
- Add Context:
add_context_to_nodes.py
- Create Vector Stores:
create_vector_stores.py
- Upload to HuggingFace:
upload_dbs_to_hf.py
Tips for New Team Members
To update the AI Tutor with new content:
- For new courses, use
add_course_workflow.py
- For updated documentation, use
update_docs_workflow.py
- For new courses, use
When adding URLs to course content:
- Get the URLs from the live course platform
- Add them to the generated JSONL file in the
url
field - Example URL format:
https://academy.towardsai.net/courses/take/python-for-genai/multimedia/62515980-course-structure
- Make sure every document has a valid URL
By default, only new content will have context added to save time and resources. Use
--process-all-context
only if you need to regenerate context for all documents. Use--skip-data-upload
if you don't want to upload data files to the private HuggingFace repo (they're uploaded by default).When adding a new course, verify that it appears in the Gradio UI:
- The workflow automatically updates
main.py
andsetup.py
to include the new source - Check that the new source appears in the dropdown menu in the UI
- Make sure it's properly included in the default selected sources
- Restart the Gradio app to see the changes
- The workflow automatically updates
First time setup or missing files:
- Both workflows automatically check for and download required data files:
all_sources_data.jsonl
- Contains the raw document dataall_sources_contextual_nodes.pkl
- Contains the processed nodes with added context
- If the PKL file exists, the
--new-context-only
flag will only process new content - You must have proper HuggingFace credentials with access to the private repository
- Both workflows automatically check for and download required data files:
Make sure you have the required environment variables set:
OPENAI_API_KEY
for LLM processingCOHERE_API_KEY
for embeddingsHF_TOKEN
for HuggingFace uploadsGITHUB_TOKEN
for accessing documentation via the GitHub API