joefarrington commited on
Commit
73130d4
·
1 Parent(s): b50d8a8

YAML header for HF Space

Browse files
Files changed (1) hide show
  1. README.md +33 -20
README.md CHANGED
@@ -1,11 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
1
  # Chat with the 2024/2025 UCL module catalogue
2
 
3
  ## NOTE
 
4
  This is a demonstration developed for educational purposes only and is not affiliated with or endorsed by University College London (UCL). The model may provide incorrect or outdated information. Interactions should therefore not be used to inform decisions such as programme choices or module selection.
5
 
6
  Please refer to the official [UCL module catalogue](https://www.ucl.ac.uk/module-catalogue) for accurate and up-to-date information.
7
 
8
- The code is licensed under Apache License 2.0 but the module catalogue content is copyright UCL.
9
 
10
  ## Get started
11
 
@@ -15,7 +27,7 @@ The easiest way to chat with the model is using the Hugging Face space.
15
 
16
  ### Local use
17
 
18
- You can use the code snippet below to run the app locally. This project uses [uv](https://docs.astral.sh/uv/) to manage dependencies and the snippet assumes that you have [uv installed](https://docs.astral.sh/uv/getting-started/installation/).
19
 
20
  The app requires an [OpenAI API key](https://help.openai.com/en/articles/4936850-where-do-i-find-my-openai-api-key) to run locally.
21
 
@@ -35,60 +47,61 @@ export OPENAI_API_KEY=<Your API key>
35
  # Run the app
36
  python app.py
37
  ```
38
- One advantage of LangChain is that you could easily substitute the embedding model and/or LLM for alternatives including locally hosted models using [Hugging Face](https://python.langchain.com/docs/integrations/providers/huggingface/), [llama.cpp](https://python.langchain.com/docs/integrations/providers/llamacpp/) or [Ollama](https://python.langchain.com/docs/integrations/providers/ollama/).
 
39
 
40
  ### Rerun scraping and embedding
41
 
42
  The repository includes the vectorstore with pages from the module catalogue embedded using OpenAI's [text-embedding-3-small](https://platform.openai.com/docs/guides/embeddings).
43
 
44
- The process for downloading the pages from the module catalogue, converting the pages to markdown documents, and embedding the documents can be re-run using the script `setup.py`. There is no need to run this script unless you want to change the way data is extracted from the HTML pages to markdown, or embed the documents using an alternative model.
45
 
46
  ## Implementation details
47
 
48
  ### Document scraping
49
 
50
- The documents in the vectorstore used to provide context to the LLM are based on the publicly available webpages describing each module offered by UCL.
51
 
52
- The URLs for the individual module catalogue pages are identified from the module catalogue search page. The modules pages are then visited in sequence and the HTML is downloaded for each page.
53
 
54
  There are more efficient ways to scrape the content from the module catalogue (e.g. [scrapy](https://scrapy.org/)). The current method is designed to minimise the effect on the server. There is long wait time between requests and the raw HTML is saved so that alternative methods of extracting the content can be considered without needing to request additional data from the server.
55
 
56
  ### Document conversion
57
 
58
- The raw HTML for each module page is converted to a markdown document using [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) to parse the HTML and a [Jinja](https://jinja.palletsprojects.com/en/stable/intro/) template to format the extracted information.
59
 
60
  ### Document embedding
61
 
62
- The module pages are relatively short documents and therefore each is treated as a single chunk and embedded as a whole.
63
 
64
- Each page is embedded using [text-embedding-3-small](https://platform.openai.com/docs/guides/embeddings). [FAISS](https://faiss.ai/) is used to store and search the embedded documents.
65
 
66
  ### Q&A based using RAG
67
 
68
- The chat interface is a simple [Gradio](https://www.gradio.app/) app, and uses OpenAI's [gpt-4o-mini](https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/) as the underlying LLM.
69
 
70
  At each turn of the conversation the following steps are performed, managed using [Langchain](https://python.langchain.com/docs/introduction/):
71
 
72
- * Call the LLM to rephrase the user's query, given the conversation history, so that it includes relevant context from the conversation.
73
 
74
- * Embed the rephrased query and retrieve relevant documents from the vectorstore.
75
 
76
- * Call the LLM with the current user input, retrieved documents for context and conversation history. Output the result as the LLM's response in the chat interface.
77
 
78
  ## Potential extensions
79
 
80
- * Add [course descriptions](https://www.ucl.ac.uk/prospective-students/undergraduate/undergraduate-courses) to the vectorstore so that the app is more useful to potential applicants and can explain, for example, which modules are mandatory on certain courses.
81
 
82
- * Provide links to the module catalogue for modules suggested by the application, either within the conversation or as a separate interface element.
83
 
84
- * Use a agent-based approach to avoid unnecessary retrieval steps and/or support more complex queries that require multiple retrieval steps.
85
 
86
- * Use a LangGraph app to manage the conversation history and state.
87
 
88
  ## Useful resources
89
 
90
- * [UCL module catalogue](https://www.ucl.ac.uk/module-catalogue?collection=drupal-module-catalogue&facetsort=alpha&num_ranks=20&daat=10000&sort=title)
91
 
92
- * [Langchain official tutorials](https://python.langchain.com/docs/tutorials/)
93
 
94
- * Hands-On Large Language Models [book](https://learning.oreilly.com/library/view/hands-on-large-language/9781098150952/) and [GitHub repository](https://github.com/HandsOnLLM/Hands-On-Large-Language-Models)
 
1
+ ---
2
+ title: Chat with the UCL module catalogue
3
+ emoji: 🎓
4
+ colorFrom: blue
5
+ colorTo: indigo
6
+ sdk: gradio
7
+ sdk_version: "4.44.1"
8
+ app_file: app.py
9
+ pinned: false
10
+ ---
11
+
12
  # Chat with the 2024/2025 UCL module catalogue
13
 
14
  ## NOTE
15
+
16
  This is a demonstration developed for educational purposes only and is not affiliated with or endorsed by University College London (UCL). The model may provide incorrect or outdated information. Interactions should therefore not be used to inform decisions such as programme choices or module selection.
17
 
18
  Please refer to the official [UCL module catalogue](https://www.ucl.ac.uk/module-catalogue) for accurate and up-to-date information.
19
 
20
+ The code is licensed under Apache License 2.0 but the module catalogue content is copyright UCL.
21
 
22
  ## Get started
23
 
 
27
 
28
  ### Local use
29
 
30
+ You can use the code snippet below to run the app locally. This project uses [uv](https://docs.astral.sh/uv/) to manage dependencies and the snippet assumes that you have [uv installed](https://docs.astral.sh/uv/getting-started/installation/).
31
 
32
  The app requires an [OpenAI API key](https://help.openai.com/en/articles/4936850-where-do-i-find-my-openai-api-key) to run locally.
33
 
 
47
  # Run the app
48
  python app.py
49
  ```
50
+
51
+ One advantage of LangChain is that you could easily substitute the embedding model and/or LLM for alternatives including locally hosted models using [Hugging Face](https://python.langchain.com/docs/integrations/providers/huggingface/), [llama.cpp](https://python.langchain.com/docs/integrations/providers/llamacpp/) or [Ollama](https://python.langchain.com/docs/integrations/providers/ollama/).
52
 
53
  ### Rerun scraping and embedding
54
 
55
  The repository includes the vectorstore with pages from the module catalogue embedded using OpenAI's [text-embedding-3-small](https://platform.openai.com/docs/guides/embeddings).
56
 
57
+ The process for downloading the pages from the module catalogue, converting the pages to markdown documents, and embedding the documents can be re-run using the script `setup.py`. There is no need to run this script unless you want to change the way data is extracted from the HTML pages to markdown, or embed the documents using an alternative model.
58
 
59
  ## Implementation details
60
 
61
  ### Document scraping
62
 
63
+ The documents in the vectorstore used to provide context to the LLM are based on the publicly available webpages describing each module offered by UCL.
64
 
65
+ The URLs for the individual module catalogue pages are identified from the module catalogue search page. The modules pages are then visited in sequence and the HTML is downloaded for each page.
66
 
67
  There are more efficient ways to scrape the content from the module catalogue (e.g. [scrapy](https://scrapy.org/)). The current method is designed to minimise the effect on the server. There is long wait time between requests and the raw HTML is saved so that alternative methods of extracting the content can be considered without needing to request additional data from the server.
68
 
69
  ### Document conversion
70
 
71
+ The raw HTML for each module page is converted to a markdown document using [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) to parse the HTML and a [Jinja](https://jinja.palletsprojects.com/en/stable/intro/) template to format the extracted information.
72
 
73
  ### Document embedding
74
 
75
+ The module pages are relatively short documents and therefore each is treated as a single chunk and embedded as a whole.
76
 
77
+ Each page is embedded using [text-embedding-3-small](https://platform.openai.com/docs/guides/embeddings). [FAISS](https://faiss.ai/) is used to store and search the embedded documents.
78
 
79
  ### Q&A based using RAG
80
 
81
+ The chat interface is a simple [Gradio](https://www.gradio.app/) app, and uses OpenAI's [gpt-4o-mini](https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/) as the underlying LLM.
82
 
83
  At each turn of the conversation the following steps are performed, managed using [Langchain](https://python.langchain.com/docs/introduction/):
84
 
85
+ - Call the LLM to rephrase the user's query, given the conversation history, so that it includes relevant context from the conversation.
86
 
87
+ - Embed the rephrased query and retrieve relevant documents from the vectorstore.
88
 
89
+ - Call the LLM with the current user input, retrieved documents for context and conversation history. Output the result as the LLM's response in the chat interface.
90
 
91
  ## Potential extensions
92
 
93
+ - Add [course descriptions](https://www.ucl.ac.uk/prospective-students/undergraduate/undergraduate-courses) to the vectorstore so that the app is more useful to potential applicants and can explain, for example, which modules are mandatory on certain courses.
94
 
95
+ - Provide links to the module catalogue for modules suggested by the application, either within the conversation or as a separate interface element.
96
 
97
+ - Use a agent-based approach to avoid unnecessary retrieval steps and/or support more complex queries that require multiple retrieval steps.
98
 
99
+ - Use a LangGraph app to manage the conversation history and state.
100
 
101
  ## Useful resources
102
 
103
+ - [UCL module catalogue](https://www.ucl.ac.uk/module-catalogue?collection=drupal-module-catalogue&facetsort=alpha&num_ranks=20&daat=10000&sort=title)
104
 
105
+ - [Langchain official tutorials](https://python.langchain.com/docs/tutorials/)
106
 
107
+ - Hands-On Large Language Models [book](https://learning.oreilly.com/library/view/hands-on-large-language/9781098150952/) and [GitHub repository](https://github.com/HandsOnLLM/Hands-On-Large-Language-Models)