Spaces:

Volkopat
/

arXivGPT

Runtime error

Volko commited on Apr 16, 2023

Commit

a58f539

1 Parent(s): ccc9ab3

Reverted

Files changed (2) hide show

app.py CHANGED Viewed

@@ -135,11 +135,11 @@ with block:
             <div style="text-align:center">
                 <p>Developed by <a href='https://www.linkedin.com/in/dekay/'>Github and Huggingface: Volkopat</a></p>
                 <p>Powered by <a href='https://openai.com/'>OpenAI</a>, <a href='https://arxiv.org/'>arXiv</a> and <a href='https://github.com/hwchase17/langchain'>LangChain 🦜️🔗</a></p>
-                <p>ArxivGPT is a chatbot that answers questions about research papers from Arxiv.org.</p>
                 <p>Currently, it can answer questions about the paper you just linked.</p>
-                <p>It's still in development, so please report any bugs you find.</p>
                 <p>The answers can be quite limited as there is a 4096 token limit for GPT-3.5, hence waiting for GPT-4 access to upgrade.</p>
-                <p>Possible upgrades coming up: GPT-4, status messages, other research paper hubs.</p>
             </div>
             <style>
                 p {

             <div style="text-align:center">
                 <p>Developed by <a href='https://www.linkedin.com/in/dekay/'>Github and Huggingface: Volkopat</a></p>
                 <p>Powered by <a href='https://openai.com/'>OpenAI</a>, <a href='https://arxiv.org/'>arXiv</a> and <a href='https://github.com/hwchase17/langchain'>LangChain 🦜️🔗</a></p>
+                <p>ArxivGPT is a chatbot that answers questions about research papers. It uses a pretrained GPT-3.5 model to generate answers.</p>
                 <p>Currently, it can answer questions about the paper you just linked.</p>
+                <p>It's still in development, so please report any bugs you find. It can take up to a minute to start a conversation for every new paper as there is a parsing delay.</p>
                 <p>The answers can be quite limited as there is a 4096 token limit for GPT-3.5, hence waiting for GPT-4 access to upgrade.</p>
+                <p>Possible upgrades coming up: GPT-4, faster parsing, status messages, other research paper hubs.</p>
             </div>
             <style>
                 p {

pdf2vectorstore.py CHANGED Viewed

@@ -5,7 +5,6 @@ from bs4 import BeautifulSoup
 from pdf2image import convert_from_path
 import pytesseract
 import pickle
-from concurrent.futures import ThreadPoolExecutor
 from langchain.text_splitter import RecursiveCharacterTextSplitter
 from langchain.document_loaders import UnstructuredFileLoader
@@ -19,19 +18,14 @@ def download_pdf(url, filename):
         for chunk in response.iter_content(chunk_size=8192):
             f.write(chunk)
-def extract_image_text(image):
-    return pytesseract.image_to_string(image)
 def extract_pdf_text(filename):
     print("Extracting text from pdf...")
     pytesseract.pytesseract.tesseract_cmd = 'tesseract'
     images = convert_from_path(filename)
     text = ""
-    with ThreadPoolExecutor() as executor:
-        text_parts = list(executor.map(extract_image_text, images))
-    text = "".join(text_parts)
     return text
 def get_arxiv_pdf_url(paper_link):

 from pdf2image import convert_from_path
 import pytesseract
 import pickle
 from langchain.text_splitter import RecursiveCharacterTextSplitter
 from langchain.document_loaders import UnstructuredFileLoader
         for chunk in response.iter_content(chunk_size=8192):
             f.write(chunk)
 def extract_pdf_text(filename):
     print("Extracting text from pdf...")
     pytesseract.pytesseract.tesseract_cmd = 'tesseract'
     images = convert_from_path(filename)
     text = ""
+    for image in images:
+        text += pytesseract.image_to_string(image)
     return text
 def get_arxiv_pdf_url(paper_link):