File size: 2,268 Bytes
d98144d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72

import os
import requests
from bs4 import BeautifulSoup
from pdf2image import convert_from_path
import pytesseract
import pickle

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import UnstructuredFileLoader
from langchain.vectorstores.faiss import FAISS
from langchain.embeddings import OpenAIEmbeddings

def download_pdf(url, filename):
    print("Downloading pdf...")
    response = requests.get(url, stream=True)
    with open(filename, 'wb') as f:
        for chunk in response.iter_content(chunk_size=8192):
            f.write(chunk)

def extract_pdf_text(filename):
    print("Extracting text from pdf...")
    pytesseract.pytesseract.tesseract_cmd = 'tesseract'  
    images = convert_from_path(filename)
    text = ""
    for image in images:
        text += pytesseract.image_to_string(image)
    
    return text

def get_arxiv_pdf_url(paper_link):
    if paper_link.endswith('.pdf'):
        return paper_link
    else:
        print("Getting pdf url...")
        response = requests.get(paper_link)
        soup = BeautifulSoup(response.text, 'html.parser')
        pdf_url = soup.find('a', {'class': 'mobile-submission-download'})['href']
        pdf_url = 'https://arxiv.org' + pdf_url
        return pdf_url

def read_paper(paper_link):
    print("Reading paper...")
    pdf_filename = 'paper.pdf'
    pdf_url = get_arxiv_pdf_url(paper_link)
    download_pdf(pdf_url, pdf_filename)
    text = extract_pdf_text(pdf_filename)
    os.remove(pdf_filename)

    return text

def convert_to_vectorstore(arxiv_url, api_key):
    if not arxiv_url or not api_key:
        return None
    print("Converting to vectorstore...")
    txtfile = "paper.txt"
    with open(txtfile, 'w') as f:
        f.write(read_paper(arxiv_url))

    loader = UnstructuredFileLoader(txtfile)
    raw_documents = loader.load()
    os.remove(txtfile)
    print("Loaded document")

    text_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=200)
    documents = text_splitter.split_documents(raw_documents)
    os.environ["OPENAI_API_KEY"] = api_key
    embeddings = OpenAIEmbeddings()
    os.environ["OPENAI_API_KEY"] = ""
    vectorstore = FAISS.from_documents(documents, embeddings)

    return vectorstore