docx2txt newspaper3k PyPDF2 regex requests requests-file requests-oauthlib torch transformers validator nltk sentence-transformers