sentence-transformers pdf2image scikit-learn pytesseract