SkimLit: NLP Model for Medical Abstracts

SkimLit is a natural language processing (NLP) project aimed at making the reading of medical abstracts more accessible. This project replicates the methodology outlined in the paper "PubMed 200K RCT: a Dataset for Sequenctial Sentence Classification in Medical Abstracts," using TensorFlow and various deep learning techniques.

Project Overview

Section 1

Data Collection

  • The PubMed 200K RCT dataset is obtained from the author's GitHub repository using the following commands:
git clone https://github.com/Franck-Dernoncourt/pubmed-rct
cd pubmed-rct/PubMed_20k_RCT_numbers_replaced_with_at_sign

Data Prepocessing

  • Sentences are extracted from the dataset, and numeric labels are assigned for machine learning models.
  • Three baseline models are established to set the foundation for more complex models.

Baseline Model (Model 0)

  • TF-IDF Multinomial Naive Bayes Classifier is implemented.
  • Classification evaluation metrics such as accuracy, precision, recall, and F1-score are employed.

Deep Sequence Models

Model 1: Conv1D with Token Embeddings

  • Custom TextVectorizer and text embedding layers are created.
  • Data is optimized for efficiency using TensorFlow tf.data API.

Model 2: Pretrained Token Embeddings

  • Universal Sentence Encoder (USE) from TensorFlow Hub is used for feature extraction.

Model 3: Conv1D with Character Embeddings

  • Character-level tokenizer and embedding are implemented.
  • Conv1D model is constructed using character embeddings.

Model 4: Hybrid Embedding Layer

  • Token and character-level embeddings are combined using layers.Concatenate.
  • A model is developed to process both types of embeddings and output label probabilities.

Model 5: Transfer Learning with Positional Embeddings

  • Positional embeddings are introduced to enhance the model's understanding of the sequence.
  • A tribrid embedding model is created, combining token, character, line_number, and total_lines features.

Model Evaluation and Comparison

  • Models are evaluated on various datasets to compare their performance.

Save and Load Models

  • Models are saved and loaded for future use.

Model Loading and Evaluation

  • Pre-trained models are loaded and evaluated on validation datasets.

Test Dataset Processing and Prediction

  • A test dataset is created, preprocessed, and used for making predictions with the loaded model.

Enriching Test Dataframe with Predictions

  • Predictions and additional columns are added to the test dataframe for analysis.

Finding Top Wrong Predictions

  • The top 100 most inaccurately predicted samples are identified.

Investigating Top Wrong Predictions

  • Detailed information on the top 10 wrong predictions is displayed.

Section 2

Example Abstracts

  • Example abstracts are downloaded from a GitHub repository.

Processing Example Abstracts with spaCy

  • spaCy is used to parse sentences from example abstracts.

One-Hot Encoding and Prediction on Example Abstracts

  • Line numbers and total lines are one-hot encoded, and predictions are made using the loaded model.

Visualizing Predictions on Example Abstracts

  • Predicted sequence labels for each line in the abstract are displayed.

Conclusion

  • SkimLit provides a comprehensive exploration of NLP techniques for medical abstracts, from baseline models to sophisticated deep learning architectures. The models are evaluated, compared, and applied to real-world examples, offering insights into their strengths and limitations.

  • Feel free to explore the code, experiment with different models, and contribute to the advancement of Skimlit NLP.

Downloads last month
7
Inference API
Unable to determine this model’s pipeline type. Check the docs .