--- license: mit library_name: sklearn tags: - sklearn - skops - text-classification model_format: pickle model_file: model.pkl --- # Model description Suicide Detection text classification model. PYTHON 3.10 ONLY ## Training Procedure Trained using 0.7 of the the Suicide and Depression Detection dataset (https://www.kaggle.com/datasets/nikhileswarkomati/suicide-watch) The model vectorises each text using a trained tfidf vectorizer and then classifies using xgboost. See main.py for further details. ### Hyperparameters
Click to expand | Hyperparameter | Value | |-------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | memory | | | steps | [('tfidf', TfidfVectorizer(min_df=100, ngram_range=(1, 3),
preprocessor=)), ('classifier', XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=None, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=None, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
multi_strategy=None, n_estimators=None, n_jobs=None,
num_parallel_tree=None, random_state=None, ...))] | | verbose | True | | tfidf | TfidfVectorizer(min_df=100, ngram_range=(1, 3),
preprocessor=) | | classifier | XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=None, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=None, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
multi_strategy=None, n_estimators=None, n_jobs=None,
num_parallel_tree=None, random_state=None, ...) | | tfidf__analyzer | word | | tfidf__binary | False | | tfidf__decode_error | strict | | tfidf__dtype | | | tfidf__encoding | utf-8 | | tfidf__input | content | | tfidf__lowercase | True | | tfidf__max_df | 1.0 | | tfidf__max_features | | | tfidf__min_df | 100 | | tfidf__ngram_range | (1, 3) | | tfidf__norm | l2 | | tfidf__preprocessor | | | tfidf__smooth_idf | True | | tfidf__stop_words | | | tfidf__strip_accents | | | tfidf__sublinear_tf | False | | tfidf__token_pattern | (?u)\b\w\w+\b | | tfidf__tokenizer | | | tfidf__use_idf | True | | tfidf__vocabulary | | | classifier__objective | binary:logistic | | classifier__base_score | | | classifier__booster | | | classifier__callbacks | | | classifier__colsample_bylevel | | | classifier__colsample_bynode | | | classifier__colsample_bytree | | | classifier__device | | | classifier__early_stopping_rounds | | | classifier__enable_categorical | False | | classifier__eval_metric | | | classifier__feature_types | | | classifier__gamma | | | classifier__grow_policy | | | classifier__importance_type | | | classifier__interaction_constraints | | | classifier__learning_rate | | | classifier__max_bin | | | classifier__max_cat_threshold | | | classifier__max_cat_to_onehot | | | classifier__max_delta_step | | | classifier__max_depth | | | classifier__max_leaves | | | classifier__min_child_weight | | | classifier__missing | nan | | classifier__monotone_constraints | | | classifier__multi_strategy | | | classifier__n_estimators | | | classifier__n_jobs | | | classifier__num_parallel_tree | | | classifier__random_state | | | classifier__reg_alpha | | | classifier__reg_lambda | | | classifier__sampling_method | | | classifier__scale_pos_weight | | | classifier__subsample | | | classifier__tree_method | | | classifier__validate_parameters | | | classifier__verbosity | |
### Model Plot
Pipeline(steps=[('tfidf',TfidfVectorizer(min_df=100, ngram_range=(1, 3),preprocessor=<function preprocessor at 0x7f8d443a30a0>)),('classifier',XGBClassifier(base_score=None, booster=None, callbacks=None,colsample_bylevel=None, colsample_bynode=None,colsample_bytree=None, device=None,early_stopping_rounds=None,enable_categorical=False, eval_metric=None,featur...importance_type=None,interaction_constraints=None, learning_rate=None,max_bin=None, max_cat_threshold=None,max_cat_to_onehot=None, max_delta_step=None,max_depth=None, max_leaves=None,min_child_weight=None, missing=nan,monotone_constraints=None, multi_strategy=None,n_estimators=None, n_jobs=None,num_parallel_tree=None, random_state=None, ...))],verbose=True)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
## Evaluation Results | Metric | Value | |----------|----------| | accuracy | 0.910317 | | f1 score | 0.910317 | | ROC AUC | 0.969008 | # How to Get Started with the Model ```python import sklearn import dill as pickle from skops import hub_utils from pathlib import Path suicide_detector_repo = Path("./suicide-detector") hub_utils.download( repo_id="AndyJamesTurner/suicideDetector", dst=suicide_detector_repo ) with open(suicide_detector_repo / "model.pkl", 'rb') as file: clf = pickle.load(file) classification = clf.predict(["I want to kill myself"])[0] ``` # Model Evaluation The model was evaluated on a 0.3 holdout split using f1 score, accuracy, confusion matrix and ROC curves. ## Confusion matrix ![Confusion matrix](confusion_matrix.png) ## ROC Curve ![ROC Curve](roc_curve.png) # Classification Report | index | precision | recall | f1-score | support | |--------------|-------------|----------|------------|--------------| | not suicide | 0.891721 | 0.934126 | 0.912431 | 34824 | | suicide | 0.930785 | 0.886491 | 0.908098 | 34799 | | accuracy | 0.910317 | 0.910317 | 0.910317 | 0.910317 | | macro avg | 0.911253 | 0.910308 | 0.910265 | 69623 | | weighted avg | 0.911246 | 0.910317 | 0.910265 | 69623 | # Model Authors This model was created by the following authors: * Andy Turner