Model description

Suicide Detection text classification model.

PYTHON 3.10 ONLY

Training Procedure

Trained using 0.7 of the the Suicide and Depression Detection dataset (https://www.kaggle.com/datasets/nikhileswarkomati/suicide-watch)

The model vectorises each text using a trained tfidf vectorizer and then classifies using xgboost.

See main.py for further details.

Hyperparameters

Click to expand
Hyperparameter Value
memory
steps [('tfidf', TfidfVectorizer(min_df=100, ngram_range=(1, 3),
preprocessor=<function preprocessor at 0x7f8d443a30a0>)), ('classifier', XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=None, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=None, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
multi_strategy=None, n_estimators=None, n_jobs=None,
num_parallel_tree=None, random_state=None, ...))]
verbose True
tfidf TfidfVectorizer(min_df=100, ngram_range=(1, 3),
preprocessor=<function preprocessor at 0x7f8d443a30a0>)
classifier XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=None, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=None, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
multi_strategy=None, n_estimators=None, n_jobs=None,
num_parallel_tree=None, random_state=None, ...)
tfidf__analyzer word
tfidf__binary False
tfidf__decode_error strict
tfidf__dtype <class 'numpy.float64'>
tfidf__encoding utf-8
tfidf__input content
tfidf__lowercase True
tfidf__max_df 1.0
tfidf__max_features
tfidf__min_df 100
tfidf__ngram_range (1, 3)
tfidf__norm l2
tfidf__preprocessor <function preprocessor at 0x7f8d443a30a0>
tfidf__smooth_idf True
tfidf__stop_words
tfidf__strip_accents
tfidf__sublinear_tf False
tfidf__token_pattern (?u)\b\w\w+\b
tfidf__tokenizer
tfidf__use_idf True
tfidf__vocabulary
classifier__objective binary:logistic
classifier__base_score
classifier__booster
classifier__callbacks
classifier__colsample_bylevel
classifier__colsample_bynode
classifier__colsample_bytree
classifier__device
classifier__early_stopping_rounds
classifier__enable_categorical False
classifier__eval_metric
classifier__feature_types
classifier__gamma
classifier__grow_policy
classifier__importance_type
classifier__interaction_constraints
classifier__learning_rate
classifier__max_bin
classifier__max_cat_threshold
classifier__max_cat_to_onehot
classifier__max_delta_step
classifier__max_depth
classifier__max_leaves
classifier__min_child_weight
classifier__missing nan
classifier__monotone_constraints
classifier__multi_strategy
classifier__n_estimators
classifier__n_jobs
classifier__num_parallel_tree
classifier__random_state
classifier__reg_alpha
classifier__reg_lambda
classifier__sampling_method
classifier__scale_pos_weight
classifier__subsample
classifier__tree_method
classifier__validate_parameters
classifier__verbosity

Model Plot

Pipeline(steps=[('tfidf',TfidfVectorizer(min_df=100, ngram_range=(1, 3),preprocessor=<function preprocessor at 0x7f8d443a30a0>)),('classifier',XGBClassifier(base_score=None, booster=None, callbacks=None,colsample_bylevel=None, colsample_bynode=None,colsample_bytree=None, device=None,early_stopping_rounds=None,enable_categorical=False, eval_metric=None,featur...importance_type=None,interaction_constraints=None, learning_rate=None,max_bin=None, max_cat_threshold=None,max_cat_to_onehot=None, max_delta_step=None,max_depth=None, max_leaves=None,min_child_weight=None, missing=nan,monotone_constraints=None, multi_strategy=None,n_estimators=None, n_jobs=None,num_parallel_tree=None, random_state=None, ...))],verbose=True)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Evaluation Results

Metric Value
accuracy 0.910317
f1 score 0.910317
ROC AUC 0.969008

How to Get Started with the Model

import sklearn 
import dill as pickle

from skops import hub_utils
from pathlib import Path

suicide_detector_repo = Path("./suicide-detector")

hub_utils.download(
    repo_id="AndyJamesTurner/suicideDetector",
    dst=suicide_detector_repo
)

with open(suicide_detector_repo / "model.pkl", 'rb') as file:
    clf = pickle.load(file)

classification = clf.predict(["I want to kill myself"])[0]

Model Evaluation

The model was evaluated on a 0.3 holdout split using f1 score, accuracy, confusion matrix and ROC curves.

Confusion matrix

Confusion matrix

ROC Curve

ROC Curve

Classification Report

index precision recall f1-score support
not suicide 0.891721 0.934126 0.912431 34824
suicide 0.930785 0.886491 0.908098 34799
accuracy 0.910317 0.910317 0.910317 0.910317
macro avg 0.911253 0.910308 0.910265 69623
weighted avg 0.911246 0.910317 0.910265 69623

Model Authors

This model was created by the following authors:

  • Andy Turner
Downloads last month
0
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.