PasswordHealthModel

Model Type: Random Forest Classifier
Framework: scikit-learn
Task: Password Strength Classification (Weak / Medium / Strong)

Overview

PasswordHealthModel is a machine learning model that classifies passwords into three strength levels:

Weak (0)
Medium (1)
Strong (2)

The model leverages a Random Forest Classifier trained on 300,000 labeled passwords and is designed for integration into password management systems to provide real-time strength evaluation and guidance.

Intended Uses

Integration into password managers (e.g., Password Utility) for evaluating password health.
Providing real-time feedback on password strength and generating recommendations for stronger passwords.
Enforcing password strength policies in security-focused applications.

Training Data

Weak: 100,000 passwords sourced from the SecLists dataset.
Medium: 100,000 synthetically generated passwords (8–12 characters, alphanumeric, 20% with symbols).
Strong: 100,000 synthetically generated passwords (12–16 characters, alphanumeric + symbols).

All passwords were stripped of whitespace prior to feature extraction.

Features (10 Total)

length: Number of characters.
entropy: Shannon entropy of characters.
has_upper: Binary flag indicating presence of uppercase characters.
has_symbol: Binary flag indicating presence of special characters.
has_leet: Binary flag for leet-speak characters (e.g., @, 3, !, 0).
repetition: Binary flag for repeated sequences (≥3 consecutive repeated characters).
digit_ratio: Ratio of digits to total length.
unique_ratio: Ratio of unique characters to total length.
bigram_entropy: Entropy of character pairs (bigrams).
compression_ratio: Ratio of compressed length to original length using zlib compression.

Model Architecture

Algorithm: Random Forest Classifier (scikit-learn)
Hyperparameters:
- n_estimators: 200
- max_depth: 20
- min_samples_split: 5
- random_state: 42

Performance

Evaluation Setup: 80/20 train-test split (80% training, 20% testing; 240,000 training samples, 60,000 test samples)
Accuracy: ~96.7% (±0.6% standard deviation)

Limitations

Feature engineering is heuristic-based and may not fully capture all password patterns across different contexts.
Primarily trained on English-like and synthetic passwords.
Potential overfitting to synthetic strong password patterns.

Ethical Considerations

Weak password data is sourced from publicly available breaches with careful handling. The model does not store actual user passwords and is intended only for classification tasks.

Dependencies

My project relies on the following open-source libraries and datasets:

pandas: Data manipulation and analysis (BSD-3-Clause License).
scikit-learn: Machine learning framework for the Random Forest Classifier (BSD-3-Clause License).
joblib: Model persistence and parallel computation (MIT License).
SecLists: Dataset for weak passwords (MIT License).

If redistributing this project, please include the respective license texts for these dependencies.

Citation

Khokhar, Naa'il Ahmad. (2025). PasswordHealthModel: A Random Forest Model for Password Strength Classification. Hugging Face Model Hub.