PasswordHealthModel
Model Type: Random Forest Classifier
Framework: scikit-learn
Task: Password Strength Classification (Weak / Medium / Strong)
Overview
PasswordHealthModel is a machine learning model that classifies passwords into three strength levels:
- Weak (0)
- Medium (1)
- Strong (2)
The model leverages a Random Forest Classifier trained on 300,000 labeled passwords and is designed for integration into password management systems to provide real-time strength evaluation and guidance.
Intended Uses
- Integration into password managers (e.g., Password Utility) for evaluating password health.
- Providing real-time feedback on password strength and generating recommendations for stronger passwords.
- Enforcing password strength policies in security-focused applications.
Training Data
- Weak: 100,000 passwords sourced from the SecLists dataset.
- Medium: 100,000 synthetically generated passwords (8โ12 characters, alphanumeric, 20% with symbols).
- Strong: 100,000 synthetically generated passwords (12โ16 characters, alphanumeric + symbols).
All passwords were stripped of whitespace prior to feature extraction.
Features (10 Total)
- length: Number of characters.
- entropy: Shannon entropy of characters.
- has_upper: Binary flag indicating presence of uppercase characters.
- has_symbol: Binary flag indicating presence of special characters.
- has_leet: Binary flag for leet-speak characters (e.g., @, 3, !, 0).
- repetition: Binary flag for repeated sequences (โฅ3 consecutive repeated characters).
- digit_ratio: Ratio of digits to total length.
- unique_ratio: Ratio of unique characters to total length.
- bigram_entropy: Entropy of character pairs (bigrams).
- compression_ratio: Ratio of compressed length to original length using zlib compression.
Model Architecture
- Algorithm: Random Forest Classifier (scikit-learn)
- Hyperparameters:
n_estimators
: 200max_depth
: 20min_samples_split
: 5random_state
: 42
Performance
- Evaluation Setup: 80/20 train-test split (80% training, 20% testing; 240,000 training samples, 60,000 test samples)
- Accuracy: ~96.7% (ยฑ0.6% standard deviation)
Limitations
- Feature engineering is heuristic-based and may not fully capture all password patterns across different contexts.
- Primarily trained on English-like and synthetic passwords.
- Potential overfitting to synthetic strong password patterns.
Ethical Considerations
Weak password data is sourced from publicly available breaches with careful handling. The model does not store actual user passwords and is intended only for classification tasks.
Dependencies
My project relies on the following open-source libraries and datasets:
- pandas: Data manipulation and analysis (BSD-3-Clause License).
- scikit-learn: Machine learning framework for the Random Forest Classifier (BSD-3-Clause License).
- joblib: Model persistence and parallel computation (MIT License).
- SecLists: Dataset for weak passwords (MIT License).
If redistributing this project, please include the respective license texts for these dependencies.
Citation
Khokhar, Naa'il Ahmad. (2025). PasswordHealthModel: A Random Forest Model for Password Strength Classification. Hugging Face Model Hub.