|
--- |
|
language: en |
|
license: mit |
|
tags: |
|
- keras |
|
- lstm |
|
- spam-classification |
|
- text-classification |
|
- binary-classification |
|
- email |
|
- deep-learning |
|
library_name: keras |
|
pipeline_tag: text-classification |
|
model_name: Spam Email Classifier (BiLSTM) |
|
datasets: |
|
- SetFit/enron_spam |
|
--- |
|
|
|
# π§ Spam Email Classifier using BiLSTM |
|
|
|
This model uses a **Bidirectional LSTM (BiLSTM)** architecture built with **Keras** to classify email messages as **Spam** or **Ham**. It was trained on the [Enron Spam Dataset](https://huggingface.co/datasets/SetFit/enron_spam) using GloVe word embeddings. |
|
|
|
--- |
|
|
|
## π§ Model Architecture |
|
|
|
- **Tokenizer**: Keras `Tokenizer` trained on the Enron dataset |
|
- **Embedding**: Pretrained [GloVe.6B.100d](https://nlp.stanford.edu/projects/glove/) |
|
- **Model**: `Embedding β BiLSTM β Dropout β Dense(sigmoid)` |
|
- **Input**: English email/message text |
|
- **Output**: `0 = Ham`, `1 = Spam` |
|
|
|
--- |
|
|
|
## π§ͺ Example Usage |
|
|
|
```python |
|
from tensorflow.keras.models import load_model |
|
from huggingface_hub import hf_hub_download |
|
import pickle |
|
from tensorflow.keras.preprocessing.sequence import pad_sequences |
|
|
|
# Load files from HF Hub |
|
model_path = hf_hub_download("lokas/spam-emails-classifier", "model.h5") |
|
tokenizer_path = hf_hub_download("lokas/spam-emails-classifier", "tokenizer.pkl") |
|
|
|
# Load model and tokenizer |
|
model = load_model(model_path) |
|
with open(tokenizer_path, "rb") as f: |
|
tokenizer = pickle.load(f) |
|
|
|
# Prediction function |
|
def predict_spam(text): |
|
seq = tokenizer.texts_to_sequences([text]) |
|
padded = pad_sequences(seq, maxlen=50) # must match training maxlen |
|
pred = model.predict(padded)[0][0] |
|
return "π« Spam" if pred > 0.5 else "β
Not Spam" |
|
|
|
# Example |
|
print(predict_spam("Win a free iPhone now!")) |
|
|