Hardik Soni
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -8,4 +8,129 @@ tags:
|
|
8 |
- NLP
|
9 |
- Text-Summarization
|
10 |
- CNN
|
11 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
8 |
- NLP
|
9 |
- Text-Summarization
|
10 |
- CNN
|
11 |
+
---
|
12 |
+
|
13 |
+
# Seq2Seq Model with Attention for Text Summarization
|
14 |
+
|
15 |
+
This repository contains a Sequence-to-Sequence (Seq2Seq) model with attention, trained on the **CNN/DailyMail** dataset for text summarization tasks. The model is built using Keras and leverages pre-trained GloVe embeddings for enhanced word representations. It consists of an encoder-decoder architecture using LSTM layers with attention to capture long-term dependencies.
|
16 |
+
|
17 |
+
## Model Architecture
|
18 |
+
|
19 |
+
The model follows the classic encoder-decoder structure, with attention to handle long sequences:
|
20 |
+
|
21 |
+
- **Embedding Layer**: Uses pre-trained GloVe embeddings (100-dimensional) for both the input (article) and target (summary) texts.
|
22 |
+
- **Encoder**: A bidirectional LSTM to encode the input sequence. The forward and backward hidden states are concatenated.
|
23 |
+
- **Decoder**: An LSTM initialized with the encoder's hidden and cell states to generate the target sequence (summary).
|
24 |
+
- **Attention Mechanism**: While the base code does not explicitly implement attention, this can be easily integrated to improve summarization by focusing on relevant parts of the input sequence during decoding.
|
25 |
+
|
26 |
+
### Embeddings
|
27 |
+
|
28 |
+
We use GloVe embeddings (100-dimensional) pre-trained on a large corpus of text data. The embedding matrix is constructed for both the input (text) and output (summary) using the GloVe embeddings.
|
29 |
+
|
30 |
+
```python
|
31 |
+
embedding_index = {}
|
32 |
+
embed_dim = 100
|
33 |
+
with open('../input/glove6b100dtxt/glove.6B.100d.txt') as f:
|
34 |
+
for line in f:
|
35 |
+
values = line.split()
|
36 |
+
word = values[0]
|
37 |
+
coefs = np.asarray(values[1:], dtype='float32')
|
38 |
+
embedding_index[word] = coefs
|
39 |
+
|
40 |
+
# Embedding for input (articles)
|
41 |
+
t_embed = np.zeros((t_max_features, embed_dim))
|
42 |
+
for word, i in t_tokenizer.word_index.items():
|
43 |
+
vec = embedding_index.get(word)
|
44 |
+
if i < t_max_features and vec is not None:
|
45 |
+
t_embed[i] = vec
|
46 |
+
|
47 |
+
# Embedding for output (summaries)
|
48 |
+
s_embed = np.zeros((s_max_features, embed_dim))
|
49 |
+
for word, i in s_tokenizer.word_index.items():
|
50 |
+
vec = embedding_index.get(word)
|
51 |
+
if i < s_max_features and vec is not None:
|
52 |
+
s_embed[i] = vec
|
53 |
+
|
54 |
+
```
|
55 |
+
|
56 |
+
## Encoder
|
57 |
+
A bidirectional LSTM is used for encoding the input text. The forward and backward hidden and cell states are concatenated to pass as the initial states to the decoder.
|
58 |
+
|
59 |
+
```python
|
60 |
+
latent_dim = 128
|
61 |
+
enc_input = Input(shape=(maxlen_text,))
|
62 |
+
enc_embed = Embedding(t_max_features, embed_dim, input_length=maxlen_text, weights=[t_embed], trainable=False)(enc_input)
|
63 |
+
enc_lstm = Bidirectional(LSTM(latent_dim, return_state=True))
|
64 |
+
enc_output, enc_fh, enc_fc, enc_bh, enc_bc = enc_lstm(enc_embed)
|
65 |
+
|
66 |
+
# Concatenate the forward and backward states
|
67 |
+
enc_h = Concatenate(axis=-1)([enc_fh, enc_bh])
|
68 |
+
enc_c = Concatenate(axis=-1)([enc_fc, enc_bc])
|
69 |
+
```
|
70 |
+
|
71 |
+
## Decoder
|
72 |
+
The decoder is an LSTM that takes the encoder's final states as the initial states to generate the output summary sequence.
|
73 |
+
|
74 |
+
```python
|
75 |
+
dec_input = Input(shape=(None,))
|
76 |
+
dec_embed = Embedding(s_max_features, embed_dim, weights=[s_embed], trainable=False)(dec_input)
|
77 |
+
dec_lstm = LSTM(latent_dim * 2, return_sequences=True, return_state=True, dropout=0.3, recurrent_dropout=0.2)
|
78 |
+
dec_outputs, _, _ = dec_lstm(dec_embed, initial_state=[enc_h, enc_c])
|
79 |
+
|
80 |
+
# Dense layer with softmax activation for final output
|
81 |
+
dec_dense = TimeDistributed(Dense(s_max_features, activation='softmax'))
|
82 |
+
dec_output = dec_dense(dec_outputs)
|
83 |
+
```
|
84 |
+
|
85 |
+
## Model Summary
|
86 |
+
The full Seq2Seq model with an attention mechanism is compiled using sparse categorical crossentropy loss and the RMSProp optimizer.
|
87 |
+
|
88 |
+
### Model Visualization
|
89 |
+
A diagram of the model is generated using Keras' plot_model function:
|
90 |
+
|
91 |
+

|
92 |
+
|
93 |
+
## Training
|
94 |
+
The model is trained with early stopping to prevent overfitting. The model is fit using batches of 128 and a maximum of 10 epochs, with validation data for performance monitoring.
|
95 |
+
|
96 |
+
```python
|
97 |
+
early_stop = keras.callbacks.EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=2)
|
98 |
+
model.fit([train_x, train_y[:, :-1]],
|
99 |
+
train_y.reshape(train_y.shape[0], train_y.shape[1], 1)[:, 1:],
|
100 |
+
epochs=10,
|
101 |
+
callbacks=[early_stop],
|
102 |
+
batch_size=128,
|
103 |
+
verbose=2,
|
104 |
+
validation_data=([val_x, val_y[:, :-1]], val_y.reshape(val_y.shape[0], val_y.shape[1], 1)[:, 1:]))
|
105 |
+
|
106 |
+
```
|
107 |
+
## Dataset
|
108 |
+
The CNN/DailyMail dataset is used for training and validation. It contains news articles and their corresponding summaries, which makes it suitable for the text summarization task.
|
109 |
+
|
110 |
+
- Train set: Used to train the model on article-summary pairs.
|
111 |
+
- Validation set: Used for model performance evaluation and to apply early stopping.
|
112 |
+
|
113 |
+
## Requirements
|
114 |
+
- Python 3.x
|
115 |
+
- Keras
|
116 |
+
- TensorFlow
|
117 |
+
- NumPy
|
118 |
+
- GloVe Embeddings
|
119 |
+
|
120 |
+
## How to Run
|
121 |
+
1. Download the CNN/DailyMail dataset and pre-trained GloVe embeddings.
|
122 |
+
2. Preprocess the dataset and prepare the embedding matrices.
|
123 |
+
3. Train the model using the provided code.
|
124 |
+
4. Evaluate the model on a validation set and generate summaries for new text inputs.
|
125 |
+
|
126 |
+
## Results
|
127 |
+
The model generates abstractive summaries of news articles. You can tweak the latent dimensions, embedding sizes, and add attention for improved performance.
|
128 |
+
|
129 |
+
## Future Work
|
130 |
+
* Attention Mechanism: Implementing Bahdanau or Luong Attention for better results.
|
131 |
+
* Beam Search: Incorporating beam search for enhanced summary generation.
|
132 |
+
|
133 |
+
## Resources:
|
134 |
+
- [Keras Documentation](https://keras.io/)
|
135 |
+
- [CNN/DailyMail Dataset](https://huggingface.co/datasets/cnn_dailymail)
|
136 |
+
- [GloVe Embeddings](https://nlp.stanford.edu/projects/glove/)
|