samuelmat19
commited on
Commit
·
f724822
1
Parent(s):
c66b641
Added README.md
Browse files
README.md
ADDED
@@ -0,0 +1,89 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
## T&C Summarization Model
|
2 |
+
|
3 |
+
T&C Summarization Model based on [sshleifer/distilbart-cnn-6-6](https://huggingface.co/sshleifer/distilbart-cnn-6-6),
|
4 |
+
|
5 |
+
This abstractive summarization model is a part of a bigger end-to-end T&C summarizer pipeline
|
6 |
+
which is preceded by LSA (Latent Semantic Analysis) extractive summarization. The extractive
|
7 |
+
summarization shortens the T&C to be further summarized by this model.
|
8 |
+
|
9 |
+
## Finetuning Corpus
|
10 |
+
|
11 |
+
The model is finetuned on a dataset scraped from https://tosdr.org/ . The article and
|
12 |
+
summarization text is reduced via extractive summarization before it is finetuned to the model.
|
13 |
+
|
14 |
+
## Load Finetuned Model
|
15 |
+
|
16 |
+
```
|
17 |
+
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
|
18 |
+
|
19 |
+
tokenizer = AutoTokenizer.from_pretrained("ml6team/distilbart-tos-summarizer-tosdr")
|
20 |
+
|
21 |
+
model = AutoModelForSeq2SeqLM.from_pretrained("ml6team/distilbart-tos-summarizer-tosdr")
|
22 |
+
```
|
23 |
+
|
24 |
+
## Code Sample
|
25 |
+
|
26 |
+
This sample requires [sumy](https://pypi.org/project/sumy/), the LSA Extractive Summarization library, as additional package to
|
27 |
+
run.
|
28 |
+
|
29 |
+
```
|
30 |
+
import re
|
31 |
+
import nltk
|
32 |
+
nltk.download('punkt')
|
33 |
+
from sumy.parsers.plaintext import PlaintextParser
|
34 |
+
from sumy.nlp.tokenizers import Tokenizer
|
35 |
+
from sumy.nlp.stemmers import Stemmer
|
36 |
+
from sumy.summarizers.lsa import LsaSummarizer
|
37 |
+
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
|
38 |
+
|
39 |
+
LANGUAGE = "english"
|
40 |
+
EXTRACTED_ARTICLE_SENTENCES_LEN = 12
|
41 |
+
|
42 |
+
stemmer = Stemmer(LANGUAGE)
|
43 |
+
lsa_summarizer = LsaSummarizer(stemmer)
|
44 |
+
tokenizer = AutoTokenizer.from_pretrained("ml6team/distilbart-tos-summarizer-tosdr")
|
45 |
+
model = AutoModelForSeq2SeqLM.from_pretrained("ml6team/distilbart-tos-summarizer-tosdr")
|
46 |
+
|
47 |
+
def get_extractive_summary(text, sentences_count):
|
48 |
+
parser = PlaintextParser.from_string(text, Tokenizer(LANGUAGE))
|
49 |
+
summarized_info = lsa_summarizer(parser.document, sentences_count)
|
50 |
+
summarized_info = [element._text for element in summarized_info]
|
51 |
+
return ' '.join(summarized_info)
|
52 |
+
|
53 |
+
def get_summary(dict_summarizer_model, dict_tokenizer, text_content):
|
54 |
+
text_content = get_extractive_summary(text_content, EXTRACTED_ARTICLE_SENTENCES_LEN)
|
55 |
+
tokenizer = dict_tokenizer['tokenizer']
|
56 |
+
model = dict_summarizer_model['model']
|
57 |
+
|
58 |
+
inputs = tokenizer(text_content, max_length=dict_tokenizer['max_length'], truncation=True, return_tensors="pt")
|
59 |
+
outputs = model.generate(
|
60 |
+
inputs["input_ids"], max_length=dict_summarizer_model['max_length'], min_length=dict_summarizer_model['min_length'],
|
61 |
+
)
|
62 |
+
|
63 |
+
summarized_text = tokenizer.decode(outputs[0])
|
64 |
+
match = re.search(r"<s>(.*)</s>", summarized_text)
|
65 |
+
if match is not None: summarized_text = match.group(1)
|
66 |
+
|
67 |
+
return summarized_text.replace('<s>', '').replace('</s>', '')
|
68 |
+
|
69 |
+
test_tos = """
|
70 |
+
In addition, certain portions of the Web Site may be subject to additional terms of use that we make available for your review or otherwise link to that portion of the Web Site to which such additional terms apply. By using such portions, or any part thereof, you agree to be bound by the additional terms of use applicable to such portions.
|
71 |
+
Age Restrictions The Web Site may be accessed and used only by individuals who can form legally binding contracts under applicable laws, who are at least 18 years of age or the age of majority in their state or territory of residence (if higher than 18), and who are not barred from using the Web Site under applicable laws.
|
72 |
+
Our Technology may not be copied, modified, reproduced, republished, posted, transmitted, sold, offered for sale, or redistributed in any way without our prior written permission and the prior written permission of our applicable licensors. Nothing in these Site Terms of Use grants you any right to receive delivery of a copy of Our Technology or to obtain access to Our Technology except as generally and ordinarily permitted through the Web Site according to these Site Terms of Use.
|
73 |
+
Furthermore, nothing in these Site Terms of Use will be deemed to grant you, by implication, estoppel or otherwise, a license to Our Technology. Certain of the names, logos, and other materials displayed via the Web site constitute trademarks, tradenames, service marks or logos (“Marks”) of us or other entities. You are not authorized to use any such Marks. Ownership of all such Marks and the goodwill associated therewith remains with us or those other entities.
|
74 |
+
Any use of third party software provided in connection with the Web Site will be governed by such third parties’ licenses and not by these Site Terms of Use. Information on this Web Site may contain technical inaccuracies or typographical errors. Lenovo provides no assurances that any reported problems may be resolved with the use of any information that Lenovo provides
|
75 |
+
"""
|
76 |
+
|
77 |
+
model_dict = {
|
78 |
+
'model': model,
|
79 |
+
'max_length': 512,
|
80 |
+
'min_length': 4
|
81 |
+
}
|
82 |
+
|
83 |
+
tokenizer_dict = {
|
84 |
+
'tokenizer': tokenizer,
|
85 |
+
'max_length': 1024
|
86 |
+
}
|
87 |
+
|
88 |
+
print(get_summary(model_dict, tokenizer_dict, test_tos))
|
89 |
+
```
|