Ihor commited on
Commit
ae691b8
1 Parent(s): d7ffcbd

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +222 -3
README.md CHANGED
@@ -1,3 +1,222 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ metrics:
6
+ - f1
7
+ library_name: transformers
8
+ pipeline_tag: token-classification
9
+ tags:
10
+ - token classification
11
+ - information extraction
12
+ - NER
13
+ - relation extraction
14
+ - text cleaning
15
+ ---
16
+ # UTC-DeBERTa-base - universal token classifier
17
+ 🚀 Meet the second version of our prompt-tuned universal token classification model 🚀
18
+
19
+ This line of models can perform various information extraction tasks by analysing input prompts and recognizing parts of texts that satisfy prompts. In comparison with the first version, the second one has more genera and can be recognised as entities, whole sentences, and even paragraphs.
20
+
21
+ To use a model, just specify a prompt, for example : ***“Identify all positive aspects of the product mentioned by John: “*** and put your target text.
22
+
23
+ This is a model based on `DeBERTaV3-base` that was trained on multiple token classification tasks or tasks that can be represented in this way.
24
+
25
+ Such *multi-task fine-tuning* enabled better generalization; even small models can be used for zero-shot named entity recognition and demonstrate good performance on reading comprehension tasks.
26
+
27
+ The model can be used for the following tasks:
28
+ * Named entity recognition (NER);
29
+ * Open information extraction;
30
+ * Question answering;
31
+ * Relation extraction;
32
+ * Coreference resolution;
33
+ * Text cleaning;
34
+ * Summarization;
35
+
36
+ #### How to use
37
+ There are few ways how you can use this model, one of the way is to utilize `token-classification` pipeline from transformers:
38
+
39
+ ```python
40
+ from transformers import AutoTokenizer, AutoModelForTokenClassification
41
+ from transformers import pipeline
42
+
43
+ def process(text, prompt, treshold=0.5):
44
+ """
45
+ Processes text by preparing prompt and adjusting indices.
46
+
47
+ Args:
48
+ text (str): The text to process
49
+ prompt (str): The prompt to prepend to the text
50
+
51
+ Returns:
52
+ list: A list of dicts with adjusted spans and scores
53
+ """
54
+ # Concatenate text and prompt for full input
55
+ input_ = f"{prompt}\n{text}"
56
+ results = nlp(input_) # Run NLP on full input
57
+ processed_results = []
58
+ prompt_length = len(prompt) # Get prompt length
59
+ for result in results:
60
+ # check whether score is higher than treshold
61
+ if result['score']<treshold:
62
+ continue
63
+ # Adjust indices by subtracting prompt length
64
+ start = result['start'] - prompt_length
65
+ # If indexes belongs to the prompt - continue
66
+ if start<0:
67
+ continue
68
+ end = result['end'] - prompt_length
69
+ # Extract span from original text using adjusted indices
70
+ span = text[start:end]
71
+ # Create processed result dict
72
+ processed_result = {
73
+ 'span': span,
74
+ 'start': start,
75
+ 'end': end,
76
+ 'score': result['score']
77
+ }
78
+ processed_results.append(processed_result)
79
+ return processed_results
80
+
81
+ tokenizer = AutoTokenizer.from_pretrained("knowledgator/UTC-DeBERTa-base-v2")
82
+ model = AutoModelForTokenClassification.from_pretrained("knowledgator/UTC-DeBERTa-base-v2")
83
+
84
+ nlp = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy = 'first')
85
+ ```
86
+
87
+ To use the model for **zero-shot named entity recognition**, we recommend to utilize the following prompt:
88
+ ```python
89
+ prompt = """Identify the following entity classes in the text:
90
+ computer
91
+
92
+ Text:
93
+ """
94
+ text = """Apple was founded as Apple Computer Company on April 1, 1976, by Steve Wozniak, Steve Jobs (1955–2011) and Ronald Wayne to develop and sell Wozniak's Apple I personal computer.
95
+ It was incorporated by Jobs and Wozniak as Apple Computer, Inc. in 1977. The company's second computer, the Apple II, became a best seller and one of the first mass-produced microcomputers.
96
+ Apple went public in 1980 to instant financial success."""
97
+
98
+ results = process(text, prompt)
99
+
100
+ print(results)
101
+ ```
102
+
103
+ To use the model for **open information extracttion**, put any prompt you want:
104
+ ```python
105
+ prompt = """Extract all positive aspects about the product
106
+ """
107
+ text = """I recently purchased the Sony WH-1000XM4 Wireless Noise-Canceling Headphones from Amazon and I must say, I'm thoroughly impressed. The package arrived in New York within 2 days, thanks to Amazon Prime's expedited shipping.
108
+
109
+ The headphones themselves are remarkable. The noise-canceling feature works like a charm in the bustling city environment, and the 30-hour battery life means I don't have to charge them every day. Connecting them to my Samsung Galaxy S21 was a breeze, and the sound quality is second to none.
110
+
111
+ I also appreciated the customer service from Amazon when I had a question about the warranty. They responded within an hour and provided all the information I needed.
112
+
113
+ However, the headphones did not come with a hard case, which was listed in the product description. I contacted Amazon, and they offered a 10% discount on my next purchase as an apology.
114
+
115
+ Overall, I'd give these headphones a 4.5/5 rating and highly recommend them to anyone looking for top-notch quality in both product and service."""
116
+
117
+ results = process(text, prompt)
118
+
119
+ print(results)
120
+ ```
121
+
122
+ To try the model in **question answering**, just specify question and text passage:
123
+
124
+ ```python
125
+ question = """Who are the founders of Microsoft?"""
126
+
127
+ text = """Microsoft was founded by Bill Gates and Paul Allen on April 4, 1975 to develop and sell BASIC interpreters for the Altair 8800.
128
+ During his career at Microsoft, Gates held the positions of chairman, chief executive officer, president and chief software architect, while also being the largest individual shareholder until May 2014."""
129
+
130
+ input_ = f"{question} {text}"
131
+
132
+ results = process(text, question)
133
+
134
+ print(results)
135
+ ```
136
+
137
+ For the **text cleaning**, please, specify the following prompt, it will recognize the part of the text that should be erased:
138
+
139
+ ```python
140
+ prompt = """Clean the following text extracted from the web matching not relevant parts:"""
141
+
142
+ text = """The mechanism of action was characterized using native mass spectrometry, the thermal shift-binding assay, and enzymatic kinetic studies (Figure ). In the native mass spectrometry binding assay, compound 23R showed dose-dependent binding to SARS-CoV-2 Mpro, similar to the positive control GC376, with a binding stoichiometry of one drug per monomer (Figure A).
143
+ Similarly, compound 23R showed dose-dependent stabilization of the SARS-CoV-2 Mpro in the thermal shift binding assay with an apparent Kd value of 9.43 μM, a 9.3-fold decrease compared to ML188 (1) (Figure B). In the enzymatic kinetic studies, 23R was shown to be a noncovalent inhibitor with a Ki value of 0.07 μM (Figure C, D top and middle panels). In comparison, the Ki for the parent compound ML188 (1) is 2.29 μM.
144
+ The Lineweaver–Burk or double-reciprocal plot with different compound concentrations yielded an intercept at the Y-axis, suggesting that 23R is a competitive inhibitor similar to ML188 (1) (Figure C, D bottom panel). Buy our T-shirts for the lowerst prices you can find!!! Overall, the enzymatic kinetic studies confirmed that compound 23R is a noncovalent inhibitor of SARS-CoV-2 Mpro."""
145
+
146
+ results = process(text, prompt)
147
+
148
+ print(results)
149
+ ```
150
+
151
+ It's possible to use the model for **relation extraction**, it allows in N*C operations to extract all relations between entities, where N - number of entities and C - number of classes:
152
+
153
+ ```python
154
+ rex_prompt="""
155
+ Identify target entity given the following relation: "{}" and the following source entity: "{}"
156
+
157
+ Text:
158
+ """
159
+
160
+ text = """Dr. Paul Hammond, a renowned neurologist at Johns Hopkins University, has recently published a paper in the prestigious journal "Nature Neuroscience". """
161
+
162
+ entity = "Paul Hammond"
163
+
164
+ relation = "worked at"
165
+
166
+ prompt = rex_prompt.format(relation, entity)
167
+
168
+ results = process(text, prompt)
169
+
170
+ print(results)
171
+ ```
172
+
173
+ To **find similar entities** in the text, consider the following example:
174
+ ```python
175
+ ent_prompt = "Find all '{}' mentions in the text:"
176
+
177
+ text = """Several studies have reported its pharmacological activities, including anti-inflammatory, antimicrobial, and antitumoral effects. The effect of E-anethole was studied in the osteosarcoma MG-63 cell line, and the antiproliferative activity was evaluated by an MTT assay. It showed a GI50 value of 60.25 μM with apoptosis induction through the mitochondrial-mediated pathway. Additionally, it induced cell cycle arrest at the G0/G1 phase, up-regulated the expression of p53, caspase-3, and caspase-9, and down-regulated Bcl-xL expression. Moreover, the antitumoral activity of anethole was assessed against oral tumor Ca9-22 cells, and the cytotoxic effects were evaluated by MTT and LDH assays. It demonstrated a LD50 value of 8 μM, and cellular proliferation was 42.7% and 5.2% at anethole concentrations of 3 μM and 30 μM, respectively. It was reported that it could selectively and in a dose-dependent manner decrease cell proliferation and induce apoptosis, as well as induce autophagy, decrease ROS production, and increase glutathione activity. The cytotoxic effect was mediated through NF-kB, MAP kinases, Wnt, caspase-3 and -9, and PARP1 pathways. Additionally, treatment with anethole inhibited cyclin D1 oncogene expression, increased cyclin-dependent kinase inhibitor p21WAF1, up-regulated p53 expression, and inhibited the EMT markers."""
178
+
179
+ entity = "anethole"
180
+
181
+ prompt = ent_prompt.format(entity)
182
+
183
+ results = process(text, prompt)
184
+
185
+ print(results)
186
+ ```
187
+
188
+ We significantly improved model **summarization** abilities in comparison to the first version, below is an example:
189
+
190
+ ```python
191
+ prompt = "Summarize the following text, highlighting the most important sentences:"
192
+
193
+ text = """Apple was founded as Apple Computer Company on April 1, 1976, by Steve Wozniak, Steve Jobs (1955–2011) and Ronald Wayne to develop and sell Wozniak's Apple I personal computer. It was incorporated by Jobs and Wozniak as Apple Computer, Inc. in 1977. The company's second computer, the Apple II, became a best seller and one of the first mass-produced microcomputers. Apple went public in 1980 to instant financial success. The company developed computers featuring innovative graphical user interfaces, including the 1984 original Macintosh, announced that year in a critically acclaimed advertisement called "1984". By 1985, the high cost of its products, and power struggles between executives, caused problems. Wozniak stepped back from Apple and pursued other ventures, while Jobs resigned and founded NeXT, taking some Apple employees with him.
194
+ Apple Inc. is an American multinational technology company headquartered in Cupertino, California. Apple is the world's largest technology company by revenue, with US$394.3 billion in 2022 revenue. As of March 2023, Apple is the world's biggest company by market capitalization. As of June 2022, Apple is the fourth-largest personal computer vendor by unit sales and the second-largest mobile phone manufacturer in the world. It is considered one of the Big Five American information technology companies, alongside Alphabet (parent company of Google), Amazon, Meta Platforms, and Microsoft.
195
+ As the market for personal computers expanded and evolved throughout the 1990s, Apple lost considerable market share to the lower-priced duopoly of the Microsoft Windows operating system on Intel-powered PC clones (also known as "Wintel"). In 1997, weeks away from bankruptcy, the company bought NeXT to resolve Apple's unsuccessful operating system strategy and entice Jobs back to the company. Over the next decade, Jobs guided Apple back to profitability through a number of tactics including introducing the iMac, iPod, iPhone and iPad to critical acclaim, launching the "Think different" campaign and other memorable advertising campaigns, opening the Apple Store retail chain, and acquiring numerous companies to broaden the company's product portfolio. When Jobs resigned in 2011 for health reasons, and died two months later, he was succeeded as CEO by Tim Cook"""
196
+
197
+ results = process(text, prompt)
198
+
199
+ print(results)
200
+ ```
201
+
202
+ ### Benchmarking
203
+ Below is a table that highlights the performance of UTC models on the [CrossNER](https://huggingface.co/datasets/DFKI-SLT/cross_ner) dataset. The values represent the Micro F1 scores, with the estimation done at the word level.
204
+
205
+ | Model | AI | Literature | Music | Politics | Science |
206
+ |----------------------|--------|------------|--------|----------|---------|
207
+ | UTC-DeBERTa-small | 0.8492 | 0.8792 | 0.864 | 0.9008 | 0.85 |
208
+ | UTC-DeBERTa-base | 0.8452 | 0.8587 | 0.8711 | 0.9147 | 0.8631 |
209
+ | UTC-DeBERTa-large | 0.8971 | 0.8978 | 0.9204 | 0.9247 | 0.8779 |
210
+
211
+ ### Future reading
212
+ Check our blogpost - ["As GPT4 but for token classification"](https://medium.com/p/9b5a081fbf27), where we highlighted possible use-cases of the model and why next-token prediction is not the only way to achive amazing zero-shot capabilites.
213
+ While most of the AI industry is focused on generative AI and decoder-based models, we are committed to developing encoder-based models.
214
+ We aim to achieve the same level of generalization for such models as their decoder brothers. Encoders have several wonderful properties, such as bidirectional attention, and they are the best choice for many information extraction tasks in terms of efficiency and controllability.
215
+
216
+ ### Feedback
217
+ We value your input! Share your feedback and suggestions to help us improve our models.
218
+ Fill out the feedback [form](https://forms.gle/5CPFFuLzNWznjcpL7)
219
+
220
+ ### Join Our Discord
221
+ Connect with our community on Discord for news, support, and discussion about our models.
222
+ Join [Discord](https://discord.gg/dkyeAgs9DG)