Report for JiaqiLee/imdb-finetuned-bert-base-uncased
Hey Team!🤗✨
We’re thrilled to share some amazing evaluation results that’ll make your day!🎉📊
We have identified 12 potential vulnerabilities in your model based on an automated scan.
This automated analysis evaluated the model on the dataset sst2 (subset default
, split validation
).
👉Robustness issues (2)
Vulnerability | Level | Data slice | Metric | Transformation | Deviation |
---|---|---|---|---|---|
Robustness | major 🔴 | — | Fail rate = 0.115 | Add typos | 92/803 tested samples (11.46%) changed prediction after perturbation |
🔍✨Examples
When feature “text” is perturbed with the transformation “Add typos”, the model changes its prediction in 11.46% of the cases. We expected the predictions not to be affected by this transformation.text | Add typos(text) | Original prediction | Prediction after perturbation | |
---|---|---|---|---|
12 | ... the film suffers from a lack of humor ( something needed to balance out the violence ) ... | .... the cfilm sufcers froj a ladk of humor ( domething nweded to balance out the violence ) ... | negative (p = 1.00) | positive (p = 0.99) |
13 | we root for ( clara and paul ) , even like them , though perhaps it 's an emotion closer to pity . | w root for ( clara and paul ) , even like them , though perhaps it 's an emlotion closer to pity . | positive (p = 0.99) | negative (p = 0.66) |
46 | a synthesis of cliches and absurdities that seems positively decadent in its cinematic flash and emptiness . | a syhthesis og clichea ajd absurdities thag seems positivey decadet in its cinematic lash ande mptiness . | positive (p = 0.95) | negative (p = 0.99) |
Vulnerability | Level | Data slice | Metric | Transformation | Deviation |
---|---|---|---|---|---|
Robustness | medium 🟡 | — | Fail rate = 0.059 | Punctuation Removal | 51/866 tested samples (5.89%) changed prediction after perturbation |
🔍✨Examples
When feature “text” is perturbed with the transformation “Punctuation Removal”, the model changes its prediction in 5.89% of the cases. We expected the predictions not to be affected by this transformation.text | Punctuation Removal(text) | Original prediction | Prediction after perturbation | |
---|---|---|---|---|
4 | it 's slow -- very , very slow . | it s slow very very slow | positive (p = 0.52) | negative (p = 0.77) |
33 | if the movie succeeds in instilling a wary sense of ` there but for the grace of god , ' it is far too self-conscious to draw you deeply into its world . | if the movie succeeds in instilling a wary sense of there but for the grace of god it is far too self conscious to draw you deeply into its world | negative (p = 1.00) | positive (p = 0.99) |
66 | if you 're hard up for raunchy college humor , this is your ticket right here . | if you re hard up for raunchy college humor this is your ticket right here | positive (p = 0.89) | negative (p = 0.57) |
👉Performance issues (10)
Vulnerability | Level | Data slice | Metric | Transformation | Deviation |
---|---|---|---|---|---|
Performance | major 🔴 | text_length(text) < 89.500 AND text_length(text) >= 80.500 |
Precision = 0.719 | — | -15.79% than global |
🔍✨Examples
For records in the dataset where `text_length(text)` < 89.500 AND `text_length(text)` >= 80.500, the Precision is 15.79% lower than the global Precision.text | text_length(text) | label | Predicted label |
|
---|---|---|---|---|
115 | sam mendes has become valedictorian at the school for soft landings and easy ways out . | 88 | negative | positive (p = 0.95) |
142 | what better message than ` love thyself ' could young women of any size receive ? | 82 | positive | negative (p = 1.00) |
286 | at its best , queen is campy fun like the vincent price horror classics of the '60s . | 86 | positive | negative (p = 1.00) |
Vulnerability | Level | Data slice | Metric | Transformation | Deviation |
---|---|---|---|---|---|
Performance | medium 🟡 | idx >= 500.500 AND idx < 548.500 |
Accuracy = 0.812 | — | -6.90% than global |
🔍✨Examples
For records in the dataset where `idx` >= 500.500 AND `idx` < 548.500, the Accuracy is 6.9% lower than the global Accuracy.idx | label | Predicted label |
|
---|---|---|---|
501 | 501 | positive | negative (p = 1.00) |
509 | 509 | positive | negative (p = 0.99) |
513 | 513 | negative | positive (p = 1.00) |
Vulnerability | Level | Data slice | Metric | Transformation | Deviation |
---|---|---|---|---|---|
Performance | medium 🟡 | idx >= 444.500 AND idx < 500.500 |
Recall = 0.844 | — | -6.81% than global |
🔍✨Examples
For records in the dataset where `idx` >= 444.500 AND `idx` < 500.500, the Recall is 6.81% lower than the global Recall.idx | label | Predicted label |
|
---|---|---|---|
445 | 445 | positive | negative (p = 0.67) |
446 | 446 | negative | positive (p = 1.00) |
447 | 447 | positive | negative (p = 0.91) |
Vulnerability | Level | Data slice | Metric | Transformation | Deviation |
---|---|---|---|---|---|
Performance | medium 🟡 | avg_whitespace(text) < 0.154 |
Recall = 0.844 | — | -6.81% than global |
🔍✨Examples
For records in the dataset where `avg_whitespace(text)` < 0.154, the Recall is 6.81% lower than the global Recall.text | avg_whitespace(text) | label | Predicted label |
|
---|---|---|---|---|
1 | unflinchingly bleak and desperate | 0.117647 | negative | positive (p = 1.00) |
68 | good old-fashioned slash-and-hack is back ! | 0.136364 | positive | negative (p = 0.60) |
112 | hilariously inept and ridiculous . | 0.142857 | positive | negative (p = 1.00) |
Vulnerability | Level | Data slice | Metric | Transformation | Deviation |
---|---|---|---|---|---|
Performance | medium 🟡 | avg_word_length(text) >= 5.511 |
Recall = 0.844 | — | -6.81% than global |
🔍✨Examples
For records in the dataset where `avg_word_length(text)` >= 5.511, the Recall is 6.81% lower than the global Recall.text | avg_word_length(text) | label | Predicted label |
|
---|---|---|---|---|
1 | unflinchingly bleak and desperate | 7.5 | negative | positive (p = 1.00) |
68 | good old-fashioned slash-and-hack is back ! | 6.33333 | positive | negative (p = 0.60) |
112 | hilariously inept and ridiculous . | 6 | positive | negative (p = 1.00) |
Vulnerability | Level | Data slice | Metric | Transformation | Deviation |
---|---|---|---|---|---|
Performance | medium 🟡 | text_length(text) < 59.500 AND text_length(text) >= 50.500 |
Precision = 0.800 | — | -6.27% than global |
🔍✨Examples
For records in the dataset where `text_length(text)` < 59.500 AND `text_length(text)` >= 50.500, the Precision is 6.27% lower than the global Precision.text | text_length(text) | label | Predicted label |
|
---|---|---|---|---|
139 | it 's not the ultimate depression-era gangster movie . | 55 | negative | positive (p = 0.98) |
183 | the lower your expectations , the more you 'll enjoy it . | 58 | negative | positive (p = 0.99) |
205 | falls neatly into the category of good stupid fun . | 52 | positive | negative (p = 0.92) |
Vulnerability | Level | Data slice | Metric | Transformation | Deviation |
---|---|---|---|---|---|
Performance | medium 🟡 | avg_whitespace(text) < 0.187 AND avg_whitespace(text) >= 0.183 |
Precision = 0.800 | — | -6.27% than global |
🔍✨Examples
For records in the dataset where `avg_whitespace(text)` < 0.187 AND `avg_whitespace(text)` >= 0.183, the Precision is 6.27% lower than the global Precision.text | avg_whitespace(text) | label | Predicted label |
|
---|---|---|---|---|
86 | the film flat lines when it should peak and is more missed opportunity and trifle than dark , decadent truffle . | 0.185841 | negative | positive (p = 0.93) |
147 | the talented and clever robert rodriguez perhaps put a little too much heart into his first film and did n't reserve enough for his second . | 0.184397 | negative | positive (p = 0.97) |
448 | something akin to a japanese alice through the looking glass , except that it seems to take itself far more seriously . | 0.183333 | positive | negative (p = 0.84) |
Vulnerability | Level | Data slice | Metric | Transformation | Deviation |
---|---|---|---|---|---|
Performance | medium 🟡 | avg_word_length(text) >= 4.354 AND avg_word_length(text) < 4.464 |
Precision = 0.800 | — | -6.27% than global |
🔍✨Examples
For records in the dataset where `avg_word_length(text)` >= 4.354 AND `avg_word_length(text)` < 4.464, the Precision is 6.27% lower than the global Precision.text | avg_word_length(text) | label | Predicted label |
|
---|---|---|---|---|
86 | the film flat lines when it should peak and is more missed opportunity and trifle than dark , decadent truffle . | 4.38095 | negative | positive (p = 0.93) |
147 | the talented and clever robert rodriguez perhaps put a little too much heart into his first film and did n't reserve enough for his second . | 4.42308 | negative | positive (p = 0.97) |
448 | something akin to a japanese alice through the looking glass , except that it seems to take itself far more seriously . | 4.45455 | positive | negative (p = 0.84) |
Vulnerability | Level | Data slice | Metric | Transformation | Deviation |
---|---|---|---|---|---|
Performance | medium 🟡 | avg_whitespace(text) < 0.195 AND avg_whitespace(text) >= 0.192 |
Recall = 0.850 | — | -6.12% than global |
🔍✨Examples
For records in the dataset where `avg_whitespace(text)` < 0.195 AND `avg_whitespace(text)` >= 0.192, the Recall is 6.12% lower than the global Recall.text | avg_whitespace(text) | label | Predicted label |
|
---|---|---|---|---|
113 | this movie is maddening . | 0.192308 | negative | positive (p = 1.00) |
121 | it seems to me the film is about the art of ripping people off without ever letting them consciously know you have done so | 0.195122 | negative | positive (p = 0.98) |
142 | what better message than ` love thyself ' could young women of any size receive ? | 0.195122 | positive | negative (p = 1.00) |
Vulnerability | Level | Data slice | Metric | Transformation | Deviation |
---|---|---|---|---|---|
Performance | medium 🟡 | avg_word_length(text) >= 4.123 AND avg_word_length(text) < 4.209 |
Recall = 0.850 | — | -6.12% than global |
🔍✨Examples
For records in the dataset where `avg_word_length(text)` >= 4.123 AND `avg_word_length(text)` < 4.209, the Recall is 6.12% lower than the global Recall.text | avg_word_length(text) | label | Predicted label |
|
---|---|---|---|---|
113 | this movie is maddening . | 4.2 | negative | positive (p = 1.00) |
121 | it seems to me the film is about the art of ripping people off without ever letting them consciously know you have done so | 4.125 | negative | positive (p = 0.98) |
142 | what better message than ` love thyself ' could young women of any size receive ? | 4.125 | positive | negative (p = 1.00) |
Disclaimer: it's important to note that automated scans may produce false positives or miss certain vulnerabilities. We encourage you to review the findings and assess the impact accordingly.
💡 What's Next?
- Checkout the Giskard Space and improve your model.
- The Giskard community is always buzzing with ideas. 🐢🤔 What do you want to see next? Your feedback is our favorite fuel, so drop your thoughts in the community forum! 🗣️💬 Together, we're building something extraordinary.
🙌 Big Thanks!
We're grateful to have you on this adventure with us. 🚀🌟 Here's to more breakthroughs, laughter, and code magic! 🥂✨ Keep hugging that code and spreading the love! 💻 #Giskard #Huggingface #AISafety 🌈👏 Your enthusiasm, feedback, and contributions are what seek. 🌟 Keep being awesome!