sarahyurick commited on
Commit
cf296cf
1 Parent(s): 2567317

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +130 -3
README.md CHANGED
@@ -2,8 +2,135 @@
2
  tags:
3
  - model_hub_mixin
4
  - pytorch_model_hub_mixin
 
5
  ---
6
 
7
- This model has been pushed to the Hub using the [PytorchModelHubMixin](https://huggingface.co/docs/huggingface_hub/package_reference/mixins#huggingface_hub.PyTorchModelHubMixin) integration:
8
- - Library: [More Information Needed]
9
- - Docs: [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  tags:
3
  - model_hub_mixin
4
  - pytorch_model_hub_mixin
5
+ license: other
6
  ---
7
 
8
+ # Model Overview
9
+ This is a multilingual text classification model that can enable data annotation, creation of domain-specific blends and the addition of metadata tags. The model classifies documents into one of 26 domain classes:
10
+
11
+ 'Adult', 'Arts_and_Entertainment', 'Autos_and_Vehicles', 'Beauty_and_Fitness', 'Books_and_Literature', 'Business_and_Industrial', 'Computers_and_Electronics', 'Finance', 'Food_and_Drink', 'Games', 'Health', 'Hobbies_and_Leisure', 'Home_and_Garden', 'Internet_and_Telecom', 'Jobs_and_Education', 'Law_and_Government', 'News', 'Online_Communities', 'People_and_Society', 'Pets_and_Animals', 'Real_Estate', 'Science', 'Sensitive_Subjects', 'Shopping', 'Sports', 'Travel_and_Transportation'
12
+
13
+ It supports 52 languages (English and 51 other languages) : 'ar', 'az', 'bg', 'bn', 'ca', 'cs', 'da', 'de', 'el', 'es', 'et', 'fa', 'fi', 'fr', 'gl', 'he', 'hi', 'hr', 'hu', 'hy', 'id', 'is', 'it', 'ka', 'kk', 'kn', 'ko', 'lt', 'lv', 'mk', 'ml', 'mr', 'ne', 'nl', 'no', 'pl', 'pt', 'ro', 'ru', 'sk', 'sl', 'sq', 'sr', 'sv', 'ta', 'tr', 'uk', 'ur', 'vi', 'ja', 'zh'
14
+ ```
15
+ Code Language Name
16
+ ar Arabic
17
+ az Azerbaijani
18
+ bg Bulgarian
19
+ bn Bengali
20
+ ca Catalan
21
+ cs Czech
22
+ da Danish
23
+ de German
24
+ el Greek
25
+ es Spanish
26
+ et Estonian
27
+ fa Persian
28
+ fi Finnish
29
+ fr French
30
+ gl Galician
31
+ he Hebrew
32
+ hi Hindi
33
+ hr Croatian
34
+ hu Hungarian
35
+ hy Armenian
36
+ id Indonesian
37
+ is Icelandic
38
+ it Italian
39
+ ka Georgian
40
+ kk Kazakh
41
+ kn Kannada
42
+ ko Korean
43
+ lt Lithuanian
44
+ lv Latvian
45
+ mk Macedonian
46
+ ml Malayalam
47
+ mr Marathi
48
+ ne Nepali
49
+ nl Dutch
50
+ no Norwegian
51
+ pl Polish
52
+ pt Portuguese
53
+ ro Romanian
54
+ ru Russian
55
+ sk Slovak
56
+ sl Slovenian
57
+ sq Albanian
58
+ sr Serbian
59
+ sv Swedish
60
+ ta Tamil
61
+ tr Turkish
62
+ uk Ukrainian
63
+ ur Urdu
64
+ vi Vietnamese
65
+ ja Japanese
66
+ zh Chinese
67
+ ```
68
+
69
+ # License
70
+ This model is released under the [NVIDIA Open Model License Agreement](https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf).
71
+
72
+ # References
73
+ - DeBERTaV3: [Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing](https://arxiv.org/abs/2111.09543)
74
+ - DeBERTa: [Decoding-enhanced BERT with Disentangled Attention](https://github.com/microsoft/DeBERTa)
75
+
76
+ # Model Architecture
77
+ - The model architecture is Deberta V3 Base
78
+ - Context length is 512 tokens
79
+
80
+ # How To Use in NVIDIA NeMo Curator
81
+ NeMo Curator improves generative AI model accuracy by processing text, image, and video data at scale for training and customization. It also provides pre-built pipelines for generating synthetic data to customize and evaluate generative AI systems.
82
+
83
+ The inference code for this model is available through the NeMo Curator GitHub repository. Check out this [example notebook](https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials/distributed_data_classification) to get started.
84
+
85
+ # Input & Output
86
+ ## Input
87
+ - Input Type: Text
88
+ - Input Format: String
89
+ - Input Parameters: 1D
90
+ - Other Properties Related to Input: Token Limit of 512 tokens
91
+
92
+ ## Output
93
+ - Output Type: Text Classifications
94
+ - Output Format: String
95
+ - Output Parameters: 1D
96
+ - Other Properties Related to Output: None
97
+
98
+ The model takes one or several paragraphs of text as input. Example input:
99
+ ```
100
+ 最年少受賞者はエイドリアン・ブロディの29歳、最年少候補者はジャッキー・クーパーの9歳。最年長受賞者、最年長候補者は、アンソニー・ホプキンスの83歳。
101
+ 最多受賞者は3回受賞のダニエル・デイ=ルイス。2回受賞経験者はスペンサー・トレイシー、フレドリック・マーチ、ゲイリー・クーパー、ダスティン・ホフマン、トム・ハンクス、ジャック・ニコルソン(助演男優賞も1回受賞している)、ショーン・ペン、アンソニー・ホプキンスの8人。なお、マーロン・ブランドも2度受賞したが、2度目の受賞を拒否している。最多候補者はスペンサー・トレイシー、ローレンス・オリヴィエの9回。
102
+ 死後に受賞したのはピーター・フィンチが唯一。ほか、ジェームズ・ディーン、スペンサー・トレイシー、マッシモ・トロイージ、チャドウィック・ボーズマンが死後にノミネートされ、うち2回死後にノミネートされたのはディーンのみである。
103
+ 非白人(黒人)で初めて受賞したのはシドニー・ポワチエであり、英語以外の演技で受賞したのはロベルト・ベニーニである。
104
+ ```
105
+
106
+ The model outputs one of the 26 domain classes as the predicted domain for each input sample. Example output:
107
+ ```
108
+ Arts_and_Entertainment
109
+ ```
110
+
111
+ # Software Integration
112
+ - Runtime Engine: Python 3.10 and NeMo Curator
113
+ - Supported Hardware Microarchitecture Compatibility: NVIDIA GPU, Volta™ or higher (compute capability 7.0+), CUDA 12 (or above)
114
+ - Preferred/Supported Operating System(s): Ubuntu 22.04/20.04
115
+
116
+ # Training, Testing, and Evaluation Dataset
117
+ ## Training data
118
+ - 1 million Common Crawl samples, labeled using Google Cloud’s Natural Language [API](https://cloud.google.com/natural-language/docs/classifying-text)
119
+ - 500k Wikipedia articles, curated using [Wikipedia-API](https://pypi.org/project/Wikipedia-API/)
120
+
121
+ ## Training steps
122
+ - Translate the English training data into 51 other languages. Each sample has 52 copies.
123
+ - During training, randomly pick one of the 52 copies for each sample.
124
+ - During validation, evaluate the model on validation set 52 times, to get the validation score for each language.
125
+
126
+ ## Evaluation
127
+ - Metric: PR-AUC
128
+
129
+ # Inference
130
+ - Engine: PyTorch
131
+ - Test Hardware: V100
132
+
133
+ # Ethical Considerations
134
+ NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
135
+
136
+ Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability).