PabloAccuosto commited on
Commit
652700a
·
verified ·
1 Parent(s): 97f9351

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +135 -0
README.md ADDED
@@ -0,0 +1,135 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - multilingual
5
+ tags:
6
+ - grant-classification
7
+ - research-funding
8
+ - oecd
9
+ - text-classification
10
+ license: "mit"
11
+ datasets:
12
+ - "SIRIS-Lab/grant-classification-dataset"
13
+ metrics:
14
+ - accuracy
15
+ - f1
16
+ base_model: "intfloat/multilingual-e5-large"
17
+ ---
18
+
19
+ # Grant Classification Model
20
+
21
+ This model classifies research grants according to a custom taxonomy based on OECD's categorization of science, technology, and innovation (STI) policy instruments.
22
+
23
+ ## Model Description
24
+
25
+ - **Model architecture**: Fine-tuned version of [intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large)
26
+ - **Language(s)**: Multilingual
27
+ - **License**: MIT
28
+ - **Limitations**: The model is specialized for grant classification and may not perform well on other text classification tasks
29
+
30
+ ## Usage
31
+
32
+ ### Basic usage
33
+
34
+ ```python
35
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
36
+
37
+ # Load model and tokenizer
38
+ model_name = "your-username/grant-classification-model"
39
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
40
+ model = AutoModelForSequenceClassification.from_pretrained(model_name)
41
+
42
+ # Create classification pipeline
43
+ classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)
44
+
45
+ # Example grant text
46
+ grant_text = """
47
+ Title: Advancing Quantum Computing Applications in Drug Discovery
48
+ Abstract: This project aims to develop novel quantum algorithms for simulating molecular interactions to accelerate the drug discovery process. The research will focus on overcoming current limitations in quantum hardware by developing error-mitigation techniques specific to chemistry applications.
49
+ Funder: National Science Foundation
50
+ Funding Scheme: Quantum Leap Challenge Institutes
51
+ Beneficiary: University of California, Berkeley
52
+ """
53
+
54
+ # Get prediction
55
+ result = classifier(grant_text)
56
+ print(f"Predicted category: {result[0]['label']}")
57
+ print(f"Confidence: {result[0]['score']:.4f}")
58
+ ```
59
+
60
+ ### Batch processing for multiple grants
61
+
62
+ ```python
63
+ import pandas as pd
64
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification, TextClassificationPipeline
65
+
66
+ # Load model and tokenizer
67
+ model_name = "your-username/grant-classification-model"
68
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
69
+ model = AutoModelForSequenceClassification.from_pretrained(model_name)
70
+
71
+ # Create classification pipeline
72
+ classifier = TextClassificationPipeline(model=model, tokenizer=tokenizer)
73
+
74
+ # Function to prepare grant text
75
+ def prepare_grant_text(row):
76
+ parts = []
77
+ if row.get('title'):
78
+ parts.append(f"Title: {row['title']}")
79
+ if row.get('abstract'):
80
+ parts.append(f"Abstract: {row['abstract']}")
81
+ if row.get('funder'):
82
+ parts.append(f"Funder: {row['funder']}")
83
+ if row.get('funding_scheme'):
84
+ parts.append(f"Funding Scheme: {row['funding_scheme']}")
85
+ if row.get('beneficiary'):
86
+ parts.append(f"Beneficiary: {row['beneficiary']}")
87
+ return "\n".join(parts)
88
+
89
+ # Example data
90
+ grants_df = pd.read_csv("grants.csv")
91
+ grants_df['text_for_model'] = grants_df.apply(prepare_grant_text, axis=1)
92
+
93
+ # Classify grants
94
+ results = classifier(grants_df['text_for_model'].tolist())
95
+
96
+ # Add results to dataframe
97
+ grants_df['predicted_category'] = [r['label'] for r in results]
98
+ grants_df['confidence'] = [r['score'] for r in results]
99
+ ```
100
+
101
+ ## Classification Categories
102
+
103
+ The model classifies grants into the following categories:
104
+
105
+ 1. **business_rnd_innovation**: Direct allocation of funding to private firms for R&D and innovation activities with commercial applications
106
+ 2. **fellowships_scholarships**: Financial support for individual researchers or higher education students
107
+ 3. **institutional_funding**: Core funding for higher education institutions and public research institutes
108
+ 4. **networking_collaborative**: Tools to bring together various actors within the innovation system
109
+ 5. **other_research_funding**: Alternative funding mechanisms for R&D or higher education
110
+ 6. **out_of_scope**: Grants unrelated to research, development, or innovation
111
+ 7. **project_grants_public**: Direct funding for specific research projects in public institutions
112
+ 8. **research_infrastructure**: Funding for research facilities, equipment, and resources
113
+
114
+ ## Training
115
+
116
+ This model was fine-tuned on a dataset of grant documents with annotations derived from a consensus of multiple LLM predictions (Gemma, Mistral, Qwen) and human validation.
117
+ The training process included:
118
+
119
+ - Base model: [intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large)
120
+ - Training approach: Fine-tuning with early stopping
121
+ - Optimization: AdamW optimizer with weight decay
122
+ - Sequence length: 512 tokens
123
+ - Batch size: 8
124
+ - Learning rate: 2e-5
125
+
126
+ ## Citation and References
127
+
128
+ This model is based on a custom taxonomy derived from the OECD's categorization of science, technology, and innovation (STI) policy instruments.
129
+ For more information, see:
130
+
131
+ EC/OECD (2023), STIP Survey, https://stip.oecd.org
132
+
133
+ ## Acknowledgements
134
+
135
+ - The model builds upon [intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large)