metadata
language:
- en
- multilingual
tags:
- grant-classification
- research-funding
- oecd
- text-classification
license: mit
datasets:
- SIRIS-Lab/grant-classification-dataset
metrics:
- accuracy
- f1
base_model: intfloat/multilingual-e5-large
Grant Classification Model
This model classifies research grants according to a custom taxonomy based on OECD's categorization of science, technology, and innovation (STI) policy instruments.
Model Description
- Model architecture: Fine-tuned version of intfloat/multilingual-e5-large
- Language(s): Multilingual
- License: MIT
- Limitations: The model is specialized for grant classification and may not perform well on other text classification tasks
Usage
Basic usage
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
# Load model and tokenizer
model_name = "SIRIS-Lab/grant-classification-model"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Create classification pipeline
classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)
# Example grant text
grant_text = """
Title: Advancing Quantum Computing Applications in Drug Discovery
Abstract: This project aims to develop novel quantum algorithms for simulating molecular interactions to accelerate the drug discovery process. The research will focus on overcoming current limitations in quantum hardware by developing error-mitigation techniques specific to chemistry applications.
Funder: National Science Foundation
Funding Scheme: Quantum Leap Challenge Institutes
Beneficiary: University of California, Berkeley
"""
# Get prediction
result = classifier(grant_text)
print(f"Predicted category: {result[0]['label']}")
print(f"Confidence: {result[0]['score']:.4f}")
Batch processing for multiple grants
import pandas as pd
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TextClassificationPipeline
# Load model and tokenizer
model_name = "SIRIS-Lab/grant-classification-model"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Create classification pipeline
classifier = TextClassificationPipeline(model=model, tokenizer=tokenizer)
# Function to prepare grant text
def prepare_grant_text(row):
parts = []
if row.get('title'):
parts.append(f"Title: {row['title']}")
if row.get('abstract'):
parts.append(f"Abstract: {row['abstract']}")
if row.get('funder'):
parts.append(f"Funder: {row['funder']}")
if row.get('funding_scheme'):
parts.append(f"Funding Scheme: {row['funding_scheme']}")
if row.get('beneficiary'):
parts.append(f"Beneficiary: {row['beneficiary']}")
return "\n".join(parts)
# Example data
grants_df = pd.read_csv("grants.csv")
grants_df['text_for_model'] = grants_df.apply(prepare_grant_text, axis=1)
# Classify grants
results = classifier(grants_df['text_for_model'].tolist())
# Add results to dataframe
grants_df['predicted_category'] = [r['label'] for r in results]
grants_df['confidence'] = [r['score'] for r in results]
Classification Categories
The model classifies grants into the following categories:
- business_rnd_innovation: Direct allocation of funding to private firms for R&D and innovation activities with commercial applications
- fellowships_scholarships: Financial support for individual researchers or higher education students
- institutional_funding: Core funding for higher education institutions and public research institutes
- networking_collaborative: Tools to bring together various actors within the innovation system
- other_research_funding: Alternative funding mechanisms for R&D or higher education
- out_of_scope: Grants unrelated to research, development, or innovation
- project_grants_public: Direct funding for specific research projects in public institutions
- research_infrastructure: Funding for research facilities, equipment, and resources
Training
This model was fine-tuned on a dataset of grant documents with annotations derived from a consensus of multiple LLM predictions (Gemma, Mistral, Qwen) and human validation. The training process included:
- Base model: intfloat/multilingual-e5-large
- Training approach: Fine-tuning with early stopping
- Optimization: AdamW optimizer with weight decay
- Sequence length: 512 tokens
- Batch size: 8
- Learning rate: 2e-5
Citation and References
This model is based on a custom taxonomy derived from the OECD's categorization of science, technology, and innovation (STI) policy instruments. For more information, see:
EC/OECD (2023), STIP Survey, https://stip.oecd.org
Acknowledgements
- The model builds upon intfloat/multilingual-e5-large