metadata

language:
  - en
  - multilingual
tags:
  - grant-classification
  - research-funding
  - oecd
  - text-classification
license: mit
datasets:
  - SIRIS-Lab/grant-classification-dataset
metrics:
  - accuracy
  - f1
base_model: intfloat/multilingual-e5-large

Grant Classification Model

This model classifies research grants according to a custom taxonomy based on OECD's categorization of science, technology, and innovation (STI) policy instruments.

Model Description

Model architecture: Fine-tuned version of intfloat/multilingual-e5-large
Language(s): Multilingual
License: MIT
Limitations: The model is specialized for grant classification and may not perform well on other text classification tasks

Usage

Basic usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline

# Load model and tokenizer
model_name = "SIRIS-Lab/grant-classification-model"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Create classification pipeline
classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)

# Example grant text
grant_text = """
Title: Advancing Quantum Computing Applications in Drug Discovery
Abstract: This project aims to develop novel quantum algorithms for simulating molecular interactions to accelerate the drug discovery process. The research will focus on overcoming current limitations in quantum hardware by developing error-mitigation techniques specific to chemistry applications.
Funder: National Science Foundation
Funding Scheme: Quantum Leap Challenge Institutes
Beneficiary: University of California, Berkeley
"""

# Get prediction
result = classifier(grant_text)
print(f"Predicted category: {result[0]['label']}")
print(f"Confidence: {result[0]['score']:.4f}")

Batch processing for multiple grants

import pandas as pd
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TextClassificationPipeline

# Load model and tokenizer
model_name = "SIRIS-Lab/grant-classification-model"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Create classification pipeline
classifier = TextClassificationPipeline(model=model, tokenizer=tokenizer)

# Function to prepare grant text
def prepare_grant_text(row):
    parts = []
    if row.get('title'):
        parts.append(f"Title: {row['title']}")
    if row.get('abstract'):
        parts.append(f"Abstract: {row['abstract']}")
    if row.get('funder'):
        parts.append(f"Funder: {row['funder']}")
    if row.get('funding_scheme'):
        parts.append(f"Funding Scheme: {row['funding_scheme']}")
    if row.get('beneficiary'):
        parts.append(f"Beneficiary: {row['beneficiary']}")
    return "\n".join(parts)

# Example data
grants_df = pd.read_csv("grants.csv")
grants_df['text_for_model'] = grants_df.apply(prepare_grant_text, axis=1)

# Classify grants
results = classifier(grants_df['text_for_model'].tolist())

# Add results to dataframe
grants_df['predicted_category'] = [r['label'] for r in results]
grants_df['confidence'] = [r['score'] for r in results]

Classification Categories

The model classifies grants into the following categories:

business_rnd_innovation: Direct allocation of funding to private firms for R&D and innovation activities with commercial applications
fellowships_scholarships: Financial support for individual researchers or higher education students
institutional_funding: Core funding for higher education institutions and public research institutes
networking_collaborative: Tools to bring together various actors within the innovation system
other_research_funding: Alternative funding mechanisms for R&D or higher education
out_of_scope: Grants unrelated to research, development, or innovation
project_grants_public: Direct funding for specific research projects in public institutions
research_infrastructure: Funding for research facilities, equipment, and resources

Training

This model was fine-tuned on a dataset of grant documents with annotations derived from a consensus of multiple LLM predictions (Gemma, Mistral, Qwen) and human validation. The training process included:

Base model: intfloat/multilingual-e5-large
Training approach: Fine-tuning with early stopping
Optimization: AdamW optimizer with weight decay
Sequence length: 512 tokens
Batch size: 8
Learning rate: 2e-5

Citation and References

This model is based on a custom taxonomy derived from the OECD's categorization of science, technology, and innovation (STI) policy instruments. For more information, see:

EC/OECD (2023), STIP Survey, https://stip.oecd.org

Acknowledgements

The model builds upon intfloat/multilingual-e5-large