Structure Extraction Model

nebuia_extract_small is an extraction model inspired by NuExtract. nebuia_extract_small is a version of qween 1.5b, fine-tuned on a private high-quality synthetic dataset for entity extraction in Spanish legal texts with an 8k context length. Supports JSON template like nu extract describing the information you need to extract. NebuIA Extract specializes in identifying and extracting legal entities and relevant information from Spanish legal documents.

Model Details

Model Description

  • Developed by: NebuIA
  • Language(s) (NLP): es
  • License: mit
  • Finetuned from model [optional]: Qween2 1.5b

Uses

Same template as NuExtract

import json
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch


def predict_extract(model, tokenizer, text, schema):
    schema = json.dumps(json.loads(schema), indent=4)
    input_llm =  "<|input|>\n### Template:\n" +  schema + "\n"
    
    input_llm +=  "### Text:\n"+text +"\n<|output|>\n"
    input_ids = tokenizer(input_llm, return_tensors="pt", truncation=True, max_length=4000).to("cuda")

    output = tokenizer.decode(model.generate(**input_ids)[0], skip_special_tokens=True)
    return output.split("<|output|>")[1].split("<|end-output|>")[0]


model = AutoModelForCausalLM.from_pretrained("NebuIA/nebuia_extract_small", trust_remote_code=True, torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("NebuIA/nebuia_extract_small", trust_remote_code=True)

model.to("cuda")

model.eval()

text = """large legal text"""

schema = """{
    "calusulas": [],
    "notario": "",
    "jurisdiccion": {
      "clausula_jurisdiccion": "",
      "lugar": ""
    }
}"""

prediction = predict_extract(model, tokenizer, text, schema)
print(prediction)
Downloads last month
19
Safetensors
Model size
1.54B params
Tensor type
BF16
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.