metadata
pipeline_tag: text-ranking
library_name: sentence-transformers
license: mit
Model Card: Assisting Mathematical Formalization with A Learning-based Premise Retriever
Model Description
This model is the first version designed for premise retrieval in Lean, based on the state representation of Lean. The model follows the architecture described in the paper:
Assisting Mathematical Formalization with A Learning-based Premise Retriever
The model implementation and code are available at:
Usage
You can use this model with the sentence-transformers
library to embed queries and premises and then calculate their similarity for retrieval.
from sentence_transformers import SentenceTransformer, util
import torch
# Load the pretrained model
model = SentenceTransformer('ruc-ai4math/Lean_State_Search_Random')
# Example Lean proof state (query) and a list of premises
query = "<GOAL> (n : \u2115), n + 0 = n </GOAL>"
premises = [
"<VAR> (n : \u2115) </VAR> <GOAL> n + 0 = n </GOAL>",
"<VAR> (n m : \u2115) </VAR> <GOAL> n + m = m + n </GOAL>",
"<VAR> (n : \u2115) </VAR> <GOAL> n = n </GOAL>",
"lemma add_zero (n : \u2115) : n + 0 = n := by sorry" # An actual Lean lemma
]
# Encode the query and premises into embeddings
query_embedding = model.encode(query, convert_to_tensor=True)
premise_embeddings = model.encode(premises, convert_to_tensor=True)
# Calculate cosine similarity between the query and all premises
cosine_scores = util.cos_sim(query_embedding, premise_embeddings)
# Print the scores for each premise
print("Similarity scores:")
for i, score in enumerate(cosine_scores[0]):
print(f" - Premise {i+1}: '{premises[i]}', Score: {score.item():.4f}")
# Find the index of the premise with the highest similarity score
best_match_idx = torch.argmax(cosine_scores).item()
print(f"
Best matching premise: '{premises[best_match_idx]}'")
Citation
If you use this model, please cite the following paper:
@misc{tao2025assistingmathematicalformalizationlearningbased,
title={Assisting Mathematical Formalization with A Learning-based Premise Retriever},
author={Yicheng Tao and Haotian Liu and Shanwen Wang and Hongteng Xu},
year={2025},
eprint={2501.13959},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2501.13959},
}