metadata

pipeline_tag: text-ranking
library_name: sentence-transformers
license: mit

Model Card: Assisting Mathematical Formalization with A Learning-based Premise Retriever

Model Description

This model is the first version designed for premise retrieval in Lean, based on the state representation of Lean. The model follows the architecture described in the paper:

Assisting Mathematical Formalization with A Learning-based Premise Retriever

The model implementation and code are available at:

GitHub Repository

Try our model

Usage

You can use this model with the sentence-transformers library to embed queries and premises and then calculate their similarity for retrieval.

from sentence_transformers import SentenceTransformer, util
import torch

# Load the pretrained model
model = SentenceTransformer('ruc-ai4math/Lean_State_Search_Random')

# Example Lean proof state (query) and a list of premises
query = "<GOAL> (n : \u2115), n + 0 = n </GOAL>"
premises = [
    "<VAR> (n : \u2115) </VAR> <GOAL> n + 0 = n </GOAL>",
    "<VAR> (n m : \u2115) </VAR> <GOAL> n + m = m + n </GOAL>",
    "<VAR> (n : \u2115) </VAR> <GOAL> n = n </GOAL>",
    "lemma add_zero (n : \u2115) : n + 0 = n := by sorry" # An actual Lean lemma
]

# Encode the query and premises into embeddings
query_embedding = model.encode(query, convert_to_tensor=True)
premise_embeddings = model.encode(premises, convert_to_tensor=True)

# Calculate cosine similarity between the query and all premises
cosine_scores = util.cos_sim(query_embedding, premise_embeddings)

# Print the scores for each premise
print("Similarity scores:")
for i, score in enumerate(cosine_scores[0]):
    print(f"  - Premise {i+1}: '{premises[i]}', Score: {score.item():.4f}")

# Find the index of the premise with the highest similarity score
best_match_idx = torch.argmax(cosine_scores).item()
print(f"
Best matching premise: '{premises[best_match_idx]}'")

Citation

If you use this model, please cite the following paper:

@misc{tao2025assistingmathematicalformalizationlearningbased,
      title={Assisting Mathematical Formalization with A Learning-based Premise Retriever}, 
      author={Yicheng Tao and Haotian Liu and Shanwen Wang and Hongteng Xu},
      year={2025},
      eprint={2501.13959},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2501.13959}, 
}