File size: 2,529 Bytes
b350739 24b9770 b350739 24b9770 d4538b5 8c361e9 d4538b5 8c361e9 d4538b5 8c361e9 d4538b5 8c361e9 d4538b5 8c361e9 d4538b5 8c361e9 d4538b5 8c361e9 24b9770 8c361e9 24b9770 8c361e9 d4538b5 8c361e9 d4538b5 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 |
---
pipeline_tag: text-ranking
library_name: sentence-transformers
license: mit
---
# Model Card: Assisting Mathematical Formalization with A Learning-based Premise Retriever
## Model Description
This model is the first version designed for **premise retrieval** in **Lean**, based on the **state representation** of Lean. The model follows the architecture described in the paper:
[Assisting Mathematical Formalization with A Learning-based Premise Retriever](https://arxiv.org/abs/2501.13959)
The model implementation and code are available at:
[GitHub Repository](https://github.com/ruc-ai4math/Premise-Retrieval)
[Try our model](https://premise-search.com)
## Usage
You can use this model with the `sentence-transformers` library to embed queries and premises and then calculate their similarity for retrieval.
```python
from sentence_transformers import SentenceTransformer, util
import torch
# Load the pretrained model
model = SentenceTransformer('ruc-ai4math/Lean_State_Search_Random')
# Example Lean proof state (query) and a list of premises
query = "<GOAL> (n : \u2115), n + 0 = n </GOAL>"
premises = [
"<VAR> (n : \u2115) </VAR> <GOAL> n + 0 = n </GOAL>",
"<VAR> (n m : \u2115) </VAR> <GOAL> n + m = m + n </GOAL>",
"<VAR> (n : \u2115) </VAR> <GOAL> n = n </GOAL>",
"lemma add_zero (n : \u2115) : n + 0 = n := by sorry" # An actual Lean lemma
]
# Encode the query and premises into embeddings
query_embedding = model.encode(query, convert_to_tensor=True)
premise_embeddings = model.encode(premises, convert_to_tensor=True)
# Calculate cosine similarity between the query and all premises
cosine_scores = util.cos_sim(query_embedding, premise_embeddings)
# Print the scores for each premise
print("Similarity scores:")
for i, score in enumerate(cosine_scores[0]):
print(f" - Premise {i+1}: '{premises[i]}', Score: {score.item():.4f}")
# Find the index of the premise with the highest similarity score
best_match_idx = torch.argmax(cosine_scores).item()
print(f"
Best matching premise: '{premises[best_match_idx]}'")
```
## Citation
If you use this model, please cite the following paper:
```
@misc{tao2025assistingmathematicalformalizationlearningbased,
title={Assisting Mathematical Formalization with A Learning-based Premise Retriever},
author={Yicheng Tao and Haotian Liu and Shanwen Wang and Hongteng Xu},
year={2025},
eprint={2501.13959},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2501.13959},
}
``` |