File size: 2,529 Bytes
b350739
24b9770
 
 
b350739
24b9770
d4538b5
8c361e9
d4538b5
8c361e9
d4538b5
8c361e9
d4538b5
8c361e9
d4538b5
8c361e9
d4538b5
8c361e9
d4538b5
8c361e9
24b9770
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8c361e9
24b9770
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8c361e9
d4538b5
 
8c361e9
 
d4538b5
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
---
pipeline_tag: text-ranking
library_name: sentence-transformers
license: mit
---

# Model Card: Assisting Mathematical Formalization with A Learning-based Premise Retriever

## Model Description

This model is the first version designed for **premise retrieval** in **Lean**, based on the **state representation** of Lean. The model follows the architecture described in the paper:

[Assisting Mathematical Formalization with A Learning-based Premise Retriever](https://arxiv.org/abs/2501.13959)

The model implementation and code are available at:

[GitHub Repository](https://github.com/ruc-ai4math/Premise-Retrieval)

[Try our model](https://premise-search.com)

## Usage

You can use this model with the `sentence-transformers` library to embed queries and premises and then calculate their similarity for retrieval.

```python
from sentence_transformers import SentenceTransformer, util
import torch

# Load the pretrained model
model = SentenceTransformer('ruc-ai4math/Lean_State_Search_Random')

# Example Lean proof state (query) and a list of premises
query = "<GOAL> (n : \u2115), n + 0 = n </GOAL>"
premises = [
    "<VAR> (n : \u2115) </VAR> <GOAL> n + 0 = n </GOAL>",
    "<VAR> (n m : \u2115) </VAR> <GOAL> n + m = m + n </GOAL>",
    "<VAR> (n : \u2115) </VAR> <GOAL> n = n </GOAL>",
    "lemma add_zero (n : \u2115) : n + 0 = n := by sorry" # An actual Lean lemma
]

# Encode the query and premises into embeddings
query_embedding = model.encode(query, convert_to_tensor=True)
premise_embeddings = model.encode(premises, convert_to_tensor=True)

# Calculate cosine similarity between the query and all premises
cosine_scores = util.cos_sim(query_embedding, premise_embeddings)

# Print the scores for each premise
print("Similarity scores:")
for i, score in enumerate(cosine_scores[0]):
    print(f"  - Premise {i+1}: '{premises[i]}', Score: {score.item():.4f}")

# Find the index of the premise with the highest similarity score
best_match_idx = torch.argmax(cosine_scores).item()
print(f"
Best matching premise: '{premises[best_match_idx]}'")
```

## Citation
If you use this model, please cite the following paper:

```
@misc{tao2025assistingmathematicalformalizationlearningbased,
      title={Assisting Mathematical Formalization with A Learning-based Premise Retriever}, 
      author={Yicheng Tao and Haotian Liu and Shanwen Wang and Hongteng Xu},
      year={2025},
      eprint={2501.13959},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2501.13959}, 
}
```