Update README.md
Browse files
README.md
CHANGED
@@ -14,4 +14,40 @@ tags:
|
|
14 |
* This model helps you **find** text within **ancient Chinese** literature, but you can **search with modern Chinese**
|
15 |
|
16 |
# 跨语种搜索
|
17 |
-
## 博古搜今
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
14 |
* This model helps you **find** text within **ancient Chinese** literature, but you can **search with modern Chinese**
|
15 |
|
16 |
# 跨语种搜索
|
17 |
+
## 博古搜今
|
18 |
+
```python
|
19 |
+
from unpackai.interp import CosineSearch
|
20 |
+
from sentence_transformers import SentenceTransformer
|
21 |
+
import pandas as pd
|
22 |
+
import numpy as np
|
23 |
+
|
24 |
+
TAG = "raynardj/xlsearch-cross-lang-search-zh-vs-classicical-cn"
|
25 |
+
encoder = SentenceTransformer(TAG)
|
26 |
+
|
27 |
+
# all_lines is a list of all your sentences
|
28 |
+
# all_lines 是一个你所有句子的列表, 可以是一本书, 按照句子分割, 也可以是很多很多书
|
29 |
+
all_lines = ["句子1","句子2",...]
|
30 |
+
vec = encoder.encode(all_lines, batch_size=32, show_progress_bar=True)
|
31 |
+
|
32 |
+
# consine距离搜索器
|
33 |
+
cosine = CosineSearch(vec)
|
34 |
+
|
35 |
+
def search(text):
|
36 |
+
enc = encoder.encode(text) # encode the search key
|
37 |
+
order = cosine(enc) # distance array
|
38 |
+
sentence_df = pd.DataFrame({"sentence":np.array(all_lines)[order[:5]]})
|
39 |
+
return sentence_df
|
40 |
+
```
|
41 |
+
|
42 |
+
将史记打成句子以后, 搜索效果如下
|
43 |
+
```python
|
44 |
+
>>> search("他是一个很慷慨的人")
|
45 |
+
```
|
46 |
+
```
|
47 |
+
sentence
|
48 |
+
0 季布者,楚人也。为气任侠,有名於楚。
|
49 |
+
1 董仲舒为人廉直。
|
50 |
+
2 大将军为人仁善退让,以和柔自媚於上,然天下未有称也。
|
51 |
+
3 勃为人木彊敦厚,高帝以为可属大事。
|
52 |
+
4 石奢者,楚昭王相也。坚直廉正,无所阿避。
|
53 |
+
```
|