dadashzadeh
/

2023_10_en_keywords_Cryptocurrency

Model card Files Files and versions Community

2023_10_en_keywords_Cryptocurrency / README.md

dadashzadeh's picture

Update README.md

4bb06bd verified 5 months ago

|

history blame contribute delete

3.42 kB

	---
	language: en
	library_name: bm25s
	tags:
	- bm25
	- bm25s
	- retrieval
	- search
	- lexical
	---

	# BM25S Index

	This is a BM25S index created with the [`bm25s` library](https://github.com/xhluca/bm25s) (version `0.2.0`), an ultra-fast implementation of BM25. It can be used for lexical retrieval tasks.

	BM25S Related Links:

	* 🏠[Homepage](https://bm25s.github.io)
	* 💻[GitHub Repository](https://github.com/xhluca/bm25s)
	* 🤗[Blog Post](https://huggingface.co/blog/xhluca/bm25s)
	* 📝[Technical Report](https://arxiv.org/abs/2407.03618)


	## Installation

	You can install the `bm25s` library with `pip`:

	```bash
	pip install "bm25s==0.2.0"

	# For huggingface hub usage
	pip install huggingface_hub
	```

	## Loading a `bm25s` index

	You can use this index for information retrieval tasks. Here is an example:

	```python
	import bm25s
	from bm25s.hf import BM25HF

	# Load the index
	retriever = BM25HF.load_from_hub("dadashzadeh/2023_10_en_keywords_Cryptocurrency")

	# You can retrieve now
	query = "a cat is a feline"
	results = retriever.retrieve(bm25s.tokenize(query), k=3)
	```

	## Saving a `bm25s` index

	You can save a `bm25s` index to the Hugging Face Hub. Here is an example:

	```python
	import bm25s
	from bm25s.hf import BM25HF

	corpus = [
	"northwest bank",
	"misfits market",
	"merrick bank login",
	"marketing",
	"market place",
	"jetblue customer service",
	"internal revenue service",
	"how to make money online",
	"gordon food service",
	"futures market",
	"frontier airlines customer service",
	"food banks near me",
	"first convenience bank",
	"eastern bank",
	"dollar bank",
	]

	retriever = BM25HF(corpus=corpus)
	retriever.index(bm25s.tokenize(corpus))

	token = None # You can get a token from the Hugging Face website
	retriever.save_to_hub("dadashzadeh/2023_10_en_keywords_Cryptocurrency", token=token)
	```

	## Advanced usage

	You can leverage more advanced features of the BM25S library during `load_from_hub`:

	```python
	# Load corpus and index in memory-map (mmap=True) to reduce memory
	retriever = BM25HF.load_from_hub("dadashzadeh/2023_10_en_keywords_Cryptocurrency", load_corpus=True, mmap=True)

	# Load a different branch/revision
	retriever = BM25HF.load_from_hub("dadashzadeh/2023_10_en_keywords_Cryptocurrency", revision="main")

	# Change directory where the local files should be downloaded
	retriever = BM25HF.load_from_hub("dadashzadeh/2023_10_en_keywords_Cryptocurrency", local_dir="/path/to/dir")

	# Load private repositories with a token:
	retriever = BM25HF.load_from_hub("dadashzadeh/2023_10_en_keywords_Cryptocurrency", token=token)
	```

	## Stats

	This dataset was created using the following data: 497 keywords Cryptocurrency (semrush)

	\| Statistic \| Value \|
	\| --- \| --- \|
	\| Number of documents \| 602959 \|
	\| Number of tokens \| 2414020 \|
	\| Average tokens per document \| 4.0 \|

	## Parameters

	The index was created with the following parameters:

	\| Parameter \| Value \|
	\| --- \| --- \|
	\| k1 \| `1.5` \|
	\| b \| `0.75` \|
	\| delta \| `0.5` \|
	\| method \| `lucene` \|
	\| idf method \| `lucene` \|

	## Citation

	To cite `bm25s`, please use the following bibtex:

	```
	@misc{lu_2024_bm25s,
	title={BM25S: Orders of magnitude faster lexical search via eager sparse scoring},
	author={Xing Han Lù},
	year={2024},
	eprint={2407.03618},
	archivePrefix={arXiv},
	primaryClass={cs.IR},
	url={https://arxiv.org/abs/2407.03618},
	}
	```