--- license: cdla-permissive-2.0 --- # xlscout-techdata-embeddings Welcome to **xlscout-techdata-embeddings**, a finetuned embedding model curated with the expertise and dedication of the team at [xlscout](https://xlscout.ai/). ## Model Description The **xlscout-techdata-embeddings** model is a variant of the original BAAI/bge-small-en-v1.5 model, finetuned with a specialized dataset crafted by patent experts at xlscout. We extracted around 50,000 samples from multiple domains, resulting in a performance boost of approximately 40% compared to the original model in various patent-related tasks such as retrieval and categorization. ### Finetuning Data The dataset for finetuning was meticulously curated and validated by our experts, ensuring its quality and reliability in the patent domain. The comprehensive dataset spans across multiple domains, providing a robust foundation for the model to understand and generate patent-related embeddings. We are open-sourcing a lighter version of the model, which has been finetuned on 10% of the entire data. For access to the full version of the model, please connect with [xlscout](https://xlscout.ai/). ## Usage Here's a quick guide on how to use the **xlscout-techdata-embeddings** model for a retrieval task: ```python from transformers import AutoTokenizer, AutoModel sentences = ["This is an example sentence", "Each sentence is converted"] model_id = "Khushwant78/xlscout-techdata-embedding" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModel.from_pretrained(model_id) embeddings = model.encode(sentences) print(embeddings) ``` ## Connect with xlscout We encourage researchers and developers to explore and utilize the lighter version of our model. For collaborations, custom solutions, or usage of the full model, please get in touch with us through our [website](https://xlscout.ai/). ## Acknowledgements A huge thank you to the original authors of the BAAI/bge-small-en-v1.5 model and to our dedicated team of patent experts at xlscout who meticulously curated the finetuning dataset, ensuring a significant performance boost in the patent domain. ## Citation ```bibtex @misc{bge_embedding, title={C-Pack: Packaged Resources To Advance General Chinese Embedding}, author={Shitao Xiao and Zheng Liu and Peitian Zhang and Niklas Muennighoff}, year={2023}, eprint={2309.07597}, archivePrefix={arXiv}, primaryClass={cs.CL} }