XiaoEnn
/

herberta_seq_128_v2

text embeddding

Model card Files Files and versions

herberta_seq_128_v2 / README.md

XiaoEnn's picture

Update README.md

c488396 verified 8 months ago

|

history blame contribute delete

2.94 kB

	---
	tags:
	- Pretrain_Model
	- transformers
	- TCM
	- herberta
	- text embeddding
	license: apache-2.0
	inference: true
	language:
	- zh
	- en
	base_model:
	- hfl/chinese-roberta-wwm-ext
	library_name: transformers
	metrics:
	- accuracy
	new_version: XiaoEnn/herberta_seq_512_V2
	---

	### intrudcution
	Herberta Pretrain model experimental research model developed by the Angelpro Team, focused on Development of a pre-training model for herbal medicine.Based on the chinese-roberta-wwm-ext-large model, we do the MLM task to complete the pre-training model on the data of 675 ancient books and 32 Chinese medicine textbooks, which we named herberta, where we take the front and back words of herb and Roberta and splice them together. We are committed to make a contribution to the TCM big modeling industry.
	We hope it can be used:
	- Encoder for Herbal Formulas, Embedding Models
	- Word Embedding Model for Chinese Medicine Domain Data
	- Support for a wide range of downstream TCM tasks, e.g., classification tasks, labeling tasks, etc.


	### requirements
	"transformers_version": "4.45.1"
	```bash
	pip install herberta
	```

	### Quickstart

	#### Use Huggingface
	```python
	from transformers import AutoTokenizer, AutoModel

	# Replace "XiaoEnn/herberta" with the Hugging Face model repository name
	model_name = "XiaoEnn/herberta"

	# Load tokenizer and model
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModel.from_pretrained(model_name)

	# Input text
	text = "中医理论是我国传统文化的瑰宝。"

	# Tokenize and prepare input
	inputs = tokenizer(text, return_tensors="pt", truncation=True, padding="max_length", max_length=128)

	# Get the model's outputs
	with torch.no_grad():
	outputs = model(**inputs)

	# Get the embedding (sentence-level average pooling)
	sentence_embedding = outputs.last_hidden_state.mean(dim=1)

	print("Embedding shape:", sentence_embedding.shape)
	print("Embedding vector:", sentence_embedding)
	```


	#### LocalModel
	```python
	from herberta.embedding import TextToEmbedding


	embedder = TextToEmbedding("path/to/your/model")

	# Single text input
	embedding = embedder.get_embeddings("This is a sample text.")


	# Multiple text input
	texts = ["This is a sample text.", "Another example."]
	embeddings = embedder.get_embeddings(texts)
	```

	## Citation

	If you find our work helpful, feel free to give us a cite.

	```bibtex
	@misc{herberta-embedding,
	title = {Herberta: A Pretrain_Model for TCM_herb and downstream Tasks as Text Embedding Generation},
	url = {https://github.com/15392778677/herberta},
	author = {Yehan Yang,Xinhan Zheng},
	month = {December},
	year = {2024}
	}

	@article{herberta-technical-report,
	title={Herberta: A Pretrain_Model for TCM_herb and downstream Tasks as Text Embedding Generation},
	author={Yehan Yang,Xinhan Zheng},
	institution={Beijing Angopro Technology Co., Ltd.},
	year={2024},
	note={Presented at the 2024 Machine Learning Applications Conference (MLAC)}
	}