Herbert: Pretrained Bert Model for Herbal Medicine

Herbert is a pretrained model for herbal medicine research, developed based on the bert-base-chinese model. The model has been fine-tuned on domain-specific data from 675 ancient books and 32 Traditional Chinese Medicine (TCM) textbooks. It is designed to support a variety of TCM-related NLP tasks.


Introduction

This model is optimized for TCM-related tasks, including but not limited to:

  • Herbal formula encoding
  • Domain-specific word embedding
  • Classification, labeling, and sequence prediction tasks in TCM research

Herbert combines the strengths of modern pretraining techniques and domain knowledge, allowing it to excel in TCM-related text processing tasks.


Model Config

{
  "hidden_size": 1024,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 16,
  "num_hidden_layers": 24,
  "torch_dtype": "float32",
  "vocab_size": 21128
}
### requirements
"transformers_version": "4.45.1"

###  Quickstart

#### Use Huggingface
```python
from transformers import AutoTokenizer, AutoModel

# Replace "Chengfengke/herbert" with the Hugging Face model repository name
model_name = "Chengfengke/herbert"

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

# Input text
text = "中医理论是我国传统文化的瑰宝。"

# Tokenize and prepare input
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding="max_length", max_length=128)

# Get the model's outputs
with torch.no_grad():
    outputs = model(**inputs)

# Get the embedding (sentence-level average pooling)
sentence_embedding = outputs.last_hidden_state.mean(dim=1)

print("Embedding shape:", sentence_embedding.shape)
print("Embedding vector:", sentence_embedding)

LocalModel

from transformers import BertTokenizer, BertForMaskedLM

# Load the model and tokenizer
model_name = "Chengfengke/herbert"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForMaskedLM.from_pretrained(model_name)
inputs = tokenizer("This is an example text for herbal medicine.", return_tensors="pt")
outputs = model(**inputs)

Citation

If you find our work helpful, feel free to give us a cite.

@misc{herbert-embedding,
  title = {Herbert: A Pretrain_Bert_Model for TCM_herb and downstream Tasks as Text Embedding Generation},
  author = {Yehan Yang,Xinhan Zheng},
  month = {December},
  year = {2024}
}

@article{herbert-technical-report,
  title={Herbert: A Pretrain_Bert_Model for TCM_herb and downstream Tasks as Text Embedding Generation},
  author={Yehan Yang,Xinhan Zheng},
  institution={Beijing Angopro Technology Co., Ltd.},
  year={2024},
  note={Presented at the 2024 Machine Learning Applications Conference (MLAC)}
}
Downloads last month
10
Safetensors
Model size
102M params
Tensor type
F32
·
Inference Examples
Unable to determine this model's library. Check the docs .

Model tree for Chengfengke/herbert

Finetuned
(155)
this model