Email-tuned BGE-M3
This is a fine-tuned version of BAAI/bge-m3 optimized for email content retrieval. The model was trained on a mixed-language (English/Korean) email dataset to improve retrieval performance for various email-related queries.
Model Description
- Model Type: Embedding model (encoder-only)
- Base Model: BAAI/bge-m3
- Languages: English, Korean
- Domain: Email content, business communication
- Training Data: Mixed-language email dataset with various types of queries (metadata, long-form, short-form, yes/no questions)
Quickstart
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.docstore.document import Document
# Initialize the embedding model
embeddings = HuggingFaceEmbeddings(
model_name="doubleyyh/email-tuned-bge-m3",
model_kwargs={'device': 'cuda'},
encode_kwargs={'normalize_embeddings': True}
)
# Example emails
emails = [
{
"subject": "νμ μΌμ λ³κ²½ μλ΄",
"from": [["κΉμ² μ", "[email protected]"]],
"to": [["μ΄μν¬", "[email protected]"]],
"cc": [["λ°μ§μ", "[email protected]"]],
"date": "2024-03-26T10:00:00",
"text_body": "μλ
νμΈμ, λ΄μΌ μμ λ νλ‘μ νΈ λ―Έν
μ μ€ν 2μλ‘ λ³κ²½νκ³ μ ν©λλ€."
},
{
"subject": "Project Timeline Update",
"from": [["John Smith", "[email protected]"]],
"to": [["Team", "[email protected]"]],
"cc": [],
"date": "2024-03-26T11:30:00",
"text_body": "Hi team, I'm writing to update you on the Q2 project milestones."
}
]
# Format emails into documents
docs = []
for email in emails:
# Format email content
content = "\n".join([f"{k}: {v}" for k, v in email.items()])
docs.append(Document(page_content=content))
# Create FAISS index
db = FAISS.from_documents(docs, embeddings)
# Query examples (supports both Korean and English)
queries = [
"νμ μκ°μ΄ μΈμ λ‘ λ³κ²½λμλμ?",
"When is the meeting rescheduled?",
"νλ‘μ νΈ μΌμ ",
"Q2 milestones"
]
# Perform similarity search
for query in queries:
print(f"\nQuery: {query}")
results = db.similarity_search(query, k=1)
print(f"Most relevant email:\n{results[0].page_content[:200]}...")
Intended Use & Limitations
Intended Use
- Email content retrieval
- Similar document search in email corpora
- Question answering over email content
- Multi-language email search systems
Limitations
- Performance may vary for domains outside of email content
- Best suited for business communication context
- While supporting both English and Korean, performance might vary between languages
Citation
@misc{email-tuned-bge-m3,
author = {doubleyyh},
title = {Email-tuned BGE-M3: Fine-tuned Embedding Model for Email Content},
year = {2024},
publisher = {HuggingFace}
}
License
This model follows the same license as the base model (bge-m3).
Contact
For questions or feedback, please use the GitHub repository issues section or contact through HuggingFace.
- Downloads last month
- 18
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.
Evaluation results
- MRR@10self-reported0.850
- NDCG@10self-reported0.820
- Recall@10self-reported0.880