thehosy's picture
Update README.md
2682abe verified
metadata
license: llama3.2
language:
  - vi
base_model:
  - meta-llama/Llama-3.2-1B-Instruct
pipeline_tag: sentence-similarity
library_name: transformers

misa-ai/Llama-3.2-1B-Instruct-Embedding-Base

This is a Embedding model for Document Retrieval: It maps sentences & paragraphs to a 2048 dimensional dense vector space and can be used for tasks like clustering or semantic search.

We train the model on a merged training dataset that consists of multiple domains, about 900k triplets in Vietnamese:

We use Llama-3.2-1B-Instruct as the pre-trained backbone.

This model directed to Document Retrieval.

Details:

  • Max support context size: 4096 tokens
  • Pooling last token (should use padding_side = "left")
  • Language: Vietnamese
  • Prompts:
    • Query: "Cho một câu truy vấn tìm kiếm thông tin, hãy truy xuất các tài liệu có liên quan trả lời cho truy vấn đó."
    • Document: ""

Please cite our manuscript if this dataset is used for your work