indo-dpr-question_encoder-multiset-base

Indonesian Dense Passage Retrieval trained on translated SQuADv2.0 and Natural Question dataset in DPR format.

Evaluation

Class Precision Recall F1-Score Support
hard_negative 0.9961 0.9961 0.9961 384778
positive 0.8783 0.8783 0.8783 12414
Metric Value
Loss 0.0220
Accuracy 0.9924
Macro Average 0.9372
Weighted Average 0.9924
Accuracy and F1 0.9353
Average Rank 0.2194

Note: This report is for evaluation on the dev set, after 27288 batches.

Usage

from transformers import DPRContextEncoder, DPRContextEncoderTokenizer

tokenizer = DPRContextEncoderTokenizer.from_pretrained('firqaaa/indo-dpr-ctx_encoder-multiset-base')
model = DPRContextEncoder.from_pretrained('firqaaa/indo-dpr-ctx_encoder-multiset-base')
input_ids = tokenizer("Siapa nama tokoh utama dalam serial SLAMDunk?", return_tensors='pt')["input_ids"]
embeddings = model(input_ids).pooler_output

You can use it using haystack as follows:

from haystack.nodes import DensePassageRetriever
from haystack.document_stores import InMemoryDocumentStore

retriever = DensePassageRetriever(document_store=InMemoryDocumentStore(),
                                  query_embedding_model="firqaaa/indo-dpr-ctx_encoder-multiset-base",
                                  passage_embedding_model="firqaaa/indo-dpr-ctx_encoder-multiset-base",
                                  max_seq_len_query=64,
                                  max_seq_len_passage=256,
                                  batch_size=16,
                                  use_gpu=True,
                                  embed_title=True,
                                  use_fast_tokenizers=True)
Downloads last month
35
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Datasets used to train firqaaa/indo-dpr-ctx_encoder-multiset-base