|
--- |
|
license: mit |
|
language: |
|
- en |
|
tags: |
|
- text-embeddings |
|
- telecom |
|
- domain-adaptation |
|
- triplet-loss |
|
- transformer |
|
- semantic-search |
|
- sentence-transformers |
|
- domain-specific |
|
- contrastive-learning |
|
- simcse |
|
- bio-bert |
|
- don’t-stop-pretraining |
|
|
|
metrics: |
|
- name: Telecom Triplet Score |
|
type: accuracy |
|
value: 0.9380 |
|
verified: false |
|
- name: Average MTEB Score |
|
type: accuracy |
|
value: 0.825 |
|
verified: false |
|
- name: Average STS Score |
|
type: spearman |
|
value: 82.19 |
|
verified: false |
|
- name: AllNLI Triplet Score |
|
type: accuracy |
|
value: 0.6150 |
|
verified: false |
|
base_model: |
|
- Alibaba-NLP/gte-Qwen2-1.5B-instruct |
|
model-index: |
|
- name: T-VEC |
|
results: |
|
- task: |
|
type: text-embedding |
|
name: Telecom Triplet Benchmark |
|
dataset: |
|
type: custom |
|
name: Telecom Triplet Benchmark |
|
metrics: |
|
- name: Telecom Triplet Score |
|
type: accuracy |
|
value: 0.9380 |
|
verified: false |
|
- task: |
|
type: text-embedding |
|
name: MTEB Benchmark |
|
dataset: |
|
type: openai_humaneval |
|
name: MTEB Benchmark |
|
metrics: |
|
- name: Average MTEB Score |
|
type: accuracy |
|
value: 0.825 |
|
verified: false |
|
- task: |
|
type: text-embedding |
|
name: STS Benchmark |
|
dataset: |
|
type: openai_humaneval |
|
name: STS Benchmark |
|
metrics: |
|
- name: Average STS Score |
|
type: spearman |
|
value: 82.19 |
|
verified: false |
|
- task: |
|
type: text-embedding |
|
name: AllNLI Triplet |
|
dataset: |
|
type: openai_humaneval |
|
name: AllNLI Triplet |
|
metrics: |
|
- name: Triplet Score |
|
type: accuracy |
|
value: 0.6150 |
|
verified: false |
|
|
|
extra_gated_prompt: "Please provide answers to the below questions to gain access to the model" |
|
extra_gated_fields: |
|
Company: text |
|
Full Name: text |
|
Email: text |
|
I want to use this model for: |
|
type: select |
|
options: |
|
- Research |
|
- Education |
|
- Commercial |
|
- label: Other |
|
value: other |
|
--- |
|
|
|
|
|
# T-VEC: A Telecom-Specific Text Embedding Model |
|
|
|
## Overview |
|
|
|
**T-VEC (Telecom Vectorization Model)** is a domain-adapted text embedding model developed by NetoAI and fine-tuned from [Alibaba-NLP/gte-Qwen2-1.5B-instruct](https://huggingface.co/Alibaba-NLP/gte-Qwen2-1.5B-instruct). Using a deeply supervised triplet-loss approach, T-VEC learns rich semantic representations tailored to telecom use cases, achieving state-of-the-art results on custom and standard benchmarks. |
|
|
|
## Model Details |
|
|
|
- **Model Name**: T-VEC |
|
- **Developer**: [NetoAI](https://www.netoai.ai) |
|
- **Base Model**: Alibaba-NLP/gte-Qwen2-1.5B-instruct |
|
- **Parameters**: 1.5 Billion |
|
- **Embedding Dimension**: 1536 |
|
- **Max Input Tokens**: 32,000 |
|
- **Languages**: Multilingual (optimized for English) |
|
- **License**: MIT |
|
- **Tokenizer**: Custom telecom-specific tokenizer (open-source) |
|
|
|
## Intended Uses |
|
|
|
- Semantic search over telecom documents (3GPP standards, vendor manuals) |
|
- Fault log analysis for root-cause detection |
|
- Telecom-specific chatbots and Q&A systems |
|
- Regulatory compliance analysis and semantic auditing |
|
|
|
## Training Details |
|
|
|
- **Objective**: Triplet loss using cosine similarity |
|
- **Dataset**: 100k+ telecom triplets curated by domain experts over 3 months |
|
- **Layer Modification**: 338 transformer layers fine-tuned |
|
- **Avg. L2 Norm Weight Change**: 0.7735 |
|
- **Enhancements**: Telecom-specific tokenizer and query-aware anchor strategies |
|
|
|
## Evaluation Results |
|
|
|
| Benchmark | Metric | Score | |
|
|-----------------------------|----------------------|--------| |
|
| Telecom Triplet Benchmark | Accuracy | 0.9380 | |
|
| MTEB Benchmark | Accuracy | 0.825 | |
|
| STS Benchmark | Spearman Correlation | 82.19 | |
|
| AllNLI Triplet | Accuracy | 0.6150 | |
|
|
|
T-VEC significantly outperforms both its base model and other strong general-purpose models on telecom-specific benchmarks, while still retaining competitive general performance. |
|
|
|
|
|
| Model | ArguAna | SciDocsRR | STS12 | STS13 | STS14 | STS15 | STS16 | STSBenchmark | |
|
|--------------------------------|---------|--------------|-------------|------------|------------|------------|------------|--------------| |
|
| gte‑Qwen2‑1.5B‑instruct | 0.62335 | 0.81558 | 0.72805 | 0.84699 | 0.78803 | 0.87450 | 0.84938 | 0.85379 | |
|
| T‑VEC | 0.61150 | 0.83970 | 0.80320 | 0.88220 | 0.82750 | 0.88260 | 0.84780 | 0.88050 | |
|
| all‑MiniLM‑L6‑v2 | 0.50167 | 0.87119 | 0.72369 | 0.80603 | 0.75589 | 0.85390 | 0.78989 | 0.82032 | |
|
| all‑mpnet‑base‑v2 | 0.46521 | 0.88654 | 0.72634 | 0.83485 | 0.78000 | 0.85663 | 0.80030 | 0.83422 | |
|
| bge‑base‑en‑v1.5 | 0.63616 | 0.87494 | 0.78028 | 0.84184 | 0.82273 | 0.87957 | 0.85474 | 0.86418 | |
|
| e5‑base‑v2 | 0.51604 | 0.82834 | 0.73489 | 0.82997 | 0.80446 | 0.88181 | 0.83659 | 0.85480 | |
|
| jina‑embeddings‑v2‑base‑en | 0.44152 | 0.83106 | 0.74278 | 0.84177 | 0.78808 | 0.87553 | 0.85347 | 0.84842 | |
|
| instructor‑xl | 0.54884 | 0.79538 | 0.74085 | 0.85046 | 0.80318 | 0.88359 | 0.83784 | 0.83048 | |
|
| gte‑base | 0.57151 | 0.87083 | 0.75707 | 0.85729 | 0.81510 | 0.88810 | 0.83824 | 0.85738 | |
|
| multilingual‑e5‑base | 0.47829 | 0.80392 | 0.77933 | 0.76890 | 0.77535 | 0.88373 | 0.82699 | 0.84201 | |
|
|
|
|
|
 |
|
|
|
|
|
|
|
## Limitations |
|
|
|
- Reduced performance on non-domain tasks (e.g., AllNLI) due to specialization |
|
- Large size may impact deployment on edge devices |
|
- May miss recent telecom developments outside the training set |
|
|
|
## Ethical Considerations |
|
|
|
- Use in critical telecom systems should be validated by domain experts |
|
- May reflect terminology biases from dominant vendors in the dataset |
|
- Open licensing (MIT) supports transparency and community contributions |
|
|
|
## Usage |
|
|
|
### Installation |
|
|
|
```bash |
|
pip install transformers |
|
``` |
|
|
|
### Load and Run |
|
|
|
```python |
|
from transformers import AutoModel, AutoTokenizer |
|
import torch |
|
|
|
model = AutoModel.from_pretrained("netoai/t-vec") |
|
tokenizer = AutoTokenizer.from_pretrained("netoai/t-vec") |
|
|
|
texts = ["5G NR architecture", "LTE handover", "Core network functions"] |
|
inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True, max_length=32000) |
|
emb = model(**inputs).last_hidden_state.mean(dim=1) |
|
|
|
cos_sim = torch.nn.functional.cosine_similarity(emb[0:1], emb[1:], dim=1) |
|
print(cos_sim) |
|
``` |
|
|
|
## Citation |
|
|
|
```bibtex |
|
@article{ethiraj2025tvec, |
|
title={T-VEC: A Telecom-Specific Vectorization Model with Enhanced Semantic Understanding via Deep Triplet Loss Fine-Tuning}, |
|
author={Ethiraj, Vignesh and Menon, Sidhanth and Vijay, Divya}, |
|
journal={arXiv preprint}, |
|
year={2025}, |
|
url={https://arxiv.org/abs/2504.16460} |
|
} |
|
``` |
|
|
|
## References |
|
- Ethiraj, V., Menon, S., & Vijay, D. (2025). T-VEC: A Telecom-Specific Vectorization Model with Enhanced Semantic Understanding via Deep Triplet Loss Fine-Tuning. arXiv:2504.16460. |
|
- Schroff, F., Kalenichenko, D., Philbin, J. “FaceNet: A Unified Embedding for Face Recognition and Clustering.” CVPR, 2015. |
|
- Hermans, A., Beyer, L., Leibe, B. “In Defense of the Triplet Loss for Person Re-Identification.” arXiv:1703.07737, 2017. |
|
- Reimers, N., Gurevych, I. “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks.” EMNLP, 2019. |
|
- Gao, T., Yao, X., Chen, D. “SimCSE: Simple Contrastive Learning of Sentence Embeddings.” arXiv:2104.08821, 2021. |
|
- Gururangan, S., et al. “Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks.” ACL, 2020. |
|
- Lee, J., Yoon, W., et al. “BioBERT: a pre-trained biomedical language representation model for biomedical text mining.” Bioinformatics, 2020. |
|
- Sahu, S. K., Maheshwari, A. “Automatic extraction of telecom network events from log messages.” IEEE ICC, 2018. |
|
- Wang, X., Li, Y., Han, J. “Log2Vec: A Deep Embedding Model for Network Log Analysis.” IEEE/IFIP DSN, 2021. |
|
|
|
|
|
## Contact |
|
- For questions or contributions, visit https://www.netoai.ai. |
|
--- |
|
|