T-VEC / README.md
NetoAI's picture
Update README.md
2c8f57a verified
---
license: mit
language:
- en
tags:
- text-embeddings
- telecom
- domain-adaptation
- triplet-loss
- transformer
- semantic-search
- sentence-transformers
- domain-specific
- contrastive-learning
- simcse
- bio-bert
- don’t-stop-pretraining
metrics:
- name: Telecom Triplet Score
type: accuracy
value: 0.9380
verified: false
- name: Average MTEB Score
type: accuracy
value: 0.825
verified: false
- name: Average STS Score
type: spearman
value: 82.19
verified: false
- name: AllNLI Triplet Score
type: accuracy
value: 0.6150
verified: false
base_model:
- Alibaba-NLP/gte-Qwen2-1.5B-instruct
model-index:
- name: T-VEC
results:
- task:
type: text-embedding
name: Telecom Triplet Benchmark
dataset:
type: custom
name: Telecom Triplet Benchmark
metrics:
- name: Telecom Triplet Score
type: accuracy
value: 0.9380
verified: false
- task:
type: text-embedding
name: MTEB Benchmark
dataset:
type: openai_humaneval
name: MTEB Benchmark
metrics:
- name: Average MTEB Score
type: accuracy
value: 0.825
verified: false
- task:
type: text-embedding
name: STS Benchmark
dataset:
type: openai_humaneval
name: STS Benchmark
metrics:
- name: Average STS Score
type: spearman
value: 82.19
verified: false
- task:
type: text-embedding
name: AllNLI Triplet
dataset:
type: openai_humaneval
name: AllNLI Triplet
metrics:
- name: Triplet Score
type: accuracy
value: 0.6150
verified: false
extra_gated_prompt: "Please provide answers to the below questions to gain access to the model"
extra_gated_fields:
Company: text
Full Name: text
Email: text
I want to use this model for:
type: select
options:
- Research
- Education
- Commercial
- label: Other
value: other
---
# T-VEC: A Telecom-Specific Text Embedding Model
## Overview
**T-VEC (Telecom Vectorization Model)** is a domain-adapted text embedding model developed by NetoAI and fine-tuned from [Alibaba-NLP/gte-Qwen2-1.5B-instruct](https://huggingface.co/Alibaba-NLP/gte-Qwen2-1.5B-instruct). Using a deeply supervised triplet-loss approach, T-VEC learns rich semantic representations tailored to telecom use cases, achieving state-of-the-art results on custom and standard benchmarks.
## Model Details
- **Model Name**: T-VEC
- **Developer**: [NetoAI](https://www.netoai.ai)
- **Base Model**: Alibaba-NLP/gte-Qwen2-1.5B-instruct
- **Parameters**: 1.5 Billion
- **Embedding Dimension**: 1536
- **Max Input Tokens**: 32,000
- **Languages**: Multilingual (optimized for English)
- **License**: MIT
- **Tokenizer**: Custom telecom-specific tokenizer (open-source)
## Intended Uses
- Semantic search over telecom documents (3GPP standards, vendor manuals)
- Fault log analysis for root-cause detection
- Telecom-specific chatbots and Q&A systems
- Regulatory compliance analysis and semantic auditing
## Training Details
- **Objective**: Triplet loss using cosine similarity
- **Dataset**: 100k+ telecom triplets curated by domain experts over 3 months
- **Layer Modification**: 338 transformer layers fine-tuned
- **Avg. L2 Norm Weight Change**: 0.7735
- **Enhancements**: Telecom-specific tokenizer and query-aware anchor strategies
## Evaluation Results
| Benchmark | Metric | Score |
|-----------------------------|----------------------|--------|
| Telecom Triplet Benchmark | Accuracy | 0.9380 |
| MTEB Benchmark | Accuracy | 0.825 |
| STS Benchmark | Spearman Correlation | 82.19 |
| AllNLI Triplet | Accuracy | 0.6150 |
T-VEC significantly outperforms both its base model and other strong general-purpose models on telecom-specific benchmarks, while still retaining competitive general performance.
| Model | ArguAna | SciDocsRR | STS12 | STS13 | STS14 | STS15 | STS16 | STSBenchmark |
|--------------------------------|---------|--------------|-------------|------------|------------|------------|------------|--------------|
| gte‑Qwen2‑1.5B‑instruct | 0.62335 | 0.81558 | 0.72805 | 0.84699 | 0.78803 | 0.87450 | 0.84938 | 0.85379 |
| T‑VEC | 0.61150 | 0.83970 | 0.80320 | 0.88220 | 0.82750 | 0.88260 | 0.84780 | 0.88050 |
| all‑MiniLM‑L6‑v2 | 0.50167 | 0.87119 | 0.72369 | 0.80603 | 0.75589 | 0.85390 | 0.78989 | 0.82032 |
| all‑mpnet‑base‑v2 | 0.46521 | 0.88654 | 0.72634 | 0.83485 | 0.78000 | 0.85663 | 0.80030 | 0.83422 |
| bge‑base‑en‑v1.5 | 0.63616 | 0.87494 | 0.78028 | 0.84184 | 0.82273 | 0.87957 | 0.85474 | 0.86418 |
| e5‑base‑v2 | 0.51604 | 0.82834 | 0.73489 | 0.82997 | 0.80446 | 0.88181 | 0.83659 | 0.85480 |
| jina‑embeddings‑v2‑base‑en | 0.44152 | 0.83106 | 0.74278 | 0.84177 | 0.78808 | 0.87553 | 0.85347 | 0.84842 |
| instructor‑xl | 0.54884 | 0.79538 | 0.74085 | 0.85046 | 0.80318 | 0.88359 | 0.83784 | 0.83048 |
| gte‑base | 0.57151 | 0.87083 | 0.75707 | 0.85729 | 0.81510 | 0.88810 | 0.83824 | 0.85738 |
| multilingual‑e5‑base | 0.47829 | 0.80392 | 0.77933 | 0.76890 | 0.77535 | 0.88373 | 0.82699 | 0.84201 |
![image/png](https://cdn-uploads.huggingface.co/production/uploads/66fa4fb0ec6983f03c2b1ca2/oIX2bc76Er4TDd5eZCb_C.png)
## Limitations
- Reduced performance on non-domain tasks (e.g., AllNLI) due to specialization
- Large size may impact deployment on edge devices
- May miss recent telecom developments outside the training set
## Ethical Considerations
- Use in critical telecom systems should be validated by domain experts
- May reflect terminology biases from dominant vendors in the dataset
- Open licensing (MIT) supports transparency and community contributions
## Usage
### Installation
```bash
pip install transformers
```
### Load and Run
```python
from transformers import AutoModel, AutoTokenizer
import torch
model = AutoModel.from_pretrained("netoai/t-vec")
tokenizer = AutoTokenizer.from_pretrained("netoai/t-vec")
texts = ["5G NR architecture", "LTE handover", "Core network functions"]
inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True, max_length=32000)
emb = model(**inputs).last_hidden_state.mean(dim=1)
cos_sim = torch.nn.functional.cosine_similarity(emb[0:1], emb[1:], dim=1)
print(cos_sim)
```
## Citation
```bibtex
@article{ethiraj2025tvec,
title={T-VEC: A Telecom-Specific Vectorization Model with Enhanced Semantic Understanding via Deep Triplet Loss Fine-Tuning},
author={Ethiraj, Vignesh and Menon, Sidhanth and Vijay, Divya},
journal={arXiv preprint},
year={2025},
url={https://arxiv.org/abs/2504.16460}
}
```
## References
- Ethiraj, V., Menon, S., & Vijay, D. (2025). T-VEC: A Telecom-Specific Vectorization Model with Enhanced Semantic Understanding via Deep Triplet Loss Fine-Tuning. arXiv:2504.16460.
- Schroff, F., Kalenichenko, D., Philbin, J. “FaceNet: A Unified Embedding for Face Recognition and Clustering.” CVPR, 2015.
- Hermans, A., Beyer, L., Leibe, B. “In Defense of the Triplet Loss for Person Re-Identification.” arXiv:1703.07737, 2017.
- Reimers, N., Gurevych, I. “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks.” EMNLP, 2019.
- Gao, T., Yao, X., Chen, D. “SimCSE: Simple Contrastive Learning of Sentence Embeddings.” arXiv:2104.08821, 2021.
- Gururangan, S., et al. “Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks.” ACL, 2020.
- Lee, J., Yoon, W., et al. “BioBERT: a pre-trained biomedical language representation model for biomedical text mining.” Bioinformatics, 2020.
- Sahu, S. K., Maheshwari, A. “Automatic extraction of telecom network events from log messages.” IEEE ICC, 2018.
- Wang, X., Li, Y., Han, J. “Log2Vec: A Deep Embedding Model for Network Log Analysis.” IEEE/IFIP DSN, 2021.
## Contact
- For questions or contributions, visit https://www.netoai.ai.
---