segment-text / globe.py
Tonic's picture
add description
8bffd4f unverified
raw
history blame
6.59 kB
from huggingface_hub import hf_hub_download
import json
title = """# 🙋🏻‍♂️Welcome to Tonic's PLeIAs/✂️📜Segment-Text
---
"""
description = """
Segmentext is a specialized language model for text-segmentation. Segmentext has been trained to be resilient to broken and unstructured texts including digitzation artifacts and ill-recognized layout formats.
In contrast with most text-segmentation approach, Segmentext is based on token classification. Editorial structure are reconstructed by the raw text without any reference to the original layout.
Segmentext was trained using HPC resources from GENCI–IDRIS on Ad Astra with 3,500 example of manually annotated texts, mostly coming from three large scale dataset collected by PleIAs, Finance Commons (financial documents in open data), Common Corpus (cultural heritage texts) and the Science Pile (scientific publication in open licenses - to be released).
Given the diversity of the training data, Segmentext should work correctly on diverse document formats in the main European languages.
Segmentext can be tested on PleIAs-Bad-Data-Editor, a free demo along with OCRonos, another model trained by PleIAs for the correction of OCR errors and other digitization artifact.
"""
joinus = """
## Join us :
🌟TeamTonic🌟 is always making cool demos! Join our active builder's 🛠️community 👻 [![Join us on Discord](https://img.shields.io/discord/1109943800132010065?label=Discord&logo=discord&style=flat-square)](https://discord.gg/qdfnvSPcqP) On 🤗Huggingface:[MultiTransformer](https://huggingface.co/MultiTransformer) On 🌐Github: [Tonic-AI](https://github.com/tonic-ai) & contribute to🌟 [Build Tonic](https://git.tonic-ai.com/contribute)🤗Big thanks to Yuvi Sharma and all the folks at huggingface for the community grant 🤗
"""
model_name = "PleIAs/Segmentext"
config_path = hf_hub_download(repo_id=model_name, filename="config.json")
tokenizer_config_path = hf_hub_download(repo_id=model_name, filename="tokenizer_config.json")
# Load config.json
with open(config_path, 'r') as f:
config = json.load(f)
# Load tokenizer_config.json
with open(tokenizer_config_path, 'r') as f:
tokenizer_config = json.load(f)
id2label = config['id2label']
modelinfor1 = f"""
## Model Information
**Model Name**: `{config['_name_or_path']}`
**Model Architecture**: `{config['architectures'][0]}`
**Model Type**: `{config['model_type']}`
Segmentext is a token classification model trained for segmenting unstructured or noisy text into meaningful categories. The model uses `{config['model_type']}` as its underlying architecture, optimized for token classification tasks with the following characteristics:
- **Hidden Size**: `{config['hidden_size']}`
- **Intermediate Size**: `{config['intermediate_size']}`
- **Number of Hidden Layers**: `{config['num_hidden_layers']}`
- **Number of Attention Heads**: `{config['num_attention_heads']}`
- **Dropout (Attention)**: `{config['attention_probs_dropout_prob']}`
- **Dropout (Hidden)**: `{config['hidden_dropout_prob']}`
- **Hidden Activation Function**: `{config['hidden_act']}`
- **Layer Normalization Epsilon**: `{config['layer_norm_eps']}`
- **Maximum Position Embeddings**: `{config['max_position_embeddings']}`
- **Position Attention Type**: `{', '.join(config['pos_att_type'])}`
- **Relative Attention Mechanism**: `{config['relative_attention']}`
- **Position Buckets**: `{config['position_buckets']}`
- **Vocabulary Size**: `{config['vocab_size']}`
- **Initializer Range**: `{config['initializer_range']}`
- **Torch Dtype**: `{config['torch_dtype']}`
"""
modelinfor2 = f"""### Label Categories:
The model is capable of classifying tokens into 14 distinct categories :
1. **{id2label['0']}** (0)
2. **{id2label['1']}** (1)
3. **{id2label['2']}** (2)
4. **{id2label['3']}** (3)
5. **{id2label['4']}** (4)
6. **{id2label['5']}** (5)
7. **{id2label['6']}** (6)
8. **{id2label['7']}** (7)
9. **{id2label['8']}** (8)
10. **{id2label['9']}** (9)
11. **{id2label['10']}** (10)
12. **{id2label['11']}** (11)
13. **{id2label['12']}** (12)
14. **{id2label['13']}** (13)
### Tokenizer:
- **Tokenizer Class**: `{tokenizer_config['tokenizer_class']}`
- **Model Max Length**: No specific limit (`{tokenizer_config['model_max_length']}`)
- **Vocabulary Type**: `{tokenizer_config['vocab_type']}`
- **Special Tokens**:
- `[PAD]`, `[CLS]`, `[SEP]`, `[UNK]`, `[MASK]`
- **Case Sensitivity**: Not lower-cased (`{tokenizer_config['do_lower_case']}`)
- **Subword Tokenization**: Handles wordpieces (tokens starting with "▁")
This model was trained using diverse datasets, including cultural heritage texts and modern digitized documents. It excels in identifying editorial structures in noisy or unstructured text, making it robust for tasks involving broken text or OCR artifacts.
Segmentext is ideal for text segmentation across a variety of document types, including financial reports, scientific papers, and historical manuscripts.
"""
placeholder = """PLEIAS :
WE BUILD ENERGY-EFFICIENT LLMS FOR INFORMATION-INTENSIVE AND HIGHLY REGULATED INDUSTRIES
WE CARE ABOUT DATA
(01)
We develop unique multilingual synthetic data capacity
Through novel approaches of LLM-driven rephrasing, refining and redocumentarising of the original content, massive high-quality synthetic datasets will be established and routinely expanded for customer use cases.
(02)
We build and open corpus mining pipelines
Numerous untapped training data sources exist beyond the typical web archives and copyrighted material. We develop innovative pipelines for corpus preparation, along with models capable of recognizing various layouts, allowing for the integration of overlooked open data, open science and cultural heritage resources, particularly those in PDF format.
(03)
We integrate and support semantic data
We build an extensive collection of semantic web for pretraining and alignment with a large diversity of standards matching use cases: XML, XLBR, RDF.
Position Overview:
As a Proof of Concept Engineer / Data Demo Engineer, you will play a critical role in building and presenting cutting-edge data-driven demos and prototypes to our clients. Your primary responsibility will be to translate complex data use cases into engaging, real-world proof of concept (PoC) demos, leveraging Pleias' LLM capabilities. You will work closely with cross-functional teams, including product development, customer success, and sales, to create tailored solutions that meet client needs, particularly in information-rich and regulated industries.
"""