Spaces:
Sleeping
Sleeping
add description
Browse files
globe.py
CHANGED
@@ -5,19 +5,6 @@ title = """# 🙋🏻♂️Welcome to Tonic's PLeIAs/✂️📜Segment-Text
|
|
5 |
---
|
6 |
"""
|
7 |
|
8 |
-
|
9 |
-
description = """
|
10 |
-
✂️📜Segment-Text is a specialized language model for text-segmentation. ✂️📜Segment-Text has been trained to be resilient to broken and unstructured texts including digitzation artifacts and ill-recognized layout formats.
|
11 |
-
|
12 |
-
In contrast with most text-segmentation approach, ✂️📜Segment-Text is based on token classification. Editorial structure are reconstructed by the raw text without any reference to the original layout.
|
13 |
-
|
14 |
-
✂️📜Segment-Text was trained using HPC resources from GENCI–IDRIS on Ad Astra with 3,500 example of manually annotated texts, mostly coming from three large scale dataset collected by PleIAs, Finance Commons (financial documents in open data), Common Corpus (cultural heritage texts) and the Science Pile (scientific publication in open licenses - to be released).
|
15 |
-
|
16 |
-
Given the diversity of the training data, ✂️📜Segment-Text should work correctly on diverse document formats in the main European languages.
|
17 |
-
|
18 |
-
✂️📜Segment-Text can be tested on PleIAs-Bad-Data-Editor, a free demo along with OCRonos, another model trained by PleIAs for the correction of OCR errors and other digitization artifact.
|
19 |
-
"""
|
20 |
-
|
21 |
joinus = """
|
22 |
## Join us :
|
23 |
|
@@ -39,14 +26,25 @@ with open(tokenizer_config_path, 'r') as f:
|
|
39 |
|
40 |
id2label = config['id2label']
|
41 |
|
42 |
-
|
43 |
## Model Information
|
44 |
|
45 |
**Model Name**: `{config['_name_or_path']}`
|
46 |
**Model Architecture**: `{config['architectures'][0]}`
|
47 |
**Model Type**: `{config['model_type']}`
|
48 |
|
49 |
-
✂️📜Segment-Text is a
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
50 |
|
51 |
- **Hidden Size**: `{config['hidden_size']}`
|
52 |
- **Intermediate Size**: `{config['intermediate_size']}`
|
|
|
5 |
---
|
6 |
"""
|
7 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
8 |
joinus = """
|
9 |
## Join us :
|
10 |
|
|
|
26 |
|
27 |
id2label = config['id2label']
|
28 |
|
29 |
+
description = f"""
|
30 |
## Model Information
|
31 |
|
32 |
**Model Name**: `{config['_name_or_path']}`
|
33 |
**Model Architecture**: `{config['architectures'][0]}`
|
34 |
**Model Type**: `{config['model_type']}`
|
35 |
|
36 |
+
✂️📜Segment-Text is a specialized language model for text-segmentation. ✂️📜Segment-Text has been trained to be resilient to broken and unstructured texts including digitzation artifacts and ill-recognized layout formats.
|
37 |
+
|
38 |
+
In contrast with most text-segmentation approach, ✂️📜Segment-Text is based on token classification. Editorial structure are reconstructed by the raw text without any reference to the original layout.
|
39 |
+
|
40 |
+
✂️📜Segment-Text was trained using HPC resources from GENCI–IDRIS on Ad Astra with 3,500 example of manually annotated texts, mostly coming from three large scale dataset collected by PleIAs, Finance Commons (financial documents in open data), Common Corpus (cultural heritage texts) and the Science Pile (scientific publication in open licenses - to be released).
|
41 |
+
|
42 |
+
Given the diversity of the training data, ✂️📜Segment-Text should work correctly on diverse document formats in the main European languages.
|
43 |
+
|
44 |
+
✂️📜Segment-Text can be tested on PleIAs-Bad-Data-Editor, a free demo along with OCRonos, another model trained by PleIAs for the correction of OCR errors and other digitization artifact.
|
45 |
+
"""
|
46 |
+
|
47 |
+
modelinfor1 = f"""✂️📜Segment-Text is a token classification model trained for segmenting unstructured or noisy text into meaningful categories. The model uses `{config['model_type']}` as its underlying architecture, optimized for token classification tasks with the following characteristics:
|
48 |
|
49 |
- **Hidden Size**: `{config['hidden_size']}`
|
50 |
- **Intermediate Size**: `{config['intermediate_size']}`
|