Datasets related to the cleaning up of raw transcription of HCA-13-58
AI & ML interests
Use of LLMs in post-production clean up of HTR for Early Modern Legal depositions
Organization Card
1.1 Fine-tuning Small LLMs
Exploring the potential of small LLMs for cleaning Raw HTR outputs from machine-transcribed English Admiralty depositions.
Fine-Tuned Models
- mT5-small (300M parameters)
- GPT-2 Small (124M parameters)
- LLaMA 3.1 (1B parameters)
Current Training Data
- 100 pages: 40,000 lines (~0.4M words)
- 200 pages: 80,000 lines (~0.8M words)
- 400 pages: 160,000 lines (~1.6M words)
Objectives
- Word Correction: Identify and correct errors using contextual and grammatical cues.
- Language Identification: Distinguish English from Latin text.
- Artefact Removal: Eliminate HTR-generated artefacts.
- Structural Recognition: Detect depositions’ components (e.g., front matter, headings, articles).
- Insertion Logic: Handle missing text at marked positions.
1.2 Integration with RAG Pipeline
Components:
- Retriever: BM25 or Sentence-BERT
- LLM: mT5-small
- Corpus: Curated historical texts or JSON/SQLite databases
Deployment Highlights:
- Scalable: Easily runs on platforms like Hugging Face Spaces with lightweight GPU instances.
- API-Friendly: Supports integrations via Hugging Face Inference API for retrieval-augmented tasks.
📚 2.0 Datasets
2.1 Published Datasets
ENGLISH HIGH COURT OF ADMIRALTY DEPOSITIONS
- MarineLives/English-Expansions
- MarineLives/Latin-Expansions
- MarineLives/Line-Insertions
- MarineLives/HCA-1358-Errors-In-Phrases
- MarineLives/HCA-13-58-TEXT
YIDDISH LETTERS
- MarineLives/Gavin-yiddish-raw-HTR-and-groundtruth-lines
- MarineLives/Gavin-yiddish-raw-HTR-and-groundtruth-paragraphs
2.2 Unpublished Datasets
- Dataset 1: 420K tokens, full diplomatic transcription (1627–1660)
- Dataset 2: 4.5M tokens, semi-diplomatic transcription (1607–1660)
- Dataset 3: 100K tokens, diplomatic transcription of Early Modern letters (1600–1685)
🌍 Explore MarineLives
Join us in unlocking Early Modern history by exploring our Hugging Face organization and datasets! You can follow us on BlueSky at @marinelives.bsky.social You can explore our content on our MarineLives wiki and on our ai-and-history-collaboratory GitHub repository.
spaces
8
Sleeping
Early Modern Legal Rag
💬
Demonstration of research augmented retrieval
Runtime error
Mistral 7B V0.2 Summarizer
⚡
Chat bot and sumamrizer based on Mistral-7B-v0.2
Sleeping
MarineLives Legal Assistant
🌍
HTR correct Text summarization Text Question Answering
Sleeping
Yiddish English Translation
💻
UI to translate Hebrew script Yiddish into English
Sleeping
Yiddish Transcription Correction
🐨
byt5-small-fine-tuned-yiddish-experiment-10 test UI
models
10
MarineLives/byt5-finetuned-yiddish-experiment-11
Updated
MarineLives/byt5-finetuned-yiddish-experiment-10
Updated
MarineLives/byt5-finetuned-yiddish-experiment-9
Updated
MarineLives/byt5-finetuned-yiddish-experiment-8
Updated
MarineLives/byt5-finetuned-yiddish-experiment-7
Updated
MarineLives/mBert-finetuned-yiddish-experiment-1
Updated
•
1
MarineLives/mBert-finetuned-yiddish-experiment-3
Fill-Mask
•
0.2B
•
Updated
•
2
MarineLives/bert-base-multilingual-cased-finetuned-yiddish-experiment-1
Updated
MarineLives/hca-1370-mt5-paragraph-embedding-rag
Updated
MarineLives/mt5-small-raw-htr-clean-ver.1.0
0.3B
•
Updated
•
3
datasets
8
MarineLives/Gavin_yiddish_raw_HTR_and_groundtruth_paragraphs
Viewer
•
Updated
•
98
•
3
•
1
MarineLives/Gavin_yiddish_raw_HT_and_groundtruth_lines
Updated
•
1
MarineLives/raw-htr-handchecked-groundtruth-small
Viewer
•
Updated
•
697
•
10
MarineLives/HCA-1358-HTR-Errors-In-Phrases
Viewer
•
Updated
•
194
•
5
MarineLives/Line-Insertions
Viewer
•
Updated
•
177
•
13
•
1
MarineLives/English-Expansions
Viewer
•
Updated
•
175
•
7
MarineLives/Latin-Expansions
Viewer
•
Updated
•
192
•
11
MarineLives/HCA-13-58-TEXT
Viewer
•
Updated
•
65.8k
•
6