MarineLives
AI & ML interests
Use of LLMs in post-production clean up of HTR for Early Modern Legal depositions
Recent Activity
1.1 Fine-tuning Small LLMs
Exploring the potential of small LLMs for cleaning Raw HTR outputs from machine-transcribed English Admiralty depositions.
Fine-Tuned Models
- mT5-small (300M parameters)
- GPT-2 Small (124M parameters)
- LLaMA 3.1 (1B parameters)
Current Training Data
- 100 pages: 40,000 lines (~0.4M words)
- 200 pages: 80,000 lines (~0.8M words)
- 400 pages: 160,000 lines (~1.6M words)
Objectives
- Word Correction: Identify and correct errors using contextual and grammatical cues.
- Language Identification: Distinguish English from Latin text.
- Artefact Removal: Eliminate HTR-generated artefacts.
- Structural Recognition: Detect depositionsβ components (e.g., front matter, headings, articles).
- Insertion Logic: Handle missing text at marked positions.
1.2 Integration with RAG Pipeline
Components:
- Retriever: BM25 or Sentence-BERT
- LLM: mT5-small
- Corpus: Curated historical texts or JSON/SQLite databases
Deployment Highlights:
- Scalable: Easily runs on platforms like Hugging Face Spaces with lightweight GPU instances.
- API-Friendly: Supports integrations via Hugging Face Inference API for retrieval-augmented tasks.
π 2.0 Datasets
2.1 Published Datasets
ENGLISH HIGH COURT OF ADMIRALTY DEPOSITIONS
- MarineLives/English-Expansions
- MarineLives/Latin-Expansions
- MarineLives/Line-Insertions
- MarineLives/HCA-1358-Errors-In-Phrases
- MarineLives/HCA-13-58-TEXT
YIDDISH LETTERS
- MarineLives/Gavin-yiddish-raw-HTR-and-groundtruth-lines
- MarineLives/Gavin-yiddish-raw-HTR-and-groundtruth-paragraphs
2.2 Unpublished Datasets
- Dataset 1: 420K tokens, full diplomatic transcription (1627β1660)
- Dataset 2: 4.5M tokens, semi-diplomatic transcription (1607β1660)
- Dataset 3: 100K tokens, diplomatic transcription of Early Modern letters (1600β1685)
π Explore MarineLives
Join us in unlocking Early Modern history by exploring our Hugging Face organization and datasets! You can follow us on BlueSky at @marinelives.bsky.social You can explore our content on our MarineLives wiki and on our ai-and-history-collaboratory GitHub repository.
Collections
1
spaces
7
Early Modern Legal Rag
Demonstration of research augmented retrieval
Mistral 7B V0.2 Summarizer
Chat bot and sumamrizer based on Mistral-7B-v0.2
MarineLives Legal Assistant
HTR correct Text summarization Text Question Answering
Yiddish English Translation
UI to translate Hebrew script Yiddish into English
Yiddish Transcription Correction
byt5-small-fine-tuned-yiddish-experiment-10 test UI
Mt5 Small Experiment 14
Hugging Face Space to deploy mT6-small-experiment-14 model