AI & ML interests

Use of LLMs in post-production clean up of HTR for Early Modern Legal depositions

Recent Activity

AddaciΒ  updated a Space 16 days ago
MarineLives/early-modern-legal-rag
AddaciΒ  updated a Space 17 days ago
MarineLives/README
AddaciΒ  updated a Space 17 days ago
MarineLives/Mistral-7B-v0.2-summarizer
View all activity

1.1 Fine-tuning Small LLMs

Exploring the potential of small LLMs for cleaning Raw HTR outputs from machine-transcribed English Admiralty depositions.

Fine-Tuned Models

  • mT5-small (300M parameters)
  • GPT-2 Small (124M parameters)
  • LLaMA 3.1 (1B parameters)

Current Training Data

  • 100 pages: 40,000 lines (~0.4M words)
  • 200 pages: 80,000 lines (~0.8M words)
  • 400 pages: 160,000 lines (~1.6M words)

Objectives

  • Word Correction: Identify and correct errors using contextual and grammatical cues.
  • Language Identification: Distinguish English from Latin text.
  • Artefact Removal: Eliminate HTR-generated artefacts.
  • Structural Recognition: Detect depositions’ components (e.g., front matter, headings, articles).
  • Insertion Logic: Handle missing text at marked positions.

1.2 Integration with RAG Pipeline

Components:

  • Retriever: BM25 or Sentence-BERT
  • LLM: mT5-small
  • Corpus: Curated historical texts or JSON/SQLite databases

Deployment Highlights:

  • Scalable: Easily runs on platforms like Hugging Face Spaces with lightweight GPU instances.
  • API-Friendly: Supports integrations via Hugging Face Inference API for retrieval-augmented tasks.

πŸ“š 2.0 Datasets

2.1 Published Datasets

ENGLISH HIGH COURT OF ADMIRALTY DEPOSITIONS

  1. MarineLives/English-Expansions
  2. MarineLives/Latin-Expansions
  3. MarineLives/Line-Insertions
  4. MarineLives/HCA-1358-Errors-In-Phrases
  5. MarineLives/HCA-13-58-TEXT

YIDDISH LETTERS

  1. MarineLives/Gavin-yiddish-raw-HTR-and-groundtruth-lines
  2. MarineLives/Gavin-yiddish-raw-HTR-and-groundtruth-paragraphs

2.2 Unpublished Datasets

  • Dataset 1: 420K tokens, full diplomatic transcription (1627–1660)
  • Dataset 2: 4.5M tokens, semi-diplomatic transcription (1607–1660)
  • Dataset 3: 100K tokens, diplomatic transcription of Early Modern letters (1600–1685)

🌍 Explore MarineLives

Join us in unlocking Early Modern history by exploring our Hugging Face organization and datasets! You can follow us on BlueSky at @marinelives.bsky.social You can explore our content on our MarineLives wiki and on our ai-and-history-collaboratory GitHub repository.