File size: 11,089 Bytes
d4167d9
 
 
 
a345416
 
d4167d9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a345416
d4167d9
 
 
 
 
 
 
a345416
 
d4167d9
a345416
 
 
 
 
 
2412b6e
d4167d9
a345416
2412b6e
d4167d9
 
 
 
 
 
 
 
a345416
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d4167d9
 
a345416
d4167d9
2412b6e
d4167d9
 
 
a345416
 
2412b6e
d4167d9
 
 
 
 
 
a345416
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
import gradio as gr
import random
import os
from datetime import datetime
from huggingface_hub import HfApi
from typing import Optional

# The list of sentences from our previous conversation.
sentences = [
    "Optical character recognition (OCR) is the process of converting images of text into machine-readable data.",
    "When applied to handwriting, OCR faces additional challenges because of the natural variability in individual penmanship.",
    "Over the last century, advances in computer vision and machine learning have transformed handwriting OCR from bulky, specialized hardware into highly accurate, software-driven systems.",
    "The origins of OCR date back to the early 20th century.",
    "Early pioneers explored how machines might read text.",
    "In the 1920s, inventors such as Emanuel Goldberg developed early devices that could capture printed characters by converting them into telegraph codes.",
    "Around the same time, Gustav Tauschek created the Reading Machine using template-matching methods to detect letters in images.",
    "These devices were designed for printed text and depended on fixed, machine-friendly fonts rather than natural handwriting.",
    "In the 1950s, systems like David Shepard's GISMO emerged to begin automating the conversion of paper records into digital form.",
    "Although these early OCR systems were limited in scope and accuracy, they laid the groundwork for later innovations.",
    "The 1960s saw OCR technology being applied to real-world tasks.",
    "In 1965, American inventor Jacob Rabinow developed an OCR machine specifically aimed at sorting mail by reading addresses.",
    "This was a critical step for the U.S. Postal Service.",
    "Soon after, research groups, including those at IBM, began developing machines such as the IBM 1287, which was capable of reading handprinted numbers on envelopes to facilitate automated mail processing.",
    "These systems marked the first attempts to apply computer vision to handwritten data on a large scale.",
    "By the late 1980s and early 1990s, researchers such as Yann LeCun and his colleagues developed neural network architectures to recognize handwritten digits.",
    "Their work, initially applied to reading ZIP codes on mail, demonstrated that carefully designed, constrained neural networks could achieve error rates as low as about 1% on USPS data.",
    "Sargur Srihari and his team at the Center of Excellence for Document Analysis and Recognition extended these ideas to develop complete handwritten address interpretation systems.",
    "These systems, deployed by the USPS and postal agencies worldwide, helped automate the routing of mail and revolutionized the sorting process.",
    "The development and evaluation of handwriting OCR have been driven in part by standard benchmark datasets.",
    "The MNIST dataset, introduced in the 1990s, consists of 70,000 images of handwritten digits and became the de facto benchmark for handwritten digit recognition.",
    "Complementing MNIST is the USPS dataset, which provides images of hand‐written digits derived from actual envelopes and captures real-world variability.",
    "Handwriting OCR entered a new era with the introduction of neural network models.",
    "In 1989, LeCun et al. applied backpropagation to a convolutional neural network tailored for handwritten digit recognition, an innovation that evolved into the LeNet series.",
    "By automatically learning features rather than relying on hand-designed templates, these networks drastically improved recognition performance.",
    "As computational power increased and large labeled datasets became available, deep learning models, particularly convolutional neural networks and recurrent neural networks, pushed the accuracy of handwriting OCR to near-human levels.",
    "Modern systems can handle both printed and cursive text, automatically segmenting and recognizing characters in complex handwritten documents.",
    "Cursive handwriting presents a classic challenge known as Sayre's paradox, where word recognition requires letter segmentation and letter segmentation requires word recognition.",
    "Contemporary approaches use implicit segmentation methods, often combined with hidden Markov models or end-to-end neural networks, to circumvent this paradox.",
    "Today's handwriting OCR systems are highly accurate and widely deployed.",
    "Modern systems combine OCR with artificial intelligence to not only recognize text but also extract meaning, verify data, and integrate into larger enterprise workflows.",
    "Projects such as In Codice Ratio use deep convolutional networks to transcribe historical handwritten documents, further expanding OCR applications.",
    "Despite impressive advances, handwriting OCR continues to face challenges with highly variable or degraded handwriting.",
    "Ongoing research aims to improve recognition accuracy, particularly for cursive and unconstrained handwriting, and to extend support across languages and historical scripts.",
    "With improvements in deep learning architectures, increased computing power, and large annotated datasets, future OCR systems are expected to become even more robust, handling real-world handwriting in diverse applications from postal services to archival digitization.",
    "Today's research in handwriting OCR benefits from a wide array of well-established datasets and ongoing evaluation challenges.",
    "These resources help drive the development of increasingly robust systems for both digit and full-text recognition.",
    "For handwritten digit recognition, the MNIST dataset remains the most widely used benchmark thanks to its simplicity and broad adoption.",
    "Complementing MNIST is the USPS dataset, which is derived from actual mail envelopes and provides additional challenges with real-world variability.",
    "The IAM Handwriting Database is one of the most popular datasets for unconstrained offline handwriting recognition and includes scanned pages of handwritten English text with corresponding transcriptions.",
    "It is frequently used to train and evaluate models that work on full-line or full-page recognition tasks.",
    "For systems designed to capture the dynamic aspects of handwriting, such as pen stroke trajectories, the IAM On-Line Handwriting Database offers valuable data.",
    "The CVL dataset provides multi-writer handwritten texts with a range of writing styles, making it useful for assessing the generalization capabilities of OCR systems across diverse handwriting samples.",
    "The RIMES dataset, developed for French handwriting recognition, contains scanned documents and is a key resource for evaluating systems in multilingual settings.",
    "Various ICDAR competitions, such as ICDAR 2013 and ICDAR 2017, have released datasets that reflect the complexities of real-world handwriting, including historical documents and unconstrained writing.",
    "For Arabic handwriting recognition, the KHATT dataset offers a collection of handwritten texts that capture the unique challenges of cursive and context-dependent scripts.",
    "These datasets, along with continual evaluation efforts through competitions hosted at ICDAR and ICFHR, ensure that the field keeps pushing toward higher accuracy, better robustness, and broader language coverage.",
    "Emerging benchmarks, often tailored to specific scripts, historical documents, or noisy real-world data, will further refine the state-of-the-art in handwriting OCR.",
    "This array of resources continues to shape the development of handwriting OCR systems today.",
    "This additional section outlines today's most influential datasets and benchmarks, highlighting how they continue to shape the development of handwriting OCR systems."
]

class OCRDataCollector:
    def __init__(self):
        self.collected_pairs = []
        self.current_text_block = self.get_random_text_block()
        self.hf_api = HfApi()

    def get_random_text_block(self):
        block_length = random.randint(1, 5)
        start_index = random.randint(0, len(sentences) - block_length)
        block = " ".join(sentences[start_index:start_index + block_length])
        return block

    def submit_image(self, image, text_block, username: Optional[str] = None):
        if image is not None and username:
            timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
            self.collected_pairs.append({
                "text": text_block, 
                "image": image, 
                "timestamp": timestamp,
                "username": username
            })
        return self.get_random_text_block()

    def skip_text(self, text_block, username: Optional[str] = None):
        return self.get_random_text_block()

def create_gradio_interface():
    collector = OCRDataCollector()
    
    with gr.Blocks() as demo:
        gr.Markdown("## Crowdsourcing Handwriting OCR Dataset")
        
        with gr.Row():
            user_info = gr.Markdown("")

        def update_user_info(request: gr.Request):
            if request.username:
                return f"Logged in as: {request.username}", gr.update(visible=True)
            return "Please log in with your Hugging Face account to contribute to the dataset.", gr.update(visible=False)

        with gr.Column(visible=False) as main_interface:
            gr.Markdown("You will be shown between 1 and 5 consecutive sentences. Please handwrite them on paper and upload an image of your handwriting. If you wish to skip the current text, click 'Skip'.")
            
            text_box = gr.Textbox(value=collector.current_text_block, label="Text to Handwrite", interactive=False)
            image_input = gr.Image(type="pil", label="Upload Handwritten Image", sources=["upload"])
            
            with gr.Row():
                submit_btn = gr.Button("Submit")
                skip_btn = gr.Button("Skip")
        
        def check_login(request: gr.Request):
            if request.username is None:
                raise gr.Error("Please log in to use this application")
            return request.username
            
        def protected_submit(image, text_block, request: gr.Request):
            username = check_login(request)
            return collector.submit_image(image, text_block, username)
            
        def protected_skip(text_block, request: gr.Request):
            username = check_login(request)
            return collector.skip_text(text_block, username)
        
        demo.load(update_user_info, outputs=[user_info, main_interface])
        
        submit_btn.click(
            fn=protected_submit,
            inputs=[image_input, text_box],
            outputs=text_box
        )
        
        skip_btn.click(
            fn=protected_skip,
            inputs=[text_box],
            outputs=text_box
        )
    
    return demo

if __name__ == "__main__":
    demo = create_gradio_interface()
    demo.launch(auth_message="Please login with your Hugging Face account to contribute to the dataset.")