why chinese image ocr error ocde
Hello, I recently use this model to do Chinese image OCR, but I got the wrong words output, the code I use is below:
from PIL import Image
img_pil = Image.open('/kaggle/input/timuimage/timu.jpg')
image = img_pil.convert("RGB")
from transformers import LayoutXLMProcessor
processor = LayoutXLMProcessor.from_pretrained("Microsoft/layoutlmv3-base-chinese")
feature_extractor = processor.feature_extractor
preprocess image to text
encoded_inputs = feature_extractor(image)
words = encoded_inputs.words
Just output the words in a format
text = ""
for word in words[0]:
text = text + word
print(text)
The output is as below:
re\1AlltTTiani|iete44si)ii"eahi|WAiL“4HNHHAilKtintteersNaaiftyUeawliditieeaHuseuay1he‘4LrLHauiiiasiliatififiaigMtiiarecuaEtaaii!t~BCpecaaOaeeiyfnaeipiesaoriyeae4raBiia4aiaei{thiulEiuaadlfh,aeaatteateeileweypakPotHsae
The Image I use is from https://www.kaggle.com/datasets/viking714/timuimage, everyone can see the image, it's public.
I use the same method to OCR English images to words by LayoutXLM and LayoutLMV2 models, they are both ok.
Thank you very much.
你需要设置ocr语言为中文+英文,也就是'chi_sim+eng'
model_name="microsoft/layoutlmv3-base-chinese"
image_processor = LayoutLMv3ImageProcessor.from_pretrained(model_name,ocr_lang='chi_sim+eng')
tokenizer = XLMRobertaTokenizer.from_pretrained(model_name)
processor = LayoutLMv3Processor(image_processor=image_processor,tokenizer=tokenizer,apply_ocr=True)
Hello, I was trying to use it in the same way. But I got this error:
ValueError Traceback (most recent call last)
in <cell line: 4>()
2 image_processor = LayoutLMv3ImageProcessor.from_pretrained(model_name,ocr_lang='chi_sim+eng')
3 tokenizer = XLMRobertaTokenizer.from_pretrained(model_name)
----> 4 processor = LayoutLMv3Processor(image_processor=image_processor,tokenizer=tokenizer,apply_ocr=True)
ValueError: Received XLMRobertaTokenizer for argument tokenizer, but a ('LayoutLMv3Tokenizer', 'LayoutLMv3TokenizerFast') was expected.
What can be wrong? Thanks
找到LayoutLMv3Processor的源码,把
tokenizer_class = ("LayoutLMv3Tokenizer", "LayoutLMv3TokenizerFast")
改成
tokenizer_class = ("LayoutLMv3Tokenizer", "LayoutLMv3TokenizerFast",'XLMRobertaTokenizer','XLMRobertaTokenizerFast','LayoutXLMTokenizer')
您好,请问解决了吗,我参考上面的方法最后显示出来的还是只有英文
参考之前的回答,按照以下方式可以的到中文结果。如果不行的话可以看一下你的tesseract-ocr是不是缺少chi_sim.traineddata文件,一般会保存在/usr/share/tesseract-ocr/4.00/tessdata/
from transformers import XLMRobertaTokenizer, AutoModel, AutoProcessor, LayoutLMv3ImageProcessor, LayoutLMv3Processor
model_name = "Microsoft/layoutlmv3-base-chinese"
image_processor = LayoutLMv3ImageProcessor.from_pretrained(model_name, ocr_lang='chi_sim+eng')
tokenizer = XLMRobertaTokenizer.from_pretrained(model_name)
processor = LayoutLMv3Processor(image_processor=image_processor,tokenizer=tokenizer,apply_ocr=True)
feature_extractor = processor.feature_extractor
inputs = feature_extractor(image)
inputs['words']
不明白为什么要去改源码, 你只需要自己定一个拓展类LayoutLMv3ChineseProcessor就可以了。
model_name="microsoft/layoutlmv3-base-chinese"
image_processor = LayoutLMv3ImageProcessor.from_pretrained(model_name,ocr_lang='chi_sim+eng')
tokenizer = XLMRobertaTokenizer.from_pretrained(model_name, clean_up_tokenization_spaces=True)
class LayoutLMv3ChineseProcessor(LayoutLMv3Processor):
tokenizer_class = ("LayoutLMv3Tokenizer", "LayoutLMv3TokenizerFast",'XLMRobertaTokenizer','XLMRobertaTokenizerFast','LayoutXLMTokenizer')
processor = LayoutLMv3ChineseProcessor(image_processor=image_processor,tokenizer=tokenizer,apply_ocr=True)
上面几个根本没有说的核心地方。改毛线代码。按照我的来,不要59行代码就训练和推理完成:
tokenizer = LayoutXLMTokenizer.from_pretrained(
"./layoutlmv3-base-chinese"
)
image_processor = LayoutLMv3ImageProcessor.from_pretrained(
"./layoutlmv3-base-chinese", apply_ocr=False
)
processor = LayoutLMv3Processor(tokenizer=tokenizer, image_processor=image_processor, apply_ocr=False)
上面几个根本没有说的核心地方。改毛线代码。按照我的来,不要59行代码就训练和推理完成:
tokenizer = LayoutXLMTokenizer.from_pretrained(
"./layoutlmv3-base-chinese"
)image_processor = LayoutLMv3ImageProcessor.from_pretrained(
"./layoutlmv3-base-chinese", apply_ocr=False
)processor = LayoutLMv3Processor(tokenizer=tokenizer, image_processor=image_processor, apply_ocr=False)
这是怎么画出来的图
模型推理的