Python「pytesseract」：中文识别模块

在处理 .ttf 文件时，遇到了识别图片中中文的情况，常见的方式是调用百度的语言识别接口，但是这里为了大批量的识别，首先试了试 python 自带的语言识别模块 pytesseract ，这里简单做一下记录。

一、介绍

（一）image_to_string

（二）config参数

二、代码

一、介绍

（一）image_to_string

pytesseract.image_to_string

def image_to_string(image, lang=None, config='', nice=0, output_type=Output.STRING, timeout=0,
):"""Returns the result of a Tesseract OCR run on the provided image to string将在提供的图像上运行的Tesseract OCR的结果返回给string"""

参数含义：

image Object or String - PIL Image/NumPy array or file path of the image to be processed by Tesseract. If you pass object instead of file path, pytesseract will implicitly convert the image to RGB mode.

对象或者字符串。PIL Image/NumPy 数组或者是Tesseract处理图像的文件路径。如果传来的是一个对象的话，会把图像转换成RGB格式。

lang String - Tesseract language code string. Defaults to eng if not specified! Example for multiple languages: lang='eng+fra'

语言码。默认自然是英语（eng），这里中文简体为chi_sim，当然也可以采用多种语言。

config String - Any additional custom configuration flags that are not available via the pytesseract function. For example: config='--psm 6'

任何通过pytesseract函数不可用的额外自定义配置标志。

nice Integer - modifies the processor priority for the Tesseract run. Not supported on Windows. Nice adjusts the niceness of unix-like processes.

修改Tesseract运行的处理器优先级。不支持windows系统。Nice调整类unix系统进程的良好性。

output_type Class attribute - specifies the type of the output, defaults to string. For the full list of all supported types, please check the definition of pytesseract.Output class.

输出的特定方式，默认是string。

timeout Integer or Float - duration in seconds for the OCR processing, after which, pytesseract will terminate and raise RuntimeError.

OCR处理的持续时间(以秒为单位)，在此之后，pytesseract将终止并引发RuntimeError。

（二）config参数

这里详细介绍一下 config 参数，通过命令行执行 tesseract --help-psm 命令，我们可以看到

$ tesseract --help-psm
Page segmentation modes:
0 Orientation and script detection (OSD) only.

方向和脚本检测(OSD)。
1 Automatic page segmentation with OSD.

自动页面分割与OSD。
2 Automatic page segmentation, but no OSD, or OCR.

自动页面分割，但没有OSD，或OCR。
3 Fully automatic page segmentation, but no OSD. (Default)

全自动页面分割，但没有OSD。(默认)
4 Assume a single column of text of variable sizes.

假设有一列大小不同的文本。
5 Assume a single uniform block of vertically aligned text.

假设有一个垂直对齐的文本块。
6 Assume a single uniform block of text.

包含一个统一的文本块。
7 Treat the image as a single text line.

将图像重新定义为单个文本行。
8 Treat the image as a single word.

将图像视为一个单词。
9 Treat the image as a single word in a circle.

将图像视为一个圆圈中的单个单词。
10 Treat the image as a single character.

将图像视为单个字符。
11 Sparse text. Find as much text as possible in no particular order.

稀疏的文本。在没有特定顺序的情况下，尽可能多地查找文本。
12 Sparse text with OSD.

稀疏文本与OSD。
13 Raw line. Treat the image as a single text line,
bypassing hacks that are Tesseract-specific.

将图像视为一个单独的文本行，绕过特定于tesseract的技巧。

二、代码

使用如下所示：

import pytesseract
from PIL import Image# 识别
text = pytesseract.image_to_string(Image.open('图片路径'), lang='chi_sim', config='-psm 6')

Python「pytesseract」：中文识别模块相关推荐

OpenCV Python + Tesseract-OCR轻松实现中文识别
点击上方"小白学视觉",选择加"星标"或"置顶" 重磅干货,第一时间送达推荐阅读 42个pycharm使用技巧,瞬间从黑铁变王者Google ...
「转」中文文案排版指北
中文文案排版指北統一中文文案.排版的相關用法,降低團隊成員之間的溝通成本,增強網站氣質. Other languages: English Chinese Traditional Chinese S ...
你真的知道什么是 Python「命名空间」吗？
写在之前命名空间,又名 namesapce,是在很多的编程语言中都会出现的术语,估计很多人都知道这个词,但是让你真的来说这是个什么,估计就歇菜了,所以我觉得「命名空间」有必要了解一下. 全局变量 ...
oracle获取去年年份_「实战」中文检错纠错之语料获取与处理
来源 | AI实战派作者 | AI实战派在自然语言处理领域中,语料是非常关键的一个部分.然而,中文的自然语言处理领域在大的通用型语料上虽然不少,但在特定方向上的语料仍然匮乏. 在要进行拼音型文本纠 ...
Python「PIL」：调整图片大小
使用 PIL 在图片比例不变的情况下修改图片大小. 目录一.介绍二.代码一.介绍 Image.resize def resize(self, size, resample=BICUBIC, bo ...
对Python中文分词模块结巴分词算法过程的理解和分析
结巴分词是国内程序员用python开发的一个中文分词模块, 源码已托管在github, 地址在: https://github.com/fxsjy/jieba 作者的文档写的不是很全, 只写了怎么用, ...
GAN掉人脸识别系统？GAN模型「女扮男装」
文章来源新智元编辑:LRS [新智元导读]人脸识别技术最近又有新的破解方式!一位斯坦福的学生使用GAN模型生成了几张自己的图片,轻松攻破两个约会软件,最离谱的是「女扮男装」都识别不出来. 真的有人 ...
从知乎「悟空」看一个成熟的Anti-Spam系统演进之路
Hi there! 距离 2015 年 4 月「悟空」正式与大家见面,已经整整三个年头了.随着知乎的不断发展壮大,过去的一段时间,「悟空」不断面临着新的考验,并持续地在优化升级.接下来跟大家系统分享一 ...
AI「干掉」程序员后，又对艺术家下手了
几十年前,柯达说出了那句经典的广告语,「你负责按快门,剩下的交给我们」.在未来,AI 兴许也会打起类似的广告,「你什么都不用干,剩下的交给我们」. 人工智能领域缺钱,但这两个月来,他们不缺「好消息」. ...

Python「pytesseract」：中文识别模块

一、介绍

（一）image_to_string

（二）config参数

二、代码

Python「pytesseract」：中文识别模块相关推荐

最新文章

热门文章