C++ 调用 Tesseract

Tesseract-ocr 是一个知名的开源的 OCR 。这里简单写写它的 C++ API 接口的使用方法。

本文主要参考了：

还有就是API 帮助文档：https://ub-mannheim.github.io/tesseract/index.html

如何编译 tesseract 这里就不多说了。在 VC 下就是 vcpkg install tesseract 一条命令。

先看一个官方的例子：

#include <tesseract/baseapi.h>
#include <leptonica/allheaders.h>int main()
{char *outText;tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();// Initialize tesseract-ocr with English, without specifying tessdata pathif (api->Init(NULL, "eng")) {fprintf(stderr, "Could not initialize tesseract.\n");exit(1);}// Open input image with leptonica libraryPix *image = pixRead("phototest.png");api->SetImage(image);// Get OCR resultoutText = api->GetUTF8Text();printf("OCR output:\n%s", outText);// Destroy used object and release memoryapi->End();delete api;delete [] outText;pixDestroy(&image);return 0;
}

api->Init(NULL, “eng”) 这句是加载 eng.traineddata ，NULL表示从默认的位置加载。当然也可以把eng.traineddata 的位置传进来。

如果我们还想同时加载其他的语言的训练数据可以这样写：api->Init(NULL, “eng+deu”)

这样就同时加载了英文和德文数据。

api->Init(NULL, “xxx”) 函数在程序中可以多次调用。每次调用后 OCR 引擎就被重新初始化。

api->SetImage(image); 这就是加载图像。之后我们还可以限制只对图像的一部分区域进行 OCR。类似下面这条语句：

api->SetRectangle(left, top, width, height) ;

api->GetUTF8Text() 获得 OCR 识别出的字符串。需要特别注意的是 GetUTF8Text() 返回的是 C 字符串，需要我们自己释放这个字符串的内存空间：

delete [] outText;

从这里也可以看出 Tesseract 比较原始，好歹应该返回个 std::string 啊。这样很容易造成内存泄漏。

一般在 OCR 之后还会看看识别的 confidence value 。

api->MeanTextConf();

这个值介于 0 到100 之间，越大说明识别正确的概率越大。

完事之后可以调用 api->End(); 来释放内存空间。

基本上这个例子就是一个最简单的用法。上面例子中用到了一个图片，我把图片放这里：

在我电脑上输出的结果如下：

1284567890 4934567890This is a lot of 12 point text to test the
ocr code and see if it works on all types
of file format.The quick brown dog jumped over the
lazy fox. The quick brown dog jumped
over the lazy fox. The quick brown dog
jumped over the lazy fox. The quick
brown dog jumped over the lazy fox.

可以看到有几个数字识别错了。如果用 SetRectangle 圈住那一串数字后再识别就全都可以识别正确。

下面再看一个高级些的例子：

  Pix *image = pixRead("phototest.png");tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();api->Init(NULL, "eng");api->SetImage(image);Boxa* boxes = api->GetComponentImages(tesseract::RIL_TEXTLINE, true, NULL, NULL);printf("Found %d textline image components.\n", boxes->n);for (int i = 0; i < boxes->n; i++) {BOX* box = boxaGetBox(boxes, i, L_CLONE);api->SetRectangle(box->x, box->y, box->w, box->h);char* ocrResult = api->GetUTF8Text();int conf = api->MeanTextConf();fprintf(stdout, "Box[%d]: x=%d, y=%d, w=%d, h=%d, confidence: %d, text: %s",i, box->x, box->y, box->w, box->h, conf, ocrResult);boxDestroy(&box);}

这个例子可以将图片中的文字按行分割出来，利用的是下面这个函数：

api->GetComponentImages(tesseract::RIL_TEXTLINE, true, NULL, NULL);

RIL_TEXTLINE 表示按行分割，除此之外还可以按段落（RIL_PARA）、单词（RIL_WORD）或者字符（RIL_WORD）分割。

上面的例子运行结果如下，可以看出识别率不高。

Found 9 textline image components.
Box[0]: x=42, y=33, w=321, h=33, confidence: 40, text: 123496 /890 1234567890
Box[1]: x=36, y=92, w=544, h=30, confidence: 92, text: This Is a lot of 12 point text to test the
Box[2]: x=36, y=126, w=582, h=31, confidence: 89, text: ocr code and see If it works on ail types
Box[3]: x=36, y=160, w=187, h=24, confidence: 88, text: of tie format.
Box[4]: x=36, y=194, w=549, h=31, confidence: 90, text: The quick brown dog Jumped over the
Box[5]: x=37, y=228, w=548, h=31, confidence: 75, text: lazy Tox. 1ne quick brown dog Jumped
Box[6]: x=36, y=262, w=561, h=31, confidence: 93, text: over the lazy fox. [he quick brown dog
Box[7]: x=43, y=296, w=518, h=31, confidence: 89, text: jumped over the lazy Tox. [ne quick
Box[8]: x=37, y=330, w=524, h=31, confidence: 82, text: brown dog Jumped over the lazy Tox.

之所以识别率不高，是因为 api->SetRectangle(box->x, box->y, box->w, box->h); 这句有点问题。如果改成下面这样：

api->SetRectangle(box->x, box->y-1, box->w, box->h+1);

识别率会提升很多。这时的结果如下：

Found 9 textline image components.
Box[0]: x=42, y=33, w=321, h=33, confidence: 91, text: 1234567890 1234567890
Box[1]: x=36, y=92, w=544, h=30, confidence: 95, text: This is a lot of 12 point text to test the
Box[2]: x=36, y=126, w=582, h=31, confidence: 95, text: ocr code and see if it works on all types
Box[3]: x=36, y=160, w=187, h=24, confidence: 94, text: of file format.
Box[4]: x=36, y=194, w=549, h=31, confidence: 95, text: The quick brown dog jumped over the
Box[5]: x=37, y=228, w=548, h=31, confidence: 93, text: lazy fox. The quick brown dog jumped
Box[6]: x=36, y=262, w=561, h=31, confidence: 95, text: over the lazy fox. The quick brown dog
Box[7]: x=43, y=296, w=518, h=31, confidence: 95, text: jumped over the lazy fox. The quick
Box[8]: x=37, y=330, w=524, h=31, confidence: 93, text: brown dog jumped over the lazy fox.

上面代码另一个问题是分配的字符串没有释放空间。所以正确的代码应该改成这样：

    Boxa* boxes = api->GetComponentImages(tesseract::RIL_TEXTLINE, true, NULL, NULL);printf("Found %d textline image components.\n", boxes->n);for (int i = 0; i < boxes->n; i++) {BOX* box = boxaGetBox(boxes, i, L_CLONE);api->SetRectangle(box->x, box->y-1, box->w, box->h+1);char* ocrResult = api->GetUTF8Text();int conf = api->MeanTextConf();fprintf(stdout, "Box[%d]: x=%d, y=%d, w=%d, h=%d, confidence: %d, text: %s",i, box->x, box->y, box->w, box->h, conf, ocrResult);delete [] ocrResult;      boxDestroy(&box);}

最后再看一个例子：

Pix *image = pixRead("phototest.png");tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();api->Init(NULL, "eng");api->SetImage(image);api->Recognize(0);tesseract::ResultIterator* ri = api->GetIterator();tesseract::PageIteratorLevel level = tesseract::RIL_WORD;if (ri != 0) {do {const char* word = ri->GetUTF8Text(level);float conf = ri->Confidence(level);int x1, y1, x2, y2;ri->BoundingBox(level, &x1, &y1, &x2, &y2);printf("word: '%s';  \tconf: %.2f; BoundingBox: %d,%d,%d,%d;\n",word, conf, x1, y1, x2, y2);delete[] word;} while (ri->Next(level));}

这个代码和上面的代码差不多，只不过用了 Iterator。这里不多解释了。

程序运行的结果如下：

word: '1284567890';     conf: 64.73; BoundingBox: 42,33,170,50;
word: '4934567890';     conf: 56.32; BoundingBox: 190,47,363,66;
word: 'This';   conf: 96.59; BoundingBox: 36,92,96,116;
word: 'is';     conf: 96.92; BoundingBox: 109,92,129,116;
word: 'a';      conf: 96.33; BoundingBox: 141,98,156,116;
word: 'lot';    conf: 96.33; BoundingBox: 169,92,201,116;
word: 'of';     conf: 96.45; BoundingBox: 212,92,240,116;
word: '12';     conf: 96.45; BoundingBox: 251,92,282,116;
word: 'point';          conf: 96.47; BoundingBox: 296,92,364,122;
word: 'text';   conf: 96.47; BoundingBox: 374,93,427,116;
word: 'to';     conf: 96.88; BoundingBox: 437,93,463,116;
word: 'test';   conf: 96.98; BoundingBox: 474,93,526,116;
word: 'the';    conf: 96.37; BoundingBox: 536,92,580,116;
word: 'ocr';    conf: 96.07; BoundingBox: 36,132,81,150;
word: 'code';   conf: 96.07; BoundingBox: 91,126,160,150;
word: 'and';    conf: 96.62; BoundingBox: 172,126,223,150;
word: 'see';    conf: 96.53; BoundingBox: 236,132,286,150;
word: 'if';     conf: 94.37; BoundingBox: 299,126,314,150;
word: 'it';     conf: 94.37; BoundingBox: 325,126,339,150;
word: 'works';          conf: 95.96; BoundingBox: 348,126,433,150;
word: 'on';     conf: 93.54; BoundingBox: 445,132,478,150;
word: 'all';    conf: 93.54; BoundingBox: 500,126,529,150;
word: 'types';          conf: 96.90; BoundingBox: 541,127,618,157;
word: 'of';     conf: 96.23; BoundingBox: 36,160,64,184;
word: 'file';   conf: 95.72; BoundingBox: 72,160,113,184;
word: 'format.';        conf: 95.68; BoundingBox: 123,160,223,184;
word: 'The';    conf: 96.51; BoundingBox: 36,194,91,218;
word: 'quick';          conf: 96.63; BoundingBox: 102,194,177,224;
word: 'brown';          conf: 96.82; BoundingBox: 189,194,274,218;
word: 'dog';    conf: 95.79; BoundingBox: 287,194,339,225;
word: 'jumped';         conf: 95.79; BoundingBox: 348,194,456,225;
word: 'over';   conf: 96.60; BoundingBox: 468,200,531,218;
word: 'the';    conf: 96.49; BoundingBox: 540,194,585,218;
word: 'lazy';   conf: 96.40; BoundingBox: 37,228,92,259;
word: 'fox.';   conf: 96.44; BoundingBox: 103,228,153,252;
word: 'The';    conf: 96.70; BoundingBox: 165,228,220,252;
word: 'quick';          conf: 96.63; BoundingBox: 232,228,307,258;
word: 'brown';          conf: 96.62; BoundingBox: 319,228,404,252;
word: 'dog';    conf: 95.80; BoundingBox: 417,228,468,259;
word: 'jumped';         conf: 95.80; BoundingBox: 478,228,585,259;
word: 'over';   conf: 96.29; BoundingBox: 36,268,99,286;
word: 'the';    conf: 96.28; BoundingBox: 109,262,153,286;
word: 'lazy';   conf: 96.51; BoundingBox: 165,262,221,293;
word: 'fox.';   conf: 96.30; BoundingBox: 231,262,281,286;
word: 'The';    conf: 96.65; BoundingBox: 294,262,349,286;
word: 'quick';          conf: 96.61; BoundingBox: 360,262,435,292;
word: 'brown';          conf: 96.12; BoundingBox: 447,262,532,286;
word: 'dog';    conf: 96.12; BoundingBox: 545,262,597,293;
word: 'jumped';         conf: 96.73; BoundingBox: 43,296,150,327;
word: 'over';   conf: 96.38; BoundingBox: 162,302,226,320;
word: 'the';    conf: 96.38; BoundingBox: 235,296,279,320;
word: 'lazy';   conf: 96.80; BoundingBox: 292,296,347,327;
word: 'fox.';   conf: 96.77; BoundingBox: 357,296,407,320;
word: 'The';    conf: 96.17; BoundingBox: 420,296,475,320;
word: 'quick';          conf: 96.95; BoundingBox: 486,296,561,326;
word: 'brown';          conf: 96.83; BoundingBox: 37,330,122,354;
word: 'dog';    conf: 96.32; BoundingBox: 135,330,187,361;
word: 'jumped';         conf: 96.80; BoundingBox: 196,330,304,361;
word: 'over';   conf: 96.95; BoundingBox: 316,336,379,354;
word: 'the';    conf: 96.56; BoundingBox: 388,330,433,354;
word: 'lazy';   conf: 95.99; BoundingBox: 445,330,500,361;
word: 'fox.';   conf: 96.61; BoundingBox: 511,330,561,354;

可以看出还是有识别错误的。对于这些识别错误的，可以记录下位置，稍微扩大些范围，利用 SetRectangle 重新识别。但是一定不要在 Iterator 迭代时做这个事情。因为重新识别会破坏 Iterator 的状态。

C++ 调用 Tesseract相关推荐

图像文字识别（二）：java调用tesseract 识别图片文字
在JAVA中调用tesseract识别图片的文字内容,主要有两种方式:cmd方式,tess4j方式.在这篇博客中,主要记录一下通过cmd命令行的方式.cmd方式,就是通过在java中调用命令行,来 ...
VS2017 调用Tesseract
最近在学tesseract,但遇到太多的问题是. 虽然网上有不少的方法,就算是按照tersseract,github上提供的方法也是编译不成功. 问题一大堆.不过我也想到了其它方法最张还是可以用了. ...
VS2010调用tesseract步骤
我的tesseract安装路径为D:\Tesseract-OCR,如果你的安装路径和我不一样,将这份文档里所有的D:\Tesseract-OCR改为你的安装路径即可. 1. 下载lib和dll(所有需 ...
java 调用tesseract_通过maven调用tesseract引擎
建议: 最好通过maven搭建自己的项目和引入相应的jar文件,这样可以避免版本不兼容的情况发生: 1.引入pom文件: // tesseract-platform里面包含了所有tesseract所需 ...
python如何调用图片-python调用图片
广告关闭腾讯云11.11云上盛惠 ,精选热门产品助力上云,云服务器首年88元起,买的越多返的越多,最高返5000元! python本身也有识别图片转文字的框架,但是相比调用接口,识别的精度就略显不行 ...
[转]浅谈OCR之Tesseract
转载请注明出处:http://www.cnblogs.com/brooks-dotnet/archive/2010/10/05/1844203.html 浅谈OCR之Tesseract 光学字符识别( ...
Tesseract图形识别软件的安装
安装下载安装: tesseract 安装 pytesseract 和 Pillow pip install pytesseractpip install Pillow 这里只说 winsows 系统 ...
浅谈OCR之Tesseract
光学字符识别(OCR,Optical Character Recognition)是指对文本资料进行扫描,然后对图像文件进行分析处理,获取文字及版面信息的过程.OCR技术非常专业,一般多是印刷.打印行 ...
python代码标识码_代码分享：使用Python和Tesseract来识别图形验证码
原标题:代码分享:使用Python和Tesseract来识别图形验证码 *本文原创作者:ipenox,本文属FreeBuf原创奖励计划,未经许可禁止转载各位在企业中做Web漏洞扫描或者渗透测试的朋友 ...

C++ 调用 Tesseract

C++ 调用 Tesseract相关推荐

最新文章

热门文章