精华内容
下载资源
问答
  • Tesseract

    2020-12-09 15:04:14
    <div><p>There are now two flavours of Tesseract: <a href="https://github.com/Homebrew/homebrew-core/blob/master/Formula/tesseract.rb">tessseract</a> and ... I guess we should allow both: ...
  • tesseract

    2019-08-20 13:21:22
    tesseract 安装 tesseract 使用说明 tesseract API tesseract 提升识别质量 tesseract 配置文件 tesseract win 训练 tesseract linux 训练 pytesseract
    展开全文
  • Tesseract:Desafio Vaga Tesseract
  • tesseract3.0.5+tesseract4.0.0相关
  • tesseract 安装及使用

    万次阅读 多人点赞 2018-09-12 09:49:43
    1. 安装tesseract OCR,即Optical Character Recognition,光学字符识别,是指通过扫描字符,然后通过其形状将其翻译成电子文本的过程。对于图形验证码来说,它们都是一些不规则的字符,这些字符确实是由字符稍加...

    1. 安装tesseract

    OCR,即Optical Character Recognition,光学字符识别,是指通过扫描字符,然后通过其形状将其翻译成电子文本的过程。对于图形验证码来说,它们都是一些不规则的字符,这些字符确实是由字符稍加扭曲变换得到的内容。

    tesseract下载地址:https://digi.bib.uni-mannheim.de/tesseract/

    进入下载页面,可以看到有各种.exe文件的下载列表,这里可以选择下载3.0版本。

    其中文件名中带有dev的为开发版本,不带dev的为稳定版本,可以选择下载不带dev的版本,例如可以选择下载tesseract-ocr-setup-3.05.02.exe。

    下载完成后双击,此时会出现如下图所示的页面。

    此时可以勾选Additional language data(download)选项来安装OCR识别支持的语言包,这样OCR便可以识别多国语言。然后一路点击Next按钮即可。

    接下来,为了在python代码中使用tesseract功能,使用pip安装pytesseract:

    pip install pytesseract

    2、配置环境变量

    为了在全局使用方便,比如安装路径为D:\Program Files (x86)\Tesseract-OCR,将该路径添加到环境变量的path中

    配置完成后在命令行输入tesseract -v,如果出现如下图所示,说明环境变量配置成功

    3、验证安装

    接下来,我们可以使用tesseract和pytesseract来分别进行测试。

    我们以如下图所示的图片为样例进行测试。

    该图片的链接为https://raw.githubusercontent.com/Python3WebSpider/TestTess/master/image.png,可以直接保存或下载。

    首先用命令行进行测试,将图片下载到D盘chromeDownload文件夹,保存为image.png,然后在该文件夹中打开命令行,用tesseract命令测试:

    tesseract image.png result 

    运行结果如下:

    D:\chromeDownload>tesseract image.png result
    Tesseract Open Source OCR Engine v3.05.02 with Leptonica

    这里我们调用了tesseract命令,其中第一个参数为图片名称,第二个参数result 为结果保存的目标文件名称。

    运行结果便是图片的识别结果:Python3WebSpider。可以在chromeDownload文件夹中看到result.txt,这时已经成功将图片文字转为电子文本了。

    然后还可以利用Python代码来测试,这里就需要借助于pytesseract库了,测试代码如下:

    from PIL import Image
    import pytesseract
    
    text = pytesseract.image_to_string(Image.open(r'D:\chromeDownload\image.png'))
    print(text)
    

    我们首先利用Image读取了图片文件,然后调用了pytesseract的image_to_string()方法,再将其识别结果输出。

    运行结果如下:

    Python3WebSpider

    如果成功输出结果,则证明tesseract和pytesseract都已经安装成功。

    4、使用时遇到的坑

    在使用tesseract命令行进行测试时,会议开始报以下的错误

    Error opening data file \Program Files (x86)\Tesseract-OCR\tessdata/eng.traineddata
    Please make sure the TESSDATA_PREFIX environment variable is set to the parent directory of your "tessdata" directory.
    Failed loading language 'eng'
    Tesseract couldn't load any languages!
    Could not initialize tesseract.

    报错是意思是缺少环境变量TESSDATA_PREFIX,导致无法加载任何语言,就不能初始化tesseract。

    解决的方法也很简单,在环境变量中添加TESSDATA_PREFIX,如下图

    注意:变量值中的路径为“D:/Program Files (x86)/Tesseract-OCR”,使用正斜杠“/”。windows中复制过来的路径默认是反斜杠“\”

    配置完成后,重新打开命令行,即可正常使用。

    第二个坑是使用pytesseract时,出现以下错误

    Traceback (most recent call last):
      File "D:\Python36\lib\site-packages\pytesseract\pytesseract.py", line 170, in run_tesseract
        proc = subprocess.Popen(cmd_args, **subprocess_args())
      File "D:\Python36\lib\subprocess.py", line 709, in __init__
        restore_signals, start_new_session)
      File "D:\Python36\lib\subprocess.py", line 997, in _execute_child
        startupinfo)
    FileNotFoundError: [WinError 2] 系统找不到指定的文件。

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last):
      File "D:/python/20180911.py", line 4, in <module>
        text = pytesseract.image_to_string(Image.open(r'D:\chromeDownload\image.png'))
      File "D:\Python36\lib\site-packages\pytesseract\pytesseract.py", line 294, in image_to_string
        return run_and_get_output(*args)
      File "D:\Python36\lib\site-packages\pytesseract\pytesseract.py", line 202, in run_and_get_output
        run_tesseract(**kwargs)
      File "D:\Python36\lib\site-packages\pytesseract\pytesseract.py", line 172, in run_tesseract
        raise TesseractNotFoundError()
    pytesseract.pytesseract.TesseractNotFoundError: tesseract is not installed or it's not in your path

    这就很坑,添加了全局变量,还是提示tesseract没有安装或者不在PATH中。

    百度了一下,解决方案如下。

    pytesseract安装后,在python的Lib目录下site-packges下会生成一个pytesseract文件夹,文件夹中找到pytesseract.py,路径为:D:\Python36\Lib\site-packages\pytesseract,使用notepad之类软件打开pytesseract.py,找到如下两行:

    # CHANGE THIS IF TESSERACT IS NOT IN YOUR PATH, OR IS NAMED DIFFERENTLY
    tesseract_cmd = 'tesseract'

    将tesseract_cmd = 'tesseract'修改为:tesseract_cmd = 'D:/Program Files (x86)/Tesseract-OCR/tesseract.exe'

    表示tesseract_cmd配置的是你安装tesseract的绝对路径,这样就能找到tesseract了。修改后保存,再去运行python代码,就可以成功了。

    展开全文
  • tesseract 4.0

    2017-08-01 21:20:52
    tesseract ocr
  • tesseract安装包

    2020-04-07 13:45:06
    Tesseract:开源的OCR识别引擎,初期Tesseract引擎由HP实验室研发,后来贡献给了开源软件业,后由Google进行改进、修改bug、优化,重新发布。
  • tesseract3.01

    2016-10-17 09:19:42
    tesseract3.01
  • tesseract软件包

    2018-10-10 15:45:07
    tesseract开发的工具包,包含了tesseract安装包,字体训练工具,以及一些验证码的样例
  • Tesseract OCR About This package contains an OCR engine - libtesseract and a command line program - tesseract. Tesseract 4 adds a new neural net (LSTM) based OCR engine which is focused on ...
  • tesseract4.0

    2018-12-30 23:21:55
    tesseract4.0源文件最近刚开始接触识别库引擎方面的知识,由于项目中需要使用光学识别处理模块,在老师与朋友的推荐下,我开始接触tesseract光学识别库,
  • at Tesseract.TesseractEngine.Initialise(String datapath, String language, EngineMode engineMode, IEnumerable<code>1 configFiles, IDictionary</code>2 initialValues, Boolean setOnlyNonDebugVariables)\r\...
  • tesseract源码

    2018-07-13 14:25:41
    tesseract源码,可以在windows,linux下编译使用。也有语言包
  • Tesseract安装包

    2018-07-17 09:43:08
    Tesseract安装包,可使用与识别图片中的文字、英文等多种语言。多用于验证码识别。
  • tesseract4

    2018-03-13 16:27:15
    Google 开源的 ocr项目 “tesseract” 源码。版本是4.0.0-beta.1
  • tesseract环境

    2018-02-10 09:53:54
    vs2015配置tesseract,文件解压后配置环境即可用,配置流程请参考本人博客
  • tesseract英文库

    2020-05-13 15:59:32
    tesseract英文库
  • tesseract-ocr/tesseract

    2020-09-28 20:31:31
    This package contains an OCR engine - libtesseract and a command line program - tesseract. Tesseract 4 adds a new neural net (LSTM) based OCR engine which is focused on line recognition, but also ...

    About
    This package contains an OCR engine - libtesseract and a command line program - tesseract. Tesseract 4 adds a new neural net (LSTM) based OCR engine which is focused on line recognition, but also still supports the legacy Tesseract OCR engine of Tesseract 3 which works by recognizing character patterns. Compatibility with Tesseract 3 is enabled by using the Legacy OCR Engine mode (–oem 0). It also needs traineddata files which support the legacy engine, for example those from the tessdata repository.

    The lead developer is Ray Smith. The maintainer is Zdenko Podobny. For a list of contributors see AUTHORS and GitHub’s log of contributors.

    Tesseract has unicode (UTF-8) support, and can recognize more than 100 languages “out of the box”.

    Tesseract supports various output formats: plain text, hOCR (HTML), PDF, invisible-text-only PDF, TSV. The master branch also has experimental support for ALTO (XML) output.

    You should note that in many cases, in order to get better OCR results, you’ll need to improve the quality of the image you are giving Tesseract.

    This project does not include a GUI application. If you need one, please see the 3rdParty documentation.

    Tesseract can be trained to recognize other languages. See Tesseract Training for more information.

    Brief history
    Tesseract was originally developed at Hewlett-Packard Laboratories Bristol and at Hewlett-Packard Co, Greeley Colorado between 1985 and 1994, with some more changes made in 1996 to port to Windows, and some C++izing in 1998. In 2005 Tesseract was open sourced by HP. Since 2006 it is developed by Google.

    The latest (LSTM based) stable version is 4.1.1, released on December 26, 2019. Latest source code is available from master branch on GitHub. Open issues can be found in issue tracker, and planning documentation.

    The latest 3.0x version is 3.05.02, released on June 19, 2018. Latest source code for 3.05 is available from 3.05 branch on GitHub. There is no development for this version, but it can be used for special cases (e.g. see Regression of features from 3.0x).

    See Release Notes and Change Log for more details of the releases.

    Installing Tesseract
    You can either Install Tesseract via pre-built binary package or build it from source.

    Supported Compilers are:

    GCC 4.8 and above
    Clang 3.4 and above
    MSVC 2015, 2017, 2019
    Other compilers might work, but are not officially supported.

    Running Tesseract
    Basic command line usage:

    tesseract imagename outputbase [-l lang] [–oem ocrenginemode] [–psm pagesegmode] [configfiles…]
    For more information about the various command line options use tesseract --help or man tesseract.

    Examples can be found in the documentation.

    For developers
    Developers can use libtesseract C or C++ API to build their own application. If you need bindings to libtesseract for other programming languages, please see the wrapper section in the AddOns documentation.

    Documentation of Tesseract generated from source code by doxygen can be found on tesseract-ocr.github.io.

    Support
    Before you submit an issue, please review the guidelines for this repository.

    For support, first read the documentation, particularly the FAQ to see if your problem is addressed there. If not, search the Tesseract user forum, the Tesseract developer forum and past issues, and if you still can’t find what you need, ask for support in the mailing-lists.

    Mailing-lists:

    tesseract-ocr - For tesseract users.
    tesseract-dev - For tesseract developers.
    Please report an issue only for a bug, not for asking questions.

    License
    The code in this repository is licensed under the Apache License, Version 2.0 (the “License”);
    you may not use this file except in compliance with the License.
    You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

    Unless required by applicable law or agreed to in writing, software
    distributed under the License is distributed on an “AS IS” BASIS,
    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    See the License for the specific language governing permissions and
    limitations under the License.
    NOTE: This software depends on other packages that may be licensed under different open source licenses.

    Tesseract uses Leptonica library which essentially uses a BSD 2-clause license.

    Dependencies
    Tesseract uses Leptonica library for opening input images (e.g. not documents like pdf). It is suggested to use leptonica with built-in support for zlib, png and tiff (for multipage tiff).

    Latest Version of README
    For the latest online version of the README.md see:

    https://github.com/tesseract-ocr/tesseract/blob/master/README.md

    展开全文
  • 一般的方法:建议与开发沟通,设置万能验证码或是屏蔽代码,如果想挑战一下coding技术,那么就需要自己动手编写代码、利用第三方tesseract解析图片小工具。 1、tesseract的安装和使用请点击链接查看,安装完成后,...

    背景:在web测试工作中,时常会遇到图片验证码处理;一般的方法:建议与开发沟通,设置万能验证码或是屏蔽代码,如果想挑战一下coding技术,那么就需要自己动手编写代码、利用第三方tesseract解析图片小工具。

    1、tesseract的安装和使用请点击链接查看,安装完成后,自动add path,cmd出现以下信息,表示安装成功:

    test:下载一张图片验证码,保存本地,tesseract测试一下,最后在当前目录下有一个result.txt文件:

    2、 如下截图,注册窗口的图片验证码,JMeter工具如何实现获取验证码处理(tesseract+jmeter(beanshell脚本)):

    3、首先要知道这个图片验证码是怎么生成的,代码实现,接口返回,通过抓包获取接口数据:

    4、 JMeter开发测试脚本结构如下:

    5、上面已经实现了保存接口生成的图片并通过调用cmd命令执行tesseract解析图片保存txt文本,JMeter-beanshell sampler读取txt的内容,JMeter内置函数方法vars提取赋值code给接口请求${code};

    6、开始分析第一个beanshell 后置器脚本的代码(保存接口生成的图片):

    import java.io.File; 
    import java.io.FileOutputStream; 
    import java.io.IOException;
    import java.io.OutputStream; 
    // beanshell编程,是泛型,理解就是没有严格的java编码风格,
    //可以没有class类,只要有变量、方法,而且同beanshell中,方法直接调用。
    //public class TestImage {       
    //主函数,这里不用使用
    /*	public static void main(byte[] args)throws IOException{                    
    		Test("测试方法");  // 可见Test()括号l是什么类型不重要        
    	}      
    */
    	public static void Test (byte[] cgs) throws IOException {                    
    		File f = new File("C:\\Users\\Administrator\\atest.jpg");                    
    		OutputStream out = new FileOutputStream(f);// 如果文件不存在会自动创建                    
    		out.write(cgs);// 因为是字节流,所以要转化成字节数组进行输出                    
    		out.close();  //关闭字节流输出        
    	} 
    //} 
    //获取响应的结果内容,保存为byte[] 
    byte[] json = prev.getResponseData(); 
    Test(json);
    //调用上面的类:把二进制字节流转成图片输出,如有class类,则如下调用
    //TestImage.Test(json);

     7、分析第二个beanshell 取样器脚本的代码(调用CMD命令执行tesseract):

    Runtime r = Runtime.getRuntime();
    String commad="cmd.exe /c tesseract atest.jpg result -l eng";
    r.exec(commad);

    8、分析第三个beanshell 取样器脚本的代码(文件、字节流I/O读取txt文本的内容):

    import java.io.BufferedInputStream;
    import java.io.BufferedReader;
    import java.io.File;
    import java.io.FileInputStream;
    import java.io.InputStreamReader;
    import java.io.Reader;
    
    String filePath = "C:\\Users\\Administrator\\result.txt";
    File file=new File(filePath);
    InputStreamReader read = new InputStreamReader(new FileInputStream(file));
    BufferedReader bufferedReader = new BufferedReader(read);
    String lineTxt = null;
    lineTxt = bufferedReader.readLine();
    vars.put("code",lineTxt);
    read.close();

    9、结束语:这样解决了web功能测试遇到图片验证码的解决方法,假设tesseract解析图片没有成功,还只是保存了图片,建议可以参考导入Scanner类的解决方法,通过控制台手动输入图片验证码来给下一个接口请求。

    展开全文
  • Tesseract OCR演示 将.pdf文件转换为图像。 使用tesseract将转换后的图像转换为x格式。 去做: 将结果写入数据库。 将tesseract识别的单词和组绘制到图像上。 用法 Ubuntu 20.04 bash 0_install_and_setup_tools...

空空如也

空空如也

1 2 3 4 5 ... 20
收藏数 9,063
精华内容 3,625
热门标签
关键字:

tesseract