OCR da linha de comandos no Windows 7

3

Quais são alguns utilitários OCR de linha de comando que funcionarão no Windows 7 de 64 bits?

    
por Phenom 07.06.2010 / 00:17

3 respostas

4

Acho que o Tesseract é o melhor software de OCR baseado em linha de comando (gratuito). Infelizmente, não parece haver um binário disponível para o Windows 7 de 64 bits, então você teria que compilá-lo você mesmo; Estas são as instruções para fazer isso (tiradas de um comentário na página de perguntas frequentes do Tesseract ):

  1. Download tesseract 2.04. Unpack it. In this example I've unpacked to C:\projects\tesseract-2.04. Windows 7 still doesn't understand .tar.gz out of the box. My recommendation is to get a copy of 7-Zip.

  2. Download your required language files. I need German and English. I unpack these to the tessdata subdirectory of C:\projects\tesseract-2.04\tessdata.

  3. Install libtiff. On my (64 bit) system the suggested install directory is C:\Program Files (x86)\GnuWin32?. Underneath this directory are a bunch of subdirectories containing files we'll need to compile tesseract with tiff support, namely include, bin and lib.

  4. Add C:\Program Files (x86)\GnuWin32?\bin to your PATH environment variable so that the output tesseract.exe can find the libtiff dll. Restart.

  5. Open the vc solution (tesseract.sln)

  6. Change the solution configuration to "Release" mode. Note that if you later change back to Debug mode, you'll need to set up all the following again...

  7. In the solution explorer right click the solution node (Solution 'tesseract') and click "Properties". Change to "Configuration Properties" and select "Release" configuration from the dropdown at the top of the window. Navigate to: Tools -> Options -> Projects and Solutions -> VC++ Directories Here we'll be adding the full paths for the subdirectories lib and include from the libtiff install so that VC can find the required header (.h) and static library (.lib) files. In this example they are: $(ProgramFiles?)\GnuWin32?\include $(ProgramFiles?)\GnuWin32?\lib as I'm using an environment variable. I could however just have written them as C:\Program Files (x86)\GnuWin32?\include. Change the "Show Directories For" dropdown to "Include files". Add the following: $(ProgramFiles?)\GnuWin32?\include Now change the "Show Directories For" dropdown to "Library files". Add the following: $(ProgramFiles?)\GnuWin32?\lib

  8. Now open the project properties window for the tesseract project (not the solution). In the solution explorer right click the tesseract project and click properties. Navigate the horrendous list of options to Configuration Properties -> C/C++ -> Preprocessor and add HAVE_LIBTIFF to the list of Preprocessor Definitions. This causes a bunch of #includes to be enabled in the code.

  9. You also want to add an "Additional dependency". go to the "Additional dependencies" section for the project properties and add libtiff.lib.

  10. Build the solution. Watch the error list. If you get a bunch of LNK2109 errors, that means the linker can't find something tesseract references. You're missing a reference to one of the paths from libtiff. If you get an error mentioning mt.exe, you've possibly encountered a bug in the sdk. Just try building again. see http://connect.microsoft.com/VisualStudio/feedback/ViewFeedback.aspx?FeedbackID=106634 for more info.

If/when the solution builds successfully, you'll have a tesseract.exe file in the same directory as the tesseract solution file. drag you multipage compressed tiff here and try running tesseract. for example, if your tiff is called in.tif and you want to output text to out.txt, and the documents' language is German then your command line would look like:

tesseract.exe in.tif out -l deu The output file will have .txt appended to it by tesseract. If you're just translating English text then you can leave off the -l option, as tesseract assumes "eng" if you don't specify anything. If your tif file has the file extension .tiff, then tesseract will crap itself thusly:

C:\projects\tesseract-2.04>tesseract.exe in.tiff out -l deu Tesseract Open Source OCR Engine name_to_image_type:Error:Unrecognized image type:in.tiff IMAGE::read_header:Error:Can't read this image type:in.tiff tesseract.exe:Error:Read of file failed:in.tiff

Hopefully (fingers crossed, heh) you've now got an OCR'd out.txt file sitting in C:\projects\tesseract-2.04.

    
por 07.06.2010 / 02:15
1

O JOCR é o único que eu sei que pode funcionar no Windows e é baseado em linha de comando. Veja a página da web aqui

    
por 07.06.2010 / 00:27
0

Existe um instalador para o Windows 7 para o tesseract. Acabei de instalá-lo e consegui fazer o OCR em uma pequena imagem. O resultado foi terrível, mas espero que com algum ajuste eu possa melhorar os resultados.

    
por 19.01.2011 / 21:05