Existe alguma opção para extrair script Unicode?
When I try it for Unicode like Hindi, Marathi, or Devanagari Script it produces the wrong output.
Parece que apenas Hindi
é compatível com o produto.
Você precisa usar a opção -l lang
:
tesseract 1.png output.txt -l hin
Você pode treinar o tesseract para reconhecer outros idiomas, como Marathi
ou Devanagari
.
Veja Como usar as ferramentas fornecidas para treinar o Tesseract 3.0x para um novo idioma
TESSERACT (1) Página Manual
OPTIONS
...
-l lang
The language to use. If none is specified, English is assumed. Multiple languages may be specified, separated by plus characters. Tesseract uses 3-character ISO 639-2 language codes. (See LANGUAGES)
...
LANGUAGES
There are currently language packs available for the following languages:
ara (Arabic), aze (Azerbauijani), bul (Bulgarian), cat (Catalan), ces (Czech), chi_sim (Simplified Chinese), chi_tra (Traditional Chinese), chr (Cherokee), dan (Danish), dan-frak (Danish (Fraktur)), deu (German), ell (Greek), eng (English), enm (Old English), epo (Esperanto), est (Estonian), fin (Finnish), fra (French), frm (Old French), glg (Galician), heb (Hebrew), hin (Hindi), hrv (Croation), hun (Hungarian), ind (Indonesian), ita (Italian), jpn (Japanese), kor (Korean), lav (Latvian), lit (Lithuanian), nld (Dutch), nor (Norwegian), pol (Polish), por (Portuguese), ron (Romanian), rus (Russian), slk (Slovakian), slv (Slovenian), sqi (Albanian), spa (Spanish), srp (Serbian), swe (Swedish), tam (Tamil), tel (Telugu), tgl (Tagalog), tha (Thai), tur (Turkish), ukr (Ukrainian), vie (Vietnamese)
To use a non-standard language pack named foo.traineddata, set the TESSDATA_PREFIX environment variable so the file can be found at TESSDATA_PREFIX/tessdata/foo.traineddata and give Tesseract the argument -l foo.
Fonte TESSERACT (1) Página do manual