Este link pode ajudá-lo.
Trecho:
go to your directory of appropriately named book page scan images (tiff or png). When you type ls, you should see the pages list in order! Then, try:
ocropus book2pages out image*
This grooms the pages for OCR. Next, let’s make the page objects, and eventually the book:
ocropus pages2lines out
ocropus lines2fsts out/
ocropus fsts2text out/
ocropus buildhtml out/ > book.html
That should create you a nice book html file, in the hOCR format.
Além disso, há muitos tutoriais sobre como usar o ocropus.