remove falhas no texto após conversão usando a técnica de OCR em PDF

1

Eu converto o arquivo PDF usando o leitor de PDF OCR. originalmente o texto era uma imagem no arquivo PDF e PDF Foxit convertê-lo para texto usando OCR agora o problema após a conversão é o texto não está alinhado corretamente parece todas as palavras e linhas onde mudou. texto de amostra

  biochemistry can be divided in three fields; molecular genetics, protein science and metabolism. Over the last decades 
of the 20th century, biochem
istry has through these three disciplines becom
e successful at explaining living processes. Almost all areas o
f the life sciences are being uncovered and developed by biochemical methodology and research.[2] Biochemistry focuses on unde
rstanding how biolog
ical molecules give 
rise to the processes that occur within living cells and
 between cells,[3] which
 in turn relates greatly to the study and understanding of 
, organs, and organism structure and function[4]

Biochemistry is closely related to mol
ecular biology, the study of the molecular mechanisms by which geneti
c information encoded in DNA is able to result in the processes of life.[5]

Much of biochemistry deals with the structu
res, 
 an
d interactions of biological macromolecules, such as proteins, nucleic acids, carbohydrates and lipids, which provide the structure of cells and perform many of the functions associated with life.[6] The chemistry of the cell also depends on the 
 of smaller molecules and ions. Th
ese can be inorganic, for example water and metal ions, or organic, for example the amino acids, which are used to synthesi
ze proteins.[7]
 The mechanisms by which cells harness energy from their environment via chemical reactions are known as metabolism. The findings of biochemistry are applied primarily in medicine, nutrition, and agriculture. In medicine, b
iochemists investigate the causes and cures of diseases.[8] In nutrition, they study how to maintain health wellness and study the effects of nutritional deficiencies.[9] In agriculture, biochemists investigate soil and fertilizers, and try to discover ways to improve crop cultivation, crop storage and pest control.

o problema também algumas palavras são cortadas pela metade. Existe alguma coisa que eu possa fazer para consertar o texto para que seja legível?

    
por Jeff Schaller 05.10.2018 / 19:55

2 respostas

1

Provavelmente há espaço para melhorias, mas aqui está um começo:

perl -0777 -ne 's/([^ ])$\n//g; s/\n/ /g; print' < input | fmt

Ele usa perl para combinar novas linhas - linhas contínuas se a linha terminar em branco, ou então remover linhas novas, então canalizar a saída através de fmt para quebrar linhas longas.

    
por 05.10.2018 / 20:09
1

Você pode usar um awk linear para remover retornos extras, algo assim:

awk '{gsub(/\n/,""); gsub(/\r/,""); print}' RS='' file

biochemistry can be divided in three fields; molecular genetics, protein science and metabolism. Over the last decades of the 20th century, biochemistry has through these three disciplines become successful at explaining living processes. Almost all areas of the life sciences are being uncovered and developed by biochemical methodology and research.[2] Biochemistry focuses on understanding how biological molecules give rise to the processes that occur within living cells and between cells,[3] which in turn relates greatly to the study and understanding of , organs, and organism structure and function[4]
Biochemistry is closely related to molecular biology, the study of the molecular mechanisms by which genetic information encoded in DNA is able to result in the processes of life.[5]
Much of biochemistry deals with the structures,  and interactions of biological macromolecules, such as proteins, nucleic acids, carbohydrates and lipids, which provide the structure of cells and perform many of the functions associated with life.[6] The chemistry of the cell also depends on the  of smaller molecules and ions. These can be inorganic, for example water and metal ions, or organic, for example the amino acids, which are used to synthesize proteins.[7] The mechanisms by which cells harness energy from their environment via chemical reactions are known as metabolism. The findings of biochemistry are applied primarily in medicine, nutrition, and agriculture. In medicine, biochemists investigate the causes and cures of diseases.[8] In nutrition, they study how to maintain health wellness and study the effects of nutritional deficiencies.[9] In agriculture, biochemists investigate soil and fertilizers, and try to discover ways to improve crop cultivation, crop storage and pest control.
A função

gsub tem o seguinte formato:

gsub(regexp, replacement [, target])

Isso é semelhante à subfunção, exceto que o gsub substitui todas as subseqüências de correspondência mais longas, mais à esquerda e não sobrepostas que podem ser encontradas. O 'g' em gsub significa "global", o que significa substituir em todos os lugares

gsub(/\n/,"") replaces all newline occurrences within a string with non for all input text. 

gsub(/\r/,"") replace all carriage return (ASCII code 13) occurrences with non for all input text. 
    
por 05.10.2018 / 20:00