Extraindo informações entre tags [duplicado]

Question

Extraindo informações entre tags [duplicado]

#1 resposta do (1 votos)
#2 resposta do (0 votos)

0

Eu tenho um arquivo de texto, o seguinte mostra uma amostra do conteúdo do arquivo:

1234 A novel homeodomain-encoding gene is associated with a large CpG island interrupted by the <category="Modifier">myotonic dystrophy</category> unstable (CTG)n repeat. <category="SpecificDisease">Myotonic dystrophy</category> ( <category="SpecificDisease">DM</category> ) is associated with a ( CTG ) n trinucleotide repeat expansion in the 3-untranslated region of a protein kinase-encoding gene , DMPK , which maps to chromosome 19q13 . 3 . Characterisation of the expression of this gene in patient tissues has thus far generated conflicting data on alterations in the steady state levels of DMPK mRNA , and on the final DMPK protein levels in the presence of the expansion . The <category="Modifier">DM</category> region of chromosome 19 is gene rich , and it is possible that the repeat expansion may lead to dysfunction of a number of transcription units in the vicinity , perhaps as a consequence of chromatin disruption . We have searched for genes associated with a CpG island at the 3 end of DMPK . Sequencing of this region shows that the island extends over 3 . 5 kb and is interrupted by the ( CTG ) n repeat . Comparison of genomic sequences downstream ( centromeric ) of the repeat in human and mouse identified regions of significant homology . These correspond to exons of a gene predicted to encode a homeodomain protein . RT-PCR analysis shows that this gene , which we have called <category="Modifier">DM</category> locus-associated homeodomain protein ( DMAHP ) , is expressed in a number of human tissues , including skeletal muscle , heart and brain .

Eu preciso extrair o que há entre as tags: por exemplo,

<category="SpecificDisease">Myotonic dystrophy</category>

Eu preciso extrair "distrofia miotônica" e escrever em um novo arquivo de texto.

text-processing

por nlp 07.11.2013 / 04:46

2 respostas

Tags text-processing

Tentando instalar o numpy, o nltk para o python 2.4 no CentOs 5.10 com várias versões do python instalado - python 2.4 e python 3.3 Windows e Linux [fechado]

score 1 · Answer 1

Você pode fazer isso usando grep para encontrar o texto entre as tags e, em seguida, sed para remover as tags:

$ grep -oP '<category.+?>.*?</category>' file.txt | sed 's/<.*>\(.*\)<.*>//'
myotonic dystrophy
Myotonic dystrophy
DM
DM
DM

Explicação

grep -oP : -P ativa PCRE para grep e -o faz com que seja impressa apenas a string correspondente.
'<category.+?>.*?</category>' : diz ao grep para pesquisar tudo entre abrir e fechar category tags.
sed 's/<.*>\(.*\)<.*>//' : A saída do grep acima é canalizada para sed , o que simplesmente exclui as tags, substituindo-as por seu conteúdo (aqui porque parênteses foram usados para capturá-las).

score 0 · Answer 2

Isso pode ser feito através do PCRE, eu tentei até agora ... mas ainda não entendi completamente.

aqui está o exemplo do que tentei e trabalhei:

grep -oP '(?:<category=[A-Za-z\"\s]*>)[A-Za-z\s]+(?:<\/category>)' input|\
awk -F">" '{split($2,a,"<"); print a[1]}'