Eu tenho um arquivo fastq. Eu vou explicar o que é. É algo parecido com isto
@SRR1024120.7 DBRHHJN1:259:D0PM7ACXX:1:1101:1386:1189 length=100
GATACAGGATGCCTGGGTCTAGGCTGTGTGACCTTGGGCCAGTTCCTCTC
+SRR1024120.7 DBRHHJN1:259:D0PM7ACXX:1:1101:1386:1189 length=100
DDDFFDDBGFEHEHGIGC9F>HG9EH8?DF4?:DF<?3:D?DHIGGDDFH
@SRR1024120.25 DBRHHJN1:259:D0PM7ACXX:1:1101:1752:1149 length=100
CTGCTGCTCATGCTCAT
+SRR1024120.25 DBRHHJN1:259:D0PM7ACXX:1:1101:1752:1149 length=100
BDDDDD<<CC:C+AFFE
@SRR1024120.42 DBRHHJN1:259:D0PM7ACXX:1:1101:2482:1096 length=100
AGCGTGTGCCACCCTACGCCGGC
+SRR1024120.42 DBRHHJN1:259:D0PM7ACXX:1:1101:2482:1096 length=100
DD>DAA@AA@@?2C8AB)?@:DD
@SRR1024120.1 DBRHHJN1:259:D0PM7ACXX:1:1101:1200:1120 length=100
AGACAGAAGGGGAGTACAGCTCTCTGGAACATGAGAGTGCAAGGGGTTGAGTGTTT
+SRR1024120.1 DBRHHJN1:259:D0PM7ACXX:1:1101:1200:1120 length=100
DDDFFFCFGEHI@CGFADFGCCFFGHFGCFFFHGGDGHIFHDFGGI<BF=DHIHHH
Agora, 4 linhas correspondem a 1 leitura, então
@SRR1024120.7 DBRHHJN1:259:D0PM7ACXX:1:1101:1386:1189 length=100
GATACAGGATGCCTGGGTCTAGGCTGTGTGACCTTGGGCCAGTTCCTCTC
+SRR1024120.7 DBRHHJN1:259:D0PM7ACXX:1:1101:1386:1189 length=100
DDDFFDDBGFEHEHGIGC9F>HG9EH8?DF4?:DF<?3:D?DHIGGDDFH
corresponde a 1 leitura que é GATACAGGATGCCTGGGTCTAGGCTGTGTGACCTTGGGCCAGTTCCTCTC
Mostrei o arquivo fastq acima. O que eu quero fazer é extrair apenas as leituras em que o comprimento do seq de leitura é < = 25, Então minha saída deve ser
@SRR1024120.25 DBRHHJN1:259:D0PM7ACXX:1:1101:1752:1149 length=100
CTGCTGCTCATGCTCAT
+SRR1024120.25 DBRHHJN1:259:D0PM7ACXX:1:1101:1752:1149 length=100
BDDDDD<<CC:C+AFFE
@SRR1024120.42 DBRHHJN1:259:D0PM7ACXX:1:1101:2482:1096 length=100
AGCGTGTGCCACCCTACGCCGGC
+SRR1024120.42 DBRHHJN1:259:D0PM7ACXX:1:1101:2482:1096 length=100
DD>DAA@AA@@?2C8AB)?@:DD
Eu quero usar o awk para esse propósito.
Eu tentei algo assim
awk 'NR % 2 == 0 {if(length($1) <= 25) print $0}; NR % 2 == 1' test.fastq
MAS isso imprime algo assim
@SRR1024120.7 DBRHHJN1:259:D0PM7ACXX:1:1101:1386:1189 length=100
+SRR1024120.7 DBRHHJN1:259:D0PM7ACXX:1:1101:1386:1189 length=100
@SRR1024120.25 DBRHHJN1:259:D0PM7ACXX:1:1101:1752:1149 length=100
CTGCTGCTCATGCTCAT
+SRR1024120.25 DBRHHJN1:259:D0PM7ACXX:1:1101:1752:1149 length=100
BDDDDD<<CC:C+AFFE
@SRR1024120.42 DBRHHJN1:259:D0PM7ACXX:1:1101:2482:1096 length=100
AGCGTGTGCCACCCTACGCCGGC
+SRR1024120.42 DBRHHJN1:259:D0PM7ACXX:1:1101:2482:1096 length=100
DD>DAA@AA@@?2C8AB)?@:DD
@SRR1024120.1 DBRHHJN1:259:D0PM7ACXX:1:1101:1200:1120 length=100
+SRR1024120.1 DBRHHJN1:259:D0PM7ACXX:1:1101:1200:1120 length=100
É evidente que não quero
@SRR1024120.7 DBRHHJN1:259:D0PM7ACXX:1:1101:1386:1189 length=100
+SRR1024120.7 DBRHHJN1:259:D0PM7ACXX:1:1101:1386:1189 length=100
@SRR1024120.1 DBRHHJN1:259:D0PM7ACXX:1:1101:1200:1120 length=100
+SRR1024120.1 DBRHHJN1:259:D0PM7ACXX:1:1101:1200:1120 length=100
na minha saída.
Qualquer ajuda seria apreciada
Obrigado