Como comparar 2 arquivos e extrair a seqüência particular usando o ID no primeiro arquivo e imprimir a saída

5

Arquivo 1 (contém apenas os IDs selecionados):

AAX50250
AAX50251
AAX50252
AAX50253
AAX50254
AAX50257
AAX50258

Arquivo 2 (contém sequências com os IDs). Este é um arquivo FASTA no qual as linhas que começam com > são o ID da sequência e as linhas subsequentes são a própria sequência.

> AAX50250.1 cds:annotated chromosome:ASM1212v1:Chromosome:1:1770:1 gene:CTA_0001.1 gene_biotype:protein_coding transcript_biotype:protein_coding description:hypothetical protein
ATGAGCATCAGGGGAGTAGGAGGCAACGGGAATAGTCGAATCCCTTCTCATAATGGGGAT
GGATCGAATCGCAGAAGTCAAAATACGAAGAATAAAGTTGAAGATCGAGTTCGTTCTCTA
TATTCATCTCGTAGTAACGAAAATAGAGAATCTCCTTATGCAGTAGTAGACGTCAGCTCT
> AAX50251.1 cds:annotated chromosome:ASM1212v1:Chromosome:1915:2187:-1 gene:CTA_0002.1 gene_biotype:protein_coding transcript_biotype:protein_coding description:hypothetical membrane associated protein
ATGCTTTGTAAAGTTTGTAGAGGATTATCTTCTCTTATTGTTGTTCTCGGAGCTATAAAC
ACTGGAATTTTAGGAGTAACAGGGTATAAGGTAAACCTACTTACTCACCTGCTTGGTGAA
GGAACCATGTGGACACAAGCAGCTTATGTAGTAACGGGAATCGCTGGGGTTATGGTCTGC
> AAX50252.1 cds:annotated chromosome:ASM1212v1:Chromosome:2388:2690:1 gene:CTA_0003.1 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:gatC description:glutamyl-tRNA(Gln) amidotransferase subunit C
ATGACAGAGTCATATGTAAACAAAGAAGAAATCATCTCTTTAGCAAAGAATGCTGCATTG
GAGTTGGAAGATGCCCACGTGGAAGAGTTCGTAACATCTATGAATGACGTCATTGCTTTA
ATGCAGGAAGTAATCGCGATAGATATTTCGGATATCATTCTTGAAGCTACAGTGCATCAT
TTCGTTGGTCCAGAGGATCTTAGAGAAGACATGGTGACTTCGGATTTTACTCAAGAAGAA
TAG
> AAX50262.1 cds:annotated chromosome:ASM1212v1:Chromosome:13318:13932:-1 gene:CTA_0013.1 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:ybbP description:hypothetical membrane spanning protein
ATGTTCGTAGGTATAACGTATTACACCACACCTCTGTTGGAGATAGCTTTAATTTGGGTG
GTCCTTAATTATTTGCTAAAGTTTTTCTGGGGAACAGGCGCCATGGACCTCGTCTTTGGC
TTGTTGTCTTTTCTTTGCCTATTTGTTCTAGCAGAAAAACTTCATCTCCCCGTTATTCGC
AATTTGATGCTTCATGTAGTGAATATTGCGGCTATCGTGGTATTTATTATCTTCCAACCA
GAAATTCGCCTTGCTCTCTCTAGGATACGCTTGTGTAGAGAGAAATTTGTCATCAATATG

Agora, preciso comparar o Arquivo 1, que contém os IDs selecionados, com o Arquivo2, que possui um número n de seqüências. Precisamos gerar as seqüências associadas aos IDs no arquivo 1.

Por exemplo, o ID AAX50250 que está no arquivo 1 deve retornar:

>AAX50250.1 cds:annotated chromosome:ASM1212v1:Chromosome:1:1770:1 gene:CTA_0001.1 gene_biotype:protein_coding transcript_biotype:protein_coding description:hypothetical protein
ATGAGCATCAGGGGAGTAGGAGGCAACGGGAATAGTCGAATCCCTTCTCATAATGGGGAT
GGATCGAATCGCAGAAGTCAAAATACGAAGAATAAAGTTGAAGATCGAGTTCGTTCTCTA
TATTCATCTCGTAGTAACGAAAATAGAGAATCTCCTTATGCAGTAGTAGACGTCAGCTCT
ATGATCGAGAGCACCCCAACGAGTGGAGAGACGACAAGAGCTTCGCGTGGAGTATTCAGT
CGTTTCCAAAGAGGTTTAGGACGAGTAGCTGACAAAGTAAGACGAGCTGTTCAGCGTGCG
TGGAGTTCAGTCTCTATAAGAAGATCGTCTGCAACAAGAGCCGCAGAATCCAGATCAAGT

Como posso fazer isso?

    
por Nitha 03.10.2016 / 10:37

2 respostas

6

Você pode escrever um script para fazer isso diretamente, mas sugiro que use os dois pequenos scripts abaixo. Eu passei muitos anos trabalhando com esse tipo de dados e eles foram muito, muito úteis. Se você está planejando fazer muita manipulação de arquivos FASTA, sugiro que você os use.

  1. FastaToTbl :

    #!/usr/bin/awk -f
    {
            if (substr($1,1,1)==">")
            if (NR>1)
                        printf "\n%s\t", substr($0,2,length($0)-1)
            else 
                printf "%s\t", substr($0,2,length($0)-1)
            else 
                    printf "%s", $0
    }END{printf "\n"}
    
  2. TblToFasta :

    #! /usr/bin/awk -f
    {
      sequence=$NF
    
      ls = length(sequence)
      is = 1
      fld  = 1
      while (fld < NF)
      {
         if (fld == 1){printf ">"}
         printf "%s " , $fld
    
         if (fld == NF-1)
          {
            printf "\n"
          }
          fld = fld+1
      }
      while (is <= ls)
      {
        printf "%s\n", substr(sequence,is,60)
        is=is+60
      }
    }
    

Salve os scripts acima no diretório $HOME/bin (se ele não existir, crie-o, efetue logout e faça login novamente para adicioná-lo ao seu $PATH ). Em seguida, torne os scripts executáveis com chmod a+x ~/bin/TblToFasta ~/bin/FastaToTbl .

O FastaToTbl converterá os arquivos de sequência FASTA para o formato tbl (o ID da sequência, seguido por um caractere de tabulação e a sequência, tudo em uma linha). Quando eles estiverem no formato tbl e você tiver o ID e a sequência na mesma linha, use grep para pesquisar seus IDs. Então, você passa a saída através de TblToFasta para converter de volta para FASTA:

$ FastaToTbl file2 | grep -wFf file1 | TblToFasta
>AAX50250.1 cds:annotated chromosome:ASM1212v1:Chromosome:1:1770:1 gene:CTA_0001.1 gene_biotype:protein_coding transcript_biotype:protein_coding description:hypothetical protein 
ATGAGCATCAGGGGAGTAGGAGGCAACGGGAATAGTCGAATCCCTTCTCATAATGGGGAT
GGATCGAATCGCAGAAGTCAAAATACGAAGAATAAAGTTGAAGATCGAGTTCGTTCTCTA
TATTCATCTCGTAGTAACGAAAATAGAGAATCTCCTTATGCAGTAGTAGACGTCAGCTCT
>AAX50251.1 cds:annotated chromosome:ASM1212v1:Chromosome:1915:2187:-1 gene:CTA_0002.1 gene_biotype:protein_coding transcript_biotype:protein_coding description:hypothetical membrane associated protein 
ATGCTTTGTAAAGTTTGTAGAGGATTATCTTCTCTTATTGTTGTTCTCGGAGCTATAAAC
ACTGGAATTTTAGGAGTAACAGGGTATAAGGTAAACCTACTTACTCACCTGCTTGGTGAA
GGAACCATGTGGACACAAGCAGCTTATGTAGTAACGGGAATCGCTGGGGTTATGGTCTGC
>AAX50252.1 cds:annotated chromosome:ASM1212v1:Chromosome:2388:2690:1 gene:CTA_0003.1 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:gatC description:glutamyl-tRNA(Gln) amidotransferase subunit C 
ATGACAGAGTCATATGTAAACAAAGAAGAAATCATCTCTTTAGCAAAGAATGCTGCATTG
GAGTTGGAAGATGCCCACGTGGAAGAGTTCGTAACATCTATGAATGACGTCATTGCTTTA
ATGCAGGAAGTAATCGCGATAGATATTTCGGATATCATTCTTGAAGCTACAGTGCATCAT
TTCGTTGGTCCAGAGGATCTTAGAGAAGACATGGTGACTTCGGATTTTACTCAAGAAGAA
TAG

As opções grep usadas são:

   -F, --fixed-strings
          Interpret PATTERN as a list of fixed strings  (instead  of  regular
          expressions), separated by newlines, any of which is to be matched.

   -f FILE, --file=FILE
          Obtain  patterns  from  FILE, one per line.  If this option is used
          multiple times or is combined with the -e (--regexp) option, search
          for all patterns given.  The empty file contains zero patterns, and
          therefore matches nothing.
   -w, --word-regexp
          Select only those lines containing matches that form  whole  words.
          The  test  is  that  the  matching  substring must either be at the
          beginning of the  line,  or  preceded  by  a  non-word  constituent
          character.   Similarly, it must be either at the end of the line or
          followed by a  non-word  constituent  character.   Word-constituent
          characters are letters, digits, and the underscore.

Assim, o -f nos permite fornecer um arquivo para ler a pesquisa [padrões de, o -F diz para tratar os padrões como sequências de caracteres e não expressões regulares (caso contrário, seus IDs contêm . , que eles costumam fazer, que serão considerados como "correspondem a qualquer caractere") e o -w garante que correspondamos somente a ID inteira, de forma que foo não corresponda a foobar .

Alternativamente, aqui está um perl rápido e sujo para fazer o que você quer:

perl -e 'open(F,"File1.txt");while(<F>){/(\S+)/; $k{$1}++}; while(<>){if(/>\s*(\S+?)(\.| )/){if($k{$1}){$k=1}else{$k=0}; } print if $k==1;}' File2.fa

Ou, um pouco menos sujo:

perl -e 'open(F,"File1.txt");
         while(<F>){
            /(\S+)/; 
            $k{$1}++
         } 
         while(<>){
            if(/>\s*(\S+?)(\.| )/){
                if($k{$1}){$k=1}
                else{$k=0}
            } 
            print if $k==1
        }' File2.fa
    
por terdon 03.10.2016 / 11:01
3

Que tal algo assim?

$ awk -F'[ .]' 'NR==FNR{a[$0]; next}/^>/{p=$2 in a}p' file1 file2
> AAX50250.1 cds:annotated chromosome:ASM1212v1:Chromosome:1:1770:1 gene:CTA_0001.1 gene_biotype:protein_coding transcript_biotype:protein_coding description:hypothetical protein
ATGAGCATCAGGGGAGTAGGAGGCAACGGGAATAGTCGAATCCCTTCTCATAATGGGGAT
GGATCGAATCGCAGAAGTCAAAATACGAAGAATAAAGTTGAAGATCGAGTTCGTTCTCTA
TATTCATCTCGTAGTAACGAAAATAGAGAATCTCCTTATGCAGTAGTAGACGTCAGCTCT
> AAX50251.1 cds:annotated chromosome:ASM1212v1:Chromosome:1915:2187:-1 gene:CTA_0002.1 gene_biotype:protein_coding transcript_biotype:protein_coding description:hypothetical membrane associated protein
ATGCTTTGTAAAGTTTGTAGAGGATTATCTTCTCTTATTGTTGTTCTCGGAGCTATAAAC
ACTGGAATTTTAGGAGTAACAGGGTATAAGGTAAACCTACTTACTCACCTGCTTGGTGAA
GGAACCATGTGGACACAAGCAGCTTATGTAGTAACGGGAATCGCTGGGGTTATGGTCTGC
> AAX50252.1 cds:annotated chromosome:ASM1212v1:Chromosome:2388:2690:1 gene:CTA_0003.1 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:gatC description:glutamyl-tRNA(Gln) amidotransferase subunit C
ATGACAGAGTCATATGTAAACAAAGAAGAAATCATCTCTTTAGCAAAGAATGCTGCATTG
GAGTTGGAAGATGCCCACGTGGAAGAGTTCGTAACATCTATGAATGACGTCATTGCTTTA
ATGCAGGAAGTAATCGCGATAGATATTTCGGATATCATTCTTGAAGCTACAGTGCATCAT
TTCGTTGGTCCAGAGGATCTTAGAGAAGACATGGTGACTTCGGATTTTACTCAAGAAGAA
TAG

O script acima pode ser expandido para:

awk -F'[ .]' '       # set the field separator to space or dot
        NR == FNR {  # for the lines of the first file
            a[$0]    # Store the line in an array
            next     # And skip to the next record
        }
        /^>/{        # For lines of the second file starting with '>'
            p = $2 in a # Set flag to true if index is present in array
        }
        p            # print line if flag is set
      ' file1 file2
    
por user000001 03.10.2016 / 11:02

Tags