Mesclar campos em um arquivo

4

Eu tenho um arquivo com 7 colunas, um arquivo GFF com regiões cromossômicas.Eu quero recolher as linhas onde REGION="exon" para apenas uma linha no arquivo.A linha tem que ser recolhida com base em regiões sendo sobrepondo uns aos outros.

REGION  START   END  SCORE STRAND FRAME     ATTRIBUTE
 exon   26453   26644   .   +   .   Transcript "XM_092971"; Name "XM_092971"
 exon   26842   27020   .   +   .   Transcript "XM_092971"; Name "XM_092971"
 exon   30355   30899   .   -   .   Transcript "XM_104663"; Name "XM_104663"
 GS_TRAN    30355   34083   .   -   .   GS_TRAN "Hs22_30444_28_1_1"; Name "Hs22_30444_28_1_1"
 snp    30847   30847   .   +   .   SNP "rs2971719"; Name "rs2971719"
 exon   31012   31409   .   -   .   Transcript "XM_104663"; Name "XM_104663"
 exon   34013   34083   .   -   .   Transcript "XM_104663"; Name "XM_104663"
 exon   40932   41071   .   +   .   Transcript "XM_092971"; Name "XM_092971"
 snp    44269   44269   .   +   .   SNP "rs2873227"; Name "rs2873227"
 snp    45723   45723   .   +   .   SNP "rs2227095"; Name "rs2227095"
 exon   134031  134495  .   -   .   Transcript "XM_086913"; Name "XM_086913"            
 exon   134034  134457  .   -   .   Transcript "XM_086914"; Name "XM_086914"            

Observando os dados de amostra acima, somente as duas últimas linhas podem ser mescladas em uma linha. Assim, a nova linha se tornará.

exon    134031  134495  .   -   .   Transcript "XM_086913"; Name "XM_086913"            

No caso, o final da outra linha teria sido maior que o anterior, que seria a região END nesse caso. Basicamente, se houver alguma sobreposição, então pegue a região que inicia Anteriormente, e a que termina depois.

Pode haver várias linhas de tal instância, aqui apenas as últimas 2 linhas estão lá. Uma coisa é que a coluna ATRRIBUTE mostrará definitivamente diferentes nomes de Transcrição para tais linhas, que são na maioria iguais em outros casos.

Sugestões sobre como proceder.

EXEMPLO ATUALIZADO: Se as últimas 2 linhas forem assim

  exon  134031  134457  .   -   .   Transcript "XM_086913"; Name "XM_086913"            
  exon  134034  134495  .   -   .   Transcript "XM_086914"; Name "XM_086914"

Em seguida, a saída deve ser:

exon    134031  134495  .   -   .   Transcript "XM_086913"; Transcript "XM_086914"

Basicamente, o START de primeiro e END de segundo. Desde que eu quero cobrir a sobreposição em uma única linha, em vez de 2 ou 3 ou mais linhas.Aqui a sobreposição é entre 2 linhas, mas poderia ser entre mais de 2 linhas também.

EXEMPLO ATUALIZADO (24/3/2014)

chr1    HAVANA  stop_codon  1120520 1120522 .   +   0   gene_id "ENSG00000162571.9"; transcript_id "ENST00000379288.3"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TTLL10"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "TTLL10-001"; level 2; tag "CCDS"; ccdsid "CCDS8.1"; havana_gene "OTTHUMG00000000851.3"; havana_transcript "OTTHUMT00000002420.2";
chr1    HAVANA  UTR 1115077 1115233 .   +   .   gene_id "ENSG00000162571.9"; transcript_id "ENST00000379288.3"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TTLL10"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "TTLL10-001"; level 2; tag "CCDS"; ccdsid "CCDS8.1"; havana_gene "OTTHUMG00000000851.3"; havana_transcript "OTTHUMT00000002420.2";
chr1    HAVANA  UTR 1115414 1115433 .   +   .   gene_id "ENSG00000162571.9"; transcript_id "ENST00000379288.3"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TTLL10"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "TTLL10-001"; level 2; tag "CCDS"; ccdsid "CCDS8.1"; havana_gene "OTTHUMG00000000851.3"; havana_transcript "OTTHUMT00000002420.2";
chr1    HAVANA  UTR 1120520 1121244 .   +   .   gene_id "ENSG00000162571.9"; transcript_id "ENST00000379288.3"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TTLL10"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "TTLL10-001"; level 2; tag "CCDS"; ccdsid "CCDS8.1"; havana_gene "OTTHUMG00000000851.3"; havana_transcript "OTTHUMT00000002420.2";
chr1    HAVANA  transcript  1115864 1119307 .   +   .   gene_id "ENSG00000162571.9"; transcript_id "ENST00000460998.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TTLL10"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "TTLL10-004"; level 2; havana_gene "OTTHUMG00000000851.3"; havana_transcript "OTTHUMT00000002423.1";
chr1    HAVANA  exon    1115864 1116240 .   +   .   gene_id "ENSG00000162571.9"; transcript_id "ENST00000460998.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TTLL10"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "TTLL10-004"; level 2; havana_gene "OTTHUMG00000000851.3"; havana_transcript "OTTHUMT00000002423.1";
chr1    HAVANA  *exon   1117121 1117195*    .   +   .   gene_id "ENSG00000162571.9"; transcript_id "ENST00000460998.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TTLL10"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "TTLL10-004"; level 2; havana_gene "OTTHUMG00000000851.3"; havana_transcript "OTTHUMT00000002423.1";
chr1    HAVANA  *exon   1117150 1117826*    .   +   .   gene_id "ENSG00000162571.9"; transcript_id "ENST00000460998.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TTLL10"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "TTLL10-004"; level 2; havana_gene "OTTHUMG00000000851.3"; havana_transcript "OTTHUMT00000002423.1";
chr1    HAVANA  exon    1118256 1118427 .   +   .   gene_id "ENSG00000162571.9"; transcript_id "ENST00000460998.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TTLL10"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "TTLL10-004"; level 2; havana_gene "OTTHUMG00000000851.3"; havana_transcript "OTTHUMT00000002423.1";

chr1    HAVANA  transcript  1190648 1209229 .   -   .   gene_id "ENSG00000160087.16"; transcript_id "ENST00000473215.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "UBE2J2"; transcript_type "nonsense_mediated_decay"; transcript_status "KNOWN"; transcript_name "UBE2J2-003"; level 2; havana_gene "OTTHUMG00000001911.7"; havana_transcript "OTTHUMT00000005432.2";
chr1    HAVANA  exon    1209046 1209229 .   -   .   gene_id "ENSG00000160087.16"; transcript_id "ENST00000473215.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "UBE2J2"; transcript_type "nonsense_mediated_decay"; transcript_status "KNOWN"; transcript_name "UBE2J2-003"; level 2; havana_gene "OTTHUMG00000001911.7"; havana_transcript "OTTHUMT00000005432.2";
chr1    HAVANA  exon    1203113 1203372 .   -   .   gene_id "ENSG00000160087.16"; transcript_id "ENST00000473215.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "UBE2J2"; transcript_type "nonsense_mediated_decay"; transcript_status "KNOWN"; transcript_name "UBE2J2-003"; level 2; havana_gene "OTTHUMG00000001911.7"; havana_transcript "OTTHUMT00000005432.2";
chr1    HAVANA  CDS 1203241 1203372 .   -   0   gene_id "ENSG00000160087.16"; transcript_id "ENST00000473215.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "UBE2J2"; transcript_type "nonsense_mediated_decay"; transcript_status "KNOWN"; transcript_name "UBE2J2-003"; level 2; havana_gene "OTTHUMG00000001911.7"; havana_transcript "OTTHUMT00000005432.2";
chr1    HAVANA  start_codon 1203370 1203372 .   -   0   gene_id "ENSG00000160087.16"; transcript_id "ENST00000473215.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "UBE2J2"; transcript_type "nonsense_mediated_decay"; transcript_status "KNOWN"; transcript_name "UBE2J2-003"; level 2; havana_gene "OTTHUMG00000001911.7"; havana_transcript "OTTHUMT00000005432.2";
chr1    HAVANA  stop_codon  1203238 1203240 .   -   0   gene_id "ENSG00000160087.16"; transcript_id "ENST00000473215.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "UBE2J2"; transcript_type "nonsense_mediated_decay"; transcript_status "KNOWN"; transcript_name "UBE2J2-003"; level 2; havana_gene "OTTHUMG00000001911.7"; havana_transcript "OTTHUMT00000005432.2";
chr1    HAVANA  exon    1198726 1198766 .   -   .   gene_id "ENSG00000160087.16"; transcript_id "ENST00000473215.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "UBE2J2"; transcript_type "nonsense_mediated_decay"; transcript_status "KNOWN"; transcript_name "UBE2J2-003"; level 2; havana_gene "OTTHUMG00000001911.7"; havana_transcript "OTTHUMT00000005432.2";
chr1    HAVANA  exon    1192588 1192690 .   -   .   gene_id "ENSG00000160087.16"; transcript_id "ENST00000473215.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "UBE2J2"; transcript_type "nonsense_mediated_decay"; transcript_status "KNOWN"; transcript_name "UBE2J2-003"; level 2; havana_gene "OTTHUMG00000001911.7"; havana_transcript "OTTHUMT00000005432.2";
chr1    HAVANA  exon    1192372 1192510 .   -   .   gene_id "ENSG00000160087.16"; transcript_id "ENST00000473215.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "UBE2J2"; transcript_type "nonsense_mediated_decay"; transcript_status "KNOWN"; transcript_name "UBE2J2-003"; level 2; havana_gene "OTTHUMG00000001911.7"; havana_transcript "OTTHUMT00000005432.2";
chr1    HAVANA  *exon   1191425 1191505*    .   -   .   gene_id "ENSG00000160087.16"; transcript_id "ENST00000473215.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "UBE2J2"; transcript_type "nonsense_mediated_decay"; transcript_status "KNOWN"; transcript_name "UBE2J2-003"; level 2; havana_gene "OTTHUMG00000001911.7"; havana_transcript "OTTHUMT00000005432.2";
chr1    HAVANA  *exon   1190648 1191470*    .   -   .   gene_id "ENSG00000160087.16"; transcript_id "ENST00000473215.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "UBE2J2"; transcript_type "nonsense_mediated_decay"; transcript_status "KNOWN"; transcript_name "UBE2J2-003"; level 2; havana_gene "OTTHUMG00000001911.7"; havana_transcript "OTTHUMT00000005432.2";

A parte superior mostra a sobreposição na fita "+", e a parte inferior mostra a sobreposição na fita "-". A fita "-" tem regiões decrescentes, portanto a sobreposição será como mostrada nas últimas duas linhas. são diferentes genes.Assim, a sobreposição tem que ser por gene, como às vezes diferentes genes têm sobreposição de exons também, mas isso é muito raro, como li em alguns posts. A informação do gene pode ser extraída da última coluna, presente como "gene_name".

As duas linhas de gene_name = TTLL10 têm exons sobrepostos, então elas serão mescladas na saída final.

chr1    HAVANA  *exon   1117121 1117195*    .   +   .   transcript_id "ENST00000460998.1"; gene_name "TTLL10"; 
chr1    HAVANA  *exon   1117150 1117826*    .   +   .   transcript_id "ENST00000460998.1"; gene_name "TTLL10"; 

As duas linhas de gene_name = UBE2J2 têm sobreposição de exons.

 chr1   HAVANA  *exon   1191425 1191505*    .   -   .   transcript_id "ENST00000473215.1"; gene_name "UBE2J2"; 
  chr1  HAVANA  *exon   1190648 1191470*    .   -   .   transcript_id "ENST00000473215.1";  gene_name "UBE2J2"; 

SAÍDA DE AMOSTRA

O restante das linhas permanece igual e as linhas acima são mescladas para cada gene.

chr1    HAVANA  *exon   1117121 1117826*    .   +   .   transcript_id "ENST00000460998.1"; gene_name "TTLL10";
chr1    HAVANA  *exon   1190648 1191505*    .   -   .   transcript_id "ENST00000473215.1";  gene_name "UBE2J2"; 

No caso, os transcript_ids são diferentes, ambos os IDs de transcrição seriam impressos, embora gene_name permaneça o mesmo.Por exemplo, para gene, os id de transcrição eram diferentes, como abaixo:

  chr1  HAVANA  *exon   1191425 1191505*    .   -   .   transcript_id "ENST00000473215.1"; gene_name "UBE2J2"; 
  chr1  HAVANA  *exon   1190648 1191470*    .   -   .   transcript_id "ENST00000473215.2";  gene_name "UBE2J2"; 

Isso será mesclado com o acima, mas deve ter os dois nomes de transcrição. Desde então, acredito que possa ser necessário e será importante mais tarde preservar a informação da transcrição.

  chr1  HAVANA  *exon   1190648 1191505*    .   -   .   transcript_id "ENST00000473215.1"; "ENST00000473215.2"  gene_name "UBE2J2"; 
    
por Ron 21.03.2014 / 18:42

2 respostas

6

Uma abordagem "awk"

awk '
  $1!="exon" {                       # If the first died is unequal to "exon"
    if(previous)print previous       # If there is a previous line then print it
    print                            # Print the current line
    previous=start=end=exon_seq=""   # Set all variable to an empty string
    next                             # Move on to the next line in the input file
  }
  {
    if(exon_seq) {                   # if there is a sequence of lines with "exon in field 1
      if(start<=$2 && end>=$3)       # if the start value (field 2) of the previous line 
                                     # is less or equal to the current line and the end
                                     # value of the previous line is greater than or
                                     # equal to field 3 of the current line
        next                         # then do nothing and read the next line
      else                           # if there is no overlap,
        print previous               # then print the previous line
    }
    else {                           # if we are not already in the a sequence of 
                                     # "exon" lines, then this is the first one
      exon_seq=1                     # so exon_seq should become 1
    }
    previous=$0; start=$2; end=$3    # 'start' become field2, 'end' becomes field 3 and
                                     # 'previous' becomes the current record (line)
  }
  END{                               # After all lines are processed
    if(previous) print previous      # If there still is a previous line, then print it
  }
' file
    
por 21.03.2014 / 21:10
2

Eu usaria o Perl para resolver uma tarefa tão complicada. Aqui está uma solução parcial, você pode precisar ajustá-lo para melhor atender a você:

#!/usr/bin/perl
use warnings;
use strict;
use List::Util qw{ max };

sub output {
    my $previous = shift;
    print join ' ', 'exon',
               @{$previous}{qw(start end score strand frame attribute)};
}

$\ = "\n";
my %previous;
while (<>) {
    chomp;
    my ($region, $start, $end, $score, $strand, $frame, $attribute)
        = split ' ', $_, 7;

    if ($. == 1) {
        print;

    } elsif ('exon' eq $region) {
        if (keys %previous
            and $start < $previous{end}                # Overlap.
           ) {

            if ($end > $previous{end}) {               # Not contained.
                $previous{attribute} =~ s/; Name .*//;
                $previous{attribute} .= '; ' 
                                     . ($attribute =~ /(Transcript ".*?")/)[0];
                $previous{end} = $end;
            }

        } else {
            if (keys %previous) {
                output(\%previous);
            }

            %previous = ( start     => $start,
                          end       => $end,
                          score     => $score,
                          strand    => $strand,
                          frame     => $frame,
                          attribute => $attribute,
                        );
        }

    } else {
        output(\%previous) if keys %previous;
        %previous = ();
    }
}

output(\%previous) if keys %previous;
    
por 21.03.2014 / 19:19