como dividir um arquivo em arquivos separados com base nos cabeçalhos de coluna no arquivo original?

Question

como dividir um arquivo em arquivos separados com base nos cabeçalhos de coluna no arquivo original?

#1 resposta do (3 votos)
#2 resposta do (3 votos)
#3 resposta do (2 votos)
#4 resposta do (1 votos)

5

Gostaria de dividir um arquivo em arquivos diferentes com base nas informações da primeira linha. Por exemplo, eu tenho:

Entrada:

1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 4 30 30 30 30
0 2 2 0 2 0 2 0 2 0 2 2 0 0 2 2 2 0 1 1 1 2 0 2 0 0 0 2 0 0 2 0 2
0 2 1 0 1 0 1 1 1 0 2 2 0 0 2 2 2 0 0 0 0 2 0 2 0 0 1 2 0 0 2 0 2
0 2 1 0 1 0 1 1 1 0 2 2 0 0 2 2 2 0 0 0 0 2 0 2 0 0 1 2 0 0 2 0 2

Saída desejada:

output1.txt

02202020
02101011
02101011

output2.txt

2022002
1022002
1022002

output3.txt

220111
220000
220000

output4.txt

202000200202
202001200202
202001200202

output30.txt

0202
0202
0202

bash perl awk linux cut

por zara 30.09.2015 / 18:52

4 respostas

3

$ awk '
    NR == 1 {
        for (i=1; i<=NF; i++) {
            output[i] = "output" $i ".txt"
            files[output[i]] = 1
        }
        next
    }
    {
        for (i=1; i<=NF; i++)  printf "%s", $i > output[i]
        for (file in files)    print ""        > file
    }
' input.filename

$ for f in output*.txt; do echo $f; cat $f; done
output1.txt
02202020
02101011
02101011
output2.txt
2022002
1022002
1022002
output3.txt
220111
220000
220000
output30.txt
00202
00202
00202
output4.txt
2020002
2020012
2020012

Observe que sua linha de cabeçalho tem 32 campos e as outras linhas 33. Isso precisa ser corrigido primeiro.

por 30.09.2015 / 20:57

2

OK, também por diversão - uma versão Bash pura (conforme solicitado) que se apóia strongmente no builtin read para enviar palavras para matrizes e salvá-las em arquivo. Os arquivos estão bem formatados como output001.txt .... output030.txt. Usado um arquivo de dados modificado por @ringO para propósitos de teste. Não testado, mas em arquivos muito grandes, pode ser mais eficiente em termos de tempo e recursos do que outros.

Dados:

1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 4 30 30 30 30
0 2 2 0 2 0 2 0 2 0 2 2 0 0 2 2 2 0 1 1 1 2 0 2 0 0 0 2 0 2 0 2
0 2 1 0 1 0 1 1 1 0 2 2 0 0 2 2 2 0 0 0 0 2 0 2 0 0 1 2 0 2 0 2
0 2 1 0 1 0 1 1 1 0 2 2 0 0 2 2 2 0 0 0 0 2 0 2 0 0 1 2 0 2 0 2

Fonte:

#!/usr/bin/env bash

# genome : to sort genome data sets according to patterns of the first (header)
# line of the file.  Data must be space delimited.  No dependencies.
#
# Usage:
#
#                      ./genome "data.txt" 

# global arrays
sc=(  )             # array of set element counts
sn=(  )             # array of set id numbers

# output_file "set id"

# change the output pattern and digit output width as required - default
# pattern is output.txt and digit width of three : output000.txt
output_file(){
    # format concept: pattern000.txt
    local op='output.txt'     # output pattern
    local ow=3                # output width: 3 => 000
    printf "%s%0${ow}d.%s" "${op%%.*}" "$1" "${op##*.}"
}

# define_sets "input.txt"

# identify sets - get elements count and sets id numbers from file
# header.
define_sets(){
    # declare and initialize
    local a an b c n
    read -r c < "$1"
    read -r a b <<< "$c"
    n=0; sn=( $a )

    # recurse header, identify sets
    until [[ -z $b ]]
    do
        n=$((n+1))
        an=$a
        read -r a b <<< "$b"
        [[ $an == $a ]] || { sn+=( $a ); sc+=( $n ); n=0; }
    done
    n=$((n+1))
    sc+=( $n )
}

# reset_files

# optional function, clears file data, otherwise data is appended to existing
# output files.
reset_files(){
    for s in ${sn[@]}
    do
        > "$(output_file "$s")"
    done
}

# extract_data "input.txt"

# use defined sets to extract data from the input file and send it to required
# output files. Uses nested 'while read' to bypass file header as data is saved.
extract_data(){
    local a c n s fn da=( )
    while read -a da
    do
        while read -a da
        do
            a=0 n=0
            for s in ${sc[@]}
            do
                c="$(echo "${da[@]:$a:$s}")" # words => string
                echo "${c// /}" >> "$(output_file "${sn[$n]}")"  # save
                n=$((n+1))
                a=$((a+s))
            done
        done
    done < "$1"
}

define_sets "$1"    # get data set structure from header
reset_files         # optional, clears and resets files
extract_data "$1"   # get data from input file and save

# end file

Saída de dados:

$ cat output001.txt 
02202020
02101011
02101011

$ cat output002.txt 
2022002
1022002
1022002

$ cat output003.txt 
220111
220000
220000

$ cat output004.txt 
2020002
2020012
2020012

$ cat output030.txt 
0202
0202
0202

por 05.10.2015 / 19:04

1

Apenas por diversão, alguma outra solução:

awk '{ for (i=1; i<=NF;i++){
          if (NR==1) { file[i]=$i }
          if (NR!=1) { f="output" file[i]   ".txt";
                       g="output" file[i+1] ".txt";
                       printf("%s%s",$i,f==g?OFS:ORS)>>f;
                       close(f);
                      }
          }
      }' file

Se você precisar de campos que não sejam separados, altere ?OFS: para ?"": .

O arquivo padrão que recebe valores não pareados é output.txt . Esse arquivo receberá valores se o número de colunas na primeira linha não corresponder às próximas linhas processadas. Se tudo estiver correto, deve estar vazio. Se existir depois que o script for executado, há um problema em algum lugar.

por 01.10.2015 / 08:21

Tags bash perl awk linux cut

alternativa awk / nawk em SunOs e Linux Como posso verificar se 'nice' está funcionando?

score 3 · Accepted Answer

script Perl.

Defina o nome do arquivo em $in no lugar de genome.txt ou dê o nome como argumento.

Nomeie o script counter.pl e dê a ele direitos executáveis e execute-o como ./counter.pl

chmod 755 counter.pl
./counter.pl

ou alternativamente

chmod 755 counter.pl
./counter.pl genome.txt

counter.pl:

#!/usr/bin/perl

use strict;
use warnings;

my $in = $ARGV[0] || 'genome.txt'; # input file name

open (my $F, '<', $in) or die "Cannot open input file $!";
my $n = 0;
my %fd = ();
my @fd = ();

while (<$F>) {
        # trim
        s/^\s+//;
        s/\s+$//;
        next if (!$_); # Skip empty lines
        my @x = split(/\s+/, $_);
        # 1st line, open files
        if ( ! $n++)  {
           my $fd = 0;
           for (@x) {
              open ($fd{$_}, '>', "output$_.txt") 
                or die ("Cannot open file $!")
                  if (!exists($fd{$_}));
              $fd[$fd++] = $_;
           }
        }
        else { # Write data
           die ("Should have " . ($#fd+1) . " entries on line $n")
             if ($#x != $#fd);
           for (0 .. $#x) {
              print {$fd{$fd[$_]}} ($x[$_]);
           }
           print {$fd{$_}} ("\n") for (keys %fd);
        }
}

close $fd{$_} for (keys %fd);
close $F;
# the end

Corrigido o número de palavras por linha (às vezes era 32, às vezes 33 no exemplo).

Esta versão pode acomodar qualquer variação de colunas, mas todas as linhas precisam ter o mesmo número de palavras. Um erro ocorrerá (as linhas die ) se o número de palavras for diferente ou se não puder abrir arquivos.

Basta ajustar o nome do arquivo ( $in ).

Arquivo de entrada: (removeu o 0 extra perto do final)

1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 4 30 30 30 30
0 2 2 0 2 0 2 0 2 0 2 2 0 0 2 2 2 0 1 1 1 2 0 2 0 0 0 2 0 2 0 2
0 2 1 0 1 0 1 1 1 0 2 2 0 0 2 2 2 0 0 0 0 2 0 2 0 0 1 2 0 2 0 2
0 2 1 0 1 0 1 1 1 0 2 2 0 0 2 2 2 0 0 0 0 2 0 2 0 0 1 2 0 2 0 2

output1.txt

02202020
02101011
02101011

output2.txt

2022002
1022002
1022002

output30.txt

0202
0202
0202

output3.txt

220111
220000
220000

output4.txt

2020002
2020012
2020012