Usando awk
:
NR == 1 { next } # skip header
$5 != window { # new (or first) window
if (window != "") # unless this is the first window, print the collected data
print window, chrom, start, end
# collect data for next window
chrom = $2
start = $3
window = $5
}
{ end = $3 } # always update the end position
# at the end, print the collected data for the last window
END { print window, chrom, start, end }
Executando isso:
$ awk -f script.awk file
"window_1391" 1 69500112 69500233
"window_1747" 1 87300054 87300800
"window_59705" 17 200219189 200241059
Com separadores como separadores de saída:
$ awk -v OFS='\t' -f script.awk file
"window_1391" 1 69500112 69500233
"window_1747" 1 87300054 87300800
"window_59705" 17 200219189 200241059
Versão ligeiramente mais chique que coleta o código para fazer a saída em uma função. Esta função também produz um cabeçalho e retira as aspas duplas da ID da janela original.
function output() {
if (window == "")
# no previous window, output header
print "window_id", "chrom", "starting_posititon", "ending_position"
else {
# strip the first and last characters from window ID (the quotes)
# then output
w = substr(window, 2, length(window) - 2)
print w, chrom, start, end
}
}
NR == 1 { next } # skip header
$5 != window { # new (or first) window
output()
# collect data for next window
chrom = $2
start = $3
window = $5
}
{ end = $3 } # always update the end position
# at the end, print the collected data for the last window
END { output() }
Executando:
$ awk -v OFS='\t' -f script.awk file
window_id chrom starting_posititon ending_position
window_1391 1 69500112 69500233
window_1747 1 87300054 87300800
window_59705 17 200219189 200241059