Extrai linha do CSV onde a coluna corresponde aos critérios

1

Eu tenho um arquivo CSV de 3 GB (24mil linhas) do qual preciso extrair dados. Eu fiz esse tipo de extração usando o Excel, mas esse arquivo é muito grande para trabalhar no Excel. Quero obter linhas de dados com base no valor da terceira coluna.

Aqui está uma amostra dos dados. Não há cabeçalhos (mas podem ser adicionados, se necessário).

5/23/2017,00:04:50,421,         0, ,B,     657.78240967,        -1
5/23/2017,00:04:50,421,         1, ,B,       1.00000000,        -1
5/23/2017,00:04:50,421,         2, ,B,       0.00000000,        -1
5/23/2017,00:04:50,421,         3, ,B,       6.03906202,        -1
5/23/2017,00:04:50,421,         4, ,B,    6000.00000000,        -1
5/23/2017,00:04:50,421,         5, ,B,       0.00000000,        -1
5/23/2017,00:04:50,421,         6, ,B,       0.00000000,        -1
5/23/2017,00:04:50,421,         7, ,B,      11.00000000,        -1
5/23/2017,00:04:50,421,         8, ,B,      65.00000000,        -1
5/23/2017,00:04:50,421,         9, ,B,       0.00000000,        -1
5/23/2017,00:04:50,421,        10, ,B,       0.00000000,        -1
5/23/2017,00:04:50,421,        11, ,B,       0.00000000,        -1
5/23/2017,00:04:50,421,        12, ,B,       2.00000000,        -1
5/23/2017,00:04:50,421,        13, ,B,       0.00000000,        -1
5/23/2017,00:04:50,421,        14, ,B,       5.28633022,        -1
5/23/2017,00:04:50,421,        15, ,B,    2000.00000000,        -1
5/23/2017,00:04:50,421,        16, ,B,       0.00000000,        -1
5/23/2017,00:04:50,421,        17, ,B,       0.00000000,        -1
5/23/2017,00:04:50,421,        18, ,B,       4.40085888,        -1
5/23/2017,00:04:50,421,        19, ,B,       0.00000000,        -1
5/23/2017,00:04:50,421,        20, ,B,       0.00000000,        -1
5/23/2017,00:04:50,421,        21, ,B,       0.00000000,        -1
5/23/2017,00:04:50,421,        22, ,B,       0.00000000,        -1
5/23/2017,00:04:50,421,        23, ,B,       0.00000000,        -1
5/23/2017,00:04:50,421,        24, ,B,       7.00000000,        -1
5/23/2017,00:04:50,421,        25, ,B,      60.00000000,        -1
5/23/2017,00:04:50,421,        26, ,B,       0.00000000,        -1
5/23/2017,00:04:50,421,        27, ,B,       0.00000000,        -1
5/23/2017,00:04:50,421,        28, ,B,      80.00000000,        -1
5/23/2017,00:04:50,421,        29, ,B,       0.00000000,        -1
5/23/2017,00:04:50,421,        30, ,B,       0.00000000,        -1
5/23/2017,00:04:50,421,        31, ,B,       0.00000000,        -1
5/23/2017,00:04:50,421,        32, ,B,       0.00000000,        -1
5/23/2017,00:04:50,421,        33, ,B,       0.00000000,        -1
5/23/2017,00:04:50,421,        34, ,B,       0.00000000,        -1
5/23/2017,00:04:50,421,        35, ,B,       0.00000000,        -1
5/23/2017,00:04:50,421,        36, ,B,       4.32588148,        -1
5/23/2017,00:04:50,421,        37, ,B,    1718.00000000,        -1
5/23/2017,00:04:50,421,        38, ,B,       0.05945313,        -1
5/23/2017,00:04:50,421,        39, ,B,       2.95723248,        -1
5/23/2017,00:04:50,421,        40, ,B,       0.00000000,        -1
5/23/2017,00:04:50,421,        41, ,B,       0.00000000,        -1
5/23/2017,00:04:50,421,        42, ,B,       0.00000000,        -1
5/23/2017,00:04:50,421,        43, ,B,   35946.00000000,        -1
5/23/2017,00:04:50,421,        44, ,B,       0.00000000,        -1
5/23/2017,00:04:50,421,        45, ,B,       0.00000000,        -1
5/23/2017,00:04:50,421,        46, ,B,   42271.00000000,        -1
5/23/2017,00:04:50,421,        47, ,B,   25402.00000000,        -1
5/23/2017,00:04:50,421,        48, ,B,       1.00000000,        -1
5/23/2017,00:04:50,421,        49, ,B,       1.00000000,        -1
5/23/2017,00:04:50,421,        50, ,B,   40745.00000000,        -1
5/23/2017,00:04:50,421,        51, ,B,   24373.00000000,        -1
5/23/2017,00:04:50,421,        52, ,B,       0.00000000,        -1
5/23/2017,00:04:50,421,        53, ,B,       0.00000000,        -1
5/23/2017,00:04:50,421,        54, ,B,   46517.00000000,        -1
5/23/2017,00:04:50,421,        55, ,B,      35.00000000,        -1
5/23/2017,00:04:50,421,        56, ,B,      51.00000000,        -1
5/23/2017,00:04:50,421,        57, ,B,      51.00000000,        -1
5/23/2017,00:04:50,421,        58, ,B,   49396.00000000,        -1
5/23/2017,00:04:50,421,        59, ,B,       0.00000000,        -1
5/23/2017,00:04:50,421,        60, ,B,       0.00000000,        -1
5/23/2017,00:04:50,421,        61, ,B,   13610.00000000,        -1
5/23/2017,00:04:50,421,        62, ,B,   50711.00000000,        -1
5/23/2017,00:04:50,421,        63, ,B,       0.00000000,        -1
5/23/2017,00:04:50,421,        64, ,B,       0.00000000,        -1
5/23/2017,00:04:50,421,        65, ,B,   93927.00000000,        -1
5/23/2017,00:04:50,421,        66, ,B,   50173.00000000,        -1
5/23/2017,00:04:50,421,        67, ,B,       1.00000000,        -1
5/23/2017,00:04:50,421,        68, ,B,       1.00000000,        -1
5/23/2017,00:04:50,421,        69, ,B,   13797.00000000,        -1
5/23/2017,00:04:50,421,        70, ,B,      75.00000000,        -1
5/23/2017,00:04:50,421,        71, ,B,     114.00000000,        -1
5/23/2017,00:04:50,421,        72, ,B,      75.00000000,        -1
5/23/2017,00:04:50,421,        73, ,B,   47238.00000000,        -1
5/23/2017,00:04:50,421,        74, ,B,       1.00000000,        -1
5/23/2017,00:04:50,421,        75, ,B,       1.00000000,        -1
5/23/2017,00:04:50,421,        76, ,B,   19338.00000000,        -1
5/23/2017,00:04:50,421,        77, ,B,   29879.00000000,        -1
5/23/2017,00:04:50,421,        78, ,B,       0.00000000,        -1
5/23/2017,00:04:50,421,        79, ,B,       0.00000000,        -1
5/23/2017,00:04:50,421,        80, ,B,   19633.00000000,        -1
5/23/2017,00:04:50,421,        81, ,B,   16687.00000000,        -1
5/23/2017,00:04:50,421,        82, ,B,       0.00000000,        -1
5/23/2017,00:04:50,421,        83, ,B,       0.00000000,        -1
5/23/2017,00:04:50,421,        84, ,B,    6823.00000000,        -1
5/23/2017,00:04:50,421,        85, ,B,      85.00000000,        -1
5/23/2017,00:04:50,421,        86, ,B,      91.00000000,        -1
5/23/2017,00:04:50,421,        87, ,B,      91.00000000,        -1
5/23/2017,00:04:50,421,        88, ,B,       0.00000000,        -1
5/23/2017,00:04:50,421,        89, ,B,       0.00000000,        -1
5/23/2017,00:04:54,312,         0, , ,     660.28082275,         0
5/23/2017,00:04:54,312,         1, , ,       1.00000000,         1
5/23/2017,00:04:54,312,         2, , ,       0.00000000,         2
5/23/2017,00:04:54,312,         3, , ,       5.95734215,         3
5/23/2017,00:04:54,312,         4, , ,    6000.00000000,         4
5/23/2017,00:04:54,312,         5, , ,       0.00000000,         5
5/23/2017,00:04:54,312,         6, , ,       0.00000000,         6
5/23/2017,00:04:54,312,         7, , ,      11.00000000,         7
5/23/2017,00:04:54,312,         8, , ,      65.00000000,         8
5/23/2017,00:04:54,312,         9, , ,       0.00000000,         9
5/23/2017,00:04:54,312,        10, , ,       0.00000000,        10
5/23/2017,00:04:54,312,        11, , ,       0.00000000,        11
5/23/2017,00:04:54,312,        12, , ,       2.00000000,        12
5/23/2017,00:04:54,312,        13, , ,       0.00000000,        13
5/23/2017,00:04:54,312,        14, , ,       5.32481575,        14
5/23/2017,00:04:54,312,        15, , ,    2000.00000000,        15
5/23/2017,00:04:54,312,        16, , ,       0.00000000,        16
5/23/2017,00:04:54,312,        17, , ,       0.00000000,        17
5/23/2017,00:04:54,312,        18, , ,       4.46131372,        18
5/23/2017,00:04:54,312,        19, , ,       0.00000000,        19
5/23/2017,00:04:54,312,        20, , ,       0.00000000,        20
5/23/2017,00:04:54,312,        21, , ,       0.00000000,        21
5/23/2017,00:04:54,312,        22, , ,       0.00000000,        22
5/23/2017,00:04:54,312,        23, , ,       0.00000000,        23
5/23/2017,00:04:54,312,        24, , ,       7.00000000,        24
5/23/2017,00:04:54,312,        25, , ,      60.00000000,        25
5/23/2017,00:04:54,312,        26, , ,       0.00000000,        26
5/23/2017,00:04:54,312,        27, , ,       0.00000000,        27
5/23/2017,00:04:54,312,        28, , ,      80.00000000,        28
5/23/2017,00:04:54,312,        29, , ,       0.00000000,        29
5/23/2017,00:04:54,312,        30, , ,       0.00000000,        30
5/23/2017,00:04:54,312,        31, , ,       0.00000000,        31
5/23/2017,00:04:54,312,        32, , ,       0.00000000,        32
5/23/2017,00:04:54,312,        33, , ,       0.00000000,        33
5/23/2017,00:04:54,312,        34, , ,       0.00000000,        34
5/23/2017,00:04:54,312,        35, , ,       0.00000000,        35
5/23/2017,00:04:54,312,        36, , ,       4.33114624,        36
5/23/2017,00:04:54,312,        37, , ,    1733.00000000,        37
5/23/2017,00:04:54,312,        38, , ,       0.71863282,        38
5/23/2017,00:04:54,312,        39, , ,       2.95582461,        39
5/23/2017,00:04:54,312,        40, , ,       0.00000000,        40
5/23/2017,00:04:54,312,        41, , ,       0.00000000,        41
5/23/2017,00:04:54,312,        42, , ,       0.00000000,        42
5/23/2017,00:04:54,312,        43, , ,   35946.00000000,        43
5/23/2017,00:04:54,312,        44, , ,       0.00000000,        44
5/23/2017,00:04:54,312,        45, , ,       0.00000000,        45
5/23/2017,00:04:54,312,        46, , ,   42271.00000000,        46
5/23/2017,00:04:54,312,        47, , ,   25402.00000000,        47
5/23/2017,00:04:54,312,        48, , ,       1.00000000,        48
5/23/2017,00:04:54,312,        49, , ,       1.00000000,        49
5/23/2017,00:04:54,312,        50, , ,   40745.00000000,        50
5/23/2017,00:04:54,312,        51, , ,   24373.00000000,        51
5/23/2017,00:04:54,312,        52, , ,       0.00000000,        52
5/23/2017,00:04:54,312,        53, , ,       0.00000000,        53
5/23/2017,00:04:54,312,        54, , ,   46517.00000000,        54
5/23/2017,00:04:54,312,        55, , ,      35.00000000,        55
5/23/2017,00:04:54,312,        56, , ,      50.00000000,        56
5/23/2017,00:04:54,312,        57, , ,      50.00000000,        57
5/23/2017,00:04:54,312,        58, , ,   49396.00000000,        58
5/23/2017,00:04:54,312,        59, , ,       0.00000000,        59
5/23/2017,00:04:54,312,        60, , ,       0.00000000,        60
5/23/2017,00:04:54,312,        61, , ,   13610.00000000,        61
5/23/2017,00:04:54,312,        62, , ,   50711.00000000,        62
5/23/2017,00:04:54,312,        63, , ,       0.00000000,        63
5/23/2017,00:04:54,312,        64, , ,       0.00000000,        64
5/23/2017,00:04:54,312,        65, , ,   93927.00000000,        65
5/23/2017,00:04:54,312,        66, , ,   50173.00000000,        66
5/23/2017,00:04:54,312,        67, , ,       1.00000000,        67
5/23/2017,00:04:54,312,        68, , ,       1.00000000,        68
5/23/2017,00:04:54,312,        69, , ,   13797.00000000,        69
5/23/2017,00:04:54,312,        70, , ,      75.00000000,        70
5/23/2017,00:04:54,312,        71, , ,     114.00000000,        71
5/23/2017,00:04:54,312,        72, , ,      75.00000000,        72
5/23/2017,00:04:54,312,        73, , ,   47238.00000000,        73
5/23/2017,00:04:54,312,        74, , ,       1.00000000,        74
5/23/2017,00:04:54,312,        75, , ,       1.00000000,        75
5/23/2017,00:04:54,312,        76, , ,   19338.00000000,        76
5/23/2017,00:04:54,312,        77, , ,   29879.00000000,        77
5/23/2017,00:04:54,312,        78, , ,       0.00000000,        78
5/23/2017,00:04:54,312,        79, , ,       0.00000000,        79
5/23/2017,00:04:54,312,        80, , ,   19633.00000000,        80
5/23/2017,00:04:54,312,        81, , ,   16687.00000000,        81
5/23/2017,00:04:54,312,        82, , ,       0.00000000,        82
5/23/2017,00:04:54,312,        83, , ,       0.00000000,        83
5/23/2017,00:04:54,312,        84, , ,    6823.00000000,        84
5/23/2017,00:04:54,312,        85, , ,      85.00000000,        85
5/23/2017,00:04:54,312,        86, , ,      91.00000000,        86
5/23/2017,00:04:54,312,        87, , ,      91.00000000,        87
5/23/2017,00:04:54,312,        88, , ,       0.00000000,        88
5/23/2017,00:04:54,312,        89, , ,       0.00000000,        89

Digamos que eu queira a linha que tenha o índice (terceira coluna) de 37 , para que meus dados ficassem assim.

5/23/2017,00:04:50,421,        37, ,B,    1718.00000000,        -1
5/23/2017,00:04:54,312,        37, , ,    1733.00000000,        37

Qual é a ferramenta que posso usar para realizar isso? Alguma dica? Eu estou usando principalmente Windows 10 , mas poderia encontrar uma máquina Linux se há uma ótima ferramenta sobre isso. Eu fiz manipulação básica para este arquivo com Sublime Text 3 e isso funcionou bem.

Qualquer ajuda seria muito apreciada.

    
por blackandorangecat 30.10.2017 / 23:00

2 respostas

2

Eu acho que você quer dizer a quarta coluna. Você poderia usar awk na linha de comando:

awk -F, -v OFS=, '$4==37' input.csv > output.csv

-F, define o separador do campo de entrada como vírgula

-v OFS=, define o separador do campo de saída como vírgula

$4==37 corresponde e imprime todas as linhas em que o quarto campo é 37

Isso cria um novo arquivo, output.csv :

5/23/2017,00:04:50,421,        37, ,B,    1718.00000000,        -1
5/23/2017,00:04:54,312,        37, , ,    1733.00000000,        37

awk é muito versátil. Você também pode usá-lo para combinar vários arquivos CSV e remover o cabeçalho. Ele vem como padrão no Linux e MacOS e versões do Windows estão disponíveis.

    
por 31.10.2017 / 00:00
2

Eu comento no segundo teylyn: o Excel deve ser capaz de lidar com um arquivo de 3GB com 24k linhas .

64-bit environment imposes no hard limits on file size. Workbook size is limited only by available memory and system resources.

Total number of rows and columns on a worksheet: 1,048,576 rows by 16,384 columns

A opção segura é importar ou analisar o CSV, pois pode haver uma vírgula dentro de uma string entre aspas (true, seus dados parecem não ter nenhum).

No entanto, pensei em lançar uma solução perl :

perl -F, -lane 'print if $F[3]==37' input.csv

perl é padrão nos sistemas operacionais do tipo UNIX; para disponibilizá-lo no Windows, você precisa instalar, por exemplo, cygwin . O WSL também deve ser oferecido, embora eu não tenha verificado.

O mesmo pode ser realizado com awk , como sugerido por Mike Fitzpatrick:

awk -F, -v OFS=, '$4==37' input.csv

O mesmo pode ser realizado com sed :

sed -n '/[^,]*,[^,]*,[^,]*, *37,/p' input.csv
    
por 31.10.2017 / 15:48