Compare dois arquivos com quatro colunas

4

Eu tenho dois arquivos csv com algumas informações de log. Eu preciso comparar linha a linha do campo1 (arquivo1) com o campo2 (arquivo2) com um intervalo de 1 minuto (-00: 00: 01 < time < 00:00:01) e o campo5 (file1) com o campo3 ( arquivo2). Se eles corresponderem, copie field3 (input2) para a saída e imprima os próximos campos do arquivo1, senão escreva "NOACT". Eu tentei com awk , join e python , mas não funcionou.

arquivo1 (55000000 linhas):

19:15:30.047315,184.125.142.179,87492,174.145.246.178,87961,.,7648919765,145,767942442,896450764,1
19:15:30.047578,184.125.142.185,87492,174.145.246.132,52364,.,7648919765,145,767942442,896450764,1
19:15:31.047315,184.125.142.145,87492,174.145.246.158,52364,.,7648919765,145,767942442,896450764,1
19:15:31.049687,184.125.142.145,87492,174.145.246.158,44444,.,7648919765,145,767942442,896450764,1
19:15:32.049687,184.125.142.145,87492,174.145.246.158,44454,.,7648919765,145,767942442,896450764,1

Arquivo2 (25000 linhas):

19:15:30,187.173.121.63,42347,NOT
19:15:30,187.173.121.63,52364,OK
19:15:30,187.173.121.63,52364,OK
19:15:32,145.246.158,44444,NOT

Resultado esperado:

19:15:30.047315,184.125.142.179,87492,174.145.246.178,87961,.,7648919765,145,767942442,896450764,1,NOACT
19:15:30.047578,184.125.142.185,87492,174.145.246.132,52364,.,7648919765,145,767942442,896450764,1,OK
19:15:31.047315,184.125.142.145,87492,174.145.246.158,52364,.,7648919765,145,767942442,896450764,1,NOACT
19:15:31.049687,184.125.142.145,87492,174.145.246.158,44444,.,7648919765,145,767942442,896450764,1,NOT
19:15:32.049687,184.125.142.145,87492,174.145.246.158,44454,.,7648919765,145,767942442,896450764,1,NOACT

Exemplo

arquivo1:

A11 A12 A13 A14 A15 A16 A17 A18 A19 A110  
A21 A22 A23 A24 A25 A26 A27 A28 A29 A210  
A31 A32 A33 A34 A35 A36 A37 A38 A39 A310  

arquivo2:

B11 B12 B13  
B21 B22 B23  
B31 B32 B33  

Eu preciso ver se B11 corresponde a A11 e, em caso afirmativo, ver se B12 corresponde a A15. Em caso afirmativo, escreva a primeira linha de saída (outline1 = inputAline1 & B13), caso contrário, vá para a próxima linha de B. Se nenhuma correspondência for encontrada, escreva a primeira linha de A & "NOACT".

    
por beginner 19.05.2015 / 15:44

3 respostas

1

Você pode usar sqlite para adicionar seus dados:

$ sqlite3 <<EOT
.mode csv
CREATE TABLE file1 (A11,A12,A13,A14,A15,A16,A17,A18,A19,A110,A111);
CREATE TABLE file2 (B11,B12,B13,B14);
.import file1 file1
.import file2 file2
SELECT DISTINCT file1.*, ifnull(file2.B14,"NOACT") FROM file1 LEFT JOIN file2 ON abs(julianday(file1.A11) - julianday(file2.B11))*86400.0 < 1 AND A15 = B13;
EOT

que dá:

19:15:30.047315,184.125.142.179,87492,174.145.246.178,87961,.,7648919765,145,767942442,896450764,1,NOACT
19:15:30.047578,184.125.142.185,87492,174.145.246.132,52364,.,7648919765,145,767942442,896450764,1,OK
19:15:31.047315,184.125.142.145,87492,174.145.246.158,52364,.,7648919765,145,767942442,896450764,1,NOACT
19:15:31.049687,184.125.142.145,87492,174.145.246.158,44444,.,7648919765,145,767942442,896450764,1,NOT
19:15:32.049687,184.125.142.145,87492,174.145.246.158,44454,.,7648919765,145,767942442,896450764,1,NOACT
    
por 21.05.2015 / 10:34
0

O script a seguir deve fazer o que você está procurando em python:

#!/usr/bin/env python3

import sys
import csv
import bisect
import re

timeRange = 1 # seconds

# Timestamps should match the following regex
timestampRegex = re.compile('[0-9]*:*[0-9]*:[0-9][0-9]\.*[0-9]*')

# Convert the time from a string 'hh:mm:ss' to a float
def timeToFloat(time):
    # We add hours and minutes in case they are missinge
    # We reverse the list
    # We only take the first three elements of the list anyway
    (s, m, h) = (['00', '00'] + time.split(':'))[::-1][0:3]
    result = float(h) * 3600 + float(m) * 60 + float(s)
    return result

if (len(sys.argv) != 4):
    print('Usage: {} <input file 1> <input file 2> <output file>'.format(sys.argv[0]))
    exit(1)

inputFileName1 = sys.argv[1]
inputFileName2 = sys.argv[2]
outputFileName = sys.argv[3]

# Each entry will be a tuple with the time as the first element
# and the row of file2 as the second element
file2Entries = []
with open(inputFileName2) as inputFile2:
    csvReader = csv.reader(inputFile2)
    for row in csvReader:
        if len(row) == 4:
            if not timestampRegex.match(row[0]):
                continue
            time = timeToFloat(row[0])
            file2Entries.append((time, row))

file1Entries = []
with open(inputFileName1) as inputFile1, open(outputFileName, 'w') as outputFile:
    csvReader = csv.reader(inputFile1)
    # For each row in file1 we look for a match among the entries of file2
    for row in csvReader:
        if len(row) == 11:
            if not timestampRegex.match(row[0]):
                # We can't possibly find a match for this line
                outputFile.write(','.join(row + ['NOACT']))
                outputFile.write('\n')
                continue
            time = timeToFloat(row[0])
            # Find the first and last entries of file2 within a range of 1 second
            a = bisect.bisect_right(file2Entries, (time - timeRange,))
            b = bisect.bisect_left(file2Entries, (time + timeRange,))
            # Loop on the possible matches (those within the given range)
            for entry in file2Entries[a:b]:
                if entry[1][2] == row[4]:
                    outputFile.write(','.join(row + [entry[1][3]]))
                    outputFile.write('\n')
                    break;
            else:
                # We haven't found a match in file2
                outputFile.write(','.join(row + ['NOACT']))
                outputFile.write('\n')

Você pode salvá-lo como compare.py e executá-lo como:

./compare.py input1.txt input2.txt output.txt

Note que não estou totalmente certo sobre a exatidão do intervalo extraído, você provavelmente deve ter certeza de que funciona corretamente.

UPDATE : as linhas do segundo arquivo de entrada que não contêm um registro de data e hora válido são ignoradas, enquanto as linhas do primeiro arquivo de entrada que não contêm um registro de data e hora válido são copiadas para o arquivo de saída com NOACT anexado no final

    
por 19.05.2015 / 18:55
0

O novo script editado está ok, mas para em uma nova linha
aqui está o erro

root@localhost:~/python# ./compare.py input1.txt input2.txt output.txt Traceback (most recent call last): File "./compare.py", line 46, in <module> for row in csvReader: File "/usr/lib/python3.2/codecs.py", line 300, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0x83 in position 2099: invalid start byte''

Eu procurei a próxima linha no input1 e é semelhante aos outros, então tentei resolver o problema e achei isso:

link

Eu mudei o script para

#!/usr/bin/env python3

    import sys
    import csv
    import bisect
    import re

timeRange = 1 # seconds

# Timestamps should match the following regex
timestampRegex = re.compile('[0-9]*:*[0-9]*:[0-9][0-9]\.*[0-9]*')

# Convert the time from a string 'hh:mm:ss' to a float
def timeToFloat(time):
# We add hours and minutes in case they are missinge
# We reverse the list
# We only take the first three elements of the list anyway
(s, m, h) = (['00', '00'] + time.split(':'))[::-1][0:3]
result = float(h) * 3600 + float(m) * 60 + float(s)
return result

if (len(sys.argv) != 4):
    print('Usage: {} <input file 1> <input file 2> <output file>'.format(sys.argv[0]))
exit(1)

inputFileName1 = sys.argv[1]
inputFileName2 = sys.argv[2]
outputFileName = sys.argv[3]

# Each entry will be a tuple with the time as the first element
# and the row of file2 as the second element
file2Entries = []
with open(inputFileName2, encoding="utf8") as inputFile2:
    csvReader = csv.reader(inputFile2)
    for row in csvReader:

        if len(row) == 4:
            if not timestampRegex.match(row[0]):
                continue
            time = timeToFloat(row[0])
            file2Entries.append((time, row))

file1Entries = []
with open(inputFileName1, encoding="utf8") as inputFile1, open(outputFileName, 'w', encoding="utf8") as outputFile:
csvReader = csv.reader(inputFile1)
# For each row in file1 we look for a match among the entries of file2
for row in csvReader:
        if len(row) == 11:
            if not timestampRegex.match(row[0]):
                # We can't possibly find a match for this line
                outputFile.write(','.join(row + ['NOACT']))
                outputFile.write('\n')
                continue
            time = timeToFloat(row[0])
            # Find the first and last entries of file2 within a range of 1 second
            a = bisect.bisect_right(file2Entries, (time - timeRange,))
            b = bisect.bisect_left(file2Entries, (time + timeRange,))
            # Loop on the possible matches (those within the given range)
            for entry in file2Entries[a:b]:
                if entry[1][2] == row[4]:
                    outputFile.write(','.join(row + [entry[1][3]]))
                    outputFile.write('\n')
                    break;
            else:
                # We haven't found a match in file2
                outputFile.write(','.join(row + ['NOACT']))
                outputFile.write('\n')

mas continua com algum erro

    
por 21.05.2015 / 16:27