Remove entradas duplicadas usando bash, awk ou sed

0

Como posso pesquisar por dados duplicados usando o lote? O objetivo é remover a entrada duplicada "Changelist: XXXXX" do arquivo data.txt. Eu estou meio preso, alguém pode me ajudar?

Por favor, dê uma olhada no output.txt para a saída desejada.

data.txt

====================================
 Changelist: 808298
 Date: 2015/03/19
 Developer: A
 ShortDescr: Checking in the following graphics:

 CodeReview: 
 CodeReview: Result: @result___
 ====================================
 Changelist: 808273
 Date: 2015/03/19
 Developer: B
 ShortDescr: Hello

 CodeReview: Result: 
 ====================================
 Changelist: 808271
 Date: 2015/03/19
 Developer: C
 ShortDescr: HI

 CodeReview: 
 ====================================
 Changelist: 808298
 Date: 2015/03/19
 Developer: A
 ShortDescr: Checking in the following graphics:

 CodeReview: 
 CodeReview: Result: @result___
 ====================================
 Changelist: 808273
 Date: 2015/03/19
 Developer: B
 ShortDescr: Hello

 CodeReview: Result:  
 ====================================
  Changelist: 808277
 Date: 2015/03/19
 Developer: D
 ShortDescr: HEY

 CodeReview: 
 ====================================

output.txt

====================================
 Changelist: 808298
 Date: 2015/03/19
 Developer: A
 ShortDescr: Checking in the following graphics:

 CodeReview: 
 CodeReview: Result: @result___
 ====================================
 Changelist: 808273
 Date: 2015/03/19
 Developer: B
 ShortDescr: Hello

 CodeReview: Result: 
 ====================================
 Changelist: 808271
 Date: 2015/03/19
 Developer: C
 ShortDescr: HI

 CodeReview: 
 ====================================
  Changelist: 808277
 Date: 2015/03/19
 Developer: D
 ShortDescr: HEY

 CodeReview: 
 ====================================
    
por Mihir 20.03.2015 / 16:19

1 resposta

2

Estou assumindo que você não se importa com o espaço em branco, porque na verdade seus registros Changelist: 808273 são diferentes (selecione o texto para ver a diferença):

  • Primeiro:

    CodeReview: Result: 

    um espaço após o cólon

  • Segundo:

     CodeReview: Result:  

    dois espaços após o cólon

Este é o script do PowerShell que remove as duplicatas dos seus dados:

# Setup input and output files
$InFile = '.\Data.txt'
$OutFile = '.\Output.txt'

# Separator to split records
$Separator = '^=+$'

# Read file to array and trim strings
# https://mjolinor.wordpress.com/2014/01/18/another-take-on-using-the-operator/
$Reader = New-Object -TypeName System.IO.StreamReader -ArgumentList $InFile -ErrorAction Stop
$Data = while(($line = $Reader.ReadLine()) -ne $null){$line.Trim()}
$Reader.Close()
$Reader.Dispose()

# Find start and end indexes of each record
$RecordBounds = 0..($Data.Length-1) | Where-Object {$Data[$_] -match $Separator}

# Split records into multidimensional array
$Records = @()
for ($i=0 ; $i -lt ($RecordBounds.Length-1) ; $i++)
{
    $Records += ,($Data[($RecordBounds[$i]+1)..($RecordBounds[$i+1]-1)])
}

# Get actual separator string to use it in new file
$LiteralSeparator = $Data | Where-Object {$_ -match $Separator} | Select-Object -First 1

# Get only unique records, combine with separators
$Result = ,$LiteralSeparator + ($Records | Select-Object -Unique | ForEach-Object {$_ ; $LiteralSeparator})

# Write result to file
$Result | Out-File -LiteralPath $OutFile -Encoding Default -Force

Exemplo de resultado:

====================================
Changelist: 808298
Date: 2015/03/19
Developer: A
ShortDescr: Checking in the following graphics:

CodeReview:
CodeReview: Result: @result___
====================================
Changelist: 808273
Date: 2015/03/19
Developer: B
ShortDescr: Hello

CodeReview: Result:
====================================
Changelist: 808271
Date: 2015/03/19
Developer: C
ShortDescr: HI

CodeReview:
====================================
Changelist: 808277
Date: 2015/03/19
Developer: D
ShortDescr: HEY

CodeReview:
====================================
    
por 20.03.2015 / 17:57