Compare a similaridade ou a distância entre cada par de linhas dentro de um arquivo?

0

Gostaria de encontrar o par de linhas mais semelhante contido em um arquivo, usando algo como distância levenshtein . Por exemplo, dado um arquivo ao longo das linhas de:

What is your favorite color?
What is your favorite food?
Who was the 8th president?
Who was the 9th president?

… retornaria as linhas 3 & 4 como o par de linhas mais similar.

Idealmente, gostaria de poder calcular as linhas X mais semelhantes. Então, usando o exemplo acima, o segundo par mais similar seria as linhas 1 & 2.

    
por Matt V. 25.07.2017 / 00:46

1 resposta

2

Eu não estava familiarizado com as distâncias de Levenshtein, mas Perl tem um módulo para calcular Levenshtein distâncias , então eu escrevi um script perl simples para calcular as distâncias de cada combinação de pares de linhas na entrada, depois imprimi-las em "distance" crescente, sujeito a um parâmetro "top X" (N):

#!/usr/bin/perl -w
use strict;
use Text::Levenshtein qw(distance);
use Getopt::Std;

our $opt_n;
getopts('n:');
$opt_n ||= -1; # print all the matches if -n is not provided

my @lines=<>;
my %distances = ();

# for each combination of two lines, compute distance
foreach(my $i=0; $i <= $#lines - 1; $i++) {
  foreach(my $j=$i + 1; $j <= $#lines; $j++) {
        my $d = distance($lines[$i], $lines[$j]);
        push @{ $distances{$d} }, $lines[$i] . $lines[$j];
  }
}

# print in order of increasing distance
foreach my $d (sort { $a <=> $b } keys %distances) {
  print "At distance $d:\n" . join("\n", @{ $distances{$d} }) . "\n";
  last unless --$opt_n;
}

Na entrada de amostra, ele fornece:

$ ./solve.pl < input
At distance 1:
Who was the 8th president?
Who was the 9th president?

At distance 3:
What is your favorite color?
What is your favorite food?

At distance 21:
What is your favorite color?
Who was the 8th president?
What is your favorite color?
Who was the 9th president?
What is your favorite food?
Who was the 8th president?
What is your favorite food?
Who was the 9th president?

e mostrando o parâmetro opcional:

$ ./solve.pl -n 2 < input
At distance 1:
Who was the 8th president?
Who was the 9th president?

At distance 3:
What is your favorite color?
What is your favorite food?

Eu não sabia como imprimir a saída sem ambiguidade, mas as strings estão lá para serem impressas do jeito que você quiser.

    
por 25.07.2017 / 03:05