Estratégia para extrair o nome dos filmes deste conjunto de dados não uniforme?

Question

Estratégia para extrair o nome dos filmes deste conjunto de dados não uniforme?

#1 resposta do (7 votos)
#2 resposta do (6 votos)
#3 resposta do (5 votos)
#4 resposta do (3 votos)
#5 resposta do (1 votos)

6

Estou trabalhando em um problema de banco de dados de filmes para melhorar as expressões regulares. Esse é o problema que estou enfrentando. Meu conjunto de dados é assim:

Movie Name (variable space and tabs) year
Movie1(can have spaces or multiple spaces between them)(variable spaces and tabs could be \t+ or multiple space or single space> Year1
Movie2(can have spaces or multiple spaces between them)(variable spaces and tabs could be \t+ or multiple space or single space> Year2
Movie3(can have spaces or multiple spaces between them)(variable spaces and tabs could be \t+ or multiple space or single space> Year3
Movie4(can have spaces or multiple spaces between them)(variable spaces and tabs could be \t+ or multiple space or single space> Year4

Eu quero extrair nomes de todos os filmes. Estes são os desafios que estou enfrentando ao fazer isso:

1: The delimiter is variable. If it was colon or something unique, I would have used an awk command to extract them like this awk -F 'separator' '{print $1}'
In this case, it can be single space, two or more spaces or combination of \t or spaces.

2: For those rows where delimiter is \t, I can use a \t to extract it, because that does not come in movie names. But what if the delimiter is one space or two spaces. They can very easily appear in the movie's name. In those cases, I don't know what to do.

Eu sei que a pergunta é muito rígida e específica. Mas como descrevi anteriormente, estou muito bloqueado aqui. Não consigo pensar em nenhuma maneira de contornar este problema.

Existe alguma combinação de grep / sed / awk com reg-ex que possa ser usada para atingir o objetivo?

bash grep awk sed regular-expression

por Dude 04.07.2014 / 18:59

5 respostas

6

bash:

while read -r line; do
    if [[ $line =~ (.*)[[:blank:]]+[0-9]{4}$ ]]; then
        echo "${BASH_REMATCH[1]}"
    fi
done < data

sed:

sed 's/[[:blank:]]\+[0-9]\{4\}$//' < data

por 04.07.2014 / 19:54

5

Isso é realmente muito simples. Contanto que o último campo, o ano, não contenha nenhum espaço em branco (isso não está claro na sua pergunta, mas estou assumindo que é o caso), tudo o que você precisa fazer é remover o último campo. Por exemplo:

$ cat movies
Casablanca  1942
Eternal Sunshine        of the Spotless Mind            2004
He Died with a Felafel in His Hand                       2001
The Blues Brothers 1980

Então, se você quiser imprimir apenas o título, você pode usar:

$ perl -lpe 's/[^\s]+$//' movies
Casablanca  
Eternal Sunshine        of the Spotless Mind            
He Died with a Felafel in His Hand                       
The Blues Brothers 

$ sed 's/[^ \t]*$//' movies 
Casablanca  
Eternal Sunshine        of the Spotless Mind            
He Died with a Felafel in His Hand                       
The Blues Brothers

ou, para reduzir o espaço em branco nos títulos também:

$ sed -r 's/[\t ]+/ /g;s/[^ \t]*$//' movies 
Casablanca 
Eternal Sunshine of the Spotless Mind 
He Died with a Felafel in His Hand 
The Blues Brothers 

$ perl -lpe 's/\s+/ /g;s/[^\s]+$//' movies
Casablanca 
Eternal Sunshine of the Spotless Mind 
He Died with a Felafel in His Hand 
The Blues Brothers 

$ awk '{for(i=1;i<NF-1;i++){printf "%s ",$i} print $(NF-1)}' movies
Casablanca 
Eternal Sunshine of the Spotless Mind 
He Died with a Felafel in His Hand 
The Blues Brothers

Se o ano tiver sempre 4 dígitos, você poderá usar

$ perl -lpe 's/....$//' movies 
Casablanca 
Eternal Sunshine of the Spotless Mind 
He Died with a Felafel in His Hand 
The Blues Brothers

ou

$ perl -lpe 's/\s+/ /g;s/....$//' movies 
Casablanca 
Eternal Sunshine of the Spotless Mind 
He Died with a Felafel in His Hand 
The Blues Brothers

ou

$ while read line; do echo ${line%%????}; done < movies|od -c 
Casablanca 
Eternal Sunshine of the Spotless Mind 
He Died with a Felafel in His Hand 
The Blues Brothers

por 05.07.2014 / 10:21

3

Suponho que os dados do filme serão parecidos com os abaixo.

cat movies
one flew over the cuckoo's nest          1975
taxi driver      1976
the shining    1980

Agora, também presumo que os anos nos dados do filme sempre serão 4 caracteres no final.

Então, agora, se você usar os comandos abaixo,

 awk '{ gsub (" ", "", $0); print}' movies | rev | cut -c 5- | rev

A saída seria,

oneflewoverthecuckoo'snest
taxidriver
theshining

EDITAR:

No entanto, a melhor abordagem seria,

rev movies | cut -c5- | rev
one flew over the cuckoo's nest          
taxi driver      
the shining

Claro, presumo que o ano nos seus dados será sempre de 4 caracteres. Se for sempre o mesmo número de caracteres, você poderá seguir a segunda abordagem, pois ela mantém os espaços nos nomes dos filmes.

por 04.07.2014 / 19:44

1

Isso deve remover os últimos caracteres numéricos e as guias e espaços antes:

sed -e 's#[\t ]*[0-9]*$##' movies.txt

por 04.07.2014 / 21:39

Tags bash grep awk sed regular-expression

find: lista todos os diretórios, exceto aqueles com números em seus nomes e seus filhos Por que o vim não reconhece o novo runlevel7 no inittab?

score 7 · Accepted Answer

Usando gawk e assumindo que o ano sempre termina o registro:

awk -F"[0-9]{4}$" '{print $1}' movies