Como posso analisar um URL do YouTube?

3

Como faço para extrair apenas

http://www.youtube.com/watch?v=qdRaf3-OEh4

de um URL como

http://www.youtube.com/watch?v=qdRaf3-OEh4&playnext=1&list=PL4367CEDBC117AEC6&feature=results_main

Estou interessado apenas no parâmetro "v".

    
por Hendré 18.12.2012 / 20:56

2 respostas

13

Atualização:

Os melhores seriam:

sed 's/^.\+\(\/\|\&\|\?\)v=\([^\&]*\).*//'
awk 'match($0,/((\/|&|\?)v=)([^&]*)/,x){print x[3]}'
grep -Po '(?&lt=(\/|&|\?)v=)[^&]*'
# Saying match / or & then v=

RFC 3986 afirma:

   URI           = scheme ":" hier-part [ "?" query ] [ "#" fragment ]

   query         = *( pchar / "/" / "?" )
   fragment      = *( pchar / "/" / "?" )

   pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"
   unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
   sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
                 / "*" / "+" / "," / ";" / "="
   …
 

Então, para ser um uso seguro:

 | sed 's/#.*//' | - to remove #fragment part

na frente.

Ou seja,

| sed 's/#.*//' | grep -Po '(?<=(\/|&)v=)[^&]*'

SED (2):

echo 'http://www.youtube.com/watch?v=qdRaf3-OEh4&playnext=1&list=PL4367CEDBC117AEC6&feature=results_main' \
| sed 's/^.\+\Wv=\([^\&]*\).*//'

Explicação:


's       
/…/…/    /THIS/WITH THIS/

'substitute/MATCH 0 or MORE THINGS and GROUP them in ()/WITH THIS/

+-------------------------- s    _s_ubsititute
|+------------------------- /    START MATCH
||                    +---- /    END MATCH
||                    | +--    REPLACE WITH - ==Group 1. Or FIRS low ().
||                    | | +- /   End of SUBSTITUTE
s/^.\+\Wv=\([^\&]*\).*//'
  +++-+-+-+-+-----+-+------- ^        Match from beginning of line
   ++-+-+-+-+-----+-+------- .        Match any character
    +-+-+-+-+-----+-+------- \+       multiple times (grep (greedy +, * *? etc))
      +-+-+-+-----+-+------- \W       Non-word-character
        +-+-+-----+-+------- v=       Literally match "v="
          +-+-----+-+------- \(       Start MATCH GROUP
            +-----+-+------- [^\&]*   Match any character BUT & - as many as possible
                  +-+------- \)       End MATCH GROUP
                    +------- .*       Match anything; *As many times as possible 
                                      - aka to end of line; as there is no 

         [abc]  would match a OR b OR c
         [abc]* would match a AND/OR b AND/OR c - as many times as possible
         [^abc] would match anything BUT a,b or c

//     Replace ENTIRE match with MATCH GROUP number 1.
         That would be - everything between \( and \) - which his anything but "&"
         after the literal string "v=" - which in turn has a non word letter in 
         front of it.

         That also means that no match means no substitution which ultimately result in 
         no change.

Resultado: qdRaf3-OEh4

Nota: Se nenhuma sequência inteira for retornada.

(G) AWK:

echo 'http://www.youtube.com/watch?v=qdRaf3-OEh4&playnext=1&list=PL4367CEDBC117AEC6&feature=results_main' \
| awk 'match($0,/(\Wv=)([^&]*)/,v){print v[2]}'

Resultado: qdRaf3-OEh4

Explicação:

Em Awk match(string, regexp) é uma função que procura a correspondência mais longa e mais à esquerda de regexp em string. Aqui eu usei uma extensão que vem com o Gawk. (veja Awk , GAwk ; MAwk etc.) que coloca as correspondências individuais - isto é: o que está entre parênteses - em uma matriz de correspondências.

O padrão é bem parecido com o do Perl / Grep abaixo.


  +-------------------------------------- Built in function
  |    +--------------------------------- Entire input ($1 would have been filed 1)
  |    |                                  etc. (Using default delimiters " "*)
  |    |
  |    |
  |    |  (....)(....) ------------------ Places \Wv= in one group 1, and [^&]* group 2.
match($0, /(\Wv=)([^&]*)/, v){print v[2]}
                           |   |    | |
                           |   |    +-+---- Use "v" from /, v; v is a user defined name
                           |   |      +---- 2 specifies index in v, which is group from
                           |   |            what is between ()'s in /…/
                           |   |
                           |   +----------- Print is another built in function.
                           +--------------- Group name that one can use in print.



GREP (usando Perl-compatível):

echo 'http://www.youtube.com/watch?v=qdRaf3-OEh4&playnext=1&list=PL4367CEDBC117AEC6&feature=results_main' | \
grep -Po '(?<=\Wv=)[^&]*'

Resultado: qdRaf3-OEh4

Explicação:


-P  Use Perl compatible
-o  Only print match of the expression.
    - That means: Of our pattern only print/return what it matches.
    If nothing matches; return nothing.

          +------- ^    Negate math to - do not match (ONLY as it is FIRST between [])
          |+------ &    A literal "&" character
          || 
(?<=\Wv=)[^&]*
|   | |  |  ||
|   | |  |  |+---- *     Greedy; as many times as possible.
|   | |  +--+----- []    Wild order/any order of what is inside []
|   | +----------- v=    Literal v=
|   +------------- \W    Non Word character
+----------------- (?<=  What follows should be (mediately) preceded by.
                    ?=Huh, <=left, = =Equals to

So: Match literal "v=" where "v" is preceded by an non-word-character. Then match
anything; as many times as possible until we are at end of line or we meet an "&".

As you can't have "&" in an URL between key/value pairs this should be OK.

    
por Runium 18.12.2012 / 21:35
4
echo 'http://www.youtube.com/watch?v=qdRaf3-OEh4&playnext=1&list=PL4367CEDBC117AEC6&feature=results_main' | sed -e 's/&.*//' -e 's/.*watch?//'

você receberá v=qdRaf3-OEh4 .

    
por evilsoup 18.12.2012 / 21:07