Problema de rastreamento do Wget

0

Eu vi que, para rastrear um site inteiro, esse comando deve funcionar:

wget  --spider -r https://wikipedia.org/

Mas a minha pergunta é por que o mesmo comando para rastrear um site inteiro não funciona com a Wikipédia?

Meu objetivo não é rastrear todos os wikiepdia, mas saber a diferença.

Esta é a saída do comando:

Spider mode enabled. Check if remote file exists.
--2016-08-31 17:53:56--  http://wikipedia.org/
Resolving wikipedia.org (wikipedia.org)... 91.198.174.192, 2620:0:862:ed1a::1
Connecting to wikipedia.org (wikipedia.org)|91.198.174.192|:80... connected.
HTTP request sent, awaiting response... 301 TLS Redirect
Location: https://wikipedia.org/ [following]
Spider mode enabled. Check if remote file exists.
--2016-08-31 17:53:56--  https://wikipedia.org/
Connecting to wikipedia.org (wikipedia.org)|91.198.174.192|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://www.wikipedia.org/ [following]
Spider mode enabled. Check if remote file exists.
--2016-08-31 17:53:56--  https://www.wikipedia.org/
Resolving www.wikipedia.org (www.wikipedia.org)... 91.198.174.192, 2620:0:862:ed1a::1
Connecting to www.wikipedia.org (www.wikipedia.org)|91.198.174.192|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Remote file exists and could contain links to other resources -- retrieving.

--2016-08-31 17:53:56--  https://www.wikipedia.org/
Reusing existing connection to www.wikipedia.org:443.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘wikipedia.org/index.html’

    [ <=>                                                                                                                                                                                                                                   ] 81 292      --.-K/s   in 0,03s   

2016-08-31 17:53:57 (2,44 MB/s) - ‘wikipedia.org/index.html’ saved [81292]

Removing wikipedia.org/index.html.

Found no broken links.

FINISHED --2016-08-31 17:53:57--
Total wall clock time: 0,2s
Downloaded: 1 files, 79K in 0,03s (2,44 MB/s)
    
por 4m1nh4j1 31.08.2016 / 17:57

1 resposta

1

É uma FAQ (para wget e Wikipedia):

By default, Wget plays the role of a web-spider that plays nice, and obeys a site's robots.txt file and no-follow attributes.

On 18 January 2005 the Google blog entry "Preventing comment spam" declared that Google would henceforth respect a rel="nofollow" attribute on hyperlinks. Their page ranking algorithm now ignores links with this attribute when ranking the destination page. The intended result is that site administrators can modify user-posted links such that the attribute is present, and thus an attempt to googlebomb by posting a link on such a site would yield no increase from that link.

O ponto é que a Wikipedia configurou o site para desencorajá-lo a fazer isso.

    
por 01.09.2016 / 01:34

Tags