Por que não estou conseguindo espelhar um site (usando wget)?

3

Eu tentei usar wget --mirror http://tshepang.net/ , mas ele só recupera uma página, " tshepang.net/index.html ". Isso é um bug no wget?

Aqui está a saída, usando a opção --debug :

DEBUG output created by Wget 1.12 on linux-gnu.

Enqueuing http://tshepang.net/ at depth 0
Queue count 1, maxcount 1.
[IRI Enqueuing 'http://tshepang.net/' with None
Dequeuing http://tshepang.net/ at depth 0
Queue count 0, maxcount 1.
--2011-01-15 12:32:51--  http://tshepang.net/
Resolving tshepang.net... 66.216.125.32
Caching tshepang.net => 66.216.125.32
Connecting to tshepang.net|66.216.125.32|:80... connected.
Created socket 4.
Releasing 0x089e2be0 (new refcount 1).

---request begin---
GET / HTTP/1.0

User-Agent: Wget/1.12 (linux-gnu)

Accept: */*

Host: tshepang.net

Connection: Keep-Alive



---request end---
HTTP request sent, awaiting response... 
---response begin---
HTTP/1.1 302 Found

Server: nginx/0.7.65

Date: Sat, 15 Jan 2011 10:33:45 GMT

Content-Type: text/html; charset=utf-8

Connection: keep-alive

Status: 302 Found

Location: http://posterous.com/sso/verify/2d35d71b1e728dc99f3c153eaf6f8fa0?jumpto=%2F

X-Runtime: 3

Set-Cookie: cookies_enabled=true; path=/

Cache-Control: no-cache

Content-Length: 141

X-Varnish: 419207385

Age: 0

Via: 1.1 varnish

X-Cache: MISS



---response end---
302 Found

Stored cookie tshepang.net -1 (ANY) / <session> <insecure> [expiry none] cookies_enabled true
Registered socket 4 for persistent reuse.
Location: http://posterous.com/sso/verify/2d35d71b1e728dc99f3c153eaf6f8fa0?jumpto=%2F [following]
Skipping 141 bytes of body: [<html><body>You are being <a href="http://posterous.com/sso/verify/2d35d71b1e728dc99f3c153eaf6f8fa0?jumpto=%2F">redirected</a>.</body></html>] done.
--2011-01-15 12:32:52--  http://posterous.com/sso/verify/2d35d71b1e728dc99f3c153eaf6f8fa0?jumpto=%2F
conaddr is: 66.216.125.32
Resolving posterous.com... 184.106.20.99
Caching posterous.com => 184.106.20.99
Releasing 0x089e3e20 (new refcount 1).
Found posterous.com in host_name_addresses_map (0x89e3e20)
Connecting to posterous.com|184.106.20.99|:80... connected.
Created socket 5.
Releasing 0x089e3e20 (new refcount 1).

---request begin---
GET /sso/verify/2d35d71b1e728dc99f3c153eaf6f8fa0?jumpto=%2F HTTP/1.0

User-Agent: Wget/1.12 (linux-gnu)

Accept: */*

Host: posterous.com

Connection: Keep-Alive



---request end---
HTTP request sent, awaiting response... 
---response begin---
HTTP/1.1 302 Found

Server: nginx/0.7.65

Date: Sat, 15 Jan 2011 10:33:46 GMT

Content-Type: text/html; charset=utf-8

Connection: close

Status: 302 Found

Location: http://tshepang.net/sso/recovery/2d35d71b1e728dc99f3c153eaf6f8fa0?jumpto=%2F

X-Runtime: 7

Set-Cookie: _sharebymail_session_id=296a636c8ed3cb6e4e7cabb10256008a; domain=.posterous.com; path=/; HttpOnly

Cache-Control: no-cache

Content-Length: 142

X-Varnish: 2019529137

Age: 0

Via: 1.1 varnish

X-Cache: MISS



---response end---
302 Found
cdm: 1 2
Stored cookie posterous.com -1 (ANY) / <session> <insecure> [expiry none] _sharebymail_session_id 296a636c8ed3cb6e4e7cabb10256008a
Location: http://tshepang.net/sso/recovery/2d35d71b1e728dc99f3c153eaf6f8fa0?jumpto=%2F [following]
Closed fd 5
--2011-01-15 12:32:53--  http://tshepang.net/sso/recovery/2d35d71b1e728dc99f3c153eaf6f8fa0?jumpto=%2F
Reusing existing connection to tshepang.net:80.
Reusing fd 4.

---request begin---
GET /sso/recovery/2d35d71b1e728dc99f3c153eaf6f8fa0?jumpto=%2F HTTP/1.0

User-Agent: Wget/1.12 (linux-gnu)

Accept: */*

Host: tshepang.net

Connection: Keep-Alive

Cookie: cookies_enabled=true



---request end---
HTTP request sent, awaiting response... 
---response begin---
HTTP/1.1 302 Found

Server: nginx/0.7.65

Date: Sat, 15 Jan 2011 10:33:46 GMT

Content-Type: text/html; charset=utf-8

Connection: keep-alive

Status: 302 Found

Location: http://tshepang.net/

X-Runtime: 5

Set-Cookie: _sharebymail_session_id=cab0227db8c38f17e572984ee188dc5e; domain=tshepang.net; path=/; HttpOnly

Cache-Control: no-cache

Content-Length: 86

X-Varnish: 419207606

Age: 0

Via: 1.1 varnish

X-Cache: MISS



---response end---
302 Found
cdm: 1 2
Stored cookie tshepang.net -1 (ANY) / <session> <insecure> [expiry none] _sharebymail_session_id cab0227db8c38f17e572984ee188dc5e
Location: http://tshepang.net/ [following]
Skipping 86 bytes of body: [<html><body>You are being <a href="http://tshepang.net/">redirected</a>.</body></html>] done.
--2011-01-15 12:32:54--  http://tshepang.net/
Reusing existing connection to tshepang.net:80.
Reusing fd 4.

---request begin---
GET / HTTP/1.0

User-Agent: Wget/1.12 (linux-gnu)

Accept: */*

Host: tshepang.net

Connection: Keep-Alive

Cookie: _sharebymail_session_id=cab0227db8c38f17e572984ee188dc5e; cookies_enabled=true



---request end---
HTTP request sent, awaiting response... 
---response begin---
HTTP/1.1 200 OK

Server: nginx/0.7.65

Date: Sat, 15 Jan 2011 10:33:49 GMT

Content-Type: text/html; charset=utf-8

Connection: keep-alive

Status: 200 OK

ETag: "6ec7aeb4e15e3a80e733f7c2b5e00d6f"

X-Runtime: 1680

Cache-Control: private, max-age=0, must-revalidate

Content-Length: 66513

X-Varnish: 419207692

Age: 0

Via: 1.1 varnish

X-Cache: MISS



---response end---
200 OK
Length: 66513 (65K) [text/html]
Saving to: 'tshepang.net/index.html'

     0K .......... .......... .......... .......... .......... 76% 25.7K 1s
    50K .......... ....                                       100% 39.3K=2.3s

2011-01-15 12:32:58 (27.9 KB/s) - 'tshepang.net/index.html' saved [66513/66513]

Deciding whether to enqueue "http://tshepang.net/".
Already on the black list.
Decided NOT to load it.
Redirection "http://tshepang.net/" failed the test.
FINISHED --2011-01-15 12:32:58--
Downloaded: 1 files, 65K in 2.3s (27.9 KB/s)
    
por Tshepang 15.01.2011 / 10:01

3 respostas

5

A opção --no-cookies ajudou (graças ao wag ):

It seems like all the redirection caused wget to interrupt the request. Try with --no-cookies.

Isso foi determinado a partir da leitura do registro em anexo.

    
por 15.01.2011 / 15:26
1

Supondo que wget esteja em seu caminho (se não for, você precisará inserir o caminho completo) emita os seguintes comandos:

mkdir wget_files
cd wget_files
wget --mirror –-wait=2 --page-requisites --html-extension –-convert-links –-directory-prefix wget_files/example1 http://www.yourdomain.com
    
por 15.01.2011 / 11:35
-1

Você também precisa definir -r para recursivo e -l X para profundidade de link, onde X é um inteiro. Também é uma boa idéia definir -A para definir a lista de tipos de arquivos aceitáveis para manter (caso contrário, você só obtém arquivos HTML).

    
por 17.01.2011 / 03:19

Tags