Eu pensei que haveria uma maneira de fazer isso facilmente com o wget / curl também, mas também não consegui fazer nada funcionar. Você pode usar essa jóia Ruby, anêmona , para fazê-lo com bastante facilidade.
Instalando gema anêmona
% gem install anemone
Fetching: robotex-1.0.0.gem (100%)
Fetching: anemone-0.7.2.gem (100%)
Successfully installed robotex-1.0.0
Successfully installed anemone-0.7.2
2 gems installed
Installing ri documentation for robotex-1.0.0...
Installing ri documentation for anemone-0.7.2...
Installing RDoc documentation for robotex-1.0.0...
Installing RDoc documentation for anemone-0.7.2...
Exemplo de script de anêmona
#! /usr/bin/env ruby
require 'anemone'
Anemone.crawl("https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/") do |anemone|
anemone.on_every_page do |page|
puts page.url
end
end
Exemplo de execução
% ./anemone.rb | grep -v '?C='
https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/
https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/README_BCR.txt
https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/README_MAF.txt
https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/acc/
https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/
https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/brca/
https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/blca/
https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/cesc/
https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/cntl/
https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/dlbc/
https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/coad/
https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/esca/
https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/gbm/
https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/hnsc/
https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/kich/
https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/kirc/
https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/kirp/
https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/lcll/
https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/laml/
https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/lcml/
https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/lihc/
https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/lgg/
https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/lnnh/
https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/lost+found/
...
...
NOTA: O bit grep -v '?C='
está filtrando os cabeçalhos padrão que o Apache está gerando através de sua diretiva de Indexação, ou seja:
IndexOptions FancyIndexing VersionSort NameWidth=* HTMLTable
Estes permitem ordenar as páginas pelas diferentes colunas (Nome, Criar Data, etc.). Eles aparecem como páginas e eu estou apenas filtrando-os da saída.