nutch, solr e servidor ubuntu 12.04lts

2

Eu uso o servidor Ubuntu 12.04lts e gostaria de saber qual versão do nutch e do solr pode ser compatível com ele.

Alguma solução, por favor?

    
por kadija 24.02.2014 / 10:57

1 resposta

4

O Nutch 1.5 e o Solr 3.6.0 são compatíveis.

HowTo:

1) instale o jdk

sudo apt-get install openjdk-7-jdk

2) Baixe e descompacte o Solr

sudo mkdir ~/tmp/solr
cd ~/tmp/solr
wget http://mirror.lividpenguin.com/pub/apache/lucene/solr/3.6.0/apache-solr-3.6.0.tgz
tar -xzvf apache-solr-3.6.0.tgz
*default jetty in solr, try to run java -jar start.jar* shutdown Ctrl-C

verifique http://localhost:8983/solr

3) Baixe e descompacte Nutch

sudo mkdir ~/tmp/nutch
cd ~/tmp/nutch
wget  http://mirror.rmg.io/apache/nutch/1.5/apache-nutch-1.5-bin.tar.gz
tar -xzvf apache-nutch-1.5-bin.tar.gz

4) configure o Nutch

chmod +x bin/nutch
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386

adicione em conf / nutch-site.xml

<property>
 <name>http.agent.name</name>
 <value>My Nutch Spider</value>
</property>

sair

mkdir -p urls
cd urls
touch seed.txt
nano seed.txt

adicione URLs para rastreamento, por exemplo

http://nutch.apache.org/

em conf / regex-urlfilter.txt e substitua

# accept anything else
+.

com uma expressão regular correspondente ao domínio que você deseja rastrear. Por exemplo, se você quisesse limitar o rastreamento ao domínio nutch.apache.org, a linha deveria ser:

+^http://([a-z0-9]*\.)*nutch.apache.org/

5) configure o Solr

 ~/tmp/solr/apache-solr-3.6.0/example/solr/conf
schema.xml add the following

<fieldType name="text" class="solr.TextField"
            positionIncrementGap="100">
            <analyzer>
                <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                <filter class="solr.StopFilterFactory"
                    ignoreCase="true" words="stopwords.txt"/>
                <filter class="solr.WordDelimiterFilterFactory"
                    generateWordParts="1" generateNumberParts="1"
                    catenateWords="1" catenateNumbers="1" catenateAll="0"
                    splitOnCaseChange="1"/>
                <filter class="solr.LowerCaseFilterFactory"/>
                <filter class="solr.EnglishPorterFilterFactory"
                    protected="protwords.txt"/>
                <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
            </analyzer>
        </fieldType>


<field name="digest" type="text" stored="true" indexed="true"/>
<field name="boost" type="text" stored="true" indexed="true"/>
<field name="segment" type="text" stored="true" indexed="true"/>
<field name="host" type="text" stored="true" indexed="true"/>
<field name="site" type="text" stored="true" indexed="true"/>
<field name="content" type="text" stored="true" indexed="true"/>
<field name="tstamp" type="text" stored="true" indexed="false"/>
<field name="url" type="string" stored="true" indexed="true"/>
<field name="anchor" type="text" stored="true" indexed="false" multiValued="true"/>

change <uniqueKey>id</uniqueKey> to
<uniqueKey>url</uniqueKey> 

in solrconfig.xml add
<requestHandler name="/nutch" class="solr.SearchHandler" >
    <lst name="defaults">
       <str name="defType">dismax</str>
       <str name="echoParams">explicit</str>
       <float name="tie">0.01</float>
       <str name="qf">
         content^0.5 anchor^1.0 title^1.2
       </str>
       <str name="pf">
         content^0.5 anchor^1.5 title^1.2 site^1.5
       </str>
       <str name="fl">
       url
       </str>
       <int name="ps">100</int>
       <bool name="hl">true</bool>
       <str name="q.alt">*:*</str>
<str name="hl.fl">title url content</str>
<str name="f.title.hl.fragsize">0</str>
<str name="f.title.hl.alternateField">title</str>
<str name="f.url.hl.fragsize">0</str>
<str name="f.url.hl.alternateField">url</str>
<str name="f.content.hl.fragmenter">regex</str>
</lst>
</requestHandler>

6) execute o rastreador e índice da Nutch no Solr (certifique-se de que o Solr tenha iniciado)

bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5

verificar arquivos indexados @ http://localhost:8983/solr

Fonte

    
por Noosrep 24.02.2014 / 10:59