2014年7月24日 星期四

nutch 1.8 + solr 4.9.0 探討系列三 : SolrCloud 安裝

參考了 Apache Solr Reference Guide / Apache Solr Reference Guide / SolrCloud, 打算用 4 個 centos 6.3 VM 來安裝 SolrCloud. 在上述文章中可以看到 SolrCloud 的簡介.

Apache Solr includes the ability to set up a cluster of Solr servers that combines fault tolerance and high availability. Called SolrCloud, these capabilities provide distributed indexing and search capabilities, supporting the following features:

  • Central configuration for the entire cluster
  • Automatic load balancing and fail-over for queries
  • ZooKeeper integration for cluster coordination and configuration.

SolrCloud is flexible distributed search and indexing, without a master node to allocate nodes, shards and replicas. Instead, Solr uses ZooKeeper to manage these locations, depending on configuration files and schemas. Documents can be sent to any server and ZooKeeper will figure it out.

ZooKeeper 會安裝成獨立的 server, 不會用 solr 提供的 embedded ZooKeeper. 以下是 4 個 VM 安裝的配置 :

solr1(192.168.0.11) : 安裝 ZooKeeper, solr
solr2(192.168.0.12) : 安裝 ZooKeeper, solr
solr3(192.168.0.13) : 安裝 ZooKeeper, solr
solr4(192.168.0.14) : 安裝 solr
1. solr 安裝
先在 solr1 VM 上安裝 solr, 再 scp 到其他 VM. solr 的安裝請參考 nutch 1.8 + solr 4.9.0 探討系列一 : 基礎安裝篇. 先在 4 台 VM 安裝 scp.

yum install openssh-clients

以下是在 solr1 VM 上傳 solr example/ 目錄 到 solr2, solr3, solr4, 並且 solr1 本身也 copy 一份到 local /root

cd /root/solr-4.9.0/solr
cp -r example/ /root/
scp -r example/ solr2:/root
scp -r example/ solr3:/root
scp -r example/ solr4:/root

在 4 台 VM solr1, solr2, solr3, solr4 更改目錄名.

mv /root/example/ /root/solr-node
2. ZooKeeper 安裝
在 4 台 VM solr1, solr2, solr3, solr4 執行以下指令 :

cd ~
wget http://ftp.twaren.net/Unix/Web/apache/zookeeper/zookeeper-3.4.6/zookeeper-3.4.6.tar.gz
tar zxvf zookeeper-3.4.6.tar.gz
cd zookeeper-3.4.6
cp zoo_sample.cfg zoo.cfg
vi zoo.cfg

修改並儲存 :

# the directory where the snapshot is stored.
# do not use /tmp for storage, /tmp here is just
# example sakes.
dataDir=/root/zookeeper-3.4.6/data
# the port at which the clients will connect
clientPort=2181

server.1=solr1:2888:3888
server.2=solr2:2888:3888
server.3=solr3:2888:3888

建立 data 目錄 :

cd ..
mkdir data

solr1 VM 新增 myid 檔案 :

echo "1" >> data/myid

solr2 VM 新增 myid 檔案 :

echo "2" >> data/myid

solr3 VM 新增 myid 檔案 :

echo "3" >> data/myid

solr4 VM 新增 myid 檔案 :

echo "4" >> data/myid
3. 啟動 solr1, solr2, solr3 的 zookeeper server
cd /root/zookeeper-3.4.6
bin/zkServer.sh start
4. 啟動 solr1 的 solr
cd /root/solr-node
java -DnumShards=2 -Dbootstrap_confdir=./solr/collection1/conf \
-Dcollection.configName=myconf -DzkHost=solr1:2181,solr2:2181,solr3:2181 \
-jar start.jar

可連接以下網址查看 :

http://192.168.0.11:8983/solr/#/~cloud

可看到 :

5. 啟動 solr2, solr3, solr4 的 solr
cd /root/solr-node
java -Djetty.port=7574 -DzkHost=solr1:2181,solr2:2181,solr3:2181 \
-jar start.jar

請記得第 2, 3, 4 VM 的 jetty.port 不要和第一台一樣, 在此設為 7574.可連接以下任一網址查看, 請記得這 4 個網址都可以當做這一個 SolrCloud 的入口.

http://192.168.0.11:8983/solr/#/~cloud
or
http://192.168.0.12:7574/solr/#/~cloud
or
http://192.168.0.13:7574/solr/#/~cloud
or
http://192.168.0.14:7574/solr/#/~cloud

可看到 :

請忽略圖中灰色看不到的 端點和文字, 這是在測試過程, 啟動 server 留下的痕跡, 沒有意義.

6. SolrCloud index, query, shards, replicas, fault tolerance and high availability 測試
請注意, 此 solr 的 schema.xml 是來自 Nutch, 所以以下的資料 import 有其特殊性, 若為 solr 的 預設 schema.xml, 可參考 Updating a Solr Index with JSON. 起初以 csv 匯入, 因為測試後發現了問題, 可能是個 bug, 所以改採 json 匯入. 以下指令中的網址可使用任一個 solr node, 任一 node 都是指令入口.
curl "http://192.168.0.14:7574/solr/update" -H 'Content-type:application/json' -d '
[
 {"id" : "book1",
  "url" : "url1",
  "title" : "A Game of Thrones",
  "author" : "George R.R. Martin"
 },
 {"id" : "book2",
  "url" : "url2",
  "title" : "A Clash of Kings",
  "author" : "George R.R. Martin"
 },
 {"id" : "book3",
  "url" : "url3",
  "title" : "Foundation",
  "author" : "Isaac Asimov"
 },
 {"id" : "book4",
  "url" : "url4",
  "title" : "Foundation and Empire",
  "author" : "Isaac Asimov"
 }
]'

然後使用以下 8 個指令看回傳的結果, 其中 distrib=false 是代表只要查詢此一 shard node 上的資料, 而不是全部資料. 請記得要用 "" 括住整個網址.

curl "http://192.168.0.11:8983/solr/collection1/select?q=*:*"
curl "http://192.168.0.12:7574/solr/collection1/select?q=*:*"
curl "http://192.168.0.13:7574/solr/collection1/select?q=*:*"
curl "http://192.168.0.14:7574/solr/collection1/select?q=*:*"

curl "http://192.168.0.11:8983/solr/collection1/select?q=*:*&distrib=false"
curl "http://192.168.0.12:7574/solr/collection1/select?q=*:*&distrib=false"
curl "http://192.168.0.13:7574/solr/collection1/select?q=*:*&distrib=false"
curl "http://192.168.0.14:7574/solr/collection1/select?q=*:*&distrib=false"

接下來再依序 stop 其中 solr4, solr3, solr2 3 個 node, 並同時執行以上指令, 看會有甚麼影響.

我們從上面的 shard node 結構圖可以看出, shard1, shard2 為一個完整的資料, 整個 cluster 有 2 份資料, 自動保持同步. 所以 shard1, shard2 至少各要保持一個 node, 如果無法達成此條件, 則上述的查詢會得到 503 的錯誤訊息. Leader shard node 如果 stop 後再啟動, 因為它 stop, 所以 Leader 身份自動轉移出去. 例如:如果 stop solr1, 再啟動, 可以發現, solr1 已失去 Leader 的身份, 改為 solr4.

接下來我們來測試 ZooKeeper 的 high availability. 從 Getting Started with SolrCloud 的 Using Multiple ZooKeepers in an Ensemble 段落內容可看出, 3 台 ZooKeeper Server 至少要保持 2 台 ZooKeeper live, 才可以達到 high availability.

To truly provide high availability, we need to make sure that not only do we also have at least one shard server running at all times, but also that the cluster also has a ZooKeeper running to manage it. To do that, you can set up a cluster to use multiple ZooKeepers. This is called using a ZooKeeper ensemble. A ZooKeeper ensemble can keep running as long as more than half of its servers are up and running, so at least two servers in a three ZooKeeper ensemble, 3 servers in a 5 server ensemble, and so on, must be running at any given time. These required servers are called a quorum.

經過測試, 任 2 台 ZooKeeper stop 時, SolrCloud 就無法使用, 所以真的必須保留一半以上的 ZooKeepers.

7. Nutch 抓取的資料給 SolrCloud 做 index
前面有說 4 個 node 都可以做 SolrCloud 的入口, 所以可以執行以下任一指令 :
bin/crawl urls crawl http://192.168.0.11:8983/solr/ 2
or
bin/crawl urls crawl http://192.168.0.12:7574/solr/ 2
or
bin/crawl urls crawl http://192.168.0.13:7574/solr/ 2
or
bin/crawl urls crawl http://192.168.0.14:7574/solr/ 2

再用上述的 query 指令, 查詢一下 Nutch 抓取的資料在各個 shard 的分布.

Nutch 的安裝請參考 nutch 1.8 + solr 4.9.0 探討系列一 : 基礎安裝篇.

8. 用 Tomcat 取代 jetty
Tomcat 的安裝, 及與 solr 的整合與設定請參考 nutch 1.8 + solr 4.9.0 探討系列一 : 基礎安裝篇 中 6~8 步驟. 先在 solr1 安裝和設定. 其中步驟 7 改成以下方式處理.
yum install unzip
unzip /root/solr-node/webapps/solr.war -d /usr/local/apache-tomcat-8.0.9/webapps/solr
vi /usr/local/apache-tomcat-8.0.9/webapps/solr/WEB-INF/web.xml

修改以下內容 :

  <!--
    <env-entry>
       <env-entry-name>solr/home</env-entry-name>
       <env-entry-value>/put/your/solr/home/here</env-entry-value>
       <env-entry-type>java.lang.String</env-entry-type>
    </env-entry>
   -->

    <env-entry> 
       <env-entry-name>solr/home</env-entry-name>
       <env-entry-value>/root/solr-node/solr</env-entry-value>
       <env-entry-type>java.lang.String</env-entry-type>
    </env-entry>

修改 server.xml :

vi /usr/local/apache-tomcat-8.0.9/conf/server.xml

修改 8080 port 成 8983 :

<Connector port="8983" protocol="HTTP/1.1"
               connectionTimeout="20000"
               redirectPort="8443" />

修改 cataina.sh :

vi /usr/local/apache-tomcat-8.0.9/bin/cataina.sh

加入以下內容 :

JAVA_OPTS="$JAVA_OPTS -Djetty.port=8983 -DnumShards=2 -Dbootstrap_confdir=/root/solr-node/solr/collection1/conf -Dcollection.configName=myconf -DzkHost=solr1:2181,solr2:2181,solr3:2181"

solr2, solr3, solr4 建立 /usr/local 目錄

mkdir /usr/local

solr1 設定好後, scp 到 solr2, solr3, solr4

scp -r /usr/local/apache-tomcat-8.0.9/ solr2:/usr/local/
scp -r /usr/local/apache-tomcat-8.0.9/ solr3:/usr/local/
scp -r /usr/local/apache-tomcat-8.0.9/ solr4:/usr/local/

solr2, solr3, solr4 修改 server.xml

vi /usr/local/apache-tomcat-8.0.9/conf/server.xml

修改 8983 port 成 7574 :

<Connector port="7574" protocol="HTTP/1.1"
               connectionTimeout="20000"
               redirectPort="8443" />

修改 cataina.sh :

vi /usr/local/apache-tomcat-8.0.9/bin/cataina.sh

修改 jetty.port 成 7574 :

JAVA_OPTS="$JAVA_OPTS -Djetty.port=7574 -DnumShards=2 -Dbootstrap_confdir=/root/solr-node/solr/collection1/conf -Dcollection.configName=myconf -DzkHost=solr1:2181,solr2:2181,solr3:2181"

-Djetty.port=7574 也可以不指定, 而直接修改 /root/solr-node/solr/solr.xml 的 hostPort, 由 ${jetty.port:8983} 改為 7574 :

vi /root/solr-node/solr/solr.xml

修改如下 :

  <solrcloud>
    <str name="host">${host:}</str>
    <int name="hostPort">7574</int>
    <str name="hostContext">${hostContext:solr}</str>
    <int name="zkClientTimeout">${zkClientTimeout:30000}</int>
    <bool name="genericCoreNodeNames">${genericCoreNodeNames:true}</bool>
  </solrcloud>

請記得 solr2, solr3, solr4 的 port 要跟 solr1 不一樣. 讀者可以試試看所有 VM 的 port 都一樣的結果.

solr1, solr2, solr3, solr4 startup tomcat

cd /usr/local/apache-tomcat-8.0.9
bin/startup.sh

連上以下網址驗證 SolrCloud 是否正常

http://192.168.0.11:8983/solr/#/~cloud
or
http://192.168.0.12:7574/solr/#/~cloud
or
http://192.168.0.13:7574/solr/#/~cloud
or
http://192.168.0.14:7574/solr/#/~cloud
9. multicore 的設定
因為要改用 multicore, 所以將先前的 collection1 刪除. 否則以下的操作完成後會不正常. 在 solr1 執行以下指令 :
curl "http://192.168.0.11:8983/solr/admin/collections?action=DELETE&name=collection1"

solr1, solr2, solr3, solr4 shutdown tomcat
cd /usr/local/apache-tomcat-8.0.9
bin/shutdown.sh

SolrCloud是透過 ZooKeeper 集群來保證 conf/ 文件的變更及時同步到各個節點上,所以,需要將 conf/* 上傳到 ZooKeeper 集群中, 我們共有 core0, core1 2 個 core(以前只有 collection1):

在 solr1 上執行以下指令 :

core0 :

java -classpath .:/root/solr-node/solr-webapp/webapp/WEB-INF/lib/*:/root/solr-node/lib/ext/* \
     org.apache.solr.cloud.ZkCLI -cmd upconfig -zkhost solr1:2181,solr2:2181,solr3:2181 \
     -confdir /root/solr-node/multicore/core0/conf -confname conf0 \
     -solrhome /root/solr-node/multicore

java -classpath .:/root/solr-node/solr-webapp/webapp/WEB-INF/lib/*:/root/solr-node/lib/ext/* \
     org.apache.solr.cloud.ZkCLI -cmd linkconfig -zkhost solr1:2181,solr2:2181,solr3:2181 \
     -collection core0 -confname conf0 -solrhome /root/solr-node/multicore

core1 :

java -classpath .:/root/solr-node/solr-webapp/webapp/WEB-INF/lib/*:/root/solr-node/lib/ext/* \
     org.apache.solr.cloud.ZkCLI -cmd upconfig -zkhost solr1:2181,solr2:2181,solr3:2181 \
     -confdir /root/solr-node/multicore/core1/conf -confname conf1 \
     -solrhome /root/solr-node/multicore

java -classpath .:/root/solr-node/solr-webapp/webapp/WEB-INF/lib/*:/root/solr-node/lib/ext/* \
     org.apache.solr.cloud.ZkCLI -cmd linkconfig -zkhost solr1:2181,solr2:2181,solr3:2181 \
     -collection core1 -confname conf1 -solrhome /root/solr-node/multicore

上傳完成以後,我們查一下 ZooKeeper 上的儲存情況, 可以選擇 solr, solr2, solr3 任一台 VM 執行以下指令:

cd /root/zookeeper-3.4.6
bin/zkCli.sh -server solr1:2181
...

[zk: solr1:2181(CONNECTED) 0] ls /
[configs, zookeeper, clusterstate.json, aliases.json, live_nodes, overseer, overseer_elect, collections]
[zk: solr1:2181(CONNECTED) 1] ls /configs
[conf0, conf1, myconf]
[zk: solr1:2181(CONNECTED) 2] ls /configs/conf0
[admin-extra.menu-top.html, currency.xml, protwords.txt, mapping-FoldToASCII.txt, _schema_analysis_synonyms_english.json, solrconfig.xml, _schema_analysis_stopwords_english.json, stopwords.txt, lang, schema.xml.bak, spellings.txt, mapping-ISOLatin1Accent.txt, admin-extra.html, xslt, synonyms.txt, scripts.conf, update-script.js, velocity, elevate.xml, admin-extra.menu-bottom.html, schema.xml, clustering]
[zk: solr1:2181(CONNECTED) 3]quit

也可以查詢一下 solr2, solr3 的情況 :

bin/zkCli.sh -server solr2:2181
or
bin/zkCli.sh -server solr3:2181

在 solr1, solr2, solr3, solr4 修改 tomcat 的設定並重新啟動 :

cd /usr/local/apache-tomcat-8.0.9
vi bin/cataina.sh

刪除先前所輸入的 -Dbootstrap_confdir=/root/solr-node/solr/collection1/conf -Dcollection.configName=myconf

JAVA_OPTS="$JAVA_OPTS -Djetty.port=7574 -DnumShards=2 -Dbootstrap_confdir=/root/solr-node/solr/collection1/conf -Dcollection.configName=myconf -DzkHost=solr1:2181,solr2:2181,solr3:2181"

刪除之後的內容 :

JAVA_OPTS="$JAVA_OPTS -Djetty.port=7574 -DnumShards=2 -DzkHost=solr1:2181,solr2:2181,solr3:2181"

在 solr1 上的 cataina.sh 是 -Djetty.port=8983, 為了方便就不另外貼指令.

修改solr1, solr2, solr3, solr4 的各個 VM 上的 web.xml :

vi /usr/local/apache-tomcat-8.0.9/webapps/solr/WEB-INF/web.xml

修改以下內容 :

  <!--
    <env-entry>
       <env-entry-name>solr/home</env-entry-name>
       <env-entry-value>/root/solr-node/solr</env-entry-value>
       <env-entry-type>java.lang.String</env-entry-type>
    </env-entry>
   -->

    <env-entry> 
       <env-entry-name>solr/home</env-entry-name>
       <env-entry-value>/root/solr-node/multicore</env-entry-value>
       <env-entry-type>java.lang.String</env-entry-type>
    </env-entry>

/root/solr-node/multicore/ 下的 core0/conf/, core1/conf/, 讀者可自行定義相關 xml, 或從 collection1/conf copy 過來再修改.

solr1, solr2, solr3, solr4 startup tomcat

bin/startup.sh
10. 參考文章

2014年7月8日 星期二

nutch 1.8 + solr 4.9.0 探討系列二 : nutch url filter, re-crawl, crawl script

這一篇有三個想要了解的重點, urlfilter 的選用, re-crawl 的設定, 及 crawl script 的內容.

1. urlfilter 的選用
NutchTutorial 或是一些 Nucth 的安裝文章裏, 都可以看到一個步驟, 就是如果想要對所抓取的網址做 filter 的話, 可以修改 regex-urlfilter.txt 的最後一行, 達到這個目的. 但是我們可以發現, conf/ 目錄下有許多 urlfilter.txt, 其他的 urlfilter.txt 有沒有作用呢? 要如何使用?

打開 regex-urlfilter.txt, 可以看到內容裏說明 :

vi /root/apache-nutch-1.8/runtime/local/conf/regex-urlfilter.txt

# The default url filter.
# Better for whole-internet crawling.

所以我們可以知道 regex-urlfilter.txt 是 default 的 url filter. 但是如果你打開 automaton-urlfilter.txt, 也可以發現以下描述 :

vi /root/apache-nutch-1.8/runtime/local/conf/automaton-urlfilter.txt

# The default url filter.
# Better for whole-internet crawling.

所以 regex-urlfilter.txt, automaton-urlfilter.txt 兩個是 default 的 url filter, 其他的不是. 真的是這樣嗎?

如果我們觀察 nutch-default.xml 的內容, 可以發現以下幾個 property

vi /root/apache-nutch-1.8/runtime/local/conf/nutch-default.xml

<!-- indexingfilter plugin properties -->

<property>
  <name>indexingfilter.order</name>
  <value></value>
  <description>The order by which index filters are applied.
  If empty, all available index filters (as dictated by properties
  plugin-includes and plugin-excludes above) are loaded and applied in system
  defined order. If not empty, only named filters are loaded and applied
  in given order. For example, if this property has value:
  org.apache.nutch.indexer.basic.BasicIndexingFilter org.apache.nutch.indexer.more.MoreIndexingFilter
  then BasicIndexingFilter is applied first, and MoreIndexingFilter second.

  Filter ordering might have impact on result if one filter depends on output of
  another filter.
  </description>
</property>

<property>
  <name>plugin.includes</name>
  <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins. In order to use HTTPS please enable
  protocol-httpclient, but be aware of possible intermittent problems with the
  underlying commons-httpclient library.
  </description>
</property>

<property>
  <name>plugin.excludes</name>
  <value></value>
  <description>Regular expression naming plugin directory names to exclude.
  </description>
</property>

<!-- urlfilter plugin properties -->

<property>
  <name>urlfilter.domain.file</name>
  <value>domain-urlfilter.txt</value>
  <description>Name of file on CLASSPATH containing either top level domains or
  hostnames used by urlfilter-domain (DomainURLFilter) plugin.</description>
</property>

<property>
  <name>urlfilter.regex.file</name>
  <value>regex-urlfilter.txt</value>
  <description>Name of file on CLASSPATH containing regular expressions
  used by urlfilter-regex (RegexURLFilter) plugin.</description>
</property>

<property>
  <name>urlfilter.automaton.file</name>
  <value>automaton-urlfilter.txt</value>
  <description>Name of file on CLASSPATH containing regular expressions
  used by urlfilter-automaton (AutomatonURLFilter) plugin.</description>
</property>

<property>
  <name>urlfilter.prefix.file</name>
  <value>prefix-urlfilter.txt</value>
  <description>Name of file on CLASSPATH containing url prefixes
  used by urlfilter-prefix (PrefixURLFilter) plugin.</description>
</property>

<property>
  <name>urlfilter.suffix.file</name>
  <value>suffix-urlfilter.txt</value>
  <description>Name of file on CLASSPATH containing url suffixes
  used by urlfilter-suffix (SuffixURLFilter) plugin.</description>
</property>

<property>
  <name>urlfilter.order</name>
  <value></value>
  <description>The order by which url filters are applied.
  If empty, all available url filters (as dictated by properties
  plugin-includes and plugin-excludes above) are loaded and applied in system
  defined order. If not empty, only named filters are loaded and applied
  in given order. For example, if this property has value:
  org.apache.nutch.urlfilter.regex.RegexURLFilter org.apache.nutch.urlfilter.prefix.PrefixURLFilter
  then RegexURLFilter is applied first, and PrefixURLFilter second.
  Since all filters are AND'ed, filter ordering does not have impact
  on end result, but it may have performance implication, depending
  on relative expensiveness of filters.
  </description>
</property>

仔細看各個 property 的描述, 可以發現 urlfilter.*.file 的 property 是在定義 url filter 的檔名, 而會引用哪些 filter, 是在 plugin.includes 定義, plugin.excludes 則是排除使用哪些 plugin(url filter 也是 plugin 的一種), 而 urlfilter.order 和 indexingfilter.order 則是讓使用者自行定義 filter 的調用順序, 如果空白, 則系統自己會排順序. 從這一系列的 property 的設定看來, 只有 regex-urlfilter.txt 會被調用. 也因此我們可以自己決定要使用哪幾個 url filter 和順序, 不過筆者覺得, regex-urlfilter.txt 就夠用了, 甚至筆者也不會想要去修改 regex-urlfilter.txt 來進一步 filter 抓取的網址.

2. re-crawl 的設定
How to re-crawl with Nutch 可以了解 re-crawl 的機制設計和幾個設定, 但系統究竟會用 db.fetch.interval.default 或 db.fetch.schedule.adaptive.* 的設定, 此篇文章並沒有說明. 但我在另一篇文章 RE: Question about fetch interval value 有說到修改以下這個 property 可以指定使用 db.fetch.schedule.adaptive.* 的設定. 各位可以試試看.
vi /root/apache-nutch-1.8/runtime/local/conf/nutch-site.xml

<property>
  <name>db.fetch.schedule.class</name>
  <value>org.apache.nutch.crawl.AdaptiveFetchSchedule</value>
</property>

但如果要 "auto" re-crawl, 還是得自己設定 crontab. 記得請不要直接修改 nutch-default.xml, 而是將要修改的 property 放到 nutch-site.xml 檔, 這樣 nutch-site.xml 的 properties 就會 override nutch-default.xml 的值.

3. bin/crawl script 的內容
在 nutch 1.8 之後已經不可以使用 bin/nutch crawl 的指令, 因為 NUTCH 1.8 及 NUTCH 2.3 之後已經 remove 這個用法. 請參考 bin/nutch crawl . 所以我們改用 bin/crawl 來替代 bin/nutch crawl. 或者也可以參考 NutchTutorial 的做法, step by step 下指令. 其實 bin/crawl script 也是將這樣的指令組合起來在一起執行而已. 我們來看一下 bin/crawl 的內容.
#!/bin/bash
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# 
# The Crawl command script : crawl <seedDir> <crawlDir> <solrURL> <numberOfRounds>
#
# 
# UNLIKE THE NUTCH ALL-IN-ONE-CRAWL COMMAND THIS SCRIPT DOES THE LINK INVERSION AND 
# INDEXING FOR EACH SEGMENT

SEEDDIR="$1"
CRAWL_PATH="$2"
SOLRURL="$3"
LIMIT="$4"

# 做輸入參數的檢查
if [ "$SEEDDIR" = "" ]; then
    echo "Missing seedDir : crawl <seedDir> <crawlDir> <solrURL> <numberOfRounds>"
    exit -1;
fi

if [ "$CRAWL_PATH" = "" ]; then
    echo "Missing crawlDir : crawl <seedDir> <crawlDir> <solrURL> <numberOfRounds>"
    exit -1;
fi

if [ "$SOLRURL" = "" ]; then
    echo "Missing SOLRURL : crawl <seedDir> <crawlDir> <solrURL> <numberOfRounds>"
    exit -1;
fi

if [ "$LIMIT" = "" ]; then
    echo "Missing numberOfRounds : crawl <seedDir> <crawlDir> <solrURL> <numberOfRounds>"
    exit -1;
fi

#############################################
# MODIFY THE PARAMETERS BELOW TO YOUR NEEDS #
#############################################

# set the number of slaves nodes
numSlaves=1

# and the total number of available tasks
# sets Hadoop parameter "mapred.reduce.tasks"
numTasks=`expr $numSlaves \* 2`

# number of urls to fetch in one iteration
# 250K per task?
# 這個參數就是 -topN $numSlaves*50 , 本來是 $numSlaves*50000 , 
# 但是考量硬碟的大小及測試階段不要花費太長的時間做抓取, 所以設為 50

sizeFetchlist=`expr $numSlaves \* 50`

# time limit for feching
timeLimitFetch=180

# num threads for fetching
numThreads=50

#############################################

# determines whether mode based on presence of job file
# 看起來如果要執行 distributed mode, 要執行 runtime/deploy/bin/crawl, 
# 該檔內容與此檔 runtime/local/bin/crawl 一樣.
# 在 runtime/deploy 下有

mode=local
if [ -f ../*nutch-*.job ]; then
    mode=distributed
fi

bin=`dirname "$0"`
bin=`cd "$bin"; pwd`

# note that some of the options listed here could be set in the 
# corresponding hadoop site xml param file 
commonOptions="-D mapred.reduce.tasks=$numTasks -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true"

 # check that hadoop can be found on the path 
if [ $mode = "distributed" ]; then
 if [ $(which hadoop | wc -l ) -eq 0 ]; then
    echo "Can't find Hadoop executable. Add HADOOP_HOME/bin to the path or run in local mode."
    exit -1;
 fi
fi

# initial injection
$bin/nutch inject $CRAWL_PATH/crawldb $SEEDDIR

if [ $? -ne 0 ] 
  then exit $? 
fi


# main loop : rounds of generate - fetch - parse - update
# 第 4 個輸入參數 LIMIT , 適用來控制 loop 的次數

for ((a=1; a <= LIMIT ; a++))
do
  if [ -e ".STOP" ]
  then
   echo "STOP file found - escaping loop"
   break
  fi

  echo `date` ": Iteration $a of $LIMIT"

  echo "Generating a new segment"

  # 請注意 -topN $sizeFetchlist , -numFetchers $numSlaves 
  # 這 2 個上面設的環境變數, 在這裡被用到了
  $bin/nutch generate $commonOptions $CRAWL_PATH/crawldb $CRAWL_PATH/segments -topN $sizeFetchlist -numFetchers $numSlaves -noFilter
  
  if [ $? -ne 0 ] 
  then exit $? 
  fi

  # capture the name of the segment
  # call hadoop in distributed mode
  # or use ls

  if [ $mode = "local" ]; then
   SEGMENT=`ls $CRAWL_PATH/segments/ | sort -n | tail -n 1`
  else
   SEGMENT=`hadoop fs -ls $CRAWL_PATH/segments/ | grep segments |  sed -e "s/\//\\n/g" | egrep 20[0-9]+ | sort -n | tail -n 1`
  fi
  
  echo "Operating on segment : $SEGMENT"

  # fetching the segment
  echo "Fetching : $SEGMENT"

  # 請注意 -threads $numThreads 這個上面設的環境變數, 在這裡被用到了
  $bin/nutch fetch $commonOptions -D fetcher.timelimit.mins=$timeLimitFetch $CRAWL_PATH/segments/$SEGMENT -noParsing -threads $numThreads

  if [ $? -ne 0 ] 
  then exit $? 
  fi

  # parsing the segment
  echo "Parsing : $SEGMENT"
  # enable the skipping of records for the parsing so that a dodgy document 
  # so that it does not fail the full task
  skipRecordsOptions="-D mapred.skip.attempts.to.start.skipping=2 -D mapred.skip.map.max.skip.records=1"
  $bin/nutch parse $commonOptions $skipRecordsOptions $CRAWL_PATH/segments/$SEGMENT

  if [ $? -ne 0 ] 
  then exit $? 
  fi

  # updatedb with this segment
  echo "CrawlDB update"
  $bin/nutch updatedb $commonOptions $CRAWL_PATH/crawldb  $CRAWL_PATH/segments/$SEGMENT

  if [ $? -ne 0 ] 
  then exit $? 
  fi

# note that the link inversion - indexing routine can be done within the main loop 
# on a per segment basis
  echo "Link inversion"
  $bin/nutch invertlinks $CRAWL_PATH/linkdb $CRAWL_PATH/segments/$SEGMENT

  if [ $? -ne 0 ] 
  then exit $? 
  fi

  echo "Dedup on crawldb"
  # Once indexed the entire contents, it must be disposed of 
  # duplicate urls in this way ensures that the urls are unique.
  # <-- from http://wiki.apache.org/nutch/NutchTutorial

  $bin/nutch dedup $CRAWL_PATH/crawldb
  
  if [ $? -ne 0 ] 
   then exit $? 
  fi

  echo "Indexing $SEGMENT on SOLR index -> $SOLRURL"
  $bin/nutch index -D solr.server.url=$SOLRURL $CRAWL_PATH/crawldb -linkdb $CRAWL_PATH/linkdb $CRAWL_PATH/segments/$SEGMENT
  
  if [ $? -ne 0 ] 
   then exit $? 
  fi

  echo "Cleanup on SOLR index -> $SOLRURL"

  # The class scans a crawldb directory looking for entries 
  # with status DB_GONE (404) and sends delete requests to 
  # Solr for those documents. Once Solr receives the request 
  # the aforementioned documents are duly deleted. 
  # This maintains a healthier quality of Solr index. 
  # <-- from http://wiki.apache.org/nutch/NutchTutorial

  $bin/nutch clean -D solr.server.url=$SOLRURL $CRAWL_PATH/crawldb
  
  if [ $? -ne 0 ] 
   then exit $? 
  fi

done

exit 0
4. 參考文章

2014年7月2日 星期三

nutch 1.8 + solr 4.9.0 探討系列一 : 基礎安裝篇

前一陣子規劃要學習 nutch, solr, lucene 已經很久, 現在終於有時間來學習. 本來是要安裝 nutch 2.2.x 版的, 但由於以下兩篇文章讓我改變了主意.

整體來說就是 nutch 2.2.x 在將資料存到資料庫方面還不穩定, 且它整體的效能還差很多. 對我來說, 因為我主要要學的是 solr, lucene, 所以選擇 nutch 1.x 版.

在安裝 nutch 之前, 讀者可參考 Nutch 等 Web Crawler 的著作權問題 這一篇文章, 如果違法是要負民事及刑事責任的, 不得不當心.

前一陣子練習 Hadoop YARN + Spark + Shark, 所以安裝了 hadoop, 共 master, slave1, slave2, slave3, 4 個 centos 6.3 VM. 不過此篇文章並未使用到 nutch 與 Hadoop 的結合, 待筆者有空時, 再補充這方面的測試.

先從 master 開始, 在那之前請先安裝 jdk, 我安裝的是 jdk1.7.0_55 .
1. Ant 1.9.5 安裝
在 root HOME 下 首先安裝 Ant 1.9.4, 因為 centos 6.3 yum 提供的 Ant 版本太舊, solr 無法接受. 但安裝 Ant 1.9.4 之後, build 時會出現以下訊息, 所以根據 https://www.mail-archive.com/blfs-book@lists.linuxfromscratch.org/msg00345.html 發現應該是有修補了. 所以就改到 svn 去抓取最新版本.
/sources/apache-ant/apache-
 
ant-1.9.4/src/tests/junit/org/apache/tools/ant/taskdefs/ExecuteWatchdogTest.java:143:
 error: cannot access Matcher
                         throw new AssumptionViolatedException("process
 interrupted in thread", e);

yum install svn
svn co http://svn.apache.org/repos/asf/ant/core/trunk/ ant-core
cd ant-core
sh build.sh -Ddist.dir=./install dist

將以下內容 append 到 /etc/profile, 以下使用的是 bash .

vi /etc/profile

append :

export ANT_HOME=/root/ant-core/install
export JAVA_HOME=/usr/java/jdk1.7.0_55
export PATH=${PATH}:${ANT_HOME}/bin

使 /etc/profile 生效

source /etc/profile
2. nutch 1.8 安裝
cd ~
wget http://ftp.twaren.net/Unix/Web/apache/nutch/1.8/apache-nutch-1.8-src.tar.gz
tar zxvf apache-nutch-1.8-src.tar.gz
cd apache-nutch-1.8
ant
3. nutch config
設定 nutch agent name, 及取消抓取檔案大小的上限(也可以不取消, 保持 nutch-default.xml 的限制)
cd runtime/local/conf
vi nutch-site.xml

在<configuration></configuration>之間加入 :

<property>
  <name>http.agent.name</name>
  <value>My Nutch Spider</value>
</property>
<property>
  <name>file.content.limit</name>
  <value>-1</value>
  <description>The length limit for downloaded content using the file://
  protocol, in bytes. If this value is nonnegative (>=0), content longer
  than it will be truncated; otherwise, no truncation at all. Do not
  confuse this setting with the http.content.limit setting.
  </description>
</property>
<property>
  <name>http.content.limit</name>
  <value>65536</value>
  <description>The length limit for downloaded content using the http://
  protocol, in bytes. If this value is nonnegative (>=0), content longer
  than it will be truncated; otherwise, no truncation at all. Do not
  confuse this setting with the file.content.limit setting.
  </description>
</property>
4. solr 4.9.0 安裝
cd ~
wget http://ftp.twaren.net/Unix/Web/apache/lucene/solr/4.9.0/solr-4.9.0-src.tgz
tar zxvf solr-4.9.0-src.tgz
cd solr-4.9.0
ant ivy-bootstrap
ant compile
cd solr
ant example
5. 整合 nutch 和 solr
cd example/solr
mv collection1/conf/schema.xml collection1/conf/schema.xml.bak
cp /root/apache-nutch-1.8/runtime/local/conf/schema-solr4.xml collection1/conf/schema.xml

將以下內容放到 collection1/conf/schema.xml

vi collection1/conf/schema.xml

在<fields>之後加入 :

<field name="_version_" type="long" stored="true" indexed="true" multiValued="false"/>
6. Tomcat 8.0.9 安裝
cd ~
wget http://ftp.mirror.tw/pub/apache/tomcat/tomcat-8/v8.0.9/src/apache-tomcat-8.0.9-src.tar.gz
tar zxvf apache-tomcat-8.0.9-src.tar.gz
cd apache-tomcat-8.0.9-src
ant -buildfile ./build.xml
cp -r output/build /usr/local/apache-tomcat-8.0.9
7. 佈署 solr.war 到 Tomcat
yum install unzip
unzip /root/solr-4.9.0/solr/example/webapps/solr.war -d /usr/local/apache-tomcat-8.0.9/webapps/solr
vi /usr/local/apache-tomcat-8.0.9/webapps/solr/WEB-INF/web.xml

修改以下內容 :

  <!--
    <env-entry>
       <env-entry-name>solr/home</env-entry-name>
       <env-entry-value>/put/your/solr/home/here</env-entry-value>
       <env-entry-type>java.lang.String</env-entry-type>
    </env-entry>
   -->

    <env-entry> 
       <env-entry-name>solr/home</env-entry-name>
       <env-entry-value>/root/solr-4.9.0/solr/example/solr</env-entry-value>
       <env-entry-type>java.lang.String</env-entry-type>
    </env-entry>
8. 啟動 Tomcat 及後續處理
cd /usr/local/apache-tomcat-8.0.9
bin/startup.sh
vi logs/catalina.out


26-Jun-2014 19:31:15.405 INFO [main] org.apache.catalina.core.AprLifecycleListener.init The APR based Apache Tomcat Native library which allows optimal performance in production environments was not found on the java.library.path: /usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib

訊息提示找不到 APR based Apache Tomcat Native library, 所以進行安裝.

cd /usr/local/apache-tomcat-8.0.9/bin
tar zxvf tomcat-native.tar.gz
cd tomcat-native-1.1.30-src/jni/native
yum install apr-devel openssl-devel

./configure --with-apr=/usr/bin/apr-1-config \
            --with-java-home=/usr/java/jdk1.7.0_55/ \
            --with-ssl=yes
make && make install

刪除解開, 再也用不著的目錄.

rm -rf /usr/local/apache-tomcat-8.0.9/bin/tomcat-native-1.1.30-src

加入 LD_LIBRARY_PATH

vi /etc/profile

append :

LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/apr/lib
export LD_LIBRARY_PATH

source /etc/profile

重啟 Tomcat, 檢視 log 是否安裝成功.

cd /usr/local/apache-tomcat-8.0.9
bin/shutdown.sh
bin/startup.sh
vi logs/catalina.out

26-Jun-2014 21:34:36.695 INFO [main] org.apache.catalina.core.AprLifecycleListener.init Loaded APR based Apache Tomcat Native library 1.1.30 using APR version 1.3.9.

從訊息看來, APR based Apache Tomcat Native library 已安裝成功.

vi logs/localhost.2014-06-26.log


26-Jun-2014 19:31:18.276 SEVERE [localhost-startStop-1] org.apache.catalina.core.StandardContext.filterStart Exception starting filter SolrRequestFilter
 java.lang.NoClassDefFoundError: Failed to initialize Apache Solr: Could not find necessary SLF4j logging jars. If using Jetty, the SLF4j logging jars need to go in the jetty lib/ext directory. For other containers, the corresponding directory should be used. For more information, see: http://wiki.apache.org/solr/SolrLogging
        at org.apache.solr.servlet.CheckLoggingConfiguration.check(CheckLoggingConfiguration.java:28)

從訊息看來, solr logging 未安裝好.

cp /root/solr-4.9.0/solr/example/lib/ext/* /usr/local/apache-tomcat-8.0.9/lib
cp /root/solr-4.9.0/solr/example/resources/log4j.properties /usr/local/apache-tomcat-8.0.9/lib

bin/shutdown.sh
bin/startup.sh
vi logs/catalina.out

0    [localhost-startStop-1] INFO  org.apache.solr.servlet.SolrDispatchFilter  – SolrDispatchFilter.init()
10   [localhost-startStop-1] INFO  org.apache.solr.core.SolrResourceLoader  – Using JNDI solr.home: /root/solr-4.8.1/solr/example/solr
11   [localhost-startStop-1] INFO  org.apache.solr.core.SolrResourceLoader  – new SolrResourceLoader for directory: '/root/solr-4.8.1/solr/example/solr/'
180  [localhost-startStop-1] INFO  org.apache.solr.core.ConfigSolr  – Loading container configuration from /root/solr-4.8.1/solr/example/solr/solr.xml
359  [localhost-startStop-1] INFO  org.apache.solr.core.CoresLocator  – Config-defined core root directory: /root/solr-4.8.1/solr/example/solr

看到 solr logging 已經正常. 另外也可查到多出了 logs/solr.log 檔.

9. 測試 solr 是否安裝成功

用瀏覽器連到以下網址, 看 solr admin 介面是否顯示並正常運作.

http://localhost:8080/solr/

可以透過 admin 管理介面修改 core name, 或 add core.

10. 進行 nutch 測試, 先不使用 hadoop
先限制每一 round 抓取的 urls 數, 否則硬碟放不下
cd /root/apache-nutch-1.8/runtime/local
vi bin/crawl
# number of urls to fetch in one iteration
# 250K per task?
sizeFetchlist=`expr $numSlaves \* 50000`

改成

# number of urls to fetch in one iteration
# 250K per task?
sizeFetchlist=`expr $numSlaves \* 50`

開始抓取, 可透過 /root/apache-nutch-1.8/runtime/local/logs/hadoop.log 及 執行指令時的 standard output, 還有 /usr/local/apache-tomcat-8.0.9/logs/solr.log 來 trace

mkdir urls
vi seed.txt

加入一行 :

http://wiki.apache.org/nutch/

執行 command, 開始抓取資料並建 solr index.

#crawl <seedDir> <crawlDir> <solrURL> <numberOfRounds>

bin/crawl urls crawl http://localhost:8080/solr/ 2

可輸入以下網址來查詢結果, 直接按下下方的 Execute Query Button 就可查詢到全部抓取到的資料.

http://localhost:8080/solr/#/collection1/query

若想有更好的, 一般網頁格式的使用者介面, 可試試以下 2 個網址.

http://localhost:8080/solr/select/?q=apache&wt=xslt&tr=example.xsl
http://localhost:8080/solr/collection1/browse

網路上其實有許多 solr 的加強介面, 不過目前重點先在於 nutch 和 slor 內涵的了解, 介面等過一陣子筆者會再去試用. 可先參考以下網址的介紹.

http://searchhub.org/2010/01/14/solr-search-user-interface-examples/
11. 參考文章