前一陣子規劃要學習 nutch, solr, lucene 已經很久, 現在終於有時間來學習. 本來是要安裝 nutch 2.2.x 版的, 但由於以下兩篇文章讓我改變了主意.
整體來說就是 nutch 2.2.x 在將資料存到資料庫方面還不穩定, 且它整體的效能還差很多. 對我來說, 因為我主要要學的是 solr, lucene, 所以選擇 nutch 1.x 版.
在安裝 nutch 之前, 讀者可參考 Nutch 等 Web Crawler 的著作權問題 這一篇文章, 如果違法是要負民事及刑事責任的, 不得不當心.
前一陣子練習 Hadoop YARN + Spark + Shark, 所以安裝了 hadoop, 共 master, slave1, slave2, slave3, 4 個 centos 6.3 VM. 不過此篇文章並未使用到 nutch 與 Hadoop 的結合, 待筆者有空時, 再補充這方面的測試.
先從 master 開始, 在那之前請先安裝 jdk, 我安裝的是 jdk1.7.0_55 .1. Ant 1.9.5 安裝
在 root HOME 下 首先安裝 Ant 1.9.4, 因為 centos 6.3 yum 提供的 Ant 版本太舊, solr 無法接受. 但安裝 Ant 1.9.4 之後, build 時會出現以下訊息, 所以根據 https://www.mail-archive.com/blfs-book@lists.linuxfromscratch.org/msg00345.html 發現應該是有修補了. 所以就改到 svn 去抓取最新版本.
/sources/apache-ant/apache-
ant-1.9.4/src/tests/junit/org/apache/tools/ant/taskdefs/ExecuteWatchdogTest.java:143:
error: cannot access Matcher
throw new AssumptionViolatedException("process
interrupted in thread", e);
yum install svn
svn co http://svn.apache.org/repos/asf/ant/core/trunk/ ant-core
cd ant-core
sh build.sh -Ddist.dir=./install dist
將以下內容 append 到 /etc/profile, 以下使用的是 bash .
vi /etc/profile
append :
export ANT_HOME=/root/ant-core/install
export JAVA_HOME=/usr/java/jdk1.7.0_55
export PATH=${PATH}:${ANT_HOME}/bin
使 /etc/profile 生效
source /etc/profile
2. nutch 1.8 安裝
cd ~
wget http://ftp.twaren.net/Unix/Web/apache/nutch/1.8/apache-nutch-1.8-src.tar.gz
tar zxvf apache-nutch-1.8-src.tar.gz
cd apache-nutch-1.8
ant
3. nutch config
設定 nutch agent name, 及取消抓取檔案大小的上限(也可以不取消, 保持 nutch-default.xml 的限制)
cd runtime/local/conf
vi nutch-site.xml
在<configuration></configuration>之間加入 :
<property>
<name>http.agent.name</name>
<value>My Nutch Spider</value>
</property>
<property>
<name>file.content.limit</name>
<value>-1</value>
<description>The length limit for downloaded content using the file://
protocol, in bytes. If this value is nonnegative (>=0), content longer
than it will be truncated; otherwise, no truncation at all. Do not
confuse this setting with the http.content.limit setting.
</description>
</property>
<property>
<name>http.content.limit</name>
<value>65536</value>
<description>The length limit for downloaded content using the http://
protocol, in bytes. If this value is nonnegative (>=0), content longer
than it will be truncated; otherwise, no truncation at all. Do not
confuse this setting with the file.content.limit setting.
</description>
</property>
4. solr 4.9.0 安裝
cd ~
wget http://ftp.twaren.net/Unix/Web/apache/lucene/solr/4.9.0/solr-4.9.0-src.tgz
tar zxvf solr-4.9.0-src.tgz
cd solr-4.9.0
ant ivy-bootstrap
ant compile
cd solr
ant example
5. 整合 nutch 和 solr
cd example/solr
mv collection1/conf/schema.xml collection1/conf/schema.xml.bak
cp /root/apache-nutch-1.8/runtime/local/conf/schema-solr4.xml collection1/conf/schema.xml
將以下內容放到 collection1/conf/schema.xml
vi collection1/conf/schema.xml
在<fields>之後加入 :
<field name="_version_" type="long" stored="true" indexed="true" multiValued="false"/>
6. Tomcat 8.0.9 安裝
cd ~
wget http://ftp.mirror.tw/pub/apache/tomcat/tomcat-8/v8.0.9/src/apache-tomcat-8.0.9-src.tar.gz
tar zxvf apache-tomcat-8.0.9-src.tar.gz
cd apache-tomcat-8.0.9-src
ant -buildfile ./build.xml
cp -r output/build /usr/local/apache-tomcat-8.0.9
7. 佈署 solr.war 到 Tomcat
yum install unzip
unzip /root/solr-4.9.0/solr/example/webapps/solr.war -d /usr/local/apache-tomcat-8.0.9/webapps/solr
vi /usr/local/apache-tomcat-8.0.9/webapps/solr/WEB-INF/web.xml
修改以下內容 :
<!--
<env-entry>
<env-entry-name>solr/home</env-entry-name>
<env-entry-value>/put/your/solr/home/here</env-entry-value>
<env-entry-type>java.lang.String</env-entry-type>
</env-entry>
-->
成
<env-entry>
<env-entry-name>solr/home</env-entry-name>
<env-entry-value>/root/solr-4.9.0/solr/example/solr</env-entry-value>
<env-entry-type>java.lang.String</env-entry-type>
</env-entry>
8. 啟動 Tomcat 及後續處理
cd /usr/local/apache-tomcat-8.0.9
bin/startup.sh
vi logs/catalina.out
26-Jun-2014 19:31:15.405 INFO [main] org.apache.catalina.core.AprLifecycleListener.init The APR based Apache Tomcat Native library which allows optimal performance in production environments was not found on the java.library.path: /usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
訊息提示找不到 APR based Apache Tomcat Native library, 所以進行安裝.
cd /usr/local/apache-tomcat-8.0.9/bin
tar zxvf tomcat-native.tar.gz
cd tomcat-native-1.1.30-src/jni/native
yum install apr-devel openssl-devel
./configure --with-apr=/usr/bin/apr-1-config \
--with-java-home=/usr/java/jdk1.7.0_55/ \
--with-ssl=yes
make && make install
刪除解開, 再也用不著的目錄.
rm -rf /usr/local/apache-tomcat-8.0.9/bin/tomcat-native-1.1.30-src
加入 LD_LIBRARY_PATH
vi /etc/profile
append :
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/apr/lib
export LD_LIBRARY_PATH
source /etc/profile
重啟 Tomcat, 檢視 log 是否安裝成功.
cd /usr/local/apache-tomcat-8.0.9
bin/shutdown.sh
bin/startup.sh
vi logs/catalina.out
26-Jun-2014 21:34:36.695 INFO [main] org.apache.catalina.core.AprLifecycleListener.init Loaded APR based Apache Tomcat Native library 1.1.30 using APR version 1.3.9.
從訊息看來, APR based Apache Tomcat Native library 已安裝成功.
vi logs/localhost.2014-06-26.log
26-Jun-2014 19:31:18.276 SEVERE [localhost-startStop-1] org.apache.catalina.core.StandardContext.filterStart Exception starting filter SolrRequestFilter
java.lang.NoClassDefFoundError: Failed to initialize Apache Solr: Could not find necessary SLF4j logging jars. If using Jetty, the SLF4j logging jars need to go in the jetty lib/ext directory. For other containers, the corresponding directory should be used. For more information, see: http://wiki.apache.org/solr/SolrLogging
at org.apache.solr.servlet.CheckLoggingConfiguration.check(CheckLoggingConfiguration.java:28)
從訊息看來, solr logging 未安裝好.
cp /root/solr-4.9.0/solr/example/lib/ext/* /usr/local/apache-tomcat-8.0.9/lib
cp /root/solr-4.9.0/solr/example/resources/log4j.properties /usr/local/apache-tomcat-8.0.9/lib
bin/shutdown.sh
bin/startup.sh
vi logs/catalina.out
0 [localhost-startStop-1] INFO org.apache.solr.servlet.SolrDispatchFilter – SolrDispatchFilter.init()
10 [localhost-startStop-1] INFO org.apache.solr.core.SolrResourceLoader – Using JNDI solr.home: /root/solr-4.8.1/solr/example/solr
11 [localhost-startStop-1] INFO org.apache.solr.core.SolrResourceLoader – new SolrResourceLoader for directory: '/root/solr-4.8.1/solr/example/solr/'
180 [localhost-startStop-1] INFO org.apache.solr.core.ConfigSolr – Loading container configuration from /root/solr-4.8.1/solr/example/solr/solr.xml
359 [localhost-startStop-1] INFO org.apache.solr.core.CoresLocator – Config-defined core root directory: /root/solr-4.8.1/solr/example/solr
看到 solr logging 已經正常. 另外也可查到多出了 logs/solr.log 檔.
9. 測試 solr 是否安裝成功用瀏覽器連到以下網址, 看 solr admin 介面是否顯示並正常運作.
http://localhost:8080/solr/
可以透過 admin 管理介面修改 core name, 或 add core.
10. 進行 nutch 測試, 先不使用 hadoop先限制每一 round 抓取的 urls 數, 否則硬碟放不下
cd /root/apache-nutch-1.8/runtime/local
vi bin/crawl
# number of urls to fetch in one iteration
# 250K per task?
sizeFetchlist=`expr $numSlaves \* 50000`
改成
# number of urls to fetch in one iteration
# 250K per task?
sizeFetchlist=`expr $numSlaves \* 50`
開始抓取, 可透過 /root/apache-nutch-1.8/runtime/local/logs/hadoop.log 及 執行指令時的 standard output, 還有 /usr/local/apache-tomcat-8.0.9/logs/solr.log 來 trace
mkdir urls
vi seed.txt
加入一行 :
http://wiki.apache.org/nutch/
執行 command, 開始抓取資料並建 solr index.
#crawl <seedDir> <crawlDir> <solrURL> <numberOfRounds>
bin/crawl urls crawl http://localhost:8080/solr/ 2
可輸入以下網址來查詢結果, 直接按下下方的 Execute Query Button 就可查詢到全部抓取到的資料.
http://localhost:8080/solr/#/collection1/query
若想有更好的, 一般網頁格式的使用者介面, 可試試以下 2 個網址.
http://localhost:8080/solr/select/?q=apache&wt=xslt&tr=example.xsl
http://localhost:8080/solr/collection1/browse
網路上其實有許多 solr 的加強介面, 不過目前重點先在於 nutch 和 slor 內涵的了解, 介面等過一陣子筆者會再去試用. 可先參考以下網址的介紹.
http://searchhub.org/2010/01/14/solr-search-user-interface-examples/
11. 參考文章
- 對 Nutch2.1 抽象存儲層的一些看法
- NUTCH FIGHT! 1.7 vs 2.2.1
- nutch1.6 配置SOLR
- Nutch 1.7 + Solr 4.3.1
- Nutch 快速入门(Nutch 1.7)
- nutch1.8+solr 4 配置过程
- nutch 异常集锦
- Solr's Logging Mechanism
- Apache Tomcat Native Library
- Clearing a Solr search index
- Solr Wiki -- XsltResponseWriter
- Solr Tutorial
- Solr Search User Interface Examples
沒有留言:
張貼留言