2014年7月8日 星期二

nutch 1.8 + solr 4.9.0 探討系列二 : nutch url filter, re-crawl, crawl script

這一篇有三個想要了解的重點, urlfilter 的選用, re-crawl 的設定, 及 crawl script 的內容.

1. urlfilter 的選用
NutchTutorial 或是一些 Nucth 的安裝文章裏, 都可以看到一個步驟, 就是如果想要對所抓取的網址做 filter 的話, 可以修改 regex-urlfilter.txt 的最後一行, 達到這個目的. 但是我們可以發現, conf/ 目錄下有許多 urlfilter.txt, 其他的 urlfilter.txt 有沒有作用呢? 要如何使用?

打開 regex-urlfilter.txt, 可以看到內容裏說明 :

vi /root/apache-nutch-1.8/runtime/local/conf/regex-urlfilter.txt

# The default url filter.
# Better for whole-internet crawling.

所以我們可以知道 regex-urlfilter.txt 是 default 的 url filter. 但是如果你打開 automaton-urlfilter.txt, 也可以發現以下描述 :

vi /root/apache-nutch-1.8/runtime/local/conf/automaton-urlfilter.txt

# The default url filter.
# Better for whole-internet crawling.

所以 regex-urlfilter.txt, automaton-urlfilter.txt 兩個是 default 的 url filter, 其他的不是. 真的是這樣嗎?

如果我們觀察 nutch-default.xml 的內容, 可以發現以下幾個 property

vi /root/apache-nutch-1.8/runtime/local/conf/nutch-default.xml

<!-- indexingfilter plugin properties -->

<property>
  <name>indexingfilter.order</name>
  <value></value>
  <description>The order by which index filters are applied.
  If empty, all available index filters (as dictated by properties
  plugin-includes and plugin-excludes above) are loaded and applied in system
  defined order. If not empty, only named filters are loaded and applied
  in given order. For example, if this property has value:
  org.apache.nutch.indexer.basic.BasicIndexingFilter org.apache.nutch.indexer.more.MoreIndexingFilter
  then BasicIndexingFilter is applied first, and MoreIndexingFilter second.

  Filter ordering might have impact on result if one filter depends on output of
  another filter.
  </description>
</property>

<property>
  <name>plugin.includes</name>
  <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins. In order to use HTTPS please enable
  protocol-httpclient, but be aware of possible intermittent problems with the
  underlying commons-httpclient library.
  </description>
</property>

<property>
  <name>plugin.excludes</name>
  <value></value>
  <description>Regular expression naming plugin directory names to exclude.
  </description>
</property>

<!-- urlfilter plugin properties -->

<property>
  <name>urlfilter.domain.file</name>
  <value>domain-urlfilter.txt</value>
  <description>Name of file on CLASSPATH containing either top level domains or
  hostnames used by urlfilter-domain (DomainURLFilter) plugin.</description>
</property>

<property>
  <name>urlfilter.regex.file</name>
  <value>regex-urlfilter.txt</value>
  <description>Name of file on CLASSPATH containing regular expressions
  used by urlfilter-regex (RegexURLFilter) plugin.</description>
</property>

<property>
  <name>urlfilter.automaton.file</name>
  <value>automaton-urlfilter.txt</value>
  <description>Name of file on CLASSPATH containing regular expressions
  used by urlfilter-automaton (AutomatonURLFilter) plugin.</description>
</property>

<property>
  <name>urlfilter.prefix.file</name>
  <value>prefix-urlfilter.txt</value>
  <description>Name of file on CLASSPATH containing url prefixes
  used by urlfilter-prefix (PrefixURLFilter) plugin.</description>
</property>

<property>
  <name>urlfilter.suffix.file</name>
  <value>suffix-urlfilter.txt</value>
  <description>Name of file on CLASSPATH containing url suffixes
  used by urlfilter-suffix (SuffixURLFilter) plugin.</description>
</property>

<property>
  <name>urlfilter.order</name>
  <value></value>
  <description>The order by which url filters are applied.
  If empty, all available url filters (as dictated by properties
  plugin-includes and plugin-excludes above) are loaded and applied in system
  defined order. If not empty, only named filters are loaded and applied
  in given order. For example, if this property has value:
  org.apache.nutch.urlfilter.regex.RegexURLFilter org.apache.nutch.urlfilter.prefix.PrefixURLFilter
  then RegexURLFilter is applied first, and PrefixURLFilter second.
  Since all filters are AND'ed, filter ordering does not have impact
  on end result, but it may have performance implication, depending
  on relative expensiveness of filters.
  </description>
</property>

仔細看各個 property 的描述, 可以發現 urlfilter.*.file 的 property 是在定義 url filter 的檔名, 而會引用哪些 filter, 是在 plugin.includes 定義, plugin.excludes 則是排除使用哪些 plugin(url filter 也是 plugin 的一種), 而 urlfilter.order 和 indexingfilter.order 則是讓使用者自行定義 filter 的調用順序, 如果空白, 則系統自己會排順序. 從這一系列的 property 的設定看來, 只有 regex-urlfilter.txt 會被調用. 也因此我們可以自己決定要使用哪幾個 url filter 和順序, 不過筆者覺得, regex-urlfilter.txt 就夠用了, 甚至筆者也不會想要去修改 regex-urlfilter.txt 來進一步 filter 抓取的網址.

2. re-crawl 的設定
How to re-crawl with Nutch 可以了解 re-crawl 的機制設計和幾個設定, 但系統究竟會用 db.fetch.interval.default 或 db.fetch.schedule.adaptive.* 的設定, 此篇文章並沒有說明. 但我在另一篇文章 RE: Question about fetch interval value 有說到修改以下這個 property 可以指定使用 db.fetch.schedule.adaptive.* 的設定. 各位可以試試看.
vi /root/apache-nutch-1.8/runtime/local/conf/nutch-site.xml

<property>
  <name>db.fetch.schedule.class</name>
  <value>org.apache.nutch.crawl.AdaptiveFetchSchedule</value>
</property>

但如果要 "auto" re-crawl, 還是得自己設定 crontab. 記得請不要直接修改 nutch-default.xml, 而是將要修改的 property 放到 nutch-site.xml 檔, 這樣 nutch-site.xml 的 properties 就會 override nutch-default.xml 的值.

3. bin/crawl script 的內容
在 nutch 1.8 之後已經不可以使用 bin/nutch crawl 的指令, 因為 NUTCH 1.8 及 NUTCH 2.3 之後已經 remove 這個用法. 請參考 bin/nutch crawl . 所以我們改用 bin/crawl 來替代 bin/nutch crawl. 或者也可以參考 NutchTutorial 的做法, step by step 下指令. 其實 bin/crawl script 也是將這樣的指令組合起來在一起執行而已. 我們來看一下 bin/crawl 的內容.
#!/bin/bash
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# 
# The Crawl command script : crawl <seedDir> <crawlDir> <solrURL> <numberOfRounds>
#
# 
# UNLIKE THE NUTCH ALL-IN-ONE-CRAWL COMMAND THIS SCRIPT DOES THE LINK INVERSION AND 
# INDEXING FOR EACH SEGMENT

SEEDDIR="$1"
CRAWL_PATH="$2"
SOLRURL="$3"
LIMIT="$4"

# 做輸入參數的檢查
if [ "$SEEDDIR" = "" ]; then
    echo "Missing seedDir : crawl <seedDir> <crawlDir> <solrURL> <numberOfRounds>"
    exit -1;
fi

if [ "$CRAWL_PATH" = "" ]; then
    echo "Missing crawlDir : crawl <seedDir> <crawlDir> <solrURL> <numberOfRounds>"
    exit -1;
fi

if [ "$SOLRURL" = "" ]; then
    echo "Missing SOLRURL : crawl <seedDir> <crawlDir> <solrURL> <numberOfRounds>"
    exit -1;
fi

if [ "$LIMIT" = "" ]; then
    echo "Missing numberOfRounds : crawl <seedDir> <crawlDir> <solrURL> <numberOfRounds>"
    exit -1;
fi

#############################################
# MODIFY THE PARAMETERS BELOW TO YOUR NEEDS #
#############################################

# set the number of slaves nodes
numSlaves=1

# and the total number of available tasks
# sets Hadoop parameter "mapred.reduce.tasks"
numTasks=`expr $numSlaves \* 2`

# number of urls to fetch in one iteration
# 250K per task?
# 這個參數就是 -topN $numSlaves*50 , 本來是 $numSlaves*50000 , 
# 但是考量硬碟的大小及測試階段不要花費太長的時間做抓取, 所以設為 50

sizeFetchlist=`expr $numSlaves \* 50`

# time limit for feching
timeLimitFetch=180

# num threads for fetching
numThreads=50

#############################################

# determines whether mode based on presence of job file
# 看起來如果要執行 distributed mode, 要執行 runtime/deploy/bin/crawl, 
# 該檔內容與此檔 runtime/local/bin/crawl 一樣.
# 在 runtime/deploy 下有

mode=local
if [ -f ../*nutch-*.job ]; then
    mode=distributed
fi

bin=`dirname "$0"`
bin=`cd "$bin"; pwd`

# note that some of the options listed here could be set in the 
# corresponding hadoop site xml param file 
commonOptions="-D mapred.reduce.tasks=$numTasks -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true"

 # check that hadoop can be found on the path 
if [ $mode = "distributed" ]; then
 if [ $(which hadoop | wc -l ) -eq 0 ]; then
    echo "Can't find Hadoop executable. Add HADOOP_HOME/bin to the path or run in local mode."
    exit -1;
 fi
fi

# initial injection
$bin/nutch inject $CRAWL_PATH/crawldb $SEEDDIR

if [ $? -ne 0 ] 
  then exit $? 
fi


# main loop : rounds of generate - fetch - parse - update
# 第 4 個輸入參數 LIMIT , 適用來控制 loop 的次數

for ((a=1; a <= LIMIT ; a++))
do
  if [ -e ".STOP" ]
  then
   echo "STOP file found - escaping loop"
   break
  fi

  echo `date` ": Iteration $a of $LIMIT"

  echo "Generating a new segment"

  # 請注意 -topN $sizeFetchlist , -numFetchers $numSlaves 
  # 這 2 個上面設的環境變數, 在這裡被用到了
  $bin/nutch generate $commonOptions $CRAWL_PATH/crawldb $CRAWL_PATH/segments -topN $sizeFetchlist -numFetchers $numSlaves -noFilter
  
  if [ $? -ne 0 ] 
  then exit $? 
  fi

  # capture the name of the segment
  # call hadoop in distributed mode
  # or use ls

  if [ $mode = "local" ]; then
   SEGMENT=`ls $CRAWL_PATH/segments/ | sort -n | tail -n 1`
  else
   SEGMENT=`hadoop fs -ls $CRAWL_PATH/segments/ | grep segments |  sed -e "s/\//\\n/g" | egrep 20[0-9]+ | sort -n | tail -n 1`
  fi
  
  echo "Operating on segment : $SEGMENT"

  # fetching the segment
  echo "Fetching : $SEGMENT"

  # 請注意 -threads $numThreads 這個上面設的環境變數, 在這裡被用到了
  $bin/nutch fetch $commonOptions -D fetcher.timelimit.mins=$timeLimitFetch $CRAWL_PATH/segments/$SEGMENT -noParsing -threads $numThreads

  if [ $? -ne 0 ] 
  then exit $? 
  fi

  # parsing the segment
  echo "Parsing : $SEGMENT"
  # enable the skipping of records for the parsing so that a dodgy document 
  # so that it does not fail the full task
  skipRecordsOptions="-D mapred.skip.attempts.to.start.skipping=2 -D mapred.skip.map.max.skip.records=1"
  $bin/nutch parse $commonOptions $skipRecordsOptions $CRAWL_PATH/segments/$SEGMENT

  if [ $? -ne 0 ] 
  then exit $? 
  fi

  # updatedb with this segment
  echo "CrawlDB update"
  $bin/nutch updatedb $commonOptions $CRAWL_PATH/crawldb  $CRAWL_PATH/segments/$SEGMENT

  if [ $? -ne 0 ] 
  then exit $? 
  fi

# note that the link inversion - indexing routine can be done within the main loop 
# on a per segment basis
  echo "Link inversion"
  $bin/nutch invertlinks $CRAWL_PATH/linkdb $CRAWL_PATH/segments/$SEGMENT

  if [ $? -ne 0 ] 
  then exit $? 
  fi

  echo "Dedup on crawldb"
  # Once indexed the entire contents, it must be disposed of 
  # duplicate urls in this way ensures that the urls are unique.
  # <-- from http://wiki.apache.org/nutch/NutchTutorial

  $bin/nutch dedup $CRAWL_PATH/crawldb
  
  if [ $? -ne 0 ] 
   then exit $? 
  fi

  echo "Indexing $SEGMENT on SOLR index -> $SOLRURL"
  $bin/nutch index -D solr.server.url=$SOLRURL $CRAWL_PATH/crawldb -linkdb $CRAWL_PATH/linkdb $CRAWL_PATH/segments/$SEGMENT
  
  if [ $? -ne 0 ] 
   then exit $? 
  fi

  echo "Cleanup on SOLR index -> $SOLRURL"

  # The class scans a crawldb directory looking for entries 
  # with status DB_GONE (404) and sends delete requests to 
  # Solr for those documents. Once Solr receives the request 
  # the aforementioned documents are duly deleted. 
  # This maintains a healthier quality of Solr index. 
  # <-- from http://wiki.apache.org/nutch/NutchTutorial

  $bin/nutch clean -D solr.server.url=$SOLRURL $CRAWL_PATH/crawldb
  
  if [ $? -ne 0 ] 
   then exit $? 
  fi

done

exit 0
4. 參考文章

沒有留言:

張貼留言