初次接触nutch,记录下来

首先数据库

CREATE DATABASE nutch DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_unicode_ci;

CREATE TABLE`webpage` (`id`varchar(767) NOT NULL,`headers` blob,`text` mediumtext,`status`int(11) default NULL,`markers` blob,`parseStatus` blob,`modifiedTime`bigint(20) default NULL,`score`float default NULL,`typ`varchar(32) default NULL,`baseUrl`varchar(767) default NULL,`content` longblob,`title`varchar(2048) default NULL,`reprUrl`varchar(767) default NULL,`fetchInterval`int(11) default NULL,`prevFetchTime`bigint(20) default NULL,`inlinks` mediumblob,`prevSignature` blob,`outlinks` mediumblob,`fetchTime`bigint(20) default NULL,`retriesSinceFetch`int(11) default NULL,`protocolStatus` blob,`signature` blob,`metadata` blob,PRIMARY KEY(`id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 ROW_FORMAT=COMPRESSED;

eclipse安装svn,ivy,ant

以上两个插件是nutch项目租使用的插件,自行安装。

nutch2.1的远程svn库文件地址

https://svn.apache.org/repos/asf/nutch/tags/release-2.1

check out检出项目

默认直接finish并创建java project项目

等待下载完成

下载完成后(注:这里的nutch2西面已做更改成nutch-2.1)

在project explorer下右击项目,选择properties。进入java build path
Add Folder > 导入选择,并把plugin下面的项目中的src/java和src/test都加入进去

src/bin
src/java
src/test
src/testresources

这一步也可以直接修改项目中的classpath文件,然后在直接刷新项目来自动添加,这样比较方便,但要注意是否有添加错误

.classpath内容

<?xml version="1.0" encoding="UTF-8"?>
<classpath><classpathentrykind="src"path="conf"/><classpathentrykind="src"path="src/java"/><classpathentrykind="src"path="src/test"/><classpathentrykind="src"path="src/plugin/protocol-file/src/test"/><classpathentrykind="src"path="src/plugin/protocol-httpclient/src/test"/><classpathentrykind="src"path="src/plugin/subcollection/src/test"/><classpathentrykind="src"path="src/plugin/parse-html/src/test"/><classpathentrykind="src"path="src/plugin/urlfilter-automaton/src/test"/><classpathentrykind="src"path="src/plugin/parse-html/src/java"/><classpathentrykind="src"path="src/plugin/parse-tika/src/test"/><classpathentrykind="src"path="src/plugin/lib-http/src/test"/><classpathentrykind="src"path="src/plugin/parse-tika/src/java"/><classpathentrykind="src"path="src/plugin/urlfilter-regex/src/java"/><classpathentrykind="src"path="src/plugin/urlfilter-domain/src/java"/><classpathentrykind="src"path="src/plugin/scoring-link/src/java"/><classpathentrykind="src"path="src/plugin/index-anchor/src/test"/><classpathentrykind="src"path="src/plugin/protocol-http/src/java"/><classpathentrykind="src"path="src/plugin/urlnormalizer-regex/src/test"/><classpathentrykind="src"path="src/plugin/urlfilter-prefix/src/java"/><classpathentrykind="src"path="src/plugin/scoring-opic/src/java"/><classpathentrykind="src"path="src/plugin/urlfilter-domain/src/test"/><classpathentrykind="src"path="src/plugin/protocol-file/src/java"/><classpathentrykind="src"path="src/plugin/urlnormalizer-regex/src/java"/><classpathentrykind="src"path="src/plugin/urlfilter-suffix/src/java"/><classpathentrykind="src"path="src/plugin/language-identifier/src/java"/><classpathentrykind="src"path="src/plugin/lib-regex-filter/src/test"/><classpathentrykind="src"path="src/plugin/language-identifier/src/test"/><classpathentrykind="src"path="src/plugin/subcollection/src/java"/><classpathentrykind="src"path="src/plugin/urlnormalizer-basic/src/test"/><classpathentrykind="src"path="src/plugin/index-basic/src/java"/><classpathentrykind="src"path="src/plugin/urlnormalizer-pass/src/test"/><classpathentrykind="src"path="src/plugin/creativecommons/src/java"/><classpathentrykind="src"path="src/bin"/><classpathentrykind="src"path="src/plugin/protocol-httpclient/src/java"/><classpathentrykind="src"path="src/plugin/tld/src/java"/><classpathentrykind="src"path="src/plugin/urlnormalizer-basic/src/java"/><classpathentrykind="src"path="src/plugin/index-basic/src/test"/><classpathentrykind="src"path="src/plugin/lib-http/src/java"/><classpathentrykind="src"path="src/plugin/protocol-ftp/src/java"/><classpathentrykind="src"path="src/plugin/index-anchor/src/java"/><classpathentrykind="src"path="src/plugin/urlfilter-validator/src/java"/><classpathentrykind="src"path="src/plugin/index-more/src/java"/><classpathentrykind="src"path="src/plugin/urlfilter-suffix/src/test"/><classpathentrykind="src"path="src/plugin/creativecommons/src/test"/><classpathentrykind="src"path="src/plugin/microformats-reltag/src/java"/><classpathentrykind="src"path="src/plugin/urlfilter-regex/src/test"/><classpathentrykind="src"path="src/plugin/lib-regex-filter/src/java"/><classpathentrykind="src"path="src/plugin/index-more/src/test"/><classpathentrykind="src"path="src/plugin/urlnormalizer-pass/src/java"/><classpathentrykind="src"path="src/plugin/urlfilter-automaton/src/java"/><classpathentrykind="src"path="src/testresources"/><classpathentrykind="con"path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&amp;ivyXmlPath=ivy%2Fivy.xml&amp;confs=*"/><classpathentrykind="con"path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&amp;ivyXmlPath=src%2Fplugin%2Fcreativecommons%2Fivy.xml&amp;confs=*"/><classpathentrykind="con"path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&amp;ivyXmlPath=src%2Fplugin%2Ffeed%2Fivy.xml&amp;confs=*"/><classpathentrykind="con"path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&amp;ivyXmlPath=src%2Fplugin%2Findex-anchor%2Fivy.xml&amp;confs=*"/><classpathentrykind="con"path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&amp;ivyXmlPath=src%2Fplugin%2Findex-basic%2Fivy.xml&amp;confs=*"/><classpathentrykind="con"path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&amp;ivyXmlPath=src%2Fplugin%2Findex-more%2Fivy.xml&amp;confs=*"/><classpathentrykind="con"path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&amp;ivyXmlPath=src%2Fplugin%2Flanguage-identifier%2Fivy.xml&amp;confs=*"/><classpathentrykind="con"path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&amp;ivyXmlPath=src%2Fplugin%2Flib-http%2Fivy.xml&amp;confs=*"/><classpathentrykind="con"path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&amp;ivyXmlPath=src%2Fplugin%2Flib-nekohtml%2Fivy.xml&amp;confs=*"/><classpathentrykind="con"path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&amp;ivyXmlPath=src%2Fplugin%2Flib-regex-filter%2Fivy.xml&amp;confs=*"/><classpathentrykind="con"path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&amp;ivyXmlPath=src%2Fplugin%2Flib-xml%2Fivy.xml&amp;confs=*"/><classpathentrykind="con"path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&amp;ivyXmlPath=src%2Fplugin%2Fmicroformats-reltag%2Fivy.xml&amp;confs=*"/><classpathentrykind="con"path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&amp;ivyXmlPath=src%2Fplugin%2Fnutch-extensionpoints%2Fivy.xml&amp;confs=*"/><classpathentrykind="con"path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&amp;ivyXmlPath=src%2Fplugin%2Fparse-ext%2Fivy.xml&amp;confs=*"/><classpathentrykind="con"path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&amp;ivyXmlPath=src%2Fplugin%2Fparse-html%2Fivy.xml&amp;confs=*"/><classpathentrykind="con"path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&amp;ivyXmlPath=src%2Fplugin%2Fparse-js%2Fivy.xml&amp;confs=*"/><classpathentrykind="con"path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&amp;ivyXmlPath=src%2Fplugin%2Fparse-swf%2Fivy.xml&amp;confs=*"/><classpathentrykind="con"path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&amp;ivyXmlPath=src%2Fplugin%2Fparse-tika%2Fivy.xml&amp;confs=*"/><classpathentrykind="con"path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&amp;ivyXmlPath=src%2Fplugin%2Fparse-zip%2Fivy.xml&amp;confs=*"/><classpathentrykind="con"path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&amp;ivyXmlPath=src%2Fplugin%2Fprotocol-file%2Fivy.xml&amp;confs=*"/><classpathentrykind="con"path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&amp;ivyXmlPath=src%2Fplugin%2Fprotocol-ftp%2Fivy.xml&amp;confs=*"/><classpathentrykind="con"path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&amp;ivyXmlPath=src%2Fplugin%2Fprotocol-http%2Fivy.xml&amp;confs=*"/><classpathentrykind="con"path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&amp;ivyXmlPath=src%2Fplugin%2Fprotocol-httpclient%2Fivy.xml&amp;confs=*"/><classpathentrykind="con"path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&amp;ivyXmlPath=src%2Fplugin%2Fprotocol-sftp%2Fivy.xml&amp;confs=*"/><classpathentrykind="con"path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&amp;ivyXmlPath=src%2Fplugin%2Fscoring-link%2Fivy.xml&amp;confs=*"/><classpathentrykind="con"path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&amp;ivyXmlPath=src%2Fplugin%2Fscoring-opic%2Fivy.xml&amp;confs=*"/><classpathentrykind="con"path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&amp;ivyXmlPath=src%2Fplugin%2Fsubcollection%2Fivy.xml&amp;confs=*"/><classpathentrykind="con"path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&amp;ivyXmlPath=src%2Fplugin%2Ftld%2Fivy.xml&amp;confs=*"/><classpathentrykind="con"path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&amp;ivyXmlPath=src%2Fplugin%2Furlfilter-automaton%2Fivy.xml&amp;confs=*"/><classpathentrykind="con"path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&amp;ivyXmlPath=src%2Fplugin%2Furlfilter-domain%2Fivy.xml&amp;confs=*"/><classpathentrykind="con"path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&amp;ivyXmlPath=src%2Fplugin%2Furlfilter-prefix%2Fivy.xml&amp;confs=*"/><classpathentrykind="con"path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&amp;ivyXmlPath=src%2Fplugin%2Furlfilter-regex%2Fivy.xml&amp;confs=*"/><classpathentrykind="con"path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&amp;ivyXmlPath=src%2Fplugin%2Furlfilter-suffix%2Fivy.xml&amp;confs=*"/><classpathentrykind="con"path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&amp;ivyXmlPath=src%2Fplugin%2Furlfilter-validator%2Fivy.xml&amp;confs=*"/><classpathentrykind="con"path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&amp;ivyXmlPath=src%2Fplugin%2Furlnormalizer-basic%2Fivy.xml&amp;confs=*"/><classpathentrykind="con"path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&amp;ivyXmlPath=src%2Fplugin%2Furlnormalizer-pass%2Fivy.xml&amp;confs=*"/><classpathentrykind="con"path="org.apache.ivyde.eclipse.cpcontainer.IVYDE_CONTAINER/?project=nutch-2.1&amp;ivyXmlPath=src%2Fplugin%2Furlnormalizer-regex%2Fivy.xml&amp;confs=*"/><classpathentrykind="con"path="org.eclipse.jdt.launching.JRE_CONTAINER"/><classpathentrykind="con"path="org.eclipse.jdt.junit.JUNIT_CONTAINER/4"/><classpathentrykind="lib"path="lib/org.restlet-2.0.0.jar"/><classpathentrykind="lib"path="lib/org.restlet.example.jar"/><classpathentrykind="lib"path="lib/org.restlet.ext.atom_1.0.jar"/><classpathentrykind="lib"path="lib/org.restlet.ext.atom.jar"/><classpathentrykind="lib"path="lib/org.restlet.ext.crypto.jar"/><classpathentrykind="lib"path="lib/org.restlet.ext.fileupload_1.2.jar"/><classpathentrykind="lib"path="lib/org.restlet.ext.freemarker_2.3.jar"/><classpathentrykind="lib"path="lib/org.restlet.ext.freemarker.jar"/><classpathentrykind="lib"path="lib/org.restlet.ext.grizzly.jar"/><classpathentrykind="lib"path="lib/org.restlet.ext.gwt.jar"/><classpathentrykind="lib"path="lib/org.restlet.ext.httpclient.jar"/><classpathentrykind="lib"path="lib/org.restlet.ext.jaas.jar"/><classpathentrykind="lib"path="lib/org.restlet.ext.jackson.jar"/><classpathentrykind="lib"path="lib/org.restlet.ext.jaxb_2.1.jar"/><classpathentrykind="lib"path="lib/org.restlet.ext.jaxrs_1.0.jar"/><classpathentrykind="lib"path="lib/org.restlet.ext.jaxrs-2.0-RC3.jar"/><classpathentrykind="lib"path="lib/org.restlet.ext.jibx_1.1.jar"/><classpathentrykind="lib"path="lib/org.restlet.ext.json_2.0.jar"/><classpathentrykind="lib"path="lib/org.restlet.ext.json.jar"/><classpathentrykind="lib"path="lib/org.restlet.ext.net.jar"/><classpathentrykind="lib"path="lib/org.restlet.ext.odata.jar"/><classpathentrykind="lib"path="lib/org.restlet.ext.rdf.jar"/><classpathentrykind="lib"path="lib/org.restlet.ext.servlet-2.0-RC3.jar"/><classpathentrykind="lib"path="lib/org.restlet.ext.servlet-2.0.0.jar"/><classpathentrykind="lib"path="lib/org.restlet.ext.servlet.jar"/><classpathentrykind="lib"path="lib/org.restlet.ext.spring_2.5.jar"/><classpathentrykind="lib"path="lib/org.restlet.ext.spring-2.0.0.jar"/><classpathentrykind="lib"path="lib/org.restlet.ext.velocity_1.5.jar"/><classpathentrykind="lib"path="lib/org.restlet.ext.wadl_1.0.jar"/><classpathentrykind="lib"path="lib/org.restlet.ext.xml.jar"/><classpathentrykind="lib"path="lib/org.restlet.ext.xstream.jar"/><classpathentrykind="lib"path="lib/org.restlet.gae-2.0-RC3.jar"/><classpathentrykind="lib"path="lib/org.restlet.gwt.jar"/><classpathentrykind="lib"path="lib/org.restlet.lib.org.json-2.0.jar"/><classpathentrykind="lib"path="src/plugin/urlfilter-automaton/lib/automaton.jar"/><classpathentrykind="lib"path="lib/mysql-connector-java-5.0.7.jar"/><classpathentrykind="output"path="bin"/>
</classpath>

刷新项目就跟上面一样了

接下order and export中要把conf提到最前面加载

这里处理玩之后接下来就是导包的过程

安装ivy的插件则能直接右击ivy.xml

直接finish。jar就会自动下载下来,需要注意,这里的ivy.xml有很多文件,只要有jar的都要add ivy library一次

这样去找会消耗点时间

当所有的ivy到导入后,最后总会有几个jar不存在的

(这里网上自行下载了,我这里自己另加入的包有)

另还有一个包hadoop-core的包需要修改,FileUtil.java

详情见http://yangshangchuan.iteye.com/blog/1839784

摘录下来(在运行时会提示错误)

错误信息:
Exception in thread "main" java.io.IOException:Failed to set permissions of path:\tmp\hadoop-ysc\mapred\staging\ysc-2036315919\.staging to 0700官方BUG参考:
https://issues.apache.org/jira/browse/HADOOP-7682解决方法:
1、下载并解压http://mirror.bit.edu.cn/apache/hadoop/common/hadoop-1.1.2/hadoop-1.1.2.tar.gz
2、修改hadoop-1.1.2\src\core\org\apache\hadoop\fs\FileUtil.java,搜索 Failed to set permissions of path,找到689行,把throw new IOException改为LOG.warn
3、修改hadoop-1.1.2\build.xml,搜索autoreconf,移除匹配的6个executable="autoreconf"的exec配置
4、下载解压ant,将ant目录下的bin目录加入环境变量path
5、在Cygwin命令下行切换到hadoop-1.1.2目录,执行ant
6、用新生成的hadoop-1.1.2\build\hadoop-core-1.1.3-SNAPSHOT.jar替换nutch的hadoop-core-1.0.3.jar
7、对于eclipse开发来说,替换C:\Users\ysc\.ivy2\cache\org.apache.hadoop\hadoop-core\jars\hadoop-core-1.1.2.jar附件中的JAR是对hadoop1.2.1修改后的JAR,可用于Nutch1.7,其他Nutch版本没测试过。

我在修改的时候直接下载这个然后替换ivy库中的hadoop-core包,名称一样;

下载http://pan.baidu.com/s/1i3FBLEP

接下里就是配置

在nutch2.1/conf下
Gora.properties
加入:

    gora.sqlstore.jdbc.driver=com.mysql.jdbc.Driver  gora.sqlstore.jdbc.url=jdbc:mysql://localhost:3306/nutch?createDatabaseIfNotExist=true  gora.sqlstore.jdbc.user=root  gora.sqlstore.jdbc.password=root  

并注释掉其他的数据库链接。
在ivy/ivy.xml

解除mysql-connector的注释。

在/conf/nutch-site.xml.template的configuration中添加如下代码:

    <property>  <name>http.agent.name</name>  <value>Your Nutch Spider</value>  </property>  <property>  <name>http.accept.language</name>  <value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value>  <description>Value of the “Accept-Language” request header field.  This allows selecting non-English language as default one to retrieve.  It is a useful setting for search engines build for certain national group.</description>  </property>  <property>  <name>parser.character.encoding.default</name>  <value>utf-8</value>  <description>The character encoding to fall back to when no other information  is available</description>  </property>  <property>  <name>plugin.includes</name>  <value>protocol-httpclient|protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic</value>  <description>Regular expression naming plugin directory names to  include.  Any plugin not matching this expression is excluded.  In any case you need at least include the nutch-extensionpoints plugin. By  default Nutch includes crawling just HTML and plain text via HTTP,  and basic indexing and search plugins. In order to use HTTPS please enable   protocol-httpclient, but be aware of possible intermittent problems with the   underlying commons-httpclient library.</description>  </property>  <property>  <name>storage.data.store.class</name>  <value>org.apache.gora.sql.store.SqlStore</value>  <description>The Gora DataStore class for storing and retrieving data.  Currently the following stores are available: ….</description>  </property>  <property>  <name>plugin.folders</name>  <value>./src/plugin</value>  <description>Directories where nutch plugins are located.  Each  element may be a relative or absolute path.  If absolute, it is used  as is.  If relative, it is searched for on the classpath.</description>  </property>   

在根目录下的build.xml中找到如下代码

    <targetname="resolve-default"depends="clean-lib, init"description="--> resolve and retrieve dependencies with ivy">  <ivy:resolvefile="${ivy.file}"conf="default"log="download-only" />  <ivy:retrievepattern="${build.lib.dir}/[artifact]-[revision].[ext]"symlink="false"log="quiet" />  <antcalltarget="copy-libs" />  </target>  

将原本的

    pattern="${build.lib.dir}/[artifact]-[revision].[ext]"  

改为

pattern="${build.lib.dir}/[artifact]-[type]-[revision].[ext]" 

用来避免ivy再次下载编译不通过的情况。原因:ivy会下载class的jar和source的jar,当时如果直接按照上面的pattern下载的话,两个文件是无法区分的。会出现相同的文件的错误。

完成如上信息之后,点击build.xml进行ant编译就会生成runtime目录。

在根目录下添加一个urls文件夹,放入seed.txt文件,其中加一个网站地址。如:http://nutch.apache.org/
打开

src/java下的crawl的package下的crawler,使用run configuration

第一页已经默认填写完毕


选择第二个arguments
放入:

urls -depth 3 -topN 5
-Xms64m -Xmx512m


最后就可以使用run进行爬取该网站的链接信息了。

执行完后打印

Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.
FetcherJob: threads: 10
FetcherJob: parsing: false
FetcherJob: resuming: false
FetcherJob : timelimit set for : -1
Using queue mode : byHost
Fetcher: threads: 10
QueueFeeder finished: total 1 records. Hit by time limit :0
fetching http://nutch.apache.org/
Fetcher: throughput threshold: -1
Fetcher: throughput threshold sequence: 5
-finishing thread FetcherThread2, activeThreads=1
-finishing thread FetcherThread3, activeThreads=1
-finishing thread FetcherThread4, activeThreads=1
-finishing thread FetcherThread7, activeThreads=1
-finishing thread FetcherThread8, activeThreads=1
-finishing thread FetcherThread9, activeThreads=1
-finishing thread FetcherThread5, activeThreads=1
-finishing thread FetcherThread6, activeThreads=1
-finishing thread FetcherThread1, activeThreads=1
-finishing thread FetcherThread0, activeThreads=0
0/0 spinwaiting/active, 1 pages, 0 errors, 0.2 0.2 pages/s, 84 84 kb/s, 0 URLs in 0 queues
-activeThreads=0
ParserJob: resuming:    false
ParserJob: forced reparse:    false
ParserJob: parsing all
Parsing http://nutch.apache.org/
Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.
FetcherJob: threads: 10
FetcherJob: parsing: false
FetcherJob: resuming: false
FetcherJob : timelimit set for : -1
Using queue mode : byHost
Fetcher: threads: 10
QueueFeeder finished: total 6 records. Hit by time limit :0
Fetcher: throughput threshold: -1
Fetcher: throughput threshold sequence: 5
fetching http://cassandra.apache.org/
fetching http://nutch.apache.org/
fetching http://accumulo.apache.org/
fetching http://avro.apache.org/
fetching http://blog.foofactory.fi/2007/03/twice-speed-half-size.html
fetching http://code.google.com/p/crawler-commons/
-finishing thread FetcherThread1, activeThreads=9
-finishing thread FetcherThread2, activeThreads=8
-finishing thread FetcherThread3, activeThreads=7
-finishing thread FetcherThread6, activeThreads=6
-finishing thread FetcherThread0, activeThreads=5
-finishing thread FetcherThread8, activeThreads=4
-finishing thread FetcherThread7, activeThreads=3
-finishing thread FetcherThread9, activeThreads=2
0/2 spinwaiting/active, 4 pages, 0 errors, 0.8 0.8 pages/s, 136 136 kb/s, 0 URLs in 2 queues
0/2 spinwaiting/active, 4 pages, 0 errors, 0.4 0.0 pages/s, 68 0 kb/s, 0 URLs in 2 queues
0/2 spinwaiting/active, 4 pages, 0 errors, 0.3 0.0 pages/s, 45 0 kb/s, 0 URLs in 2 queues
0/2 spinwaiting/active, 4 pages, 0 errors, 0.2 0.0 pages/s, 34 0 kb/s, 0 URLs in 2 queues
fetch of http://code.google.com/p/crawler-commons/ failed with: org.apache.commons.httpclient.ConnectTimeoutException: The host did not accept the connection within timeout of 10000 ms
-finishing thread FetcherThread4, activeThreads=1
fetch of http://blog.foofactory.fi/2007/03/twice-speed-half-size.html failed with: org.apache.commons.httpclient.ConnectTimeoutException: The host did not accept the connection within timeout of 10000 ms
-finishing thread FetcherThread5, activeThreads=0
0/0 spinwaiting/active, 6 pages, 2 errors, 0.2 0.4 pages/s, 27 0 kb/s, 0 URLs in 0 queues
-activeThreads=0
ParserJob: resuming:    false
ParserJob: forced reparse:    false
ParserJob: parsing all
Skipping http://sched.co/1pav9xl; different batch id (null)
Skipping http://sched.co/1pbE15n; different batch id (null)
Skipping http://t.co/k3VLhbJQhg; different batch id (null)
Skipping http://www.eu.apachecon.com/c/aceu2009/; different batch id (null)
Skipping http://eu.apachecon.com/c/aceu2009/sessions/136; different batch id (null)
Skipping http://eu.apachecon.com/c/aceu2009/sessions/137; different batch id (null)
Skipping http://eu.apachecon.com/c/aceu2009/sessions/138; different batch id (null)
Skipping http://eu.apachecon.com/c/aceu2009/sessions/165; different batch id (null)
Skipping http://eu.apachecon.com/c/aceu2009/sessions/197; different batch id (null)
Skipping http://eu.apachecon.com/c/aceu2009/sessions/201; different batch id (null)
Skipping http://eu.apachecon.com/c/aceu2009/sessions/250; different batch id (null)
Skipping http://eu.apachecon.com/c/aceu2009/sessions/251; different batch id (null)
Skipping http://www.us.apachecon.com/c/acus2009/; different batch id (null)
Skipping http://www.us.apachecon.com/c/acus2009/schedule; different batch id (null)
Skipping http://www.us.apachecon.com/c/acus2009/sessions/331; different batch id (null)
Skipping http://www.us.apachecon.com/c/acus2009/sessions/332; different batch id (null)
Skipping http://www.us.apachecon.com/c/acus2009/sessions/333; different batch id (null)
Skipping http://www.us.apachecon.com/c/acus2009/sessions/334; different batch id (null)
Skipping http://www.us.apachecon.com/c/acus2009/sessions/335; different batch id (null)
Skipping http://www.us.apachecon.com/c/acus2009/sessions/375; different batch id (null)
Skipping http://www.us.apachecon.com/c/acus2009/sessions/427; different batch id (null)
Skipping http://www.us.apachecon.com/c/acus2009/sessions/428; different batch id (null)
Skipping http://www.us.apachecon.com/c/acus2009/sessions/430; different batch id (null)
Skipping http://www.us.apachecon.com/c/acus2009/sessions/437; different batch id (null)
Skipping http://www.us.apachecon.com/c/acus2009/sessions/461; different batch id (null)
Skipping http://www.us.apachecon.com/c/acus2009/sessions/462; different batch id (null)
Skipping http://www.cafepress.com/nutch; different batch id (null)
Skipping https://www.flickr.com/photos/andrewfhart/8106189987/; different batch id (null)
Skipping https://www.flickr.com/photos/andrewfhart/8106200690/; different batch id (null)
Skipping https://www.flickr.com/photos/mrmuskrat/3637703614/; different batch id (null)
Skipping https://www.flickr.com/photos/splorp/3981832163/; different batch id (null)
Skipping https://www.google-melange.com/gsoc/homepage/google/gsoc2014; different batch id (null)
Parsing http://code.google.com/p/crawler-commons/
Skipping https://twitter.com/ApacheNutch; different batch id (null)
Skipping https://twitter.com/ApacheNutch/status/591359830171856896; different batch id (null)
Skipping https://twitter.com/cutting/status/233415059798372353; different batch id (null)
Skipping https://twitter.com/TheASF; different batch id (null)
Skipping http://www.brics.dk/automaton/; different batch id (null)
Skipping http://www.brics.dk/automaton/automaton; different batch id (null)
Parsing http://blog.foofactory.fi/2007/03/twice-speed-half-size.html
Parsing http://accumulo.apache.org/
Parsing http://avro.apache.org/
Skipping https://builds.apache.org/view/M-R/view/Nutch/; different batch id (null)
Parsing http://cassandra.apache.org/
Skipping https://cwiki.apache.org/confluence/display/solr/SolrCloud; different batch id (null)
Skipping http://gora.apache.org/; different batch id (null)
Skipping http://hadoop.apache.org/; different batch id (null)
Skipping http://hbase.apache.org/; different batch id (null)
Skipping https://issues.apache.org/jira/browse/NUTCH-1047; different batch id (null)
Skipping https://issues.apache.org/jira/browse/NUTCH-1591; different batch id (null)
Skipping https://issues.apache.org/jira/browse/NUTCH-841; different batch id (null)
Skipping https://issues.apache.org/jira/browse/NUTCH/; different batch id (null)
Skipping http://lucene.apache.org/; different batch id (null)
Skipping http://lucene.apache.org/solr; different batch id (null)
Skipping http://lucene.apache.org/solr/; different batch id (null)
Parsing http://nutch.apache.org/
Skipping http://nutch.apache.org/bot.html; different batch id (null)
Skipping http://nutch.apache.org/credits.html; different batch id (null)
Skipping http://nutch.apache.org/downloads.html; different batch id (null)
Skipping http://nutch.apache.org/index.html; different batch id (null)
Skipping http://nutch.apache.org/javadoc.html; different batch id (null)
Skipping http://nutch.apache.org/mailing_lists.html; different batch id (null)
Skipping http://nutch.apache.org/version_control.html; different batch id (null)
Skipping http://s.apache.org/1.9-release; different batch id (null)
Skipping http://s.apache.org/1zE; different batch id (null)
Skipping http://s.apache.org/LPB; different batch id (null)
Skipping http://s.apache.org/nutch10; different batch id (null)
Skipping http://s.apache.org/nutch_2.3; different batch id (null)
Skipping http://s.apache.org/oHY; different batch id (null)
Skipping http://s.apache.org/PGa; different batch id (null)
Skipping http://tika.apache.org/; different batch id (null)
Skipping http://tika.apache.org/1.2/index.html; different batch id (null)
Skipping https://whimsy.apache.org/board/minutes/Nutch.html; different batch id (null)
Skipping http://wicket.apache.org/; different batch id (null)
Skipping http://wiki.apache.org/nutch/; different batch id (null)
Skipping http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer; different batch id (null)
Skipping http://wiki.apache.org/nutch/FAQ; different batch id (null)
Skipping http://wiki.apache.org/nutch/NutchPropertiesCompleteList; different batch id (null)
Skipping https://wiki.apache.org/nutch/FrontPage; different batch id (null)
Skipping https://wiki.apache.org/nutch/NutchRESTAPI; different batch id (null)
Skipping http://www.apache.org/; different batch id (null)
Skipping http://www.apache.org/dist/nutch/1.5.1/CHANGES.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/1.6/CHANGES_1.6.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/1.7/1.7-CHANGES.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/1.8/CHANGES.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/1.9/CHANGES.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/2.0/CHANGES.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/2.1/CHANGES-2.1.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/2.2.1/CHANGES-2.2.1.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/2.2/2.2-CHANGES.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/CHANGES-0.8.1.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/CHANGES-0.9.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/CHANGES-1.0.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/CHANGES-1.1.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/CHANGES-1.2.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/CHANGES-1.3.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/CHANGES-1.4.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/CHANGES-1.5.txt; different batch id (null)
Skipping http://www.apache.org/dyn/closer.cgi/nutch/; different batch id (null)
Skipping http://www.apache.org/foundation/records/minutes/2010/board_minutes_2010_04_21.txt; different batch id (null)
Skipping http://www.apache.org/foundation/sponsorship.html; different batch id (null)
Skipping http://www.apache.org/foundation/thanks.html; different batch id (null)
Skipping http://www.apache.org/licenses/; different batch id (null)
Skipping http://www.apache.org/licenses/LICENSE-2.0; different batch id (null)
Skipping http://www.apache.org/security/; different batch id (null)
Skipping http://creativecommons.org/press-releases/entry/5064; different batch id (null)
Skipping https://creativecommons.org/licenses/by-sa/2.0/; different batch id (null)
Skipping http://www.elasticsearch.org/; different batch id (null)
Skipping http://events.linuxfoundation.org/events/apachecon-europe; different batch id (null)
Skipping http://events.linuxfoundation.org/events/apachecon-north-america; different batch id (null)
Skipping http://search.maven.org/; different batch id (null)
Skipping http://mongodb.org/; different batch id (null)
Skipping http://osuosl.org/news_folder/nutch; different batch id (null)
Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.
FetcherJob: threads: 10
FetcherJob: parsing: false
FetcherJob: resuming: false
FetcherJob : timelimit set for : -1
Using queue mode : byHost
Fetcher: threads: 10
fetching http://cassandra.apache.org/
fetching http://nutch.apache.org/
Fetcher: throughput threshold: -1
Fetcher: throughput threshold sequence: 5
fetching http://accumulo.apache.org/
fetching http://avro.apache.org/
QueueFeeder finished: total 11 records. Hit by time limit :0
fetching http://blog.foofactory.fi/2007/03/twice-speed-half-size.html
fetching http://www.apache.org/foundation/sponsorship.html
fetching http://code.google.com/p/crawler-commons/
fetching http://www.apache.org/security/
7/10 spinwaiting/active, 5 pages, 0 errors, 1.0 1.0 pages/s, 169 169 kb/s, 3 URLs in 3 queues
* queue: http://www.apache.orgmaxThreads    = 1inProgress    = 1crawlDelay    = 4000minCrawlDelay = 0nextFetchTime = 1445831574525now           = 14458315748140. http://www.apache.org/foundation/thanks.html1. http://www.apache.org/licenses/2. http://www.apache.org/
fetching http://www.apache.org/foundation/thanks.html
8/10 spinwaiting/active, 7 pages, 0 errors, 0.7 0.4 pages/s, 113 57 kb/s, 2 URLs in 3 queues
* queue: http://www.apache.orgmaxThreads    = 1inProgress    = 0crawlDelay    = 4000minCrawlDelay = 0nextFetchTime = 1445831583211now           = 14458315798170. http://www.apache.org/licenses/1. http://www.apache.org/
fetching http://www.apache.org/licenses/
8/10 spinwaiting/active, 8 pages, 0 errors, 0.5 0.2 pages/s, 86 31 kb/s, 1 URLs in 3 queues
* queue: http://www.apache.orgmaxThreads    = 1inProgress    = 0crawlDelay    = 4000minCrawlDelay = 0nextFetchTime = 1445831587582now           = 14458315848200. http://www.apache.org/
fetching http://www.apache.org/
-finishing thread FetcherThread9, activeThreads=8
-finishing thread FetcherThread2, activeThreads=8
-finishing thread FetcherThread0, activeThreads=7
-finishing thread FetcherThread1, activeThreads=6
-finishing thread FetcherThread4, activeThreads=4
-finishing thread FetcherThread3, activeThreads=4
-finishing thread FetcherThread5, activeThreads=3
-finishing thread FetcherThread7, activeThreads=2
0/2 spinwaiting/active, 9 pages, 0 errors, 0.5 0.2 pages/s, 84 81 kb/s, 0 URLs in 2 queues
fetch of http://blog.foofactory.fi/2007/03/twice-speed-half-size.html failed with: org.apache.commons.httpclient.ConnectTimeoutException: The host did not accept the connection within timeout of 10000 ms
-finishing thread FetcherThread8, activeThreads=1
fetch of http://code.google.com/p/crawler-commons/ failed with: org.apache.commons.httpclient.ConnectTimeoutException: The host did not accept the connection within timeout of 10000 ms
-finishing thread FetcherThread6, activeThreads=0
0/0 spinwaiting/active, 11 pages, 2 errors, 0.4 0.4 pages/s, 67 0 kb/s, 0 URLs in 0 queues
-activeThreads=0
ParserJob: resuming:    false
ParserJob: forced reparse:    false
ParserJob: parsing all
Skipping http://sched.co/1pav9xl; different batch id (null)
Skipping http://sched.co/1pbE15n; different batch id (null)
Skipping http://t.co/k3VLhbJQhg; different batch id (null)
Skipping http://accumulosummit.com/; different batch id (null)
Skipping http://www.amazon.com/Cassandra-High-Availability-Robbie-Strickland/dp/1783989122; different batch id (null)
Skipping http://www.eu.apachecon.com/c/aceu2009/; different batch id (null)
Skipping http://eu.apachecon.com/c/aceu2009/sessions/136; different batch id (null)
Skipping http://eu.apachecon.com/c/aceu2009/sessions/137; different batch id (null)
Skipping http://eu.apachecon.com/c/aceu2009/sessions/138; different batch id (null)
Skipping http://eu.apachecon.com/c/aceu2009/sessions/165; different batch id (null)
Skipping http://eu.apachecon.com/c/aceu2009/sessions/197; different batch id (null)
Skipping http://eu.apachecon.com/c/aceu2009/sessions/201; different batch id (null)
Skipping http://eu.apachecon.com/c/aceu2009/sessions/250; different batch id (null)
Skipping http://eu.apachecon.com/c/aceu2009/sessions/251; different batch id (null)
Skipping http://www.us.apachecon.com/c/acus2009/; different batch id (null)
Skipping http://www.us.apachecon.com/c/acus2009/schedule; different batch id (null)
Skipping http://www.us.apachecon.com/c/acus2009/sessions/331; different batch id (null)
Skipping http://www.us.apachecon.com/c/acus2009/sessions/332; different batch id (null)
Skipping http://www.us.apachecon.com/c/acus2009/sessions/333; different batch id (null)
Skipping http://www.us.apachecon.com/c/acus2009/sessions/334; different batch id (null)
Skipping http://www.us.apachecon.com/c/acus2009/sessions/335; different batch id (null)
Skipping http://www.us.apachecon.com/c/acus2009/sessions/375; different batch id (null)
Skipping http://www.us.apachecon.com/c/acus2009/sessions/427; different batch id (null)
Skipping http://www.us.apachecon.com/c/acus2009/sessions/428; different batch id (null)
Skipping http://www.us.apachecon.com/c/acus2009/sessions/430; different batch id (null)
Skipping http://www.us.apachecon.com/c/acus2009/sessions/437; different batch id (null)
Skipping http://www.us.apachecon.com/c/acus2009/sessions/461; different batch id (null)
Skipping http://www.us.apachecon.com/c/acus2009/sessions/462; different batch id (null)
Skipping http://www.cafepress.com/nutch; different batch id (null)
Skipping http://www.datastax.com/dev/blog/2012-in-review-performance; different batch id (null)
Skipping http://www.datastax.com/documentation/cql/3.1/cql/ddl/ddl_intro_c.html; different batch id (null)
Skipping http://www.datastax.com/documentation/cql/3.1/cql/ddl/ddl_primary_index_c.html; different batch id (null)
Skipping http://www.datastax.com/resources/whitepapers/benchmarking-top-nosql-databases; different batch id (null)
Skipping https://www.flickr.com/photos/andrewfhart/8106189987/; different batch id (null)
Skipping https://www.flickr.com/photos/andrewfhart/8106200690/; different batch id (null)
Skipping https://www.flickr.com/photos/mrmuskrat/3637703614/; different batch id (null)
Skipping https://www.flickr.com/photos/splorp/3981832163/; different batch id (null)
Skipping http://getbootstrap.com/; different batch id (null)
Skipping https://github.com/apache/accumulo; different batch id (null)
Skipping http://glyphicons.com/; different batch id (null)
Skipping https://www.google-melange.com/gsoc/homepage/google/gsoc2014; different batch id (null)
Parsing http://code.google.com/p/crawler-commons/
Skipping http://research.google.com/archive/bigtable.html; different batch id (null)
Skipping https://www.linkedin.com/groups/Apache-Accumulo-Professionals-4554913; different batch id (null)
Skipping http://blog.markedup.com/2013/02/cassandra-hive-and-hadoop-how-we-picked-our-analytics-stack/; different batch id (null)
Skipping http://maxgrinev.com/2010/07/12/do-you-really-need-sql-to-do-it-all-in-cassandra/; different batch id (null)
Skipping http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html; different batch id (null)
Skipping https://twitter.com/apacheaccumulo; different batch id (null)
Skipping https://twitter.com/ApacheNutch; different batch id (null)
Skipping https://twitter.com/ApacheNutch/status/591359830171856896; different batch id (null)
Skipping https://twitter.com/cutting/status/233415059798372353; different batch id (null)
Skipping https://twitter.com/TheASF; different batch id (null)
Skipping http://www.brics.dk/automaton/; different batch id (null)
Skipping http://www.brics.dk/automaton/automaton; different batch id (null)
Parsing http://blog.foofactory.fi/2007/03/twice-speed-half-size.html
Skipping http://fontawesome.io/; different batch id (null)
Skipping http://freenode.net/; different batch id (null)
Skipping http://www.slideshare.net/adrianco/migrating-netflix-from-oracle-to-global-cassandra; different batch id (null)
Skipping http://www.slideshare.net/daveconnors/cassandra-puppet-scaling-data-at-15-per-month; different batch id (null)
Skipping http://www.slideshare.net/jaykumarpatel/cassandra-at-ebay-13920376; different batch id (null)
Skipping http://www.slideshare.net/jbellis; different batch id (null)
Skipping http://www.slideshare.net/jbellis/cassandra-at-nosql-matters-2012; different batch id (null)
Skipping http://www.slideshare.net/planetcassandra/3-mohit-anchlia; different batch id (null)
Skipping http://www.slideshare.net/planetcassandra/nyc-tech-day-using-cassandra-for-dvr-scheduling-at-comcast; different batch id (null)
Skipping http://www.slideshare.net/slideshow/embed_code/15832310; different batch id (null)
Parsing http://accumulo.apache.org/
Skipping http://accumulo.apache.org/1.5/accumulo_user_manual.html; different batch id (null)
Skipping http://accumulo.apache.org/1.5/apidocs; different batch id (null)
Skipping http://accumulo.apache.org/1.5/examples; different batch id (null)
Skipping http://accumulo.apache.org/1.6/accumulo_user_manual.html; different batch id (null)
Skipping http://accumulo.apache.org/1.6/apidocs; different batch id (null)
Skipping http://accumulo.apache.org/1.6/examples; different batch id (null)
Skipping http://accumulo.apache.org/1.7/accumulo_user_manual.html; different batch id (null)
Skipping http://accumulo.apache.org/1.7/apidocs; different batch id (null)
Skipping http://accumulo.apache.org/1.7/examples; different batch id (null)
Skipping http://accumulo.apache.org/bylaws.html; different batch id (null)
Skipping http://accumulo.apache.org/contrib.html; different batch id (null)
Skipping http://accumulo.apache.org/downloads; different batch id (null)
Skipping http://accumulo.apache.org/downloads/; different batch id (null)
Skipping http://accumulo.apache.org/get_involved.html; different batch id (null)
Skipping http://accumulo.apache.org/git.html; different batch id (null)
Skipping http://accumulo.apache.org/glossary.html; different batch id (null)
Skipping http://accumulo.apache.org/governance/consensusBuilding.html; different batch id (null)
Skipping http://accumulo.apache.org/governance/lazyConsensus.html; different batch id (null)
Skipping http://accumulo.apache.org/governance/releasing.html; different batch id (null)
Skipping http://accumulo.apache.org/governance/voting.html; different batch id (null)
Skipping http://accumulo.apache.org/index.html; different batch id (null)
Skipping http://accumulo.apache.org/mailing_list.html; different batch id (null)
Skipping http://accumulo.apache.org/notable_features.html; different batch id (null)
Skipping http://accumulo.apache.org/old_documentation.html; different batch id (null)
Skipping http://accumulo.apache.org/papers.html; different batch id (null)
Skipping http://accumulo.apache.org/people.html; different batch id (null)
Skipping http://accumulo.apache.org/projects.html; different batch id (null)
Skipping http://accumulo.apache.org/rb.html; different batch id (null)
Skipping http://accumulo.apache.org/release_notes/; different batch id (null)
Skipping http://accumulo.apache.org/release_notes/1.5.4.html; different batch id (null)
Skipping http://accumulo.apache.org/release_notes/1.6.4.html; different batch id (null)
Skipping http://accumulo.apache.org/release_notes/1.7.0.html; different batch id (null)
Skipping http://accumulo.apache.org/releasing.html; different batch id (null)
Skipping http://accumulo.apache.org/screenshots.html; different batch id (null)
Skipping http://accumulo.apache.org/source.html; different batch id (null)
Skipping http://accumulo.apache.org/verifying_releases.html; different batch id (null)
Skipping http://accumulo.apache.org/versioning.html; different batch id (null)
Parsing http://avro.apache.org/
Skipping http://avro.apache.org/credits.html; different batch id (null)
Skipping http://avro.apache.org/docs/1.6.3; different batch id (null)
Skipping http://avro.apache.org/docs/1.7.7; different batch id (null)
Skipping http://avro.apache.org/docs/current; different batch id (null)
Skipping http://avro.apache.org/docs/current/; different batch id (null)
Skipping http://avro.apache.org/index.html; different batch id (null)
Skipping http://avro.apache.org/irc.html; different batch id (null)
Skipping http://avro.apache.org/issue_tracking.html; different batch id (null)
Skipping http://avro.apache.org/mailing_lists.html; different batch id (null)
Skipping http://avro.apache.org/releases.html; different batch id (null)
Skipping http://avro.apache.org/version_control.html; different batch id (null)
Skipping http://blogs.apache.org/accumulo; different batch id (null)
Skipping https://blogs.apache.org/accumulo/; different batch id (null)
Skipping https://builds.apache.org/view/A-D/view/Accumulo/; different batch id (null)
Skipping https://builds.apache.org/view/M-R/view/Nutch/; different batch id (null)
Parsing http://cassandra.apache.org/
Skipping http://cassandra.apache.org/download/; different batch id (null)
Skipping http://cassandra.apache.org/privacy.html; different batch id (null)
Skipping https://cwiki.apache.org/confluence/display/AVRO/How+To+Contribute; different batch id (null)
Skipping https://cwiki.apache.org/confluence/display/AVRO/Index; different batch id (null)
Skipping https://cwiki.apache.org/confluence/display/solr/SolrCloud; different batch id (null)
Skipping http://forrest.apache.org/; different batch id (null)
Skipping http://gora.apache.org/; different batch id (null)
Skipping http://hadoop.apache.org/; different batch id (null)
Skipping http://hadoop.apache.org/privacy_policy.html; different batch id (null)
Skipping http://hbase.apache.org/; different batch id (null)
Skipping https://issues.apache.org/jira/browse/accumulo; different batch id (null)
Skipping https://issues.apache.org/jira/browse/NUTCH-1047; different batch id (null)
Skipping https://issues.apache.org/jira/browse/NUTCH-1591; different batch id (null)
Skipping https://issues.apache.org/jira/browse/NUTCH-841; different batch id (null)
Skipping https://issues.apache.org/jira/browse/NUTCH/; different batch id (null)
Skipping http://lucene.apache.org/; different batch id (null)
Skipping http://lucene.apache.org/solr; different batch id (null)
Skipping http://lucene.apache.org/solr/; different batch id (null)
Parsing http://nutch.apache.org/
Skipping http://nutch.apache.org/bot.html; different batch id (null)
Skipping http://nutch.apache.org/credits.html; different batch id (null)
Skipping http://nutch.apache.org/downloads.html; different batch id (null)
Skipping http://nutch.apache.org/index.html; different batch id (null)
Skipping http://nutch.apache.org/javadoc.html; different batch id (null)
Skipping http://nutch.apache.org/mailing_lists.html; different batch id (null)
Skipping http://nutch.apache.org/version_control.html; different batch id (null)
Skipping http://s.apache.org/1.9-release; different batch id (null)
Skipping http://s.apache.org/1zE; different batch id (null)
Skipping http://s.apache.org/LPB; different batch id (null)
Skipping http://s.apache.org/nutch10; different batch id (null)
Skipping http://s.apache.org/nutch_2.3; different batch id (null)
Skipping http://s.apache.org/oHY; different batch id (null)
Skipping http://s.apache.org/PGa; different batch id (null)
Skipping http://thrift.apache.org/; different batch id (null)
Skipping http://tika.apache.org/; different batch id (null)
Skipping http://tika.apache.org/1.2/index.html; different batch id (null)
Skipping https://whimsy.apache.org/board/minutes/Nutch.html; different batch id (null)
Skipping http://wicket.apache.org/; different batch id (null)
Skipping http://wiki.apache.org/cassandra; different batch id (null)
Skipping http://wiki.apache.org/cassandra/Durability; different batch id (null)
Skipping http://wiki.apache.org/cassandra/FAQ; different batch id (null)
Skipping http://wiki.apache.org/cassandra/GettingStarted; different batch id (null)
Skipping http://wiki.apache.org/cassandra/HintedHandoff; different batch id (null)
Skipping http://wiki.apache.org/cassandra/HowToContribute; different batch id (null)
Skipping http://wiki.apache.org/cassandra/ReadRepair; different batch id (null)
Skipping http://wiki.apache.org/cassandra/ThirdPartySupport; different batch id (null)
Skipping http://wiki.apache.org/nutch/; different batch id (null)
Skipping http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer; different batch id (null)
Skipping http://wiki.apache.org/nutch/FAQ; different batch id (null)
Skipping http://wiki.apache.org/nutch/NutchPropertiesCompleteList; different batch id (null)
Skipping https://wiki.apache.org/nutch/FrontPage; different batch id (null)
Skipping https://wiki.apache.org/nutch/NutchRESTAPI; different batch id (null)
Parsing http://www.apache.org/
Skipping http://www.apache.org/dist/nutch/1.5.1/CHANGES.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/1.6/CHANGES_1.6.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/1.7/1.7-CHANGES.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/1.8/CHANGES.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/1.9/CHANGES.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/2.0/CHANGES.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/2.1/CHANGES-2.1.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/2.2.1/CHANGES-2.2.1.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/2.2/2.2-CHANGES.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/CHANGES-0.8.1.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/CHANGES-0.9.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/CHANGES-1.0.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/CHANGES-1.1.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/CHANGES-1.2.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/CHANGES-1.3.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/CHANGES-1.4.txt; different batch id (null)
Skipping http://www.apache.org/dist/nutch/CHANGES-1.5.txt; different batch id (null)
Skipping http://www.apache.org/dyn/closer.cgi/nutch/; different batch id (null)
Skipping http://www.apache.org/foundation/policies/conduct.html; different batch id (null)
Skipping http://www.apache.org/foundation/records/minutes/2010/board_minutes_2010_04_21.txt; different batch id (null)
Parsing http://www.apache.org/foundation/sponsorship.html
Parsing http://www.apache.org/foundation/thanks.html
Parsing http://www.apache.org/licenses/
Skipping http://www.apache.org/licenses/LICENSE-2.0; different batch id (null)
Parsing http://www.apache.org/security/
Skipping http://zookeeper.apache.org/; different batch id (null)
Skipping http://creativecommons.org/press-releases/entry/5064; different batch id (null)
Skipping https://creativecommons.org/licenses/by-sa/2.0/; different batch id (null)
Skipping http://www.elasticsearch.org/; different batch id (null)
Skipping http://hypertable.org/; different batch id (null)
Skipping http://events.linuxfoundation.org/events/apachecon-europe; different batch id (null)
Skipping http://events.linuxfoundation.org/events/apachecon-north-america; different batch id (null)
Skipping http://search.maven.org/; different batch id (null)
Skipping http://mongodb.org/; different batch id (null)
Skipping http://osuosl.org/news_folder/nutch; different batch id (null)
Skipping http://www.planetcassandra.org/; different batch id (null)
Skipping http://planetcassandra.org/; different batch id (null)
Skipping http://planetcassandra.org/blog/post/analytics-at-github-with-apache-cassandra/; different batch id (null)
Skipping http://planetcassandra.org/blog/post/cassandra-at-cern-large-hadron-collider/; different batch id (null)
Skipping http://planetcassandra.org/blog/post/cassandra-used-to-build-scalable-and-highly-available-systems-at-hulu-streaming-content-to-over-5-million-subscribers/; different batch id (null)
Skipping http://planetcassandra.org/blog/post/godaddy-worlds-largest-domain-name-registrar-and-web-host-provider-utilizes-cassandra-for-replication-and-scalability/; different batch id (null)
Skipping http://planetcassandra.org/blog/post/instagram-making-the-switch-to-cassandra-from-redis-75-instasavings/; different batch id (null)
Skipping http://planetcassandra.org/blog/post/make-it-rain-apache-cassandra-at-the-weather-channel-for-severe-weather-alerts/; different batch id (null)
Skipping http://planetcassandra.org/blog/post/reddit-upvotes-apache-cassandras-horizontal-scaling-managing-17000000-votes-daily/; different batch id (null)
Skipping http://planetcassandra.org/companies/; different batch id (null)
Skipping http://vldb.org/pvldb/vol5/p1724_tilmannrabl_vldb2012.pdf; different batch id (null)

表中插入的数据

到直接基本算是在eclipse导入完成

接下自己慢慢学习了

---------------------------------------------------------------------------------

另一种简单方式

File > New > Project > SVN > 从SVN 检出项目
创建新的资源库位置 >

URL:https://svn.apache.org/repos/asf/nutch/tags/release-1.7/

选中URL > Finish    弹出New Project向导,选择Java Project > Next,

输入Project name:nutch1.7 > Finishsd

搭建环境

在左部Package Explorer的 nutch1.7文件夹上单击右键 >Build Path > Configure Build Path...
> 选中Source选项 > 选择src > Remove > Add Folder... > 选择src/bin, src/Java, src/test 和 src/testresources
切换到Libraries选项 >
Add Class Folder... > 选中nutch1.7/conf
Add Library... > IvyDE Managed Dependencies > Next >Main > Ivy File > Browse > ivy/ivy.xml > Finish
切换到Order and Export选项>选中conf > Top > OK

最后:在左部Package Explorer的 nutch1.7文件夹下的build.xml文件上单击右键 > Run As > Ant Build      (然后等待完成)
在左部Package Explorer的 nutch1.7文件夹上单击右键 > Refresh
在左部Package Explorer的 nutch1.7文件夹上单击右键 > Build Path > Configure Build Path... > 选中Libraries选项 > Add Class Folder... > 选中build >
等待完成

OK,整个工程导入完成,没有红叉

转载于:https://www.cnblogs.com/hwaggLee/p/4910931.html

nutch-2.1导入eclipse+mysql运行相关推荐

  1. Nutch编译及集成eclipse+mysql开发环境的部署总结

    Nutch是一个应用程序,以Lucene为基础实现的搜索引擎应用,Lucene为Nutch 提供了文本搜索和索引的API,Nutch不仅提供搜索,而且还有数据抓取的功能. 1)linux下nutch集 ...

  2. 开源项目cardslib简单介绍和导入eclipse并运行的方法

    本文转自http://blog.csdn.net/a396901990/article/details/25158223,仅供学习使用,所有权力归原作者所有. 开源项目里有两个关于Crad类型的自定义 ...

  3. 关于导入geoserver 源码到Eclipse编译运行

    参考http://blog.csdn.net/gisshixisheng/article/details/43016443 和  http://blog.sina.com.cn/s/blog_6e37 ...

  4. eclipse启动mysql报错_Eclipse+mysql+java Eclipse中运行没有问题,但打包后运行不了,也不报错,求高手指点...

    这几天用Eclipse+mysql+java编写了一个班级信息管理的软件.在Eclipse里运行没有问题所有功能都没有问题,但无论用Eclipse中自带的export生成可执行的jar包,还是用fat ...

  5. 怎么在本地运行java项目,eclipse怎么运行java web项目?

    Eclipse是用来做开发的自由集成开发环境,这也是很多java程序员会使用的开发环境,所以可以使用eclipse创建项目并运行java web项目,那eclipse怎么运行java web项目?接下 ...

  6. 导入eclipse工程到Android Studio中

    ref: 从 Eclipse 迁移至 Android Studio | Android Studio https://developer.android.com/studio/intro/migrat ...

  7. 解决AndroidStudio2.0导入eclipse项目时卡死的问题

    在这之前因为电脑渣,跑不动AndroidStudio,所以一直都在用eclipse.最近听说AS更新了2.0版本,比以前流畅多了,于是,我激动地在我的Ubuntu上装上了AS(因为Ubuntu上运行A ...

  8. Openfire3.9.3源代码导入eclipse中开发配置指南

    软件版本: Eclipse:eclipse-jee-indigo-SR2-win32-x86_64 JDK: 1.7 Openfire: 3.9.3 本文将图文介绍如何把openfire(以3.9.3 ...

  9. mysql导入数据表大小限制,解除phpMyAdmin导入大型MySQL数据库文件大小限制

    phpMyAdmin 导入大型数据库文件大小限制配置- 1. 修改 php.ini 文件中下列3项的值: upload_max_filesize, memory_limit 和 post_max_si ...

最新文章

  1. Ajax PHP 边学边练 之三 数据库
  2. FileOutStream
  3. 01_Win10下CUDA的安装、查看并升级Nvidia显卡驱动、安装CUDA、设置环境变量、测试CUDA是否安装成功
  4. ip数据报首部校验和的计算
  5. 基于Pyspark和Thunder的神经图像数据分析-实验运行结果
  6. Apache TomEE(和Tomcat)的自签名证书
  7. windows7电脑删除文件特别慢怎么回事
  8. nofollow标签_nofollow标签是什么?如何使用
  9. 软件测试验收方法_验收测试是美丽的魔术。 这就是它可以改善您的生活的方法。...
  10. ae效果英文版翻译对照表_AE自带特效中英文对照表
  11. 全球及中国熔融碳酸盐燃料电池行业市场消费量调研及未来前瞻报告2022-2028年
  12. wps中将二维表转换为一维表
  13. 第6章 访问权限控制
  14. POJ 2242 The Circumference of the Circle G++ 海伦公式 三角形外接圆半径公式 背
  15. 通过Excel表格批量生成汉信码
  16. 原生js实现简易版消消乐
  17. Mac上面有哪些宝藏的软件
  18. 翻牌游戏如何打乱牌面java_阴阳师:彼岸花——游戏中最初的“人权卡”,现在的实力如何?...
  19. 【前端部署】vue项目打包并部署到Linux服务器
  20. Windows10+deepin双系统安装(选用意义,安装教程)

热门文章

  1. 依恋类型和我们生活的息息相关
  2. LabelImg打开图片报错:Error opening file
  3. matlab小游戏程序代码,Matlab有趣代码
  4. 【中秋征文】手把手教你海面月亮升起中秋节特效制作
  5. 【免费毕设】JSP旅游网站建设设计与实现(源代码+论文)
  6. XSS挑战之旅(1-9)
  7. python七巧板绘制图案_p5.js绘制七巧板图案2020-09-02
  8. python爬虫爬取新闻标题及链接_网络爬虫百度新闻标题及链接爬取
  9. LibGDX QQ群建立,欢迎对libGDX有兴趣的程序员加入。
  10. 重磅更新 | zData数据库一体机 v4.9