Coin163

首页 > Ubuntu环境下搭建nutch环境

Ubuntu环境下搭建nutch环境

相关标签: ubuntu nutch

2020腾讯云双十一活动,全年最低!!!(领取3500元代金券),
地址https://cloud.tencent.com/act/cps/redirect?redirect=1073

2020阿里云最低价产品入口,含代金券(新老用户有优惠),
入口地址https://www.aliyun.com/minisite/goods

相关推荐:[Nutch]Ubuntu配置Java环境 - 博客频道 - CSDN.NET

在Ubuntu上建立开发环境的时候,一般都会按照JDK,现在我们就来说明一下在Ubuntu 15.04上安装JDK8的方法。 1. 添加Java仓库 sudo add-apt-repository ppa:webupd8team/java 2. 更新源 sudo apt-get update 3. 按照Java sudo apt-get install oracle-java8-in

操作系统:Ubuntu 16.04 LTS nutch版本:2.2.1 配置nutch之前,要先配置ant,不会的可以看我的另一篇文章UBUNTU环境配置ANT 然后去nutch官网下载nutch,不过2.3.1的版本编译时有问题,切换maven2库也没用,会一直卡在以下界面: root@ubuntu:/opt/apache-nutch-2.3.1# ant runtimeBuildfile: /opt/apache-nutch-2.3.1/build.xmlivy-probe-antlib:ivy-download:ivy-download-unchecked:ivy-init-antlib:ivy-init:init:

[mkdir] Created dir: /opt/apache-nutch-2.3.1/build

[mkdir] Created dir: /opt/apache-nutch-2.3.1/build/classes

[mkdir] Created dir: /opt/apache-nutch-2.3.1/build/release

[mkdir] Created dir: /opt/apache-nutch-2.3.1/build/test

[mkdir] Created dir: /opt/apache-nutch-2.3.1/build/test/classesclean-lib:resolve-default:[ivy:resolve] :: Apache Ivy 2.3.0 - 20130110142753 :: http://ant.apache.org/ivy/ ::[ivy:resolve] :: loading settings :: file = /opt/apache-nutch-2.3.1/ivy/ivysettings.xml于是我放弃了,决定采用nutch2.2.1版本进行安装,nutch2.2.1下载地址:http://archive.apache.org/dist/nutch/2.2.1/ Ubuntu环境下的firefox默认下载存储路径为~/Downloads 1、用命令cd ~/Downloads切换路径,然后使用tar -xvf apache-nutch-2.2.1-src-tar-gz解压文件 然后移动到/opt目录下,用命令sudo mv apache-nutch-2.2.1 /opt/移动到/opt文件夹下 2、配置nutch对mysql的支持,修改 ${NUTCH_HOME}/ivy/ivy.xml文件 先取消以下行的注释 <dependency org=”mysql” name=”mysql-connector-java” rev=”5.1.18″ conf=”*->default”/> 然后修改以下行,从默认的 <dependency org="org.apache.gora" name="gora-core" rev="0.3" conf="*->default"/> 改成 <dependency org="org.apache.gora" name="gora-core" rev="0.2.1" conf="*->default"/> 最后取消掉以下行的注释 <dependency org="org.apache.gora" name="gora-sql" rev="0.1.1-incubating" conf="*->default" /> 3、数据库连接配置编辑 ${NUTCH_HOME}/conf/gora.properties文件,注释掉默认的数据库连接配置,同时添加以下配置内容: ################################ Default MySQL properties

################################gora.sqlstore.jdbc.driver=com.mysql.jdbc.Drivergora.sqlstore.jdbc.url=jdbc:mysql://localhost:3306/nutch?createDatabaseIfNotExist=truegora.sqlstore.jdbc.user=xxxx(MySQL用户名)gora.sqlstore.jdbc.password=xxxx(MySQL密码) 4、数据表映射配置 修改 ${NUTCH_HOME}/conf/gora-sql-mapping.xml 文件 将primarykey 的长度从512修改成767,即 <primarykey column=”id” length=”767″/> 5、修改nutch-site.xml配置文件 可直接将nutch-default.xml保存为nutch-site.xml,使用命令sudo mv nutch-default-xml nutch-size.xml 然后sudo gedit nutch-site,在末尾的</configuration>前添加以下代码 <property>

<name>http.agent.name</name>

<value>YourNutchSpider</value></property><property>

<name>http.accept.language</name>

<value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value>

<description>Value of the Accept-Language request header field.

This allows selecting non-English language as default one to retrieve.

It is a useful setting for search engines build for certain national group.

</description></property><property&

相关推荐:Ubuntu搭建Android开发环境

这里我们在 Ubuntu 14.04 amd64 上演示搭建Android开发环境的细节: 1、下载 adt-bundle 及 jdk adt-bundle jdk1.8 2、解压下载好的文件到android开发目录下(这里我假设android开发目录为 ANDROID_HOME) tar -zxvf jdkxxx -C ANDROID_HOME (这里的jdkxxx是j

gt;

<name>storage.data.store.class</name>

<value>org.apache.gora.sql.store.SqlStore</value>

<description>The Gora DataStore class for storing and retrieving data.

Currently the following stores are available:.

</description></property> <property>

<name>parser.character.encoding.default</name>

<value>utf-8</value>

<description>The character encoding to fall back to when no other information

is available</description> </property><property>

<name>generate.batch.id</name>

<value>*</value></property> 6、使用ant编译 切换到NUTCH目录 cd ${NUTCH_HOME}ant runtime 可能遇到的问题: 1)权限不足,创建文件夹例如build文件夹失败,使用命令sudo -i切换到root权限再进行ant编译 2)提示: Trying to override old definition of task javac [taskdef]

Could not load definitions from resource org/sonar/ant/antlib.xml. It could not be found.先下载 sonar-ant-task-2.2.jar , 将其拷贝到

${NUTCH_HOME}/lib

目录下面 然后使用命令sudo gedit /${NUTCH_HOME}/build.xml 通过ctrl+F打开搜索功能,输入antlib:org,sonar.ant定位到以下代码,添加红色部分的代码即可 <span style="color:#4b4b4b;"><!-- Define the Sonar task if this hasn't been done in a common script --> <taskdef url="antlib:org.sonar.ant" resource="org/sonar/ant/antlib.xml">

<classpath path="${ant.library.dir}" />

<classpath path="${mysql.library.dir}" />

</span><span style="color:#ff0000;"><classpath><fileset dir="lib/" includes="sonar*.jar" /></classpath></span><span style="color:#4b4b4b;"> </taskdef></span> 3)build failed,提示如 [ivy:resolve]

:: com.google.code.findbugs#jsr305;1.3.9!jsr305.jar[ivy:resolve]

::::::::::::::::::::::::::::::::::::::::::::::[ivy:resolve] [ivy:resolve] :: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILSBUILD FAILED/opt/apache-nutch-2.2.1/build.xml:444: impossible to resolve dependencies:

resolve failed - see output for details 或者是其他的依赖性问题导致BUILD FAILED的,可通过修改maven中央库地址来解决 sudo gedit

${NUTCH_HOME} /ivy/ivysettings.xml,找到以下代码 <property name="repo.maven.org"

value="http://repo1.maven.org/maven2/"

override="false"/> 将maven中央库地址 http://repo1.maven.org/maven2/

替换成国内OSC提供的镜像:http://maven.oschina.net/content/groups/public/

4)卡在以下界面 resolve-default:[ivy:resolve] :: Apache Ivy 2.3.0 - 20130110142753 :: http://ant.apache.org/ivy/ ::[ivy:resolve] :: loading settings :: file = /opt/apache-nutch-2.3.1/ivy/ivysettings.xml解决方案:耐心等待,加载需要时间,如果超过10分钟没反应就放弃吧,可以换个maven(见问题3)。 一般编译时间为半个小时左右!上个我成功的界面截图 7、网站抓取测试 7.1 设置抓取网站 cd ${NUTCH_HOME}/runtime/localsudo mkdir -p urls cd urlssudo gedit seed.txt 在seed.txt输入一个网站,例如http://blog.csdn.net/u010317005 然后输入冒号:wq保存 7.2 执行爬虫操作 bin/nutch crawl urls -depth 3 -topN 5

原文

操作系统:Ubuntu 16.04 LTS nutch版本:2.2.1 配置nutch之前,要先配置ant,不会的可以看我的另一篇文章UBUNTU环境配置ANT 然后去nutch官网下载nutch,不过2.3.1的版本编译时有问题,切换maven

------分隔线----------------------------