The MAC builds its own crawler search engine (Nutch+elasticsearch is a failed attempt to use Scrapy+elasticsearch)

Source: Internet
Author: User
Tags kibana

1. Introduction

The project needs to do crawler and can provide personalized information retrieval and push, found a variety of crawler framework. One of the more attractive is this:

Nutch+mongodb+elasticsearch+kibana Build a search engine

E text in: http://www.aossama.com/search-engine-with-apache-nutch-mongodb-and-elasticsearch/

Consider using Docker to build the system to test:

Docker sources are as follows:

Https://www.elastic.co/guide/en/elasticsearch/reference/current/docker.html

Https://store.docker.com/community/images/pure/nutch-mongo

However, Docker is too slow to download the image, giving up docker!

Mac Settings Java_home:

VI ~/.bash_profile

Export java_home=$ (/usr/libexec/java_home)
Export path= $JAVA _home/bin: $PATH
Export Class_path= $JAVA _home/lib

2. Installing MONGO

Installed directly under the Mac with Brew, at this time the latest version is 3.4.7.

Install the post-build/data/db directory and Mongod start the service.

Test the available MONGO command to connect and enter DBS to view the database.

Brew Install Mongosudo Mkdir/data/dbsudo Chown < You both user name >  /datamongod
3. Installing Es+kibana

Download ES, the latest version is 5.5.1. Address: https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-5.5.1.tar.gz

Modify Configuration

$ Vim config/elasticsearch. Yml cluster. Name: my-application node. Name: "node-1" node. Master: true node. Data: true path. Data: /opt/elasticsearch/data network. Bind_host: 127.0.0.1 network. Publish_host: 127.0.0.1 network. Host: 127.0.0.1 Run command: Bin/elasticsearch browser access: http://localhost:9200

Download Kibana, the latest version is 5.5.1, address: Mac

Run command: Bin/kibana

Browser access: http://localhost:5601

4. Install Apache Nutch

Download Apache Nutch 2.3.1 (src.tar.gz): http://nutch.apache.org/downloads.html

Configure environment variables: Export nutch_home=$(pwd)

Modify Configuration

$ cat conf/nutch-site. xml <configuration> <property>     <name>storage. Data. Store. Class</name>     <value>org. Apache. Gora. Mongodb. Store. Mongostore</value > <description>Default class for storing data</ description> </property> </configuration> Uncomment MongoDB related notes: $NUTCH _home/ivy/ivy.xml:

<dependency org="Org.apache.gora" name="Gora-mongodb" rev=" 0.5 " conf=" *->default " />

$NUTCH _home/conf/gora.properties

############################# Mongodbstore Properties #############################Gora. Datastore. Default=org. Apache. Gora. MongoDB. Store. Mongostore Gora. MongoDB. Override_hadoop_configuration=false Gora. MongoDB. Mapping. File=/Gora-MongoDB-mapping. XML Gora. MongoDB. Servers=localhost:27017 Gora. MongoDB. DB=nutch Important! Need to update elastic plugin! The original plug-in version 1.4.1, now the latest is 5.5.1. Modify

CD src/plugin/indexer-elastic/

VI Src/plugin/indexer-elastic/ivy.xml

...

<dependencies>

<dependency org= "Org.elasticsearch" name= "Elasticsearch"

rev= "5.5.1" conf= "*->default"/>

</dependencies>

...

Ant-f./build-ivy.xml

LS lib view version, update plugin.xml version number.

<library name= "Hdrhistogram-2.1.9.jar"/>
<library name= "Elasticsearch-5.5.1.jar"/>
<library name= "Hppc-0.7.1.jar"/>
<library name= "Jackson-core-2.8.6.jar"/>
<library name= "Jackson-dataformat-cbor-2.8.6.jar"/>
<library name= "Jackson-dataformat-smile-2.8.6.jar"/>
<library name= "Jackson-dataformat-yaml-2.8.6.jar"/>
<library name= "Jna-4.4.0.jar"/>
<library name= "Joda-time-2.9.5.jar"/>
<library name= "Jopt-simple-5.0.2.jar"/>
<library name= "Log4j-api-2.8.2.jar"/>
<library name= "Lucene-analyzers-common-6.6.0.jar"/>
<library name= "Lucene-backward-codecs-6.6.0.jar"/>
<library name= "Lucene-core-6.6.0.jar"/>
<library name= "Lucene-grouping-6.6.0.jar"/>
<library name= "Lucene-highlighter-6.6.0.jar"/>
<library name= "Lucene-join-6.6.0.jar"/>
<library name= "Lucene-memory-6.6.0.jar"/>
<library name= "Lucene-misc-6.6.0.jar"/>
<library name= "Lucene-queries-6.6.0.jar"/>
<library name= "Lucene-queryparser-6.6.0.jar"/>
<library name= "Lucene-sandbox-6.6.0.jar"/>
<library name= "Lucene-spatial-6.6.0.jar"/>
<library name= "Lucene-spatial-extras-6.6.0.jar"/>
<library name= "Lucene-spatial3d-6.6.0.jar"/>
<library name= "Lucene-suggest-6.6.0.jar"/>
<library name= "Securesm-1.1.jar"/>
<library name= "Snakeyaml-1.15.jar"/>
<library name= "T-digest-3.0.jar"/>

However! The bigger pit is this plugin code error! Don't toss it, give it up!

To start compiling:ant Runtime (run for 33 minutes!) ) Conclusion

1. Nutch 2.x and Elasticsearch 5.x is not very good compatibility, do not want to toss, give up.

2. Next try the new schema: Scrapy + Scrapy-redis + MongoDB + elasticsearch

The MAC builds its own crawler search engine (Nutch+elasticsearch is a failed attempt to use Scrapy+elasticsearch)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.