1. Introduction
The project needs to do crawler and can provide personalized information retrieval and push, found a variety of crawler framework. One of the more attractive is this:
Nutch+mongodb+elasticsearch+kibana Build a search engine
E text in: http://www.aossama.com/search-engine-with-apache-nutch-mongodb-and-elasticsearch/
Consider using Docker to build the system to test:
Docker sources are as follows:
Https://www.elastic.co/guide/en/elasticsearch/reference/current/docker.html
Https://store.docker.com/community/images/pure/nutch-mongo
However, Docker is too slow to download the image, giving up docker!
Mac Settings Java_home:
VI ~/.bash_profile
Export java_home=$ (/usr/libexec/java_home)
Export path= $JAVA _home/bin: $PATH
Export Class_path= $JAVA _home/lib
2. Installing MONGO
Installed directly under the Mac with Brew, at this time the latest version is 3.4.7.
Install the post-build/data/db directory and Mongod start the service.
Test the available MONGO command to connect and enter DBS to view the database.
Brew Install Mongosudo Mkdir/data/dbsudo Chown < You both user name > /datamongod
3. Installing Es+kibana
Download ES, the latest version is 5.5.1. Address: https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-5.5.1.tar.gz
Modify Configuration
$ Vim config/elasticsearch. Yml cluster. Name: my-application node. Name: "node-1" node. Master: true node. Data: true path. Data: /opt/elasticsearch/data network. Bind_host: 127.0.0.1 network. Publish_host: 127.0.0.1 network. Host: 127.0.0.1 Run command: Bin/elasticsearch browser access: http://localhost:9200
Download Kibana, the latest version is 5.5.1, address: Mac
Run command: Bin/kibana
Browser access: http://localhost:5601
4. Install Apache Nutch
Download Apache Nutch 2.3.1 (src.tar.gz): http://nutch.apache.org/downloads.html
Configure environment variables: Export nutch_home=$(pwd)
Modify Configuration
$ cat conf/nutch-site. xml <configuration> <property> <name>storage. Data. Store. Class</name> <value>org. Apache. Gora. Mongodb. Store. Mongostore</value > <description>Default class for storing data</ description> </property> </configuration> Uncomment MongoDB related notes: $NUTCH _home/ivy/ivy.xml:
<dependency org="Org.apache.gora" name="Gora-mongodb" rev=" 0.5 " conf=" *->default " />
$NUTCH _home/conf/gora.properties
############################# Mongodbstore Properties #############################Gora. Datastore. Default=org. Apache. Gora. MongoDB. Store. Mongostore Gora. MongoDB. Override_hadoop_configuration=false Gora. MongoDB. Mapping. File=/Gora-MongoDB-mapping. XML Gora. MongoDB. Servers=localhost:27017 Gora. MongoDB. DB=nutch Important! Need to update elastic plugin! The original plug-in version 1.4.1, now the latest is 5.5.1. Modify
CD src/plugin/indexer-elastic/
VI Src/plugin/indexer-elastic/ivy.xml
...
<dependencies>
<dependency org= "Org.elasticsearch" name= "Elasticsearch"
rev= "5.5.1" conf= "*->default"/>
</dependencies>
...
Ant-f./build-ivy.xml
LS lib view version, update plugin.xml version number.
<library name= "Hdrhistogram-2.1.9.jar"/>
<library name= "Elasticsearch-5.5.1.jar"/>
<library name= "Hppc-0.7.1.jar"/>
<library name= "Jackson-core-2.8.6.jar"/>
<library name= "Jackson-dataformat-cbor-2.8.6.jar"/>
<library name= "Jackson-dataformat-smile-2.8.6.jar"/>
<library name= "Jackson-dataformat-yaml-2.8.6.jar"/>
<library name= "Jna-4.4.0.jar"/>
<library name= "Joda-time-2.9.5.jar"/>
<library name= "Jopt-simple-5.0.2.jar"/>
<library name= "Log4j-api-2.8.2.jar"/>
<library name= "Lucene-analyzers-common-6.6.0.jar"/>
<library name= "Lucene-backward-codecs-6.6.0.jar"/>
<library name= "Lucene-core-6.6.0.jar"/>
<library name= "Lucene-grouping-6.6.0.jar"/>
<library name= "Lucene-highlighter-6.6.0.jar"/>
<library name= "Lucene-join-6.6.0.jar"/>
<library name= "Lucene-memory-6.6.0.jar"/>
<library name= "Lucene-misc-6.6.0.jar"/>
<library name= "Lucene-queries-6.6.0.jar"/>
<library name= "Lucene-queryparser-6.6.0.jar"/>
<library name= "Lucene-sandbox-6.6.0.jar"/>
<library name= "Lucene-spatial-6.6.0.jar"/>
<library name= "Lucene-spatial-extras-6.6.0.jar"/>
<library name= "Lucene-spatial3d-6.6.0.jar"/>
<library name= "Lucene-suggest-6.6.0.jar"/>
<library name= "Securesm-1.1.jar"/>
<library name= "Snakeyaml-1.15.jar"/>
<library name= "T-digest-3.0.jar"/>
However! The bigger pit is this plugin code error! Don't toss it, give it up!
To start compiling:ant Runtime (run for 33 minutes!) ) Conclusion
1. Nutch 2.x and Elasticsearch 5.x is not very good compatibility, do not want to toss, give up.
2. Next try the new schema: Scrapy + Scrapy-redis + MongoDB + elasticsearch
The MAC builds its own crawler search engine (Nutch+elasticsearch is a failed attempt to use Scrapy+elasticsearch)