Scrapy Environment construction under Linux

Source: Internet
Author: User
Tags gz file install mongodb mongo shell

Using scrapy for data mining recently, using scrapy to fetch data and store it in MongoDB, this paper records the environment construction process to make memo

Os:ubuntu 14.04 python:2.7.6 scrapy:1.0.5 db:mongodb 3

  ubuntu14.04 built-in python2.7 , so python and Pip installation no longer repeat.

A . installation scrapy

Pip install scrapy due to scrapy dependencies, you may encounter the following issues during the installation process:

1. Importerror:no module named W3

FIX: Pip install W3li 

2. Importerror:no module named Twisted

FIX: Pip install twisted
3. Importerror:no module named lxml.html
FIX: Pip install lxml
4. Error:libxml/xmlversion.h:no such file or directory
Solution: Apt-get Install Libxml2-dev Libxslt-dev
Apt-get Install Python-lxml
5. Importerror:no module named Cssselect
FIX: Pip install Cssselect
6. Importerror:no module named OpenSSL
FIX: Pip install Pyopenssl
The above basically covers the installation process may appear the dependency problem, if has the omission to be discovered after the supplement
  
Installation succeeds with Scrapy--version if the version information is displayed

Two:MongoDB Installation and permissions settings

1.MongoDB Installation

to store crawled objects in MongoDB first install MongoDB, you can directly use Apt-get installation, details see http://docs.mongoing.com/manual-zh/tutorial/ Install-mongodb-on-ubuntu.html, you can also download the installation package from the MongoDB website. the way I use it is to download the latest installation package directly from the official website.

after downloading, it is a. gz file, which can be extracted directly with the tar command, and be careful not to use the Gunzip command. Then move the unpacked package into the development directory (for example, I put it in the/usr/local/mongodb folder), configure the environment variable export PATH=<mongodb- Install-directory>/bin:$PATH

    Note: MongoDB data is stored in the/data/db directory, this directory needs to be created manually (EX. /usr/local/mongodb/data/db)

Then go to the Bin directory to execute the./mongod, you can start MongoDB, enter localhost:27017 or localhost:28017 in the browser, if there is content that indicates the success of the boot. Then execute the./mongo command, enter the MONGO Shell console, and then you can operate the MONGO.

2. Rights Management    

MONGO does not set permissions by default, so you set permissions as long as you can access and perform any action. One need to be aware of is that the MONGO3 and Mongo2 have different permission settings, which cannot be copied from each other. This only describes how the MONGO3 is set up.

1) Start MONGO with no authentication method (do not add--auth) First:./mongod

2) Enter the MONGO Shell console, execute show DBS First, you can find only one local database, when you want to create an admin database:

Use admin

       Db.createuser (

          {User: "root",

pwd: "1234",

Roles: [{role: ' Useradminanydatabase ', db: ' admin '}]

          })

Note:the DB in roles must be set

3) Execute Show users, and if you see a user who has just created it, the user is successful.

4) Close Mongo:use admin db.shutdownserver ()

5) Use authentication mode to start MONGO:./mongod--auth

6) again into the MONGO Shell console, switch to the admin database, at this time need authorization to perform other operations: Db.auth ("root", "1234") to complete the authorization, if the display 1 is authorized success, because this user root only manage user's rights, Therefore, if you perform a query operation with the user (such as show collections), the Deny execution is displayed

7) Then create a new user, create the user can only be established in their corresponding database. For example, to establish a user who has read and write permissions in the database PMS:

       Use PMS

Db.createuser (

{User: "Zhangsan",

PWD: "1234",

Roles: [{role: ' ReadWrite ', DB: ' PMS '}]

})

Then you can switch to the admin database to see all the users, you can find that there are already new users.

8) then cut back to the PMS database to verify that the use of PMS, then authorize Db.auth ("Zhangsan", "1234"), execute show COLLECTIONSM, you can find all the collections in the PMS Library show up

The above is MONGO user Rights management. For details, refer to the official documentation http://docs.mongoing.com/manual-zh/core/authentication-mechanisms.html

3. Set Daemon

Once the MONGO is turned on by the./mongo, the process of closing the window MONGO is killed, so set the MONGO as the daemon

Just execute sudo./mongod--fork--logpath./mongodb.log, Note: Once you have set up--fork you must also set the log path (--logpath)

Three: Writing crawlers
Then write the crawler according to its own requirements. For more information, please refer to the official documentation for Http://doc.scrapy.org/en/0.14/intro/tutorial.html.
As for crawlers and MongoDB docking only need to change the pipline.py. For details, refer to the official documentation http://doc.scrapy.org/en/1.0/topics/item-pipeline.html
Note that the version of the Pymongo problem, I do not know whether due to MongoDB2 and MongoDB3 differences, so the corresponding Pymongo version is not the same, so pay attention to install the Pymongo version as much as possible with the database version match.
Since my crawler is a single crawler, each crawler fixed crawl a site, so write a shell script to execute all the crawlers at once, but also easy to perform after the scheduled task, note that the script to be placed in the Scrapy Startproject the first-level folder of the project that was created (if the path has been set by the script is not required)
    
Four : set a scheduled task
    
The crawler is then set to a timed task and is scheduled to be crawled 8:00 every day.

Use the Linux crontab function to do timed tasks, perform crontab-e into the editing mode, and at the bottom of the 0 8 * * * * * * * * * * * * * * * * * * * * you want to execute the command you set at 8 o ' clock per day.

If you want to write the result of the execution to a file, you can edit it:

0 8 * * You want to execute the command > you want to write the results of the file 2>&1 (EX. 0 8 * * */home/ubuntu/test>/home/ubuntu/log.txt 2>&1 means 8 per morning Point executes the Home/ubuntu/test script and writes the execution result to the/home/ubuntu/log.txt file)

    Note: in scripts that use timed execution, it is recommended to use absolute paths to avoid errors and other unnecessary hassles if you use commands to the non-system itself.

 

The above is the Scrapy environment building, docking MONGO and set the steps of scheduled tasks.

Related references:

MongoDB Official Handbook: http://docs.mongoing.com/manual-zh/contents.html

Scrapy Official Document: http://doc.scrapy.org/en/1.0/index.html

  

Linux under Scrapy Environment setup

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.