Build a simple reptile frame with scrapy and Django.

Source: Internet
Author: User
Tags auth sqlite xpath install django
Directory

Catalog Preface body Environment configuration only use Scrapy complete task simple Django Project connect MySQL database write a data class join Scrapy write items write spiders write pipelines crawler set up deploy and run crawler launch SCRAPYD deployment Crawler to Scrapyd run result item address PostScript

Preface

Skip the nonsense and look directly at the text

Always write back end also uninteresting, whim, to learn to learn reptiles. The ultimate goal is to use Scrapy and Django to build a simple crawler framework and complete a simple crawl task: Crawl the contents of the target Web page into the MySQL database.

This article records a complete set of steps for a simple crawler framework that is ideal for friends who want to get started quickly. If you want to systematically learn scrapy and Django, go to the official documents listed below: Scrapy Official documents Scrapyd official documents Django Official document body Environment Configuration Install the Python development environment (it is highly recommended to install Anaconda directly (I am currently installing anaconda2-4.1.1-windows-x86_64) Install scrapy (the command line runs "Conda install Scrapy") Install Django (command line Run "Conda install Django") Install Scrapyd (currently Anaconda can't seem to find this module, just use PIP: the command line runs "Pip install Scrapyd") There are two or three other dependent libraries (such as Scrapy_djangoitem, Mysql-python), the specific memory is not complete, in the run Scrapy or Django error, with the PIP to install the only use Scrapy to complete the task

If you just want to quickly write a simple crawler, then there is no need to build any framework, just use the scrapy can be done, as the following program. (This code comes from Scrapy official website)

Import Scrapy

class Quotesspider (scrapy. Spider):
    name = "Quotes"
    start_urls = [
        ' http://quotes.toscrape.com/tag/humor/',
    ]

    def parse ( Self, Response): For
        quote in Response.css (' Div.quote '):
            yield {
                ' text ': quote.css (' Span.text::text '). Extract_first (),
                ' author ': Quote.xpath (' Span/small/text () '). Extract_first (),
            next_page

        = Response.css (' Li.next a::attr (' href ') '). Extract_first ()
        if Next_page not None:
            next_page = Response.urljoin (next_page)
            yield scrapy. Request (Next_page, Callback=self.parse)

Save the above code as quotes_spider.py and run the command:

Scrapy Runspider Quotes_spider.py-o Quotes.json

The crawled content is exported to the Quotes.json file.

As you can see, writing a reptile with scrapy is very simple, as long as you inherit scrapy. It's good to spider and implement the parse function.

However, the output here is in the file, our task is to store the output in MySQL, so we need to use other libraries, web framework Django is a good choice.
Django has its own sqlite, as well as MySQL, SQL Server and many other databases. a simple Django project

Before you combine Django with Scrapy, take a quick look at the Django project structure.
Run the following command to create a Django project:

Django-admin Startproject Helloscrapy

You can see the file structure for this project:

helloscrapy/
    manage.py
    helloscrapy/
        __init__.py
        settings.py
        urls.py

At this point, you run the following command at the top-most helloscrapy directory:

Python manage.py runserver

Open Http://127.0.0.1:8000/in a browser and you can see a display "Congratulations on your one django-powered page." page, which means your Django development environment is configured correctly.

Of course, our goal is not to develop a Web site, but to use Django to manipulate the database, so here we are just familiar with the Django project file,
(The role of each file is explained in detail in the official Django document, not to be discussed here). connecting to the MySQL database

Django defaults to using SQLite, so before writing specific data classes, we need to configure Django and MySQL connections first. This configuration is written in the inner helloscrapy directory in the settings.py (the IP, user and other information to their own):

DATABASES = {'
    default ': {
        ' ENGINE ': ' Django.db.backends.mysql ', '
        NAME ': ' database_name ',
        ' USER ': ' User ',
        ' PASSWORD ': ' PASSWORD ', '
        HOST ': ' database_server_ip ',
        ' PORT ': ' 3306 ',
    }
}

Once configured, run the following command under the project root:

Python manage.py makemigrations
python manage.py migrate

If you encounter a missing module error in this process, install the missing module with the PIP install. After the command runs successfully, open your database, and if you see a table with Auth and Django starting up, the MySQL database connection is successful. Write a data class

Run the following command under the root directory to create a new Django app:

Python manage.py Startapp Warehouse

We can see the directory structure of the generated warehouse:

warehouse/
    __init__.py
    admin.py
    apps.py
    migrations/
    __init__.py models.py tests.py
    views.py

Then add ' Warehouse ' to the Installed_apps in the settings.py in the inner helloscrapy directory, as follows:

Installed_apps = [
    ' django.contrib.admin ',
    ' Django.contrib.auth ', '
    django.contrib.contenttypes ',
    ' django.contrib.sessions ',
    ' django.contrib.messages ', '
    django.contrib.staticfiles
    ', ' Warehouse ',
]

We will use warehouse as a data class warehouse, the crawler crawled data will be processed and converted to the warehouse defined in the data Class (model), and stored in the database.

Next, write a simple model in models.py:

From django.db import Models

class Testscrapy (models. Model):
    text= models. Charfield (max_length=255)
    author= models. Charfield (max_length=255)

    class Meta:
        App_label = ' Warehouse '
        db_table = ' test_scrapy '

Similarly, run the following command under the project root:

Python manage.py makemigrations
python manage.py migrate

To view the database, you should be able to see the test_scrapy table that you just built. Join Scrapy

You can now add scrapy to your Django project.

First, configure the environment variable Pythonpath to set the path (for example, E:\PythonProjects\helloscrapy) of its value to this Django project root directory.

Next, under the project root, create a new bots folder, enter the bots directory, and create a new init. py file, which reads as follows:

Def setup_django_env ():
    import OS, Django

    os.environ.setdefault ("Django_settings_module", " Helloscrapy.settings ")
    Django.setup ()

def check_db_connection (): From
    django.db import connection

    If connection.connection:
        #NOTE: (Zacky, 2016.mar.21st) If connection is CLOSED by backend, close IT at DJANGO, WHICH would be SETUP afterwards.
        If not connection.is_usable ():
            connection.close ()

Under Bots directory, run the following command to create a new Scrapy project:

Scrapy Startproject Testbot

The TESTBOT project structure is as follows:

testbot/
    __init__.py
    scrapy.cfg
    testbot/
        __init__.py
        items.py settings.py
        spiders/    
Writing Items

Items File:

Import scrapy from
scrapy_djangoitem import djangoitem from
warehouse.models import testscrapy

class Testbotitem (Djangoitem):
    Django_model = testscrapy
Write Spiders

Create a new test_spider.py in the Spiders directory, which reads as follows:

Import scrapy from
testbot.items import Testbotitem

class Testspider (scrapy. Spider):
    name = "Test_spider"
    start_urls = [
        ' http://quotes.toscrape.com/tag/humor/',
    ]

    def Parse (self, Response): For
        quote in Response.css (' Div.quote '):
            item = Testbotitem ()
            item[' text '] = Quote.css (' Span.text::text '). Extract_first ()
            item[' author '] = Quote.xpath (' Span/small/text () '). Extract_
            the yield item
        next_page = Response.css (' Li.next a::attr (' href ') '). Extract_first ()
        if next_page is not None:
            next_page = Response.urljoin (next_page)
            yield scrapy. Request (Next_page, Callback=self.parse)
Write Pipelines
Class Testbotpipeline (object):
    def process_item (self, item, spider):
        item.save () return
        item
Crawler Settings

Testbot under the settings.py:

From bots import setup_django_env
setup_django_env ()

bot_name = ' Testbot '

spider_modules = [' Testbot.spiders ']
newspider_module = ' testbot.spiders '

download_handlers = {' S3 ': None}
Download_delay = 0.5
download_timeout =

concurrent_requests_per_ip=1

item_pipelines = {
    ' Testbot.pipelines.TestbotPipeline ': 1,
}

OK, here, a simple scrapy is done, next is the deployment and launch, where we use scrapyd-client to deploy the crawler to Scrapyd deploy and run the crawler to start Scrapyd

Run the following command:

Scrapyd

Open http://localhost:6800/should be able to see Scrapyd page deployment Crawler to Scrapyd

Delete the "#" before the URL in the scrapy.cfg in the Testbot directory:

URL = http://localhost:6800/

Run the deployment command in the Testbot directory and:

Scrapyd-deploy
Curl http://localhost:6800/schedule.json-d project=testbot-d Spider=test_spider

View the crawler running status in Http://localhost:6800/. Run Results

If the crawler is finished, you should be able to see the results in the Test_scrapy table in the database, as shown in the following figure:
Project Address

Https://github.com/clayandgithub/helloscrapy PostScript

No

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.