Directory
Catalog Preface body Environment configuration only use Scrapy complete task simple Django Project connect MySQL database write a data class join Scrapy write items write spiders write pipelines crawler set up deploy and run crawler launch SCRAPYD deployment Crawler to Scrapyd run result item address PostScript
Preface
Skip the nonsense and look directly at the text
Always write back end also uninteresting, whim, to learn to learn reptiles. The ultimate goal is to use Scrapy and Django to build a simple crawler framework and complete a simple crawl task: Crawl the contents of the target Web page into the MySQL database.
This article records a complete set of steps for a simple crawler framework that is ideal for friends who want to get started quickly. If you want to systematically learn scrapy and Django, go to the official documents listed below: Scrapy Official documents Scrapyd official documents Django Official document body Environment Configuration Install the Python development environment (it is highly recommended to install Anaconda directly (I am currently installing anaconda2-4.1.1-windows-x86_64) Install scrapy (the command line runs "Conda install Scrapy") Install Django (command line Run "Conda install Django") Install Scrapyd (currently Anaconda can't seem to find this module, just use PIP: the command line runs "Pip install Scrapyd") There are two or three other dependent libraries (such as Scrapy_djangoitem, Mysql-python), the specific memory is not complete, in the run Scrapy or Django error, with the PIP to install the only use Scrapy to complete the task
If you just want to quickly write a simple crawler, then there is no need to build any framework, just use the scrapy can be done, as the following program. (This code comes from Scrapy official website)
Import Scrapy
class Quotesspider (scrapy. Spider):
name = "Quotes"
start_urls = [
' http://quotes.toscrape.com/tag/humor/',
]
def parse ( Self, Response): For
quote in Response.css (' Div.quote '):
yield {
' text ': quote.css (' Span.text::text '). Extract_first (),
' author ': Quote.xpath (' Span/small/text () '). Extract_first (),
next_page
= Response.css (' Li.next a::attr (' href ') '). Extract_first ()
if Next_page not None:
next_page = Response.urljoin (next_page)
yield scrapy. Request (Next_page, Callback=self.parse)
Save the above code as quotes_spider.py and run the command:
Scrapy Runspider Quotes_spider.py-o Quotes.json
The crawled content is exported to the Quotes.json file.
As you can see, writing a reptile with scrapy is very simple, as long as you inherit scrapy. It's good to spider and implement the parse function.
However, the output here is in the file, our task is to store the output in MySQL, so we need to use other libraries, web framework Django is a good choice.
Django has its own sqlite, as well as MySQL, SQL Server and many other databases. a simple Django project
Before you combine Django with Scrapy, take a quick look at the Django project structure.
Run the following command to create a Django project:
Django-admin Startproject Helloscrapy
You can see the file structure for this project:
helloscrapy/
manage.py
helloscrapy/
__init__.py
settings.py
urls.py
At this point, you run the following command at the top-most helloscrapy directory:
Python manage.py runserver
Open Http://127.0.0.1:8000/in a browser and you can see a display "Congratulations on your one django-powered page." page, which means your Django development environment is configured correctly.
Of course, our goal is not to develop a Web site, but to use Django to manipulate the database, so here we are just familiar with the Django project file,
(The role of each file is explained in detail in the official Django document, not to be discussed here). connecting to the MySQL database
Django defaults to using SQLite, so before writing specific data classes, we need to configure Django and MySQL connections first. This configuration is written in the inner helloscrapy directory in the settings.py (the IP, user and other information to their own):
DATABASES = {'
default ': {
' ENGINE ': ' Django.db.backends.mysql ', '
NAME ': ' database_name ',
' USER ': ' User ',
' PASSWORD ': ' PASSWORD ', '
HOST ': ' database_server_ip ',
' PORT ': ' 3306 ',
}
}
Once configured, run the following command under the project root:
Python manage.py makemigrations
python manage.py migrate
If you encounter a missing module error in this process, install the missing module with the PIP install. After the command runs successfully, open your database, and if you see a table with Auth and Django starting up, the MySQL database connection is successful. Write a data class
Run the following command under the root directory to create a new Django app:
Python manage.py Startapp Warehouse
We can see the directory structure of the generated warehouse:
warehouse/
__init__.py
admin.py
apps.py
migrations/
__init__.py models.py tests.py
views.py
Then add ' Warehouse ' to the Installed_apps in the settings.py in the inner helloscrapy directory, as follows:
Installed_apps = [
' django.contrib.admin ',
' Django.contrib.auth ', '
django.contrib.contenttypes ',
' django.contrib.sessions ',
' django.contrib.messages ', '
django.contrib.staticfiles
', ' Warehouse ',
]
We will use warehouse as a data class warehouse, the crawler crawled data will be processed and converted to the warehouse defined in the data Class (model), and stored in the database.
Next, write a simple model in models.py:
From django.db import Models
class Testscrapy (models. Model):
text= models. Charfield (max_length=255)
author= models. Charfield (max_length=255)
class Meta:
App_label = ' Warehouse '
db_table = ' test_scrapy '
Similarly, run the following command under the project root:
Python manage.py makemigrations
python manage.py migrate
To view the database, you should be able to see the test_scrapy table that you just built. Join Scrapy
You can now add scrapy to your Django project.
First, configure the environment variable Pythonpath to set the path (for example, E:\PythonProjects\helloscrapy) of its value to this Django project root directory.
Next, under the project root, create a new bots folder, enter the bots directory, and create a new init. py file, which reads as follows:
Def setup_django_env ():
import OS, Django
os.environ.setdefault ("Django_settings_module", " Helloscrapy.settings ")
Django.setup ()
def check_db_connection (): From
django.db import connection
If connection.connection:
#NOTE: (Zacky, 2016.mar.21st) If connection is CLOSED by backend, close IT at DJANGO, WHICH would be SETUP afterwards.
If not connection.is_usable ():
connection.close ()
Under Bots directory, run the following command to create a new Scrapy project:
Scrapy Startproject Testbot
The TESTBOT project structure is as follows:
testbot/
__init__.py
scrapy.cfg
testbot/
__init__.py
items.py settings.py
spiders/
Writing Items
Items File:
Import scrapy from
scrapy_djangoitem import djangoitem from
warehouse.models import testscrapy
class Testbotitem (Djangoitem):
Django_model = testscrapy
Write Spiders
Create a new test_spider.py in the Spiders directory, which reads as follows:
Import scrapy from
testbot.items import Testbotitem
class Testspider (scrapy. Spider):
name = "Test_spider"
start_urls = [
' http://quotes.toscrape.com/tag/humor/',
]
def Parse (self, Response): For
quote in Response.css (' Div.quote '):
item = Testbotitem ()
item[' text '] = Quote.css (' Span.text::text '). Extract_first ()
item[' author '] = Quote.xpath (' Span/small/text () '). Extract_
the yield item
next_page = Response.css (' Li.next a::attr (' href ') '). Extract_first ()
if next_page is not None:
next_page = Response.urljoin (next_page)
yield scrapy. Request (Next_page, Callback=self.parse)
Write Pipelines
Class Testbotpipeline (object):
def process_item (self, item, spider):
item.save () return
item
Crawler Settings
Testbot under the settings.py:
From bots import setup_django_env
setup_django_env ()
bot_name = ' Testbot '
spider_modules = [' Testbot.spiders ']
newspider_module = ' testbot.spiders '
download_handlers = {' S3 ': None}
Download_delay = 0.5
download_timeout =
concurrent_requests_per_ip=1
item_pipelines = {
' Testbot.pipelines.TestbotPipeline ': 1,
}
OK, here, a simple scrapy is done, next is the deployment and launch, where we use scrapyd-client to deploy the crawler to Scrapyd deploy and run the crawler to start Scrapyd
Run the following command:
Scrapyd
Open http://localhost:6800/should be able to see Scrapyd page deployment Crawler to Scrapyd
Delete the "#" before the URL in the scrapy.cfg in the Testbot directory:
URL = http://localhost:6800/
Run the deployment command in the Testbot directory and:
Scrapyd-deploy
Curl http://localhost:6800/schedule.json-d project=testbot-d Spider=test_spider
View the crawler running status in Http://localhost:6800/. Run Results
If the crawler is finished, you should be able to see the results in the Test_scrapy table in the database, as shown in the following figure:
Project Address
Https://github.com/clayandgithub/helloscrapy PostScript
No