Learn to crawl from scratch (iii)------write to the source file for crawling

Source: Internet
Author: User

Start Crawl page: (2) write crawl of source file

To make the code easy to modify and to crawl pages more clearly and efficiently, we write the code to the source file for crawling.

The following steps are mainly divided into:

I. Create a crawler frame using scrapy:

Two. Modify and write the source code to determine the pages and content we want to crawl

Three. Start crawling and depositing files (database)

Note: In order to avoid lengthy narrative, more intuitive understanding, here first of the specific operation method, if you want to understand the principle, the specific explanation in the end.

* Method of Operation:

1. Create a crawler frame

Open the command line and use the CD command to go to the location where you want to create the file

scrapy Startproject folder name (assuming the demo )

SC R a P y startproject Demo

The powerful scrapy will automatically generate a crawler frame under the current path as follows: demo/

scrapy.cfg#      project configuration file    Demo/     __init__.py            items.py #设定数据存入的对象            pipelines.py #与数据库交互            settings.py#        Crawler configuration Files    spiders/             __init__.py
2. Edit the source code to determine the crawl scenario
Scrapy not only helped us build a good framework, but also provided a lot of built-in methods for us to use, and we just need to use these methods and combine the BeautifulSoup library
To filter the Web page information.

Start writing code!
First open the items.py file and modify it as follows:
1 #-*-coding:utf-8-*-2 3 #Define Here the models for your scraped items4 #5 #See documentation in:6 #http://doc.scrapy.org/en/latest/topics/items.html7 8 Importscrapy9 Ten  One classDemoitem (scrapy. Item): A     #Define the fields for your item here is like: -Name =Scrapy. Field () -     #Pass

Then create a new spider.py in the spider directory and copy the following code into it
1#-*-coding:utf-8-*-2 Import Scrapy3  fromBS4 Import BeautifulSoup4  fromdemo.items Import Demoitem5 6 7 classDemo (scrapy. Spider):8Name ="Demo"9Start_urls = ['http://www.cnblogs.com/KoalaDream/']Ten  One def parse (self, response): ASoup =BeautifulSoup (response.body) -Tmp=soup.find_all ('Div', attrs={'class':'Posttitle'}) -         forIinchRange0, Len (tmp)): theitem=Demoitem () -item['name']=tmp[i].text.encode ('Utf-8') -             yieldItem

The purpose of this code is to crawl the article title of my blog Home page, the role of specific statements are described in the following.

3. Start crawling and depositing files

The code is ready, and we're going to start crawling the page, isn't it easy?

On the command line, use the CD command to enter the directory where the Scrapy.cfg is located

Scrapy Crawl Demo-o Output.json

Starts the crawl and outputs the content to the directory containing the scrapy.cfg file in the Output.json file

Complete!

Open a. output.json file, what? Why is it all garbled?

This is because we crawl down the Chinese characters, stored in Unicode encoded format in the file.

To see Chinese characters, you can copy the contents of the output, open python, enter print U "---" Paste the copied Unicode code, and you can output the corresponding Chinese characters.

* Code Explanation  

In the spider, the Scrapy framework stipulates that there are three members of the required definitions.

name : name,spider the identity of the Scrapy crawl ____ entered at the command line when we crawled the name defined here

Start_urls : a URL list,spider start crawling from these pages

Parse () : a method, when start_urls Scrapy will automatically invoke this method to parse the content of the Web page after it has been crawled.

In general, name tells the Scrapy framework which framework I want to run. Scrapy then runs the framework, automatically links to start_url Web pages, generates response objects, and crawls the source code into the response object.

Let's look at how the specific parse () method works.

First we introduce the BeautifulSoup library, which can greatly simplify our crawl process.

Then we open the page that we want to crawl as http://www.cnblogs.com/KoalaDream/

View the source code. (In Chrome, right-click an article title, click Review Element)

As you can see, each title is in a <div> tag and the class attribute is Posttitle

So, we use the method provided in BeautifulSoup to filter out the title

Tmp=soup.find_all (' div ', attrs={' class ': ' Posttitle '})

The title is stored in the TMP as a list.

Do you remember the items.py file we just modified?

Let's walk through each item in TMP and store it in the item we just defined.

With Item=demoitem (), an object of the Demoitem class is created, and the following operations

item[' name ']=tmp[i].text.encode (' utf-8 ')

The operators on the left are somewhat confusing. Because you want to manipulate the members of a class, you generally use item.name, what is this operator about?

Careful observation of the items.py file, the Demoitem class inherits the Scrapy.item class. It is proved by experiment that this operator no longer exists if inheritance is removed.

Description in Scrapy. Item class, the operator [] is overloaded. There is no difference between the sense of personal feeling and the. Operator, we can use this syntax directly.

Finally, it's yield item.

Yield syntax has been looked at for a long time, has been generated, iterators have been a bit dizzy.

If you want to know the specific yield syntax, please see the link below

  http://www.oschina.net/translate/improve-your-python-yield-and-generators-explained

For the effect of the yield statement here, I say my understanding, where not accurate hope that we have a lot of advice.

Let's start with a brief introduction to the yield statement:

Yield is used for functions, like return, to give the function a return value. The difference is that return marks the end of the function, and yield is the function that pauses until the function is called again, and proceeds from the pause, rather than starting from scratch.

In spider.py code, the yield statement returns the current value in the list to parse, once per loop. But we did not call and return voluntarily. So where does the output come from?

Because parse is a method that is automatically called in the Scrapy framework, we can infer that:

When you tell scrapy to output to a Output.json file, the return value for each yield is automatically printed to the Output.json file. The scrapy then calls the parse method again, from the just-interrupted position, the for loop, and then returns to the next list item. So loop until the function ends. In this way, the contents of TMP are stored in the item[' name '] and then exported to the Output.json.

One value returned per yield is the so-called generator.

Each time it is called from the paused state, it is automatically started from the next object before the pause, which is called an iterator.

Learn to crawl from scratch (iii)------write to the source file for crawling

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.