Python crawler Essays (2)-Starting crawlers and XPath

Source: Internet
Author: User
Tags xpath

Start crawler

In the previous section, we have created our Scrapy project, looking at this pile of papers, presumably a lot of people will be a face, how should we start this crawler?

Now that we've created the Scrapy crawler with the cmd command, we're going to have to run it in a programmer's orthodox way--cmd.

scrapy crawl jobbole

When we entered this command in CMD, our crawler began to run. But it's not only time-consuming and laborious to debug a program in the IDE if you need this to start each and all. In this case, we can take the use of Python to implement automatic command line startup. All right, that's sweet!

So we create a main.py file in our project

Write the following code

# Author    :Albert Shen# -*- coding: utf-8 -*-from scrapy.cmdline import executeimport sysimport osif __name__ == ‘__main__‘:    sys.path.append(os.path.dirname(os.path.abspath(__file__)))    execute(["scrapy", "crawl", "jobbole"])

Running main.py, we will find that Scrapy successfully began to run, and the results of the output to the console window, at this time we can take full advantage of the IDE's advantages, easy to debug and run.

First Taste of reptiles

Now that our crawler is ready to run, how can we write logic to crawl the data we want?

First we hit a breakpoint.

At this point you notice that the text variable of the response object looks much like the HTML code of the Web page. To confirm that this is correct, we can copy the value of the text variable and copy it to an HTML file to open the HTML source code that found the target page.

This is because when the program starts running, Scrapy acquires the Web page data, invokes the parse function in the file, and passes the obtained data to the Resposne parameter. This is the strength of the scrapy framework, the previous series of operations related to the target page is encapsulated, we only need to care about how to get the desired information from the source code of the Web page, so our subsequent operations will be mainly around this response parameter.

If you try some other pages, you may get a different page from the page we want, which is mostly because the landing page requires login verification or a anti-crawler strategy, which we will cover in the following article.

Xpath

Now that we have obtained the source code of the webpage, how should we parse the data? Perhaps the attentive reader will notice that the value of Response.text is a string, so using regular expressions is a reliable way. But as a mature reptile framework, Scrapy provides us with a much simpler and more accurate way to--xpath. Change our jobbole.py file to this

# -*- coding: utf-8 -*-import scrapyclass JobboleSpider(scrapy.Spider):    name = ‘jobbole‘    allowed_domains = [‘blog.jobbole.com‘]    start_urls = [‘http://blog.jobbole.com/all-posts/‘]    def parse(self, response):        articles = response.xpath(‘//a[@class="archive-title"]/text()‘).extract()        print(articles)        pass

Run the program, break point debugging, we found that the articles variable is a list, containing 20 strings, exactly the target page of the 20 article title. Accurate target data extraction can be achieved by one line, which relies on XPath.

Before learning XPath, it is best to have a good understanding of the front end, especially the HTML language, and at least some basic concepts to know.

HTML language is a Super text markup language, unlike the backend C,java,python, HTML is not a programming language, but a markup language, is a series of tags to describe the page. If there is no concept of the reader can simply interpret it as HTML language is through a sentence to tell the browser, first I want to put a piece of text here, and then in this one picture and so on, similar to drawing.

HTML tag tags are often referred to as HTML tags. is a keyword surrounded by angle brackets <>, such as

<a href="http://www.w3school.com.cn">This is a link</a>

Where the <a></a> tag indicates that it is a link that is the character that is displayed to the reader, and the href is an attribute of it, indicating that the destination URL is http://www.w3school.com.cn. The actual display effect is shown

Click on it to jump to the landing page.

If you want to learn more about HTML, you can learn it on W3school: http://www.w3school.com.cn/

XPath is based on this. Common XPath syntax as shown (source: W3school)

W3school tutorial on XPath: http://www.w3school.com.cn/xpath/index.asp

Give some examples

If you want to learn more about XPath, it is also recommended to W3school for further study. If the front-end is not familiar with the reader, you can follow the author of this series of tutorials, I believe that after a period of practice, can also master the relevant knowledge.

With the above knowledge, let's analyze how we get all the titles of the landing page.

Open the target page in the browser http://blog.jobbole.com/all-posts/, press F12 (Chrome browser) to open the developer tool

1. We can right-click on the element you want to view and select "Check"

2. Click on the 2-position icon in the developer attack and click on the element we want to view

Chrome will automatically jump to the source of the target element.

<a class="archive-title" target="_blank" href="http://blog.jobbole.com/114331/" title="受 SQLite 多年青睐,C 语言到底好在哪儿?">受 SQLite 多年青睐,C 语言到底好在哪儿?</a>

The target element is a few more important information

1. This is a <a></a> (link)

2. The target element contains multiple attributes, one of which is class, a value of Archive-title, and a property of href, and a value of click on the URL of the landing page where the title will jump.

3. Contains a property of title with the same value as the "element" value between the labels.

articles = response.xpath(‘//a[@class="archive-title"]/text()‘).extract()

The//a[@class = "Archive-title" In our code above means that all the nodes in the entire document (//representing all the points in the entire document) contain the class attribute (the brackets [] indicate a restriction on the preceding label, such as what property is included), and the property value is " Archive-title "a node (label).

We can search all the code of the page, we will find that all contain archive-title string of only 20, exactly is this page all the article title <a></a> tag's class attribute value, That is to say we have precisely chosen the data we want with this sentence.

/text () indicates that we want to get the "element" value of these tags (that is, the content between the tags, which is what will be displayed on the page), and if you want to get a property value for the tag, such as an HREF, you can use the following statement

response.xpath(‘//a[@class="archive-title"]/@href‘).extract()

Because we take the attribute, we must not forget the @ to represent the property, if not a, it will be considered to be the href tag under the target a tag, it is obviously wrong.

Extract () represents the data variable that obtains the extracted information, that is, the contents of the tag we specify. If this sentence is difficult to understand, the reader can also delete the. Extract () in the program and observe the results.

Egg: In the developer tool, right-click the source code and we can choose to copy the XPath of the target tag. Similarly, due to dynamic Web pages, the XPath obtained in this way may not match the Web page obtained by Scrapy. This approach can help you to understand XPath more deeply, but in the subsequent programming process, I still suggest that you do the analysis.

As Albert says: Since the program is written to be lazy, it is not lazy when writing programs.

Conclusion

In this section, we learned how to quickly start the Scrapy,xpath basic syntax and try out the first attempt at the Scrapy crawler. In the following chapters, we will parse the entire page and use the depth-first search to traverse all the articles of Jobbole, parse the data, download the cover map, and so on. At the same time we will use the regular expression to parse the string to obtain the data we want, due to space limitations, I will not be detailed on the regular expression, it is recommended that you understand some of the basic knowledge of regular expressions.

Python crawler Essays (2)-Starting crawlers and XPath

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.