Tianluo website-A Preliminary Exploration of Python Crawler

Source: Internet
Author: User

Tianluo website-A Preliminary Exploration of Python Crawler
Prepare the Python Environment

We use Python2.7 for development. Pay attention to configuring environment variables.

IDE

We use Pycharm for development, which is the same as the well-known Android Studio and IDEA-Jet Brains.

We have two shameless posts about cracking:

Username: yueting3527 registration code: ==== license begin ==== 93347-1204201000001FMHemWIs "6wozMZnat3IgXKXJ2! NV2I6kSO48hgGLa9JNgjQ5oKz1UsFFR8k "nGzJHzjQT6IBG! 1fbQZn9! Vi ===== license end ====
Username: yueting3527 registration code: ==== license begin ==== 93347-1204201000001FMHemWIs "6wozMZnat3IgXKXJ2! NV2I6kSO48hgGLa9JNgjQ5oKz1UsFFR8k "nGzJHzjQT6IBG! 1fbQZn9! Vi ===== license end ====
Requests Module

The Requests module is a third-party network request library used to replace Python URLLib2.

Install
  • Windows: pip install requests
  • Linux & Mac: sudo pip install requests

However, due to some strange reasons, this download process is very difficult, so we often need to use such a website to help us download:

Http://www.lfd.uci.edu /~ Gohlke/pythonlibs/

The image collects almost all third-party Python libraries. Search for Requests and click Download.

After the download is complete, change the suffix to zip. And put the decompressed Requests folder under the Lib folder of Python.

No anti-crawler mechanism is available for obtaining webpage source code through Requests

Directly use the get method of the Requests library to obtain the webpage source code:

import requestshtml = requests.get('http://www.hujiang.com/')print(html.text)

On the terminal, we can see the source code of the generated webpage.

Anti-crawler Mechanism

However, many websites do not easily allow crawlers to obtain webpage information. At this time, we need to modify the Http header information to obtain it.

For example we use the same code to crawl my blog http://blog.csdn.net/eclipsexys in the terminal, we can see such a paragraph:

403. At this time, we need to modify the crawler code.

First, right-click the review element on the page, find the Network, refresh it, select any element, and find the final User-Agent:

User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.155 Safari/537.36

This is our Http request header. Now let's modify the code:

import requestshead = {    'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.118 Safari/537.36'}html = requests.get('http://blog.csdn.net/eclipsexys', headers = head)print(html.text.encode('utf-8'))

Add request headers and set the encoding format to UTF-8. (The default is GBK in Windows, please change coding to UTF-8 first)

Ps: In the Python file, if you want to enter Chinese characters, you need to specify the character set of the file:

# Coding = UTF-8
See https://www.python.org/dev/peps/pep-0263/ for details

Now we can get the source code normally.

Requests regular search

The content obtained directly is all the source code of the webpage. This is certainly not what we need. Therefore, we can extract the content we need through regular expressions.
For example, if we want to extract all the hyperlinks in the web page, OK, let's see how to implement:

Re Module

First, we need to introduce the re module. The re module is a regular expression module, which is the same as the web-side Regular Expressions:

import requestsimport rehead = {    'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.118 Safari/537.36'}html = requests.get('http://www.hujiang.com/', headers=head)html.encoding = "utf-8"href = re.findall('<a target="_blank" href="(.*?)"', html.text, re.S)for each in href:    print each
Submit data Get and Post to the webpage

Their differences are as follows:
-Get is used to obtain data from the server.
-Post is used to transmit data to the server.
-Get functions are implemented by constructing parameters in the url.
-Post: place the data in the header to submit the data.

Web analysis tools

Chrome debugging function-Network debugging

In the Network, locate the place where Post is submitted and find Form Data, which is Form Data.

Construct a form

Post form.

import requestsimport reurl = 'https://www.crowdfunder.com/browse/deals&template=false'data = {    'entities_only':'true',    'page':'2'}html_post = requests.post(url,data=data)title = re.findall('"card-title">(.*?)</div>',html_post.text,re.S)for each in title:    print each
XPath

XPath is the XML Path Language. If we use a regular expression when looking for an element, we can say that we are looking for a woman with long hair and 180cm. If we use XPath to express it, it is the front-end of XX Department of XX Company.

Lxml

To use XPath in Python, we need to use a third-party module lxml to install it like Requests.

Obtain the HTML XPath path

Open Chrome's review element. Right-click any element and select copy XPath.
Of course, you can also write it by hand. Its basic syntax is as follows:

  • // Locate the root node
  • /Search at the lower layer
  • Extract text content:/text ()
  • Extract attribute content: // @ xxxx

For example we choose this address: http://www.imooc.com/course/list? C = android & page = 2

Open review elements:

In this way, we can easily obtain the element's XPath, and we can also manually modify it according to the rules.

Crawl content

To use XPath, follow these steps:

  • From lxml import etree
  • Selector = etree. HTML (HTML Source)
  • Selector. xpath (XPath)

We also use the previous website as an example to capture the title of the selected course:

# coding=utf-8import requestsfrom lxml import etreehtml = requests.get("http://www.imooc.com/course/list?c=android&page=2")html.encoding = 'utf-8'selector = etree.HTML(html.text)content = selector.xpath('//*[@id="main"]/div/div/div[3]/div[1]/ul/li[1]/a/h5/span/text()')for each in content:    print each

In this way, we get the corresponding content and search method, which is actually the same as we locate through the address, China-Shanghai-Pudong New Area (when the content is unique, you can skip it before) -Zhangjiang Gaoke-Hujiang network-Xu Yisheng

What if we need to crawl all course information? We can see that there is a li [1] In the generated XPath, which corresponds to the list in our source code. We select 1 because we select a specific item. If we remove this 1, the returned result is a list of all the elements we want. We will not detail them here.

XPath advanced uses the same string to start with, but the attributes are different

For example:

<Div id = "test-1"> required content 1 </div> <div id = "test-2"> required content 2 </div> <div id = "test-3"> required content 3 </div>

We need to keep everything in advance, but their attributes are different. Let's first look at the XPath of the next element:

//*[@id="test-1"]/text()

It can be seen that ID Determines the element, so we need to use the XPathStarts-with (@ attribute name, with the same attribute characters)Method:

//*[starts-with(@id,"test")]/text()

You only need to use the starts-with method to match the content in.

Nested Tag

For example:

<div id=“class”>text1    <font color=red>text2</font>    text3</div>

Similar to this nested label, if we use XPath to obtain the first level of text, we can only get text1 and text3. If we want to obtain text2, we need to useString (.)Method.

data = selector.xpath('//div[@id="class"]')[0]info = data.xpath('string(.)')content = info.replace('\n','').replace(' ','')print content

Using string (.), we can get the text of nested tags, which is equivalent to traversing sub-tags and obtaining text.

Last

The purpose of this blog is to hold a lucky draw later. As you all know, my new book "Android group Legend" has been officially launched. In order to repay your suggestions, I am going to prepare a lucky draw on the CSDN blog. What this blog says is naturally the preparatory knowledge required for the lucky draw. Please warm up ~~~

Copyright Disclaimer: This article is an original article by the blogger and cannot be reproduced without the permission of the blogger.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.