Tianluo website-A Preliminary Exploration of Python Crawler

Last Update:2015-09-14 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Tianluo website-A Preliminary Exploration of Python Crawler
Prepare the Python Environment

We use Python2.7 for development. Pay attention to configuring environment variables.

IDE

We use Pycharm for development, which is the same as the well-known Android Studio and IDEA-Jet Brains.

We have two shameless posts about cracking:

Username: yueting3527 registration code: ==== license begin ==== 93347-1204201000001FMHemWIs "6wozMZnat3IgXKXJ2! NV2I6kSO48hgGLa9JNgjQ5oKz1UsFFR8k "nGzJHzjQT6IBG! 1fbQZn9! Vi ===== license end ====

Username: yueting3527 registration code: ==== license begin ==== 93347-1204201000001FMHemWIs "6wozMZnat3IgXKXJ2! NV2I6kSO48hgGLa9JNgjQ5oKz1UsFFR8k "nGzJHzjQT6IBG! 1fbQZn9! Vi ===== license end ====

Requests Module

The Requests module is a third-party network request library used to replace Python URLLib2.

Install

Windows: pip install requests
Linux & Mac: sudo pip install requests

However, due to some strange reasons, this download process is very difficult, so we often need to use such a website to help us download:

Http://www.lfd.uci.edu /~ Gohlke/pythonlibs/

The image collects almost all third-party Python libraries. Search for Requests and click Download.

After the download is complete, change the suffix to zip. And put the decompressed Requests folder under the Lib folder of Python.

No anti-crawler mechanism is available for obtaining webpage source code through Requests

Directly use the get method of the Requests library to obtain the webpage source code:

import requestshtml = requests.get('http://www.hujiang.com/')print(html.text)

On the terminal, we can see the source code of the generated webpage.

Anti-crawler Mechanism

However, many websites do not easily allow crawlers to obtain webpage information. At this time, we need to modify the Http header information to obtain it.

For example we use the same code to crawl my blog http://blog.csdn.net/eclipsexys in the terminal, we can see such a paragraph:

 403. At this time, we need to modify the crawler code.
 First, right-click the review element on the page, find the Network, refresh it, select any element, and find the final User-Agent:
 
 User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.155 Safari/537.36
 This is our Http request header. Now let's modify the code:
 import requestshead = {    'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.118 Safari/537.36'}html = requests.get('http://blog.csdn.net/eclipsexys', headers = head)print(html.text.encode('utf-8'))
 Add request headers and set the encoding format to UTF-8. (The default is GBK in Windows, please change coding to UTF-8 first) 
  
  Ps: In the Python file, if you want to enter Chinese characters, you need to specify the character set of the file: 
  # Coding = UTF-8
See https://www.python.org/dev/peps/pep-0263/ for details 
 
 Now we can get the source code normally.
Requests regular search The content obtained directly is all the source code of the webpage. This is certainly not what we need. Therefore, we can extract the content we need through regular expressions.
For example, if we want to extract all the hyperlinks in the web page, OK, let's see how to implement:
Re Module First, we need to introduce the re module. The re module is a regular expression module, which is the same as the web-side Regular Expressions:
 import requestsimport rehead = {    'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.118 Safari/537.36'}html = requests.get('http://www.hujiang.com/', headers=head)html.encoding = "utf-8"href = re.findall('<a target="_blank" href="(.*?)"', html.text, re.S)for each in href:    print each
Submit data Get and Post to the webpage Their differences are as follows:
-Get is used to obtain data from the server.
-Post is used to transmit data to the server.
-Get functions are implemented by constructing parameters in the url.
-Post: place the data in the header to submit the data.
Web analysis tools Chrome debugging function-Network debugging
 In the Network, locate the place where Post is submitted and find Form Data, which is Form Data.
Construct a form Post form.
 import requestsimport reurl = 'https://www.crowdfunder.com/browse/deals&template=false'data = {    'entities_only':'true',    'page':'2'}html_post = requests.post(url,data=data)title = re.findall('"card-title">(.*?)</div>',html_post.text,re.S)for each in title:    print each
XPath XPath is the XML Path Language. If we use a regular expression when looking for an element, we can say that we are looking for a woman with long hair and 180cm. If we use XPath to express it, it is the front-end of XX Department of XX Company.
Lxml To use XPath in Python, we need to use a third-party module lxml to install it like Requests.
Obtain the HTML XPath path Open Chrome's review element. Right-click any element and select copy XPath.
Of course, you can also write it by hand. Its basic syntax is as follows: 
  
  // Locate the root node 
  /Search at the lower layer 
  Extract text content:/text () 
  Extract attribute content: // @ xxxx 
 
 For example we choose this address: http://www.imooc.com/course/list? C = android & page = 2
 Open review elements:
 
 In this way, we can easily obtain the element's XPath, and we can also manually modify it according to the rules.
Crawl content To use XPath, follow these steps: 
  
  From lxml import etree 
  Selector = etree. HTML (HTML Source) 
  Selector. xpath (XPath) 
 
 We also use the previous website as an example to capture the title of the selected course:
 # coding=utf-8import requestsfrom lxml import etreehtml = requests.get("http://www.imooc.com/course/list?c=android&page=2")html.encoding = 'utf-8'selector = etree.HTML(html.text)content = selector.xpath('//*[@id="main"]/div/div/div[3]/div[1]/ul/li[1]/a/h5/span/text()')for each in content:    print each
 In this way, we get the corresponding content and search method, which is actually the same as we locate through the address, China-Shanghai-Pudong New Area (when the content is unique, you can skip it before) -Zhangjiang Gaoke-Hujiang network-Xu Yisheng
 What if we need to crawl all course information? We can see that there is a li [1] In the generated XPath, which corresponds to the list in our source code. We select 1 because we select a specific item. If we remove this 1, the returned result is a list of all the elements we want. We will not detail them here.
XPath advanced uses the same string to start with, but the attributes are different For example:
 <Div id = "test-1"> required content 1 </div> <div id = "test-2"> required content 2 </div> <div id = "test-3"> required content 3 </div>
 We need to keep everything in advance, but their attributes are different. Let's first look at the XPath of the next element:
 //*[@id="test-1"]/text()
 It can be seen that ID Determines the element, so we need to use the XPathStarts-with (@ attribute name, with the same attribute characters)Method:
 //*[starts-with(@id,"test")]/text()
 You only need to use the starts-with method to match the content in.
Nested Tag For example:
 <div id=“class”>text1    <font color=red>text2</font>    text3</div>
 Similar to this nested label, if we use XPath to obtain the first level of text, we can only get text1 and text3. If we want to obtain text2, we need to useString (.)Method.
 data = selector.xpath('//div[@id="class"]')[0]info = data.xpath('string(.)')content = info.replace('\n','').replace(' ','')print content
 Using string (.), we can get the text of nested tags, which is equivalent to traversing sub-tags and obtaining text.
Last The purpose of this blog is to hold a lucky draw later. As you all know, my new book "Android group Legend" has been officially launched. In order to repay your suggestions, I am going to prepare a lucky draw on the CSDN blog. What this blog says is naturally the preparatory knowledge required for the lucky draw. Please warm up ~~~
 Copyright Disclaimer: This article is an original article by the blogger and cannot be reproduced without the permission of the blogger.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Tianluo website-A Preliminary Exploration of Python Crawler

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Tianluo website-A Preliminary Exploration of Python Crawler

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support