Preliminary exploration of Dragnet--python reptile in early stage

Last Update:2015-09-13 Source: Internet

Author: User

Tags xpath

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Environment Preparation Python

We use Python2.7 for development, and note that environment variables are configured properly.

Ide

We developed it using Pycharm, which is a--jet Brains with the famous Android Studio and idea.

About cracking, very shameless stickers two:

用户名：yueting3527注册码：===== LICENSE BEGIN =====93347-1204201000001FMHemWIs"6wozMZnat3IgXKXJ2!nV2I6kSO48hgGLa9JNgjQ5oKz1UsFFR8k"nGzJHzjQT6IBG!1fbQZn9!Vi===== LICENSE END =====

用户名：yueting3527注册码：===== LICENSE BEGIN =====93347-1204201000001FMHemWIs"6wozMZnat3IgXKXJ2!nV2I6kSO48hgGLa9JNgjQ5oKz1UsFFR8k"nGzJHzjQT6IBG!1fbQZn9!Vi===== LICENSE END =====

Requests module

The requests module is a third-party network request library that replaces Python URLLib2.

Installation

WINDOWS:PIP Install Requests
Linux & mac:sudo Pip install requests

But for some strange reason, this download process is very difficult, so we often need to use such a website to help us to download:

http://www.lfd.uci.edu/~gohlke/pythonlibs/

This image collects almost all of the Python third-party libraries, we search requests, click Download.

When the download is complete, change the suffix name to zip. and place the extracted requests folder under the Python Lib folder.

Get webpage source code via requests No anti-crawling mechanism

Get the Web page source code directly using the Get method of the requests library:

import requestshtml = requests.get(‘http://www.hujiang.com/‘)print(html.text)

In the terminal, we can see the generated Web page source code.

Have anti-reptile mechanism

However, many sites do not easily let crawlers get information on the Web page, this time, we need to modify the HTTP header information in the way to get.

For example, we use the same code to crawl my blog Http://blog.csdn.net/eclipsexys in the terminal, we can see this passage:

<html><head><title>403 Forbidden</title></head><body bgcolor="White"><center><H1>403 Forbidden</H1></Center><hr><center>Nginx</Center></body></html>

403, this time, we need to modify the crawler code.

First, we right-click on the page to select the review element, find the network, refresh, select any element, find the last user-agent:

user-agent:mozilla/5.0 (Windows NT 6.3; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/44.0.2403.155 safari/537.36

This is our HTTP request header. Now let's modify the code:

import requestshead = {    ‘User-Agent‘‘Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.118 Safari/537.36‘}html = requests.get(‘http://blog.csdn.net/eclipsexys‘, headers = head)print(html.text.encode(‘utf-8‘))

Add the request header and set the following encoding format to UTF-8. (The default is GBK under Windows, please modify coding to UTF-8 first)

PS: In a python file, if we want to enter Chinese, we need to specify the character set of the following file:

# Coding=utf-8
See https://www.python.org/dev/peps/pep-0263/for details.

We run it again and now we can get the source code normally.

Requests regular Search

Direct get out of the content, are all the source code of the Web page, this is definitely not what we need, so we can use regular expressions to extract what we need.
For example, we want to extract all the hyperlinks in the Web page, OK, let's see how it's implemented:

Re module

First we need to introduce the RE module, the RE module is the regular expression module, using the same as the web-side of the regular:

import requestsimport rehead = {    ‘User-Agent‘‘Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.118 Safari/537.36‘}html = requests.get(‘http://www.hujiang.com/‘"utf-8"href = re.findall(‘<a target="_blank" href="(.*?)"‘, html.text, re.S)foreachin href:    each

Submit data to a Web page get and Post

Their differences are as follows:
-Get is fetching data from the server
-Post is the transfer of data to the server
-Get functions by constructing parameters in the URL
-Post data in header submission data

Web Analytics Tools

Chrome Debugging features--network debugging

Locate the post submission in the Network and find form data, which is the form.

Construct a form

Post mode to submit the form.

‘https://www.crowdfunder.com/browse/deals&template=false‘data = {    ‘entities_only‘:‘true‘,    ‘page‘:‘2‘}html_post = requests.post(url,data=data)title = re.findall(‘"card-title">(.*?)</div>‘,html_post.text,re.S)foreachin title:    each

Xpath

XPath, which is the XML Path language, when we look for an element, if we use regular expressions, so to speak, I'm looking for a woman with long hair and 180cm. Then if you want to use XPath to express, xx company XX Department of the front desk.

lxml

Using XPath in Python, we need to use the third-party module lxml, which is installed like requests.

Gets the XPath path of the HTML

Open Chrome's audit element, we find any element, right-click, select Copy XPath.
Of course, we can also handwriting, its basic syntax is as follows:

Locating the root node
/down Level Search
Extract text content:/text ()
Extract attribute contents:/@xxxx

For example, we choose this address: http://www.imooc.com/course/list?c=android&page=2

To open the Audit element:

This makes it very convenient to get the XPath for the element, and we can modify it manually according to the rules.

Crawling content

Using XPath is basically the following three steps:

From lxml import etree
Selector = etree. HTML (HTML Source)
Selector.xpath (XPath)

We also take the previous URL as an example, and we crawl the title of the selected course:

# coding=utf-8import requestsfrom lxml import etreehtml = requests.get("http://www.imooc.com/course/list?c=android&page=2"‘utf-8‘selector = etree.HTML(html.text)content = selector.xpath(‘//*[@id="main"]/div/div/div[3]/div[1]/ul/li[1]/a/h5/span/text()‘)foreachin content:    each

So we get the corresponding content, search method, actually with our address to locate is the same, China-Shanghai-Pudong New Area (content only, the front can be skipped)-Zhangjiang Hi-Hujiang Net-Xu Yisheng

So what if we need to crawl all the course information? We can see the generated XPath, there is a li[1], which corresponds to the list in our source code, we choose 1 because we choose a specific item, if we remove this 1, the return is a list, is we want all the elements, here is not detailed demonstration.

XPath advanced usage tips the same string beginning, but with different attributes

For example:

  <divid="test-1">需要的内容1</div>  <divid="test-2">需要的内容2</div>  <divid="test-3">需要的内容3</div>

We need everything in advance, but their properties are different, so let's look at the XPath of the next element:

//*[@id="test-1"]/text()

As you can see, the ID determines the element, so to take out such an element, we need to use the XPath starts-with (@ attribute name, the same part of the property character) method:

//*[starts-with(@id,"test")]/text()

Just use the Starts-with method to match the content in [] OK.

Nested labels

For example:

<div id=“class”>text1    <font color=red>text2</font>    text3</div>

Like this nested tag, if we use XPath to get the first level of text, then we can only get Text1 and Text3, if we want to get text2, we need to use string (.) Method.

data = selector.xpath(‘//div[@id="class"]‘)[0]info = data.xpath(‘string(.)‘info.replace(‘\n‘,‘‘).replace(‘ ‘,‘‘)print content

With a string (.), we can get the text of a nested label, which is equivalent to traversing a child tag to get text.

At last

The purpose of writing a blog is to be carried out in the back of a lottery, we all know that my new book "Android Elite Biography" has been officially listed, in order to repay your strong recommendation, I prepare a lottery in CSDN blog, this blog, the nature of the lottery is required to prepare the knowledge, welcome everyone to warm up ~~~

Preliminary exploration of Dragnet--python reptile in early stage

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More