Preliminary exploration of Dragnet--python reptile in early stage

Source: Internet
Author: User
Tags xpath

Environment Preparation Python

We use Python2.7 for development, and note that environment variables are configured properly.

Ide

We developed it using Pycharm, which is a--jet Brains with the famous Android Studio and idea.

About cracking, very shameless stickers two:

用户名:yueting3527注册码:===== LICENSE BEGIN =====93347-1204201000001FMHemWIs"6wozMZnat3IgXKXJ2!nV2I6kSO48hgGLa9JNgjQ5oKz1UsFFR8k"nGzJHzjQT6IBG!1fbQZn9!Vi===== LICENSE END =====
用户名:yueting3527注册码:===== LICENSE BEGIN =====93347-1204201000001FMHemWIs"6wozMZnat3IgXKXJ2!nV2I6kSO48hgGLa9JNgjQ5oKz1UsFFR8k"nGzJHzjQT6IBG!1fbQZn9!Vi===== LICENSE END =====
Requests module

The requests module is a third-party network request library that replaces Python URLLib2.

Installation
    • WINDOWS:PIP Install Requests
    • Linux & mac:sudo Pip install requests

But for some strange reason, this download process is very difficult, so we often need to use such a website to help us to download:

http://www.lfd.uci.edu/~gohlke/pythonlibs/

This image collects almost all of the Python third-party libraries, we search requests, click Download.

When the download is complete, change the suffix name to zip. and place the extracted requests folder under the Python Lib folder.

Get webpage source code via requests No anti-crawling mechanism

Get the Web page source code directly using the Get method of the requests library:

import requestshtml = requests.get(‘http://www.hujiang.com/‘)print(html.text)

In the terminal, we can see the generated Web page source code.

Have anti-reptile mechanism

However, many sites do not easily let crawlers get information on the Web page, this time, we need to modify the HTTP header information in the way to get.

For example, we use the same code to crawl my blog Http://blog.csdn.net/eclipsexys in the terminal, we can see this passage:

<html><head><title>403 Forbidden</title></head><body bgcolor="White"><center><H1>403 Forbidden</H1></Center><hr><center>Nginx</Center></body></html>

403, this time, we need to modify the crawler code.

First, we right-click on the page to select the review element, find the network, refresh, select any element, find the last user-agent:

user-agent:mozilla/5.0 (Windows NT 6.3; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/44.0.2403.155 safari/537.36

This is our HTTP request header. Now let's modify the code:

import requestshead = {    ‘User-Agent‘‘Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.118 Safari/537.36‘}html = requests.get(‘http://blog.csdn.net/eclipsexys‘, headers = head)print(html.text.encode(‘utf-8‘))

Add the request header and set the following encoding format to UTF-8. (The default is GBK under Windows, please modify coding to UTF-8 first)

PS: In a python file, if we want to enter Chinese, we need to specify the character set of the following file:

# Coding=utf-8
See https://www.python.org/dev/peps/pep-0263/for details.

We run it again and now we can get the source code normally.

Requests regular Search

Direct get out of the content, are all the source code of the Web page, this is definitely not what we need, so we can use regular expressions to extract what we need.
For example, we want to extract all the hyperlinks in the Web page, OK, let's see how it's implemented:

Re module

First we need to introduce the RE module, the RE module is the regular expression module, using the same as the web-side of the regular:

import requestsimport rehead = {    ‘User-Agent‘‘Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.118 Safari/537.36‘}html = requests.get(‘http://www.hujiang.com/‘"utf-8"href = re.findall(‘<a target="_blank" href="(.*?)"‘, html.text, re.S)foreachin href:    each
Submit data to a Web page get and Post

Their differences are as follows:
-Get is fetching data from the server
-Post is the transfer of data to the server
-Get functions by constructing parameters in the URL
-Post data in header submission data

Web Analytics Tools

Chrome Debugging features--network debugging

Locate the post submission in the Network and find form data, which is the form.

Construct a form

Post mode to submit the form.

‘https://www.crowdfunder.com/browse/deals&template=false‘data = {    ‘entities_only‘:‘true‘,    ‘page‘:‘2‘}html_post = requests.post(url,data=data)title = re.findall(‘"card-title">(.*?)</div>‘,html_post.text,re.S)foreachin title:    each
Xpath

XPath, which is the XML Path language, when we look for an element, if we use regular expressions, so to speak, I'm looking for a woman with long hair and 180cm. Then if you want to use XPath to express, xx company XX Department of the front desk.

lxml

Using XPath in Python, we need to use the third-party module lxml, which is installed like requests.

Gets the XPath path of the HTML

Open Chrome's audit element, we find any element, right-click, select Copy XPath.
Of course, we can also handwriting, its basic syntax is as follows:

    • Locating the root node
    • /down Level Search
    • Extract text content:/text ()
    • Extract attribute contents:/@xxxx

For example, we choose this address: http://www.imooc.com/course/list?c=android&page=2

To open the Audit element:

This makes it very convenient to get the XPath for the element, and we can modify it manually according to the rules.

Crawling content

Using XPath is basically the following three steps:

    • From lxml import etree
    • Selector = etree. HTML (HTML Source)
    • Selector.xpath (XPath)

We also take the previous URL as an example, and we crawl the title of the selected course:

# coding=utf-8import requestsfrom lxml import etreehtml = requests.get("http://www.imooc.com/course/list?c=android&page=2"‘utf-8‘selector = etree.HTML(html.text)content = selector.xpath(‘//*[@id="main"]/div/div/div[3]/div[1]/ul/li[1]/a/h5/span/text()‘)foreachin content:    each

So we get the corresponding content, search method, actually with our address to locate is the same, China-Shanghai-Pudong New Area (content only, the front can be skipped)-Zhangjiang Hi-Hujiang Net-Xu Yisheng

So what if we need to crawl all the course information? We can see the generated XPath, there is a li[1], which corresponds to the list in our source code, we choose 1 because we choose a specific item, if we remove this 1, the return is a list, is we want all the elements, here is not detailed demonstration.

XPath advanced usage tips the same string beginning, but with different attributes

For example:

  <divid="test-1">需要的内容1</div>  <divid="test-2">需要的内容2</div>  <divid="test-3">需要的内容3</div>

We need everything in advance, but their properties are different, so let's look at the XPath of the next element:

//*[@id="test-1"]/text()

As you can see, the ID determines the element, so to take out such an element, we need to use the XPath starts-with (@ attribute name, the same part of the property character) method:

//*[starts-with(@id,"test")]/text()

Just use the Starts-with method to match the content in [] OK.

Nested labels

For example:

<div id=“class”>text1    <font color=red>text2</font>    text3</div>

Like this nested tag, if we use XPath to get the first level of text, then we can only get Text1 and Text3, if we want to get text2, we need to use string (.) Method.

data = selector.xpath(‘//div[@id="class"]‘)[0]info = data.xpath(‘string(.)‘info.replace(‘\n‘,‘‘).replace(‘ ‘,‘‘)print content

With a string (.), we can get the text of a nested label, which is equivalent to traversing a child tag to get text.

At last

The purpose of writing a blog is to be carried out in the back of a lottery, we all know that my new book "Android Elite Biography" has been officially listed, in order to repay your strong recommendation, I prepare a lottery in CSDN blog, this blog, the nature of the lottery is required to prepare the knowledge, welcome everyone to warm up ~~~

Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.

Preliminary exploration of Dragnet--python reptile in early stage

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.