Python Crawler Learning notes (i)

Last Update:2016-08-22 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Python is a very powerful and complete language, which I did not understand when I first learned it. Think about six months ago to learn Python's original intention, but because ArcGIS provides a Python script compilation environment, when I know that arctoolbox in the powerful tools, some of them are in the so-called Python written out, Nature also want to try, Simplifying the miscellaneous work, which is one of the reasons I like programming.

But to be honest, Python has learned to do it intermittently and without writing any scripting tools, but it's not a little bit of a gain: at least I learned a language. Although the goal is somewhat deviated from the original, but learning itself, it is no harm, because you will always accidentally harvest other things.

Python is open source, so in addition to the official library, there are many third-party libraries, can do a lot of things, such as scientific computing, machine learning, building the framework of the site, and, of course, is a crawler, think it is very interesting. At present is also just contact, NumPy is following a book to learn, crawler words, is to find information on the Internet, reference to other people's case.

The first one to write today is to refer to a blog http://www.cnblogs.com/fnng/p/3576154.html by a bug teacher. This example, says how to download all the pictures from a Web page to local, and numbering, before looking at the code is completely no concept, after reading it feels that the whole process is not complex, Python provides a very powerful Urllib library, the code is very simple.

1. Import related Libraries

# Python reads a library of web pages

Import Urllib

# Regular Expressions about modules

Import re

Here's a regular expression, regular expression ( Regular Expression is very complex, not only Python has, like Java, C # have, White is to write a certain rules to match strings, master the basic is enough, the complex will have to slowly learn, here or recommend Liaoche Teacher's personal website/http Www.liaoxuefeng.com/wiki/001374738125095c955c1e6d8bb493182103fac9270762a000/ 001386832260566c26442c671fa489ebc6fe85badda25cd000, it's not difficult to get started

2. Get HTML for Web pages

def gethtml (URL):

# Similar to opening of a file

page = Urllib.urlopen (URL)

# Similar to a file read

html = Page.read ()

return HTML

Taking into account the convenience of later use, this method is encapsulated, written as a function, the code is very simple, and the file read very similar.

3. Write Regular Expressions

This is actually one of the most complicated parts of the whole reptile process. Because the content of the site is very different, want to extract the information is not the same, so there is no universal regular expression, have to be a case-by-case.

The next step is to analyze a webpage, and Html+css+javascript will continue to learn.

For example, this site http://tieba.baidu.com/p/4571038933?see_lz=1 There are many pictures, in chrome hold F12 View source code, will find that All of these images are linked to a URL and are formatted as follows:

All we need is to extract the links in the src= "".

The code is as follows:

def getimag (HTML):

# here is a string compiled into a regular object by compile

# Re. s for multi-line matching, more commonly used

Pattern = Re.compile ('

# Re.findall returns the matched substring as a list, as well as a common match (match from scratch), search (any match)

Imags = Re.findall (pattern, HTML)

The regular expression part of the function has been written, but only half of it is finished, and the next step is to download the file and save it locally, based on the extracted link.

4. Get Pictures

# T is responsible for the file number

t = 1

For IMG in imags:

# Use Urltrieve to get the file from the link

Urllib.urlretrieve (IMG, ' D:\Learn\Code\python\pachong\photo\%s.jpg '% t)

T + = 1

5. Trial run

url = ' Http://tieba.baidu.com/p/4571038933?see_lz=1 '

html = gethtml (URL)

Getimag (HTML)

You can receive files in the folder medium.

It is indeed a sense of accomplishment.

6, the following to improve the code bar

In this site, I found a more interesting usage http://www.nowamagic.net/academy/detail/1302861

See the Help documentation to find out that Urlretrieve has a callback function

Specifically, you can use this to write a callback function Reporthook ()

Def Reporthook (A, B, c):

# A: Data blocks already downloaded, B: size of data blocks, size of remote files

per = * A * b/c

If per > 100:

per = 100

print '%.2f%% '% per

Then add a parameter named Reporthook to the inside of the Urlretrieve ()

Very interesting!!!!!

Summarize:

1, about the crawler, did not find the relevant books. However, the network of resources is actually quite rich, especially a lot of technical blogs, the author is quite detailed, referring to the study, the harvest is very big, the above I wrote are found on the Internet. Emulate Yan, met will not, go to study well.

2, the official Python document, although not detailed, but concise, are the core of things, see more, will find a lot of blind spots, like this example of the callback function, a lot of people estimate that when the crawler is not used.

3, the crawler will be exposed to a lot of Web front-end content, next to want to climb better, you have to learn html+css+javascript, or even distributed

Python Crawler Learning notes (i)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python Crawler Learning notes (i)

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support