The first knowledge of Python crawler

Source: Internet
Author: User
Tags character classes locale setting

Summer school, and write notes. Now write him on the blog, but also to review a wave. Winter vacation continue to study. Try to write a ticket for me to rob the train.

Because the study is python2.7x ....
So with the Urllib
It turns out that Python code can be run in submile .... The ctrl+b will be shown below.
Dir (Urllib) will show the method of this module
Help (Urllib.open) will show the parameters of this method.

Urlopen has 3 parameters, first URL, second data, third proxy

You can do dir on an object, and then you can see how this object can be used.

Originally wrote Baidu.com entered into the www.baidu.com is 301 redirect Ah!!

403 No Access
30x redirection
50x is a server problem

Python urllib has a direct download of the PAGE DOWN!!!! Method
Urllib.urlretrieve (URL, ' save path ')
Urlretrieve

Download it directly.

#!/usr/bin/env python# -*- coding: utf-8 -*-‘http://www.163.com‘html = urllib.urlopen(url)# print html.read()print html.getcode()urllib.urlretrieve(url,‘c:/aaa.html‘)html.close()

Must remember to close AH!!!! Close

Another way to download a webpage

html = html.read()withopen(‘c:a1.txt‘,‘wb‘as f:    f.write(html)

Python has a way of doing it.

html = urllib.urlopen(url).read()print html

Otherwise you will not write read (), and then in HTML = Html.read ()

If it's just a one-way operation, you can write this.

Often appear coding errors, garbled, will go to see what type of page encoding, is to look at the source code, and then see what is like Utf-8, GBK and so on
Then after read () decode him () and then point encode () and encode it into what he wants.
Can be written in decode (' GBK ', ' ignore ') ignore is to ignore some of the wrong encoding, it is possible that there are several kinds of code in a Web page.

the OS module's GETCWD () gets the current absolute path,
ChDir (' Another path ') switch to another path

#!/usr/bin/env python# -*- coding: utf-8 -*-import urllibdef callback(a,b,c):    100.0*a*b/c    if down_pro >100:        down_pro=100    print‘%.2f%%‘‘http://www.iplaypython.com/‘local =‘c:/123.txt‘urllib.urlretrieve(url,local,callback)

The main thing is that function ...
The 3 functions of the function, the number of blocks, the size of the data block, the bytes, the size of the data file

And if you're going to output a percent sign,!!!!!!!. Write a two percent sign

The third lesson, get the page encoding, get the page header back

Third-party module, automatically determine the Web page encoding Library
Download Chardet module, install, start character set detection, package function

import‘http://www.163.com‘info = urllib.urlopen(url).infoinfo.getparam(‘charset‘)

A function in the info that gets the encoding type

The Code of the page to judge!!!
Third-party module Chardet (character set detection)

Is the import, and then gets the object returned by Urlopen in the call to read
Use
Chardet.detect (which is the object returned above)
It will return the probability of what encoding

import urllibimport‘http://www.iplaypython.com‘content = urllib.urlopen(url).read()print chardet.detect(content)

Wrote a function.

#!/usr/bin/env python#-*-Coding:utf-8-*-ImportUrllibImportChardeturl =' http://www.iplaypython.com ' def automatic(URL):con = urllib.urlopen (URL). Read () res = Chardet.detect (con) end = res[' encoding ']returnEndPrintAutomatic (URL) urls = [' http://www.baidu.com ',' http://www.163.com ',' http://www.jd.com '] forXinchURLsPrintAutomatic (x)
Lesson Four

URLLIB2 module!!!!!!!!

Be sure to pay attention! Foreign and domestic codes, Google and Baidu,
GBK is Chinese!!!

The importance of coding!!!!!

Crack not let crawl ...

。。。。。 CSDN will not be, return 403

ImportUrllib2url =' http://blog.csdn.net/qq_28295425 'My_heaeders = {' GET ': URL,' Host ':' Blog.csdn.net ',' Referer ':' http://blog.csdn.net/experts.html ',' User-agent ':' mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/50.0.2661.94 safari/537.36 '}req = Urllib2. Request (Url,headers=my_heaeders)# req.add_header (' user-agent ', ' mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/50.0.2661.94 safari/537.36 ')# req.add_header (' GET ', url)# req.add_header (' Host ', ' blog.csdn.net ')# req.add_header (' Referer ', ' http://blog.csdn.net/experts.html ')ASD = Urllib2.urlopen (req)PrintAsd.read () my_heaeders =[' mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/50.0.2661.94 safari/537.36 ',' mozilla/5.5 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/50.0.2661.94 safari/537.36 '] def get_content(url,headers):Random_header =random.choice (headers) req = Urllib2. Request (URL) req.add_header (' User-agent ', Random_header) Req.add_header (' Host ',' Blog.csdn.net ') Req.add_header (' Referer ',' http://blog.csdn.net/') Req.add_header (' GET ', url) content = Urllib2.urlopen (req). Read ()returnContentPrintGet_content (Url,my_heaeders)

Note that the random selection is ""

Give the URL to request first. Then add the header information and put the returned req into the Urlopen

Lesson five image Download crawler

For example, to Baidu post-paste pictures

Be sure to write the coding method
That-*-is straight--that's not a shift.

- -Conding:utf-8--

It is a method that learns the regular findall (expression, content) returns the matching content

And there's nothing like a stupid thing. There is a regular writing,,,,,,, will change the use of regular writing, unchanged on the direct copy
It's best to write functions

I wrote it myself. Extract proxy IP and port of fast proxy without function

#!/usr/bin/env python  #- *-Coding:utf-8-*- # first crawl to replace, set the header, do a function, write later  import  reimport  urllib2url1 = ' http://www.kuaidaili.com/'  #<td data-title=" IP "> 123.182.216.241</td>  html1 = Urllib2.urlopen (url1) HTML = Html1.read () html1.close () Regexip = r ' data-title=" IP "> (\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}) '  Regexport = r ' data-title= "PORT" > (\d{1,4}) '  Poxyip = Re.findall (regexip,html) Poxyport = Re.findall (regexport,html) for  x in  Range ( Span class= "Hljs-number" >10 ): print  poxyip[x]+ ': '  +poxyport[x] 
Lesson six third party Mokuai BeautifulSoup from BS4

Very convenient is not to write regular expression, with the BeautifulSoup, will be based on the HTML tag to find, and then pass in a string that is the return of the page, and then return all contain this tag, this is the first parameter, the second parameter is class= " The class to be written class_ prevent and Python classes mixed, then will return the name is called this, and then also return a bunch,!!! Return an object, there is a method, the object is a dictionary-like type, then find SRC should return value! The returned object's [' src '] then returns the URL of the image. Of course, you can write another key to find it.

#!/usr/bin/env python#-*-Coding:utf-8-*- fromBs4ImportBeautifulSoupImportUrllib def get_content(URL):html = urllib.urlopen (URL) content = Html.read () html.close ()returnContent def get_image(info):Soup = BeautifulSoup (info) all_img = Soup.find_all (' img ', class_="Bde_image") forImginchALL_IMG:Printimg[' src ']url =' http://tieba.baidu.com/p/4656488748 'info = get_content (URL)PrintGet_image (Info)

Determine if a file exists

os.path.exists(filename)

Output Current time: formatted

SyntaxError: invalid syntax>>> ISOTIMEFORMAT=‘%Y-%m-%d %X‘timetime.localtime() )    ‘2016-08-17 16:31:28‘

The direct Time.localtime returned is tuple!!.

I wrote the code to get the Thunderbolt member

#-*-Coding:utf-8-*-ImportUrllibImportReImportOSURL1 =' http://xlfans.com/'Regex =R ' Thunderbolt member number sharing (. +?) Password (. *) < 'Regex1 =R ' class= "item" ><a href= "(. +?)" > 'ml =' C:/xunlei.txt ' def get_html(URL):HTML1 = Urllib.urlopen (URL) html = Html1.read () html1.close ()returnHtml def get_re(HTML):Xunlei = Re.findall (regex,html) forAinchXunlei: withOpen (ML,' A ') asF:b = a[0]+"'+a[1] F.write (b +' \ n ') def get_new(HTML):New = Re.findall (regex1,html)returnnew[0]# f = open (ml, ' WB ')# f.write (' 1 ')# f.close ()Html= get_html (url1) URL = get_new (html) new_html = get_html (URL)ifos.path.exists (ML): os.remove (ML) get_re (new_html)Print ' please look c:/xunlei.txt thankyou! 'Print ' newurl= '+url
Python Writing directory Scan Tool

Read by row

readlines()os.path.splitext(s)s = ‘c:/1.txt’

The splitext is to take the file apart, one is the path + file name, the other is the suffix name

!!!!!!!!!!!!!!!!!!!!!!!!!

。 Empty () is null

。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。
I'm willing to submit a mock-up, grab the packet, look at the submitted data, directly create the dictionary data, you can set the parameter modification, and then create heasers information, also grab the packet, and then the data to use Urllib

urllib.urlencode(data)   把他编码成url编码格式的然后因为是data和headers,所以用的urllibrequest = urllib2.Request(url,data,headers)response = urllibe.urlopen(request)result = response.read().decode(‘gbk’)先用data和headers,url创建一个request,然后用urlopen打开

666 No side, I know that if the link on the wire break detection, set the time every how often to ping Baidu ah ...
Then implement the disconnection to reproduce the connection

#判断当前是否可以联网    def canConnect(self):        ‘w‘)        result = subprocess.call(‘ping www.baidu.com‘True, stdout = fnull, stderr = fnull)        fnull.close()        if result:            returnFalse        else:            returnTrue

Http://cuiqingcai.com/2083.html

In another article, about the regular,. * Just can match any infinite character, add one? It's a non-greedy pattern.
Re. s represents any matching pattern at the point of matching. Point can represent line break

The general writing is to write the compile first, that is
Pattern = Re.compile (R ' want to write regular ')
And then
result = Pattern.findall (to match)

He set the input, if the input enter the index Plus, then you can read the next joke
Keep circulating ....

Be sure to write the exception!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

? Re. I (full spell: IGNORECASE): Ignoring case (full wording in parentheses, same as below)
? Re. M (full spell: MULTILINE): Multiline mode, changing the behavior of ' ^ ' and ' $ ' (see)
? Re. S (full spell: dotall): Point random match mode, change '. ' The behavior
? Re. L (full spell: locale): Make a predetermined character class \w \w \b \b \s \s depends on the current locale setting
? Re. U (full spell: Unicode): Make predefined character classes \w \w \b \b \s \s \d \d Depending on UNICODE-defined character attributes
? Re. X (full spell: VERBOSE): Verbose mode. In this mode, the regular expression can be multiple lines, ignore whitespace characters, and can be added to comments.

To create a directory, let's see if this directory exists.

The three parameters of the request are url,post data, headers

Keep writing when you have a free time. Not for anything else. Because I like it.

The first knowledge of Python crawler

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.