Python crawler Knowledge Point two

Source: Internet
Author: User

One. Request Library

ImportJSONImportRequests fromIoImportbytesio# shows various functions equivalent to API#Print (dir (requests)) URL='http://www.baidu.com'R=requests.get (URL)Print(R.text)Print(R.status_code)Print(r.encoding)
results:

 #   pass parameters: not as Http://aaa.com?pageId=1&type =content  params  = { " k1  " :  " v1   ", "  k2   ": "  v2   " }r  = Requests.get ( " http://httpbin.org/get " Span style= "COLOR: #800000" > "   Print   (r.url)  result:  

 

 #   binary data   #   r = Requests.get (' http://i-2.shouji56.com/2015/2/11/  23dab5c5-336d-4686-9713-ec44d21958e3.jpg ')  #   Image = Image.open (Bytesio (r.content))  #   Image.Save (' meinv.jpg ')  #   JSON processing  Span style= "COLOR: #000000" >r  = Requests.get ( " https://github.com/timeline.json   " )  print   ( Type (R.json))  print   (r.text) 
result:

#Raw Data processing#Streaming Data write- inR = Requests.get ('http://i-2.shouji56.com/2015/2/11/23dab5c5-336d-4686-9713-ec44d21958e3.jpg', stream =True) with open ('meinv2.jpg','wb+') as F: forChunkinchR.iter_content (1024): F.write (chunk)#Submit Formform= {'username':'User','Password':'Pass'}r= Requests.post ('Http://httpbin.org/post', data =form)Print(R.text)
Result: The parameter is submitted as a form, so the parameter is placed in the form parameter

= Requests.post ('http://httpbin.org/post', data = json.dumps (form))Print  (R.text)
Result: The parameter is not submitted as a form form, so it is placed in the JSON field

# Cookies  'http://www.baidu.com'== r.cookies# a cookie is actually a dictionary  for inch cookies.get_dict (). Items ():     Print (k, v) result: A cookie is actually a key-value pair

= {'C1':'v1'C2 ' v2 '  = requests.get ('http://httpbin.org/cookies', cookies = cookies)  Print(r.text) results:

# redirect and redirect History  = requests.head ('http://github.com', allow_redirects = True)Print  (r.url)print(r.status_code)print(r.history) results: Directed by 301 

# # Agent ## proxies = {' http ': ',,, ', ' https ': ' ... '} # r = Requests.get (' ... ', proxies = proxies)    

Two. BeautifulSoup Library

HTML: examples are as follows

<HTML><Head><title>The Dormouse ' s story</title></Head><Body><Pclass= "title"name= "Dromouse"><b>The Dormouse ' s story</b></P><Pclass= "Story">Once Upon a time there were three little sisters; and their names were<ahref= "Http://example.com/elsie"class= "Sister"ID= "Link1"><!--Elsie -</a>,<ahref= "Http://example.com/lacie"class= "Sister"ID= "Link2">Lacie</a> and<ahref= "Http://example.com/tillie"class= "Sister"ID= "Link3">Tillie</a>; and they lived at the bottom of a well.</P><Pclass= "Story">...</P>

The parsing code is as follows:

 from Import  = BeautifulSoup (open ('test.html'))
#使html文本更加结构化 # print (Soup.prettify ()) # Tag Print (Type (soup.title))
Result: a class of BS4
Print (Soup.title.name)
Print (Soup.title)
The results are as follows:

# String Print (Type (soup.title.  String)print(soup.title.string) results as follows: Only the contents of the label are displayed

# Comment Print (Type (soup.a.string)) Print (soup.a.string)
Result: Displays the contents of the note, so it is sometimes necessary to determine whether the obtained content is not a comment

## " "  for inch soup.body.contents:     Print (item.name) result: Body has three item below

# CSS Query Print (Soup.select ('. Sister'))
Result: The style selector returns all content with a style result as a list

Print (Soup.select ('#link1'))
Result: ID Selector, select content with ID equal to Link1

Print (Soup.select ('head > title')) results:

= Soup.select ('a') for in a_s:    print(a )
Result: Tag Selector, select all the A label's

Ongoing updates .... , you are welcome to pay attention to my public number lhworld.

Python crawler Knowledge Point two

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.