The BeautifulSoup of Python crawlers

Source: Internet
Author: User

Reptiles sometimes write regular expressions with suspended animation.

Is that regular expressions are always looking for a dead loop.

Example: https://social.msdn.microsoft.com/forums/azure/en-us/3f4390ac-11eb-4d67-b946-a73ffb51e4f3/netcpu100

So you can use BeautifulSoup library to solve regular expressions of Web pages when parsing Web pages.

The online explanation for BeautifulSoup is too complicated.

I just choose the part I need to learn, the other needs to learn, no need to waste time

At least a lot of worry

The explanations are in the comments.

Print out a sentence and you'll see.

1 #!/usr/bin/python3.42 #-*-coding:utf-8-*-3 Importurllib.request4  fromBs4ImportBeautifulSoup5 6 if __name__=='__main__':7URL ="http://www.lenggirl.com/"8headers = {9         'user-agent':'mozilla/5.0 (Windows NT 6.1) applewebkit/537.11 (khtml, like Gecko) chrome/23.0.1271.64 safari/537.11',Ten         'Accept':'text/html;q=0.9,*/*;q=0.8', One         'Accept-charset':'iso-8859-1,utf-8;q=0.7,*;q=0.3', A         'accept-encoding':'gzip', -         'Connection':'Close', -         'Referer': None the     } -data =urllib.request.urlopen (URL). Read () -     #(' UTF-8 ') (' Unicode_escape ') (' GBK ', ' ignore ') -data = Data.decode ('UTF-8','Ignore') +     #Initializing Web pages -Soup = beautifulsoup (data,"Html.parser") +     #Print the entire page AHTML =soup.prettify () at     #Print  -Head =Soup.head -     #Print <body>...</body> -BODY =Soup.body -     #Print the first <p>...</p> -p =SOUP.P in     #Print the contents of P -P_string =soup.p.string to     #Soup.p.contents[0] for the +     #soup.p.contents for [' 2016\n, '] -P_string =Soup.p.contents[0] the     #print out all the heads inside the body *      forChildinchSoup.body.children: $         #print (child)Panax Notoginseng         Pass -     #print out all <a>...</a> and <p>...</p> thea_and_p = Soup.find_all (["a","P"]) +     #Find all the URLs in <a>...</a> A      forMyimginchSoup.find_all ('a'): theIMG_SRC = Myimg.get ('href') +         #print (IMG_SRC) -     #Find the <a>...</a> below class for class_= ' a ' under ...</img> inside the SRC $      forMyimginchSoup.find_all ('a', class_='a'): $IMG_SRC = Myimg.find ('img'). Get ('src') -     #page All information -     #print (HTML)

The BeautifulSoup of Python crawlers

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.