Python parsing HTML

Source: Internet
Author: User

Python parsing HTML
1. Understand the data on the webpage. The data on the webpage mainly includes:
HTMLXHTMLXMLJSON requires a mechanism for receiving and parsing data and for generating and sending data. 2. the hierarchical HTML parsing data contains multiple third-party libraries for HTML parsing, such as LXML, BeautifulSoup, and HTMLParser. Problems faced by parsing HTML:
There are no unified standards. Many Web pages do not follow the HTML document 2.1 BeautifulSoup

BeautifulSoup third-party libraries have the following features:
  
-Easy to use.
-Version 4 allows the use of lxml and html5lib to better handle nonstandard HTML.
-It is also effective in coding.

The following is a comparison between the parsing method and the advantages and disadvantages:
  
  

3 sample code

Enter the python environment from the terminal and perform the experiment as follows. If you do not have the bs4 library, run the following command (in Ubuntu) to install it:
  sudo pip install beautifulsoup4
  
  

>>>
>>> From bs4 import BeautifulSoup
>>>
>>> Import urllib
>>> Html = urllib. urlopen ("http: // 192.168.1.33/temwet/index.html")
>>>
>>> Html
Addinfourl at 164618764 whose fp = socket. _ fileobject object at 0x9cd19ac
>>> Html. code
200
>>>

Let's take a look at the source code of the webpage:
  
  
  
Use BeautifulSoup for parsing:
Usebt = BeautifulSoup(html.read(),"lxml")To parse the received html,bt.title, bt.meta, bt.title.string, bt.find_all('meta')Search for elements. You can store and access multiple results in arrays.
  
  
  
What if I want to extract the hyperlinks contained in the webpage? We only need to find the "a" label and extract it.links = bt.find_all("a')You can save all the hyperlinks in the web page in links.len(links)If the value is equal to 0, there is no hyperlink in the webpage. Otherwise, you can directly access the webpage in an array.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.