Python parsing HTML
1. Understand the data on the webpage. The data on the webpage mainly includes:
HTMLXHTMLXMLJSON requires a mechanism for receiving and parsing data and for generating and sending data. 2. the hierarchical HTML parsing data contains multiple third-party libraries for HTML parsing, such as LXML, BeautifulSoup, and HTMLParser. Problems faced by parsing HTML:
There are no unified standards. Many Web pages do not follow the HTML document 2.1 BeautifulSoup
BeautifulSoup third-party libraries have the following features:
-Easy to use.
-Version 4 allows the use of lxml and html5lib to better handle nonstandard HTML.
-It is also effective in coding.
The following is a comparison between the parsing method and the advantages and disadvantages:
3 sample code
Enter the python environment from the terminal and perform the experiment as follows. If you do not have the bs4 library, run the following command (in Ubuntu) to install it:
sudo pip install beautifulsoup4
>>>
>>> From bs4 import BeautifulSoup
>>>
>>> Import urllib
>>> Html = urllib. urlopen ("http: // 192.168.1.33/temwet/index.html")
>>>
>>> Html
Addinfourl at 164618764 whose fp = socket. _ fileobject object at 0x9cd19ac
>>> Html. code
200
>>>
Let's take a look at the source code of the webpage:
Use BeautifulSoup for parsing:
Usebt = BeautifulSoup(html.read(),"lxml")To parse the received html,bt.title, bt.meta, bt.title.string, bt.find_all('meta')Search for elements. You can store and access multiple results in arrays.
What if I want to extract the hyperlinks contained in the webpage? We only need to find the "a" label and extract it.links = bt.find_all("a')You can save all the hyperlinks in the web page in links.len(links)If the value is equal to 0, there is no hyperlink in the webpage. Otherwise, you can directly access the webpage in an array.