I. Parser overview
Soup=beautifulsoup (Response.body)
When the page is parsed, the parser is not specified, and the default parser "Html.parser" is used in Python.
What is a parser? BeautifulSoup's job is to interpret and classify HTML tags, and different parsers will interpret the same HTML tags differently.
An example of an official document:
BeautifulSoup ("<a></p>", "lxml") #
BeautifulSoup ("<a></p>", "Html5lib") #
BeautifulSoup ("<a></p>", "Html.parser") # <a></a>
The "lxml" and "html5lib" parsers are frequently mentioned in the official documentation because the default "Html.parser" Auto-complete label is poorly functioning and often has problems.
Second, use BeautifulSoup crawl news website news headlines.
ImportRequests fromBs4ImportBeautifulsouplink="http://tuijian.hao123.com/finance"Headers= {'user-agent':'mozilla/5.0 (Windows; U Windows NT 6.1; En-us; rv:1.9.1.6) gecko/20091201 firefox/3.5.6'}r= Requests.get (link, headers=headers) Soup= BeautifulSoup (R.text,"Html.parser") First_title= Soup.find ("Div", class_="Box-text"). TextPrint("the title of the first article is:", First_title) title_list= Soup.find_all ("Div", class_="Box-text") forIinchRange (len (title_list)): Title=Title_list[i].text.strip ()Print('the title of article%s is:%s'% (i+1, title))
Find_all Find all results, the result is a list. Use a loop to list the headings.
|
Parser |
How to use |
Advantages |
Disadvantage |
Python Standard library |
BeautifulSoup (markup, "Html.parser") |
- Python's built-in standard library
- Moderate execution speed
- Strong document Tolerance
|
- Poor document tolerance in versions prior to Python 2.7.3 or 3.2.2
|
lxml HTML Parser |
BeautifulSoup (markup, "lxml") |
- Fast speed
- Strong document Tolerance
|
- Need to install the C language Library
|
lxml XML Parser |
BeautifulSoup (markup, ["lxml", "xml"]) BeautifulSoup (markup, "XML") |
- Fast speed
- The only parser that supports XML
|
- Need to install the C language Library
|
Html5lib |
BeautifulSoup (markup, "Html5lib") |
- Best-in-tolerance
- Parsing documents in a browser way
- Generate documents in HTML5 format
|
|
Python crawler--BeautifulSoup of several methods of parsing Web pages