In the previous article, we mentioned that Python's requests library can get all the contents of the network code, we have to acquire the nature of it must be processed, as we learn, a book, we can easily obtain, but it is to talk about philosophy or history or other things, We need to carefully analyze, take its essence to its dross. And in Python, of course, there are ' people ' who do the job, and that's what we need to install ' beautifulsoup ' in the library.
I would like to continue to use yesterday's website to illustrate, but the effect is not good, I will be the end of this article will be attached to the source code, for friends or their own to learn again
Html_doc = from bs4 import Beautifulsoupsoup = BeautifulSoup (Html_doc) print soup.text
I quoted a piece of HTML code, which is an example of an official handbook, and this is a section of Alice in Wonderland .
We see that it extracts the key thing: To remove some code that doesn't understand, leaving behind a fictional text that can be understood.
The reason why learning requires a book, is not the combination of all its words, but the essence of some words.
We can extract all of it and, of course, extract what we want, such as the name of the book I Want to extract:
And
In fact, this is the title is the python to get the title of the method, in the HTML code, you can see that there are two titles, the above is the HTML Settings page title, the following is an indicator, Soup.title code crawling is the HTML page settings of the title
I can also crawl other things, such as the URL of a particular location, ID, name, ... or using the original HTML example
Before I start, I'll introduce you to a code
This can be found in all <a> tags
After full absorption, you can continue to learn
<ahref= "Http://example.com/elsie"class= "Sister"ID= "Link1">Elsie</a>,<ahref= "Http://example.com/lacie"class= "Sister"ID= "Link2">Lacie</a> and<ahref= "Http://example.com/tillie"class= "Sister"ID= "Link3">Tillie</a>;
1 from Import BeautifulSoup 2 soup = BeautifulSoup (html_doc)3 for in Soup.find_all ('A '):4 print (link.get ('href') ))
The above code is to find all the <a> tags from the document link, you can also find the <a> tags in the class,id, if you want to test, it is best to predict what will happen next, perhaps there will be a big surprise.
But the name is the body part of the fragment, remember the first example of this little article, crawl Alice in Wonderland novel Fragments.
I used an ingenious method to combine the preceding knowledge:
Since Soup.find_all (' a ') This code can find links to all <a> tags, we'll start with these links and give it text
Hey! Say because now is a list, this can't text
Since it is a list [can be understood as an array, as the individual understands], I need an extract body for one
The effect is also good, to achieve the effect I want
BeautifulSoup This library also has many functions, the small text just bucket to say a bit, the back to use when continue to expand, I also need time to study, only in the actual combat learning absorption effect is very good.
Finally to say something: Just the beginning of the article, I did not mean to take yesterday's to use it, but the effect is not good, as for why, I put the code attached, you can go to try
1 Import Requests 2 from Import BeautifulSoup 3 res = Requests.get ("http://baike.sogou.com/v77860.htm?pid=baike.box" )4 soup = BeautifulSoup (res.text)5print soup.text
Python Crawler Learning Calendar 2 The power of the BeautifulSoup based on Ubuntu system