Python Crawler Learning Calendar 2 The power of the BeautifulSoup based on Ubuntu system

Source: Internet
Author: User

In the previous article, we mentioned that Python's requests library can get all the contents of the network code, we have to acquire the nature of it must be processed, as we learn, a book, we can easily obtain, but it is to talk about philosophy or history or other things, We need to carefully analyze, take its essence to its dross. And in Python, of course, there are ' people ' who do the job, and that's what we need to install ' beautifulsoup ' in the library.

I would like to continue to use yesterday's website to illustrate, but the effect is not good, I will be the end of this article will be attached to the source code, for friends or their own to learn again

 Html_doc =    from  bs4 import   Beautifulsoupsoup  =  BeautifulSoup (Html_doc)  print  soup.text 

  

I quoted a piece of HTML code, which is an example of an official handbook, and this is a section of Alice in Wonderland .

We see that it extracts the key thing: To remove some code that doesn't understand, leaving behind a fictional text that can be understood.

The reason why learning requires a book, is not the combination of all its words, but the essence of some words.

We can extract all of it and, of course, extract what we want, such as the name of the book I Want to extract:

And

In fact, this is the title is the python to get the title of the method, in the HTML code, you can see that there are two titles, the above is the HTML Settings page title, the following is an indicator, Soup.title code crawling is the HTML page settings of the title

I can also crawl other things, such as the URL of a particular location, ID, name, ... or using the original HTML example

Before I start, I'll introduce you to a code

This can be found in all <a> tags

After full absorption, you can continue to learn

<ahref= "Http://example.com/elsie"class= "Sister"ID= "Link1">Elsie</a>,<ahref= "Http://example.com/lacie"class= "Sister"ID= "Link2">Lacie</a> and<ahref= "Http://example.com/tillie"class= "Sister"ID= "Link3">Tillie</a>;
1  from Import BeautifulSoup 2 soup = BeautifulSoup (html_doc)3 for in Soup.find_all ('A  '):4     print (link.get ('href') ))

The above code is to find all the <a> tags from the document link, you can also find the <a> tags in the class,id, if you want to test, it is best to predict what will happen next, perhaps there will be a big surprise.

But the name is the body part of the fragment, remember the first example of this little article, crawl Alice in Wonderland novel Fragments.

I used an ingenious method to combine the preceding knowledge:

Since Soup.find_all (' a ') This code can find links to all <a> tags, we'll start with these links and give it text

Hey! Say because now is a list, this can't text

Since it is a list [can be understood as an array, as the individual understands], I need an extract body for one

The effect is also good, to achieve the effect I want

BeautifulSoup This library also has many functions, the small text just bucket to say a bit, the back to use when continue to expand, I also need time to study, only in the actual combat learning absorption effect is very good.

Finally to say something: Just the beginning of the article, I did not mean to take yesterday's to use it, but the effect is not good, as for why, I put the code attached, you can go to try

1 Import Requests 2  from Import BeautifulSoup 3 res = Requests.get ("http://baike.sogou.com/v77860.htm?pid=baike.box"  )4 soup = BeautifulSoup (res.text)5print soup.text

 

  

Python Crawler Learning Calendar 2 The power of the BeautifulSoup based on Ubuntu system

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.