Python Crawler Learning Calendar 2 The power of the BeautifulSoup based on Ubuntu system

Last Update:2017-07-31 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In the previous article, we mentioned that Python's requests library can get all the contents of the network code, we have to acquire the nature of it must be processed, as we learn, a book, we can easily obtain, but it is to talk about philosophy or history or other things, We need to carefully analyze, take its essence to its dross. And in Python, of course, there are ' people ' who do the job, and that's what we need to install ' beautifulsoup ' in the library.

I would like to continue to use yesterday's website to illustrate, but the effect is not good, I will be the end of this article will be attached to the source code, for friends or their own to learn again

 Html_doc =    from  bs4 import   Beautifulsoupsoup  =  BeautifulSoup (Html_doc)  print  soup.text

I quoted a piece of HTML code, which is an example of an official handbook, and this is a section of Alice in Wonderland .

We see that it extracts the key thing: To remove some code that doesn't understand, leaving behind a fictional text that can be understood.

The reason why learning requires a book, is not the combination of all its words, but the essence of some words.

We can extract all of it and, of course, extract what we want, such as the name of the book I Want to extract:

And

In fact, this is the title is the python to get the title of the method, in the HTML code, you can see that there are two titles, the above is the HTML Settings page title, the following is an indicator, Soup.title code crawling is the HTML page settings of the title

I can also crawl other things, such as the URL of a particular location, ID, name, ... or using the original HTML example

Before I start, I'll introduce you to a code

This can be found in all <a> tags

After full absorption, you can continue to learn

<ahref= "Http://example.com/elsie"class= "Sister"ID= "Link1">Elsie</a>,<ahref= "Http://example.com/lacie"class= "Sister"ID= "Link2">Lacie</a> and<ahref= "Http://example.com/tillie"class= "Sister"ID= "Link3">Tillie</a>;

1  from Import BeautifulSoup 2 soup = BeautifulSoup (html_doc)3 for in Soup.find_all ('A  '):4     print (link.get ('href') ))

The above code is to find all the <a> tags from the document link, you can also find the <a> tags in the class,id, if you want to test, it is best to predict what will happen next, perhaps there will be a big surprise.

But the name is the body part of the fragment, remember the first example of this little article, crawl Alice in Wonderland novel Fragments.

I used an ingenious method to combine the preceding knowledge:

Since Soup.find_all (' a ') This code can find links to all <a> tags, we'll start with these links and give it text

Hey! Say because now is a list, this can't text

Since it is a list [can be understood as an array, as the individual understands], I need an extract body for one

The effect is also good, to achieve the effect I want

BeautifulSoup This library also has many functions, the small text just bucket to say a bit, the back to use when continue to expand, I also need time to study, only in the actual combat learning absorption effect is very good.

Finally to say something: Just the beginning of the article, I did not mean to take yesterday's to use it, but the effect is not good, as for why, I put the code attached, you can go to try

1 Import Requests 2  from Import BeautifulSoup 3 res = Requests.get ("http://baike.sogou.com/v77860.htm?pid=baike.box"  )4 soup = BeautifulSoup (res.text)5print soup.text

Python Crawler Learning Calendar 2 The power of the BeautifulSoup based on Ubuntu system

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python Crawler Learning Calendar 2 The power of the BeautifulSoup based on Ubuntu system

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python Crawler Learning Calendar 2 The power of the BeautifulSoup based on Ubuntu system

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support