Web Crawler: crawls book information from allitebooks.com and captures the price from amazon.com (1): Basic knowledge Beautiful Soup, beautifulsoup

Last Update:2016-08-18 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Web Crawler: crawls book information from allitebooks.com and captures the price from amazon.com (1): Basic knowledge Beautiful Soup, beautifulsoup

First, start with Beautiful Soup (Beautiful Soup is a Python library that parses data from HTML and XML ), I plan to learn the Beautiful Soup process with three blog posts. The first is the basic knowledge of beauul ul Soup, and the second is a simple crawler using the Beautiful Soup knowledge in the front, capture allitebook.com's book information and ISBN code, and then use the ISBN code to capture the corresponding price of the book at amazon.com.
1. Introduction to Beautiful Soup
Network Data Mining refers to the process of getting data from a website. data mining technology allows us to collect a large amount of valuable data from the website world. Beautiful Soup is a Python library that can obtain data from HTML or XML files. You can use it to do many things. For example, you can continue to parse the latest price of a product, to track price fluctuations.
Ii. Beautiful Soup installation (Mac)
Install Beautiful Soup

sudo pip3 install beautifulsoup4

Check whether the installation is successful

from bs4 import BeautifulSoup

3. Create a Beautiful Soup object

html_atag = """
  
4. Search for content 

 
Find () methodInput the node name in the find () method, such as ul, to obtain the content of the first matched ul node, for example:
#inputhtml_markup = """<div><ul id="students"><li class="student"><div class="name">Carl</div><div class="age">32</div></li><li class="student"><div class="name">Lucy</div><div class="age">25</div></li></ul></div>"""student_entries = soup.find("ul")print(student_entries)#output<ul id="students"><li class="student"><div class="name">Carl</div><div class="age">32</div></li><li class="student"><div class="name">Lucy</div><div class="age">25</div></li></ul>
After finding the ul node, we can see from the observation of html that there are two li nodes under ul and two div nodes under each li. Then student_entries.li can be used to obtain the data of the first li node, you can continue to use student_entries.li.div to obtain the data of the first div under the first li, for example:
#inputprint(student_entries.li)#output<li class="student"><div class="name">Carl</div><div class="age">32</div></li>#inputprint(student_entries.li.div)#output<div class="name">Carl</div>You can continue to use div. string to obtain the div content:
#inputprint(student_entries.li.div.string)#output'Carl'
 
Search Using Regular Expressions:The find () method supports searching content based on regular expressions, for example:
#inputimport reemail_id_example ="""<div>The below HTML has the information that has email ids.</div>abc@example.com<div>xyz@example.com</div><span>foo@example.com</span>"""soup = BeautifulSoup(email_id_example,"lxml")emailid_regexp = re.compile("\w+@\w+\.\w+")first_email_id = soup.find(text=emailid_regexp)print(first_email_id)#outputabc@example.com
 
Find_all () methodThe find () method returns the first matched content. The find_all () method returns a list Of all matched content. For example, the above method searches for the email address based on the regular expression () if the method is changed to the find_all () method, all matched content is returned:
#inputall_email_id = soup.find_all(text=emailid_regexp)print(all_email_id)#output['abc@example.com', 'xyz@example.com', 'foo@example.com']
 
Find_parent () methodThe find_parent () method looks up the content. For example, you can use the find_parent () method on the first li node to obtain the content of the parent node:
#inputprint(first_student)#output<li class="student"><div class="name">Carl</div><div class="age">32</div></li>#inputall_students = first_student.find_parent('ul')print(all_students)#output<ul id="students"><li class="student"><div class="name">Carl</div><div class="age">32</div></li><li class="student"><div class="name">Lucy</div><div class="age">25</div></li></ul>
 
Find_next_sibling () methodSibling means siblings. The find_next_sibling () method gets the next sibling node, for example:
#inputsecond_student = first_student.find_next_sibling()print(second_student)#output<li class="student"><div class="name">Lucy</div><div class="age">25</div></li>
 There are many other methods, such as: find_next () method find_all_next () method find_previus_sibling () method find_all_previous () method usage are similar, here are not repeated, please refer to the official documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-the-tree

 
5. Browsing content 

 
Browse subnodesUse the label name of the subnode to obtain the content of the subnode. For example:
#inputprint(first_student)#output<li class="student"><div class="name">Carl</div><div class="age">32</div></li>#inputname = first_student.divprint(name)#output<div class="name">Carl</div>
 
Browse parent nodeYou can use the. parent attribute to browse the parent node, for example:
#inputprint(name.parent)#output<li class="student"><div class="name">Carl</div><div class="age">32</div></li>
 
Browse sibling nodesThat is, the peer nodes, next_sibling and previus_sibling attributes obtain the previous and next sibling nodes respectively. For example:
#inputprint(first_student.next_sibling)#output<li class="student"><div class="name">Lucy</div><div class="age">25</div></li>For a complete list of methods related to browsing, see: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#navigating-the-tree

 
Vi. Modification content 

 
Modify the Tag NameYou can use the. name attribute to obtain the label name of a node, and assign a label name to the. name attribute to easily change the label name. For example:
#inputfirst_student#output<li class="student"><div class="name">Carl</div><div class="age">32</div></li>#inputfirst_student.name#output'li'#inputfirst_student.name = 'div'first_student.name#output'div'#inputfirst_student#output<div class="student"><div class="name">Carl</div><div class="age">32</div></div>
 
Modify tag attributes
#inputfirst_student['class'] = 'student_new'print(first_student)#output<div class="student_new"><div class="name">Carl</div><div class="age">32</div></div>Note: If the class attribute does not exist, this operation will not report an error, but will become a new operation.

 
Deletes a tag attribute.The del method can be used to delete an attribute of a node. For example:
#input del first_student['class']print(first_student)#output<div><div class="name">Carl</div><div class="age">32</div></div>
 
Modify TAG contentYou can use the. string attribute to obtain the TAG content value ('cars'). Similarly, you can change the value of this attribute by assigning a value. For example:
#inputprint(first_student.div.string)#outputCarl#inputfirst_student.div.string = 'carl_new'print(first_student.div.string)#outputcarl_new
 
Delete a node directlyYou can use the decompose () method to directly delete a node:
#input print(first_student)#output<li class="student"><div class="name">carl_new</div><div class="age">32</div></li>#input first_student.div.decompose()print(first_student)#output<li class="student"><div class="age">32</div></li>The extract () method can also be used to delete a node. However, unlike the decompose () method, extract () returns the content of the deleted node. In the big data era, if you are interested in data processing, please refer to another series of Essays: next, we will use the basic beauul ul Soup knowledge of this article to complete a simple crawler, the books and prices of the two websites are obtained and combined and output to the csv file. If you are interested, please follow this blog and leave a message for discussion. Big Data, big data analysis, BeautifulSoup, Beautiful Soup entry, data mining, data analysis, data processing, pandas, web crawler, web scraper

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Web Crawler: crawls book information from allitebooks.com and captures the price from amazon.com (1): Basic knowledge Beautiful Soup, beautifulsoup

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Web Crawler: crawls book information from allitebooks.com and captures the price from amazon.com (1): Basic knowledge Beautiful Soup, beautifulsoup

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support