Web Crawler: crawls book information from allitebooks.com and captures the price from amazon.com (1): Basic knowledge Beautiful Soup, beautifulsoup
First, start with Beautiful Soup (Beautiful Soup is a Python library that parses data from HTML and XML ), I plan to learn the Beautiful Soup process with three blog posts. The first is the basic knowledge of beauul ul Soup, and the second is a simple crawler using the Beautiful Soup knowledge in the front, capture allitebook.com's book information and ISBN code, and then use the ISBN code to capture the corresponding price of the book at amazon.com.
1. Introduction to Beautiful Soup
Network Data Mining refers to the process of getting data from a website. data mining technology allows us to collect a large amount of valuable data from the website world. Beautiful Soup is a Python library that can obtain data from HTML or XML files. You can use it to do many things. For example, you can continue to parse the latest price of a product, to track price fluctuations.
Ii. Beautiful Soup installation (Mac)
Install Beautiful Soup
sudo pip3 install beautifulsoup4
Check whether the installation is successful
from bs4 import BeautifulSoup
3. Create a Beautiful Soup object
html_atag = """
4. Search for content
Find () methodInput the node name in the find () method, such as ul, to obtain the content of the first matched ul node, for example:
#inputhtml_markup = """<div><ul id="students"><li class="student"><div class="name">Carl</div><div class="age">32</div></li><li class="student"><div class="name">Lucy</div><div class="age">25</div></li></ul></div>"""student_entries = soup.find("ul")print(student_entries)#output<ul id="students"><li class="student"><div class="name">Carl</div><div class="age">32</div></li><li class="student"><div class="name">Lucy</div><div class="age">25</div></li></ul>
After finding the ul node, we can see from the observation of html that there are two li nodes under ul and two div nodes under each li. Then student_entries.li can be used to obtain the data of the first li node, you can continue to use student_entries.li.div to obtain the data of the first div under the first li, for example:
#inputprint(student_entries.li)#output<li class="student"><div class="name">Carl</div><div class="age">32</div></li>#inputprint(student_entries.li.div)#output<div class="name">Carl</div>
You can continue to use div. string to obtain the div content:
#inputprint(student_entries.li.div.string)#output'Carl'
Search Using Regular Expressions:The find () method supports searching content based on regular expressions, for example:
#inputimport reemail_id_example ="""<div>The below HTML has the information that has email ids.</div>abc@example.com<div>xyz@example.com</div><span>foo@example.com</span>"""soup = BeautifulSoup(email_id_example,"lxml")emailid_regexp = re.compile("\w+@\w+\.\w+")first_email_id = soup.find(text=emailid_regexp)print(first_email_id)#outputabc@example.com
Find_all () methodThe find () method returns the first matched content. The find_all () method returns a list Of all matched content. For example, the above method searches for the email address based on the regular expression () if the method is changed to the find_all () method, all matched content is returned:
#inputall_email_id = soup.find_all(text=emailid_regexp)print(all_email_id)#output['abc@example.com', 'xyz@example.com', 'foo@example.com']
Find_parent () methodThe find_parent () method looks up the content. For example, you can use the find_parent () method on the first li node to obtain the content of the parent node:
#inputprint(first_student)#output<li class="student"><div class="name">Carl</div><div class="age">32</div></li>#inputall_students = first_student.find_parent('ul')print(all_students)#output<ul id="students"><li class="student"><div class="name">Carl</div><div class="age">32</div></li><li class="student"><div class="name">Lucy</div><div class="age">25</div></li></ul>
Find_next_sibling () methodSibling means siblings. The find_next_sibling () method gets the next sibling node, for example:
#inputsecond_student = first_student.find_next_sibling()print(second_student)#output<li class="student"><div class="name">Lucy</div><div class="age">25</div></li>
There are many other methods, such as: find_next () method find_all_next () method find_previus_sibling () method find_all_previous () method usage are similar, here are not repeated, please refer to the official documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-the-tree
5. Browsing content
Browse subnodesUse the label name of the subnode to obtain the content of the subnode. For example:
#inputprint(first_student)#output<li class="student"><div class="name">Carl</div><div class="age">32</div></li>#inputname = first_student.divprint(name)#output<div class="name">Carl</div>
Browse parent nodeYou can use the. parent attribute to browse the parent node, for example:
#inputprint(name.parent)#output<li class="student"><div class="name">Carl</div><div class="age">32</div></li>
Browse sibling nodesThat is, the peer nodes, next_sibling and previus_sibling attributes obtain the previous and next sibling nodes respectively. For example:
#inputprint(first_student.next_sibling)#output<li class="student"><div class="name">Lucy</div><div class="age">25</div></li>
For a complete list of methods related to browsing, see: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#navigating-the-tree
Vi. Modification content
Modify the Tag NameYou can use the. name attribute to obtain the label name of a node, and assign a label name to the. name attribute to easily change the label name. For example:
#inputfirst_student#output<li class="student"><div class="name">Carl</div><div class="age">32</div></li>#inputfirst_student.name#output'li'#inputfirst_student.name = 'div'first_student.name#output'div'#inputfirst_student#output<div class="student"><div class="name">Carl</div><div class="age">32</div></div>
Modify tag attributes
#inputfirst_student['class'] = 'student_new'print(first_student)#output<div class="student_new"><div class="name">Carl</div><div class="age">32</div></div>
Note: If the class attribute does not exist, this operation will not report an error, but will become a new operation.
Deletes a tag attribute.The del method can be used to delete an attribute of a node. For example:
#input del first_student['class']print(first_student)#output<div><div class="name">Carl</div><div class="age">32</div></div>
Modify TAG contentYou can use the. string attribute to obtain the TAG content value ('cars'). Similarly, you can change the value of this attribute by assigning a value. For example:
#inputprint(first_student.div.string)#outputCarl#inputfirst_student.div.string = 'carl_new'print(first_student.div.string)#outputcarl_new
Delete a node directlyYou can use the decompose () method to directly delete a node:
#input print(first_student)#output<li class="student"><div class="name">carl_new</div><div class="age">32</div></li>#input first_student.div.decompose()print(first_student)#output<li class="student"><div class="age">32</div></li>
The extract () method can also be used to delete a node. However, unlike the decompose () method, extract () returns the content of the deleted node. In the big data era, if you are interested in data processing, please refer to another series of Essays: next, we will use the basic beauul ul Soup knowledge of this article to complete a simple crawler, the books and prices of the two websites are obtained and combined and output to the csv file. If you are interested, please follow this blog and leave a message for discussion. Big Data, big data analysis, BeautifulSoup, Beautiful Soup entry, data mining, data analysis, data processing, pandas, web crawler, web scraper