Crawler-An HTML content lookup method based on BS4 library

Source: Internet
Author: User

BS4 has a find_all (Name,attrs,recursive,string,**kwargs) method that returns a list type that stores the results of a lookup

Name retrieves a string for the label name

Attrs retrieves a string for the value of a tag property, which can be indexed to find whether a particular string is contained in a tag.

Recursive whether to retrieve all descendants, by default True

String <>...</> retrieving strings in the string area

To illustrate:

Name

Soup.find_all ('a')#returns the contents of a labelSoup.find_all (['a','b'])#returns the contents of a and B tags forTaginchSoup.find_all (True):#Print all label names in a document    Print(Tag.name)" "back to Htmlheadtitlebodypbpaa" "#after the use of regularization:ImportRe#if we just want to get a label that starts with B, n then we need a regular expression, and re is the corresponding library forTaginchSoup.find_all (Re.compile ('b')):    Print(Tag.name)#returns body B

Attrs:

Soup.find_all ('P','Course')#find information that contains ' course ' in the P tagSoup.find_all (ID='Link1')" "return to [<a class= "Py1" href= "http://www.icourse163.org/course/BIT-268001" id= "Link1" >basic Python</a>] " "Soup.find_all ('Link')#return []ImportResoup.find_all (ID=re.compile ('Link'))#use regular expressions to find tag content that contains link" "[<a class= "Py1" href= "http://www.icourse163.org/course/BIT-268001" id= "Link1" >basic python</a> <a class= "Py2" href= "http://www.icourse163.org/course/BIT-1001870001" id= "Link2" >advanced Python</a>] " "

Recursive

Soup.find_all ('a', recursive=False)# return [] indicates that the son does not have a label on the node

String

Soup.find_all (string='basic python')#[' Basic Python ']  Import  resoup.find_all (String=re.compile ('python'))#  All occurrences of a python string in a string retrieve the "' Thisis a Python demo page ', ' The demo Python introduces several Python co Urses. '] " "

In addition, we can use

<tag> (..) Equivalent to <tag>.find_all (..)

Soup (..) Equivalent to Soup.find_all (..)

Extension methods for Find

Method Description
<>.find () Search for tangent returns only one result, string type, same as Find_all () parameter
<>.find_parents () Search in ancestor node, return list type, same as Find_all () parameter
<>.find_parent () Returns a result in the ancestor node, ibid.
<>.find_next_siblings () Search in subsequent parallel nodes, ibid.
<>.find_next_sibling () Returns a result in the subsequent node, as above
<>.find_previous_siblings () Search in a parallel node of the previous sequence, ibid.
<>.find_previous_sibling () Returns a result in a sequential parallel node, as above

Crawler-An HTML content lookup method based on BS4 library

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.