The Python crawler package Beautiful Soup uses regular expressions to search, pythonsoup

Source: Internet
Author: User

The Python crawler package Beautiful Soup uses regular expressions to search, pythonsoup

When using Beautiful Soup, you can specify the corresponding name and attrs to search for specific names and attributes to find the required html code.

However, sometimes, there are many possibilities for its name or attr value in the content to be processed, especially when it meets a certain rule, it cannot be written as a fixed value.

Therefore, you can use regular expressions to solve this problem.
For example,

<div class="icon_col">    

The corresponding BeautifulSoup code is as follows:

h1userSoup = soup.find(name="h1", attrs={"class":"h1user"});

If html is like this:

<div class="icon_col">    

If you want to find all the codes that match the h1 condition at a time, you can only find the code of a single class = "h1user". The remaining two

class="h1user test1"

And

class="h1user test2"

I cannot find it.

In this case, you can use BeautifulSoup with very powerful functions:

The expression of regular expressions is supported in attrs.

.

You can write it as follows:

h1userSoupList = soup.findAll(name="h1", attrs={"class":re.compile(r"h1user(\s\w+)?")});

You can find:

class="h1user"class="h1user test1"class="h1user test2"

.

<div aria-lable="xxx">
Such as tags, the content of xxx is unknown (variable ).

If you want to find the corresponding div tag, you do not know how to implement it before.
If it is written:

sopu.findAll("div", attrs={"aria-lable": "xxx"});

Xxx must be written out. If the attribute value is not written, you cannot use attrs, and you cannot find the tag of the attribute value here.
So:

<Div aria-label = "5 stars, 747 scores "class =" rating "role =" img "tabindex ="-1 "> <div> <span class =" rating-star "> </span> <span class = "rating-star"> </span> <span class = "rating-star"> </span> <span class = "rating-star"> </span> <span class = "rating-star"> </span> </div> <span class = "rating-count"> 747 scores </span> </div>

You can use:

soup.findAll("div", attrs={"aria-lable": True});

Find the div tag whose property contains aria-lable.

Therefore, we do not know how to deal with the above:

Use BeautifulSoup to find tags with unknown attribute values but known attributes

In this example, you can:

<div aria-lable="xxx">

Reuse:

sopu.findAll("div", attrs={"aria-lable": True});

You can find the corresponding div tag containing the attribute aria-lable.

Articles you may be interested in:
  • In Python, The urllib + urllib2 + cookielib module write crawler practices
  • In-depth analysis of the structure and operation process of the Python crawler framework Scrapy
  • Practice the Python crawler framework Scrapy to capture the TOP250 Douban movie
  • Some key points of using the Beautiful Soup package to write crawlers in Python
  • Create a crawler to capture beautiful pictures in Python
  • How to Write a Python crawler to capture TOP100 Douban movies and user portraits
  • Demonstrate the usage of the Python crawler Beautiful Soup using video crawling instances
  • Tutorial on creating crawler instances using Python's urllib and urllib2 modules
  • Guide to Using Python to write basic crawler modules and frameworks

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.