The Python crawler package Beautiful Soup uses regular expressions to search, pythonsoup
When using Beautiful Soup, you can specify the corresponding name and attrs to search for specific names and attributes to find the required html code.
However, sometimes, there are many possibilities for its name or attr value in the content to be processed, especially when it meets a certain rule, it cannot be written as a fixed value.
Therefore, you can use regular expressions to solve this problem.
For example,
<div class="icon_col">
The corresponding BeautifulSoup code is as follows:
h1userSoup = soup.find(name="h1", attrs={"class":"h1user"});
If html is like this:
<div class="icon_col">
If you want to find all the codes that match the h1 condition at a time, you can only find the code of a single class = "h1user". The remaining two
class="h1user test1"
And
class="h1user test2"
I cannot find it.
In this case, you can use BeautifulSoup with very powerful functions:
The expression of regular expressions is supported in attrs.
.
You can write it as follows:
h1userSoupList = soup.findAll(name="h1", attrs={"class":re.compile(r"h1user(\s\w+)?")});
You can find:
class="h1user"class="h1user test1"class="h1user test2"
.
<div aria-lable="xxx">
Such as tags, the content of xxx is unknown (variable ).
If you want to find the corresponding div tag, you do not know how to implement it before.
If it is written:
sopu.findAll("div", attrs={"aria-lable": "xxx"});
Xxx must be written out. If the attribute value is not written, you cannot use attrs, and you cannot find the tag of the attribute value here.
So:
<Div aria-label = "5 stars, 747 scores "class =" rating "role =" img "tabindex ="-1 "> <div> <span class =" rating-star "> </span> <span class = "rating-star"> </span> <span class = "rating-star"> </span> <span class = "rating-star"> </span> <span class = "rating-star"> </span> </div> <span class = "rating-count"> 747 scores </span> </div>
You can use:
soup.findAll("div", attrs={"aria-lable": True});
Find the div tag whose property contains aria-lable.
Therefore, we do not know how to deal with the above:
Use BeautifulSoup to find tags with unknown attribute values but known attributes
In this example, you can:
<div aria-lable="xxx">
Reuse:
sopu.findAll("div", attrs={"aria-lable": True});
You can find the corresponding div tag containing the attribute aria-lable.
Articles you may be interested in:
- In Python, The urllib + urllib2 + cookielib module write crawler practices
- In-depth analysis of the structure and operation process of the Python crawler framework Scrapy
- Practice the Python crawler framework Scrapy to capture the TOP250 Douban movie
- Some key points of using the Beautiful Soup package to write crawlers in Python
- Create a crawler to capture beautiful pictures in Python
- How to Write a Python crawler to capture TOP100 Douban movies and user portraits
- Demonstrate the usage of the Python crawler Beautiful Soup using video crawling instances
- Tutorial on creating crawler instances using Python's urllib and urllib2 modules
- Guide to Using Python to write basic crawler modules and frameworks