Self-taught Python 6 crawlers are essential for regular expressions and python 6 Crawlers
To be crawler, you must use regular expressions. For simple string processing, such as split and substring, It is enough, but complicated matching is involved, of course it is the world of regular expressions, but regular expressions seem so annoying. How can we do this? record the regular metacharacters and syntaxes, find an Online Matching Test website and test it at any time (in fact, I am also a food force on regular expressions... I have been getting a positive solution in the slow (query) Slow (question) test (BIG) test (OX). But I believe it will be a coincidence if I am familiar with it!
First, we recommend two blogs to introduce the regex library and regex module of python: Python Regular Expression Guide (re) Python regex module-a more powerful Regular Expression Engine. However, I am not good at summing up various types of libraries and other syntaxes. This blog will briefly introduce common and small examples of crawler program development! First, let's talk about the commonly used six re functions:
Re. compile (pattern, flag) # returns a pattern object based on the regular expression matching string and additional conditions.
Re. search (pattern, string) # search the entire string and find the string that matches the Regular Expression
Re. match (pattern, string) # Check whether the string conforms to the regular expression from the beginning. It must start with the first character of the string.
Re. sub (pattern, replacement, string) # Replace the string matching the regular expression with replacement.
Re. split (pattern, string) # split the string based on the regular expression, put the string after the split into the list and return
Re. findall (pattern, string) # split the string based on the regular expression, put all the results found in the list, and return
Matching in python is greedy by default. Greedy means to try to match as many characters as possible. For example, if the regular expression "AB *" is used to find the string "abbbc ", "abbb" is found. If it is not greedy, the result is "". Similarly, sometimes we should always pay attention to escape characters and so on. In net, we have @, and in python we have r, which is the same in usage. Let's take a look at the Code:
1 >>> import re2 >>>3 >>> pattern = re.compile(r'hello',re.I)4 >>> match = pattern.match('Hello World@!')5 >>> if match:6 ... print match.group()7 ...8 Hello9 >>>
In the above Code, we became a pattern object, and then matched it. In the re. compile function, our re. I is the incidental condition: case-insensitive. You can search for other modes by yourself. In fact, we can stop using the re. compile function and write it like this:
1 >>> match = re.match('hello','hello world!')2 >>> print match.group()3 hello
In this case, a line of re. compile (pattern, flags) code is missing, but the pattern object is also missing. If you use it, you can see the wise.
In python, You can further control the returned results of Regular Expression matching, for example:
1 >>> import re2 >>> m = re.search("output_(\d{4})","output_2016.txt")3 >>> print m.group()4 output_20165 >>> print m.group(1)6 20167 >>> print m.groups()8 ('2016',)
We can see that our regular expression output _ (\ d {4}) contains a regular expression (\ d {4}), a part of the regular expression circled like this, we call it group. We can use m. group (index) to query, group (0) is the search result of the entire regular expression, group (1) is the first group, and so on .... It may be inconvenient to look at this. We can name the group:
>>> import re>>> m = re.search("(?P<year>\d{4})\.(?P<mon>\d{2})\.(?P<day>\d{2})","output_2016.01.18.txt")>>> m.groups()('2016', '01', '18')>>> m.groupdict(){'year': '2016', 'mon': '01', 'day': '18'}>>> m.group("year")'2016'>>> m.group("mon")'01'>>> m.group("day")'18'
Let's see an example: there is a file named output_2016.01.18.txt. Read the date and time information in the file name, calculate the current day and change the file name to output_yyyy-mm-dd-0000txt, where w is the day of the week.
1 import os,re,datetime2 filename="output_1981.10.21.txt"3 get_time=re.search("(?P<year>\d{4})\.(?P<month>\d{2})\.(?P<day>\d{2})\.",filename)4 year=get_time.group("year")5 month=get_time.group("month")6 day=get_time.group("day")7 date=datetime.date(int(year),int(month),int(day))8 wd=date.weekday()+19 os.rename(filename,"output_"+year+"-"+month+"-"+day+"-"+str(wd)+".txt")
Now, let's take a look at the regular expression in the crawler. Here is a small example: What is the price of cabbage worth buying.
What is worth buying-cabbage price
To Crawl webpage information, the most important thing except for getting webpage information is to extract the information we need. Let's take a look at the html code to be extracted in f12:
1 content = response. read (). decode ('utf-8') 2 pattern = re. compile ('
.*? Is a fixed match. And * can match any number of characters, plus? That is, the minimum match, that is, the non-Greedy pattern we mentioned above. To put it bluntly, match strings as short as possible.
(.*?) As mentioned above, it is a group for regular expression matching. The re. S flag is in any match mode for vertices during regular match, that is, vertices can represent any characters such as line breaks. In this way, we get the name and price of each item. (The entire crawler source code is in the next blog)