Four main steps of a reptile
- Clear goals (know where you're going to go or search the site)
- Crawl (crawl all the content of the site)
- Take (remove data that doesn't work for us)
- Process data (stored and used in the way we want)
What is a regular expression
Regular expressions, also known as regular expressions, are often used to retrieve and replace text that conforms to a pattern (rule).
A regular expression is a logical formula for a string operation, which is a "rule string" that is used to express a filter logic for a string, using predefined specific characters and combinations of these specific characters.
Given a regular expression and another string, we can achieve the following purposes:
- Whether the given string conforms to the filtering logic of the regular expression ("match");
- With regular expressions, get the specific part ("filter") we want from the text string.
Regular expression matching rules
Python's RE module
In Python, we can use the built-in re module to use regular expressions.
It is important to note that regular expressions are escaped with special characters, so if we want to use the original string, we just need to add an R prefix, example:
R ' Chuanzhiboke\t\.\tpython '
Use regular crawl to the famous sayings of the net, only to get the first page of 10 data
fromUrllib.requestImportUrlopenImportRedefspider_quotes (): URL="http://quotes.toscrape.com"Response=urlopen (URL) HTML= Response.read (). Decode ("Utf-8") #Get 10 QuotesQuotes = Re.findall ('<span class= "text" itemprop= "text" > (. *) </span>', HTML) list_quotes= [] forQuoteinchQuotes:#strip from both sides of the search, as long as the discovery of a character within the scope of the current method, all removedList_quotes.append (Quote.strip ("""")) #get the author of 10 famous sayingsList_authors =[] Authors= Re.findall ('<small class= "author" itemprop= "Author" > (. *) </small>', HTML) forAuthorinchAuthors:list_authors.append (author)#get a label for these 10 famous sayingstags = Re.findall ('<div class= "Tags" > (. *?) </div>', Html,re. Regexflag.dotall) List_tags= [] forTaginchTags:temp_tags= Re.findall ('<a class= "tag" href= ". *" > (. *) </a>', tag) tags_t1= [] forTaginchtemp_tags:tags_t1.append (tag) list_tags.append (",". Join (TAGS_T1))#Summary of resultsResults = [] forIinchRange (len (list_quotes)): Results.append ("\ t". Join ([list_quotes[i],list_authors[i],list_tags[i])) forResultinchResults:Print(Result)#Tuning MethodSpider_quotes ()
BEAUTIFULSOUP4 Parser
BeautifulSoup used to parse HTML is simple, the API is very user-friendly, support CSS selectors, Python standard library of the HTML parser, but also support the lxml XML parser.
Official Document: http://beautifulsoup.readthedocs.io/zh_CN/v4.4.0
Use BeautifulSoup4 to get the famous website home data
fromUrllib.requestImportUrlopen fromBs4ImportBeautifulsoupurl="http://quotes.toscrape.com"Response=urlopen (URL)#Initializes an instance of BS#The parser of the corresponding response object, the most commonly used parsing method, is the default Html.parserBS = BeautifulSoup (response,"Html.parser")#Get 10 Quotesspans = Bs.select ("Span.text") List_quotes= [] forSpaninchSpans:span_text=Span.text list_quotes.append (Span_text.strip (""""))#get the author of 10 famous sayingsAuthors = Bs.select ("Small") List_authors= [] forAuthorinchAuthors:author_text=author.text list_authors.append (author_text)#get a label for these 10 famous sayingsDIVs = Bs.select ("Div.tags") List_tags= [] forDivinchDivs:tag_text= Div.select ("A.tag") Tag_list= [Tag_a.text forTag_ainchTag_text] List_tags.append (",". Join (tag_list))#Summary of resultsResults = [] forIinchRange (len (list_quotes)): Results.append ("\ t". Join ([list_quotes[i],list_authors[i],list_tags[i])) forResultinchResults:Print(Result)
Python Learning Path (v) crawler (iv) Regular expression crawl to the famous network