Python Learning Path (v) crawler (iv) Regular expression crawl to the famous network

Last Update:2018-03-28 Source: Internet

Author: User

Tags xml parser

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Four main steps of a reptile

Clear goals (know where you're going to go or search the site)
Crawl (crawl all the content of the site)
Take (remove data that doesn't work for us)
Process data (stored and used in the way we want)

What is a regular expression

Regular expressions, also known as regular expressions, are often used to retrieve and replace text that conforms to a pattern (rule).

A regular expression is a logical formula for a string operation, which is a "rule string" that is used to express a filter logic for a string, using predefined specific characters and combinations of these specific characters.

Given a regular expression and another string, we can achieve the following purposes:

Whether the given string conforms to the filtering logic of the regular expression ("match");

With regular expressions, get the specific part ("filter") we want from the text string.

Regular expression matching rules

Python's RE module

In Python, we can use the built-in re module to use regular expressions.

It is important to note that regular expressions are escaped with special characters, so if we want to use the original string, we just need to add an R prefix, example:

R ' Chuanzhiboke\t\.\tpython '

Use regular crawl to the famous sayings of the net, only to get the first page of 10 data

 fromUrllib.requestImportUrlopenImportRedefspider_quotes (): URL="http://quotes.toscrape.com"Response=urlopen (URL) HTML= Response.read (). Decode ("Utf-8")    #Get 10 QuotesQuotes = Re.findall ('<span class= "text" itemprop= "text" > (. *) </span>', HTML) list_quotes= []     forQuoteinchQuotes:#strip from both sides of the search, as long as the discovery of a character within the scope of the current method, all removedList_quotes.append (Quote.strip (""""))    #get the author of 10 famous sayingsList_authors =[] Authors= Re.findall ('<small class= "author" itemprop= "Author" > (. *) </small>', HTML) forAuthorinchAuthors:list_authors.append (author)#get a label for these 10 famous sayingstags = Re.findall ('<div class= "Tags" > (. *?) </div>', Html,re. Regexflag.dotall) List_tags= []     forTaginchTags:temp_tags= Re.findall ('<a class= "tag" href= ". *" > (. *) </a>', tag) tags_t1= []         forTaginchtemp_tags:tags_t1.append (tag) list_tags.append (",". Join (TAGS_T1))#Summary of resultsResults = []     forIinchRange (len (list_quotes)): Results.append ("\ t". Join ([list_quotes[i],list_authors[i],list_tags[i])) forResultinchResults:Print(Result)#Tuning MethodSpider_quotes ()

BEAUTIFULSOUP4 Parser

BeautifulSoup used to parse HTML is simple, the API is very user-friendly, support CSS selectors, Python standard library of the HTML parser, but also support the lxml XML parser.

Official Document: http://beautifulsoup.readthedocs.io/zh_CN/v4.4.0

Use BeautifulSoup4 to get the famous website home data

 fromUrllib.requestImportUrlopen fromBs4ImportBeautifulsoupurl="http://quotes.toscrape.com"Response=urlopen (URL)#Initializes an instance of BS#The parser of the corresponding response object, the most commonly used parsing method, is the default Html.parserBS = BeautifulSoup (response,"Html.parser")#Get 10 Quotesspans = Bs.select ("Span.text") List_quotes= [] forSpaninchSpans:span_text=Span.text list_quotes.append (Span_text.strip (""""))#get the author of 10 famous sayingsAuthors = Bs.select ("Small") List_authors= [] forAuthorinchAuthors:author_text=author.text list_authors.append (author_text)#get a label for these 10 famous sayingsDIVs = Bs.select ("Div.tags") List_tags= [] forDivinchDivs:tag_text= Div.select ("A.tag") Tag_list= [Tag_a.text forTag_ainchTag_text] List_tags.append (",". Join (tag_list))#Summary of resultsResults = [] forIinchRange (len (list_quotes)): Results.append ("\ t". Join ([list_quotes[i],list_authors[i],list_tags[i])) forResultinchResults:Print(Result)

Python Learning Path (v) crawler (iv) Regular expression crawl to the famous network

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More