Python Learning Path (v) crawler (iv) Regular expression crawl to the famous network

Source: Internet
Author: User
Tags xml parser

Four main steps of a reptile
    1. Clear goals (know where you're going to go or search the site)
    2. Crawl (crawl all the content of the site)
    3. Take (remove data that doesn't work for us)
    4. Process data (stored and used in the way we want)
What is a regular expression

Regular expressions, also known as regular expressions, are often used to retrieve and replace text that conforms to a pattern (rule).

A regular expression is a logical formula for a string operation, which is a "rule string" that is used to express a filter logic for a string, using predefined specific characters and combinations of these specific characters.

Given a regular expression and another string, we can achieve the following purposes:

  • Whether the given string conforms to the filtering logic of the regular expression ("match");
  • With regular expressions, get the specific part ("filter") we want from the text string.

Regular expression matching rules

Python's RE module

In Python, we can use the built-in re module to use regular expressions.

It is important to note that regular expressions are escaped with special characters, so if we want to use the original string, we just need to add an R prefix, example:

R ' Chuanzhiboke\t\.\tpython '

Use regular crawl to the famous sayings of the net, only to get the first page of 10 data

 fromUrllib.requestImportUrlopenImportRedefspider_quotes (): URL="http://quotes.toscrape.com"Response=urlopen (URL) HTML= Response.read (). Decode ("Utf-8")    #Get 10 QuotesQuotes = Re.findall ('<span class= "text" itemprop= "text" > (. *) </span>', HTML) list_quotes= []     forQuoteinchQuotes:#strip from both sides of the search, as long as the discovery of a character within the scope of the current method, all removedList_quotes.append (Quote.strip (""""))    #get the author of 10 famous sayingsList_authors =[] Authors= Re.findall ('<small class= "author" itemprop= "Author" > (. *) </small>', HTML) forAuthorinchAuthors:list_authors.append (author)#get a label for these 10 famous sayingstags = Re.findall ('<div class= "Tags" > (. *?) </div>', Html,re. Regexflag.dotall) List_tags= []     forTaginchTags:temp_tags= Re.findall ('<a class= "tag" href= ". *" > (. *) </a>', tag) tags_t1= []         forTaginchtemp_tags:tags_t1.append (tag) list_tags.append (",". Join (TAGS_T1))#Summary of resultsResults = []     forIinchRange (len (list_quotes)): Results.append ("\ t". Join ([list_quotes[i],list_authors[i],list_tags[i])) forResultinchResults:Print(Result)#Tuning MethodSpider_quotes ()
BEAUTIFULSOUP4 Parser

BeautifulSoup used to parse HTML is simple, the API is very user-friendly, support CSS selectors, Python standard library of the HTML parser, but also support the lxml XML parser.

Official Document: http://beautifulsoup.readthedocs.io/zh_CN/v4.4.0

Use BeautifulSoup4 to get the famous website home data

 fromUrllib.requestImportUrlopen fromBs4ImportBeautifulsoupurl="http://quotes.toscrape.com"Response=urlopen (URL)#Initializes an instance of BS#The parser of the corresponding response object, the most commonly used parsing method, is the default Html.parserBS = BeautifulSoup (response,"Html.parser")#Get 10 Quotesspans = Bs.select ("Span.text") List_quotes= [] forSpaninchSpans:span_text=Span.text list_quotes.append (Span_text.strip (""""))#get the author of 10 famous sayingsAuthors = Bs.select ("Small") List_authors= [] forAuthorinchAuthors:author_text=author.text list_authors.append (author_text)#get a label for these 10 famous sayingsDIVs = Bs.select ("Div.tags") List_tags= [] forDivinchDivs:tag_text= Div.select ("A.tag") Tag_list= [Tag_a.text forTag_ainchTag_text] List_tags.append (",". Join (tag_list))#Summary of resultsResults = [] forIinchRange (len (list_quotes)): Results.append ("\ t". Join ([list_quotes[i],list_authors[i],list_tags[i])) forResultinchResults:Print(Result)

Python Learning Path (v) crawler (iv) Regular expression crawl to the famous network

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.