Python natural language processing to fetch data from the network

Source: Internet
Author: User

Python natural language processing to fetch data from the network

Write in front

This section learns the technology of extracting data from the network python2.7 BeautifulSoup Library, in a nutshell, the crawler technology. Network programming is a complex technology, in the need of the basic place, the text gives the link address, are very good tutorials, you can refer to, I am not here to repeat the invention of the wheel. The main thrust of this section is to help you quickly Master basic crawler technology and form a main line that can construct basic data for your own experiments. after mastering the crawler technology, we can crawl data from the network to meet the specific needs for analysis, and the crawler technology studied here is suitable for data mining, natural language processing and other disciplines that need to dig data from the outside.



1. What is a web crawler?

web crawler (crawler) is also called Ant (Ant), Automatic retrieval tool (automatic indexer), simply, by constantly requesting the various network resources, and then organize them to form their own data directory. The specific unfolding narrative can refer to the wiki web spider.

here in your own words is: web crawler = {A set of initial URLs, a set of filtering data rules, a crawling network of natural Technology}

the algorithm pseudocode for the simple web crawler is as follows (refer to the StackOverflow):

An unreachable list an already visited list and a set of rules that determine your interest in the resource while the list of unreachable URLs is not empty:      never access the URL list and take a URL to      record what you're interested in on the URL page      if it's HTML: Parse out a link to a page links for each link: if it conforms to your rules and is not in your visited list or in an inaccessible list: Add a link to an unreachable list
Higher -level crawler technology is required for search engines and other data requirements, which are not covered in this section.

The Network crawler technology must pay attention to several points (Reference from: A Few scraping rules):

    • attach importance to the copyright notice of the website. These data are owned by the site, the site manages them, and you should review the site's copyright terms before crawling a site.
    • Don't be too violent. Machine access is much faster than people, and do not attack other people's servers in a rude way.
    • Crawler results are limited to site structure. As the web changes, changes and maintenance of the site data, the original work of the code may have to be changed.


What are the main 2.Python crawler technologies?

The main crawler libraries in the Python language include:

    • Scrapy
    • BeautifulSoup +URLLIB2
    • Mechanize
    • Twill
    • Harvestman
    • Ruya

Refer to: StackOverflow for more library support.

We are mainly concerned with how to facilitate our goal , this section mainly uses BeautifulSoup to expand the following example.

BeautifulSoup's Chinese document address: BeautifulSoup Chinese.

3. Crawler Technology Key Point 3.1 Find what we want

how to find the HTML or XML found in the content of our interest, there are two main aspects.

The first is how to get the corresponding node in the Web page. Depending on the implementation of different libraries, slightly different. But basically, they include:

iterate through the DOM object, using tag filtering, such as a, to select all the links. The DOM organizes the entire HTML document in the form of a tree, such as the simple DOM tree as follows:


How to learn to use DOM objects, you can refer to: HTML dom.

Use the CSS selector to filter objects, such as P > span, and select the direct child label span for the paragraph. You can refer to the W3school CSS selector.

using regular expressions to filter the contents of a tag to get the text we're interested in , you can refer to thePython Regular expression guide for how to use regular expressions .

Some tricks of the 3.2 crawler

Identity Camouflage

individual sites do not allow us to crawl, we need to disguise as a browser, by adding the browser and operating system information in the request header to achieve camouflage purposes. For example, the CSDN Web site, if crawled directly, is prone to 403 errors.

Data Decompression

Note that when using camouflage, you need to look at the encoding of the corresponding header that is returned, and if the gzip type is not decompressed, there will usually be an error that the data cannot read, or produce:unicodedecodeerror: ' UTF8 ' Codec can ' t decode byte 0x8b in position 1:invalid start byte error.

Data Encoding

python2.7 internal default encoding is ASCII, the STR type is encoded, when we read the page, we need to parse the page header, identification code (of course, the good library will automatically help us do), here note the conversion between STR and Unicode:

u = U ' chinese ' #显示指定unicode类型对象ustr = U.encode (' gb2312 ') #以gb2312编码对unicode对像进行编码str1 = U.encode (' GBK ') # Encode the Unicode pair as GBK encoding str2 = U.encode (' utf-8 ') #以utf-8 encoding encodes the Unicode pair as U1 = Str.decode (' gb2312 ') #以gb2312编码对字符串str进行解码, To get Unicode
For more information, refer to: Cnblogdescription of the Python encode and decode functions.

Here is an application of the above three key points, give a system to take advantage of the URLLIB2 crawl CSDN official Blog Category List code:

# coding:utf-8# "" "Get CSDN Blog Category List" "Import urllib2import gzipimport stringioimport refrom urllib2 import Urlerror, Httperr Ordef Read_data (RESP): # reads the response content if the gzip type is uncompressed if Response.info (). Get (' content-encoding ') = = ' gzip ': BUF = Str Ingio.stringio (Resp.read ()) Gzip_f = gzip. Gzipfile (FILEOBJ=BUF) return Gzip_f.read () Else:return resp.read () URL = ' Http://blog.csdn.net/blogdevtea m/' HEADERS = {' user-agent ': "mozilla/5.0 (X11; Ubuntu; Linux i686; rv:34.0) "Gecko/20100101 firefox/34.0", ' Accept ': "Text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0. 8 ", ' Accept-language ':" zh-cn,zh;q=0.8,en-us;q=0.5,en;q=0.3 ", ' accept-encoding ':" gzip, deflate "}req = urllib2.    Request (URL, None, HEADERS) # By adding headers to achieve camouflage purposes Try:response = Urllib2.urlopen (req) page_content = read_data (response) encoding = response.headers[' Content-type '].split (' charset= ') [-1] p = re.compile (Ur ' (<a.*?href=) (. *?category.*?& gt;) (. *) (</a>) ') # Resolves a category's positiveThe expression m = P.findall (Page_content.decode (encoding)) if M:for x in M:print x[2].encode (encoding) exc        EPT Urlerror, E:if hasattr (E, ' code '): print ' Error code: ', E.code, ', unable to complete the request. ' Elif hasattr (E, ' reason '): print ' request failed: ', E.reason, ' Unable to connect to server ' Else:print ' request completed. '


the output is:

Home Bulletin Panel expert interview use tips complaints Suggestions FAQ netizens Voice Blog Activities Download Resources recommended 2011 China Mobile Developer Conference CSDN official Events 2011 SD2.0 Conference 2012 SDCC China software Developers Conference 2012 Mobile Developers Conference SDCC Conference 2013 Cloud Computing Conference Microsoft MVP Community Weekly Blog Recommended summary Post recommended summary


4. Two crawler examples

In the Example section below we give an experiment to fetch data by using the BeautifulSoup library.

BeautifulSoup gets the key technologies for Web page tagging.

One is obtained using the Find,find_all selection tag, the prototype of this function is:

Find (name, Attrs, recursive, text, **kwargs)

You can use tag names, attributes, and whether to recursively search, and some keywords to specify the properties.

The other is through the CSS selector, select gets a list of tags, such as soup.select (' P ') to get all the paragraphs.

BeautifulSoup The Chinese document, gives a very detailed description of the API, can be consulted.

Get the key points for Web page data:

    • Analyze the structure of the Web page to find the simplest sign of the part we need, such as CSS selector flags or tag flags
    • Filtering using regular expressions or tag attributes
    • The efficiency problem of grasping, can adopt multithreading technology
    • Correct serialization of crawled resources, such as writing files, database
4.1 Extract Cool Dog music website Home Music leaderboard Data

the first step in crawling Web page data is to analyze the structure of the Web page and analyze the structure of the Web page by using the analysis tools of the browser (Firefox,google chrome, etc.) to get the corresponding selector and corresponding code for the page element in the desired right mouse button [view element].

We look at the leaderboard div of the Cool Dog Music home page, we want to crawl the three parts of the leaderboard div as shown:

Recommended Song leaderboard:



TOP10 Leaderboard:


Global Hot List:



The common denominator of these three regions is:

First a big div,div below a p,p contains the small categories in the leaderboard, and then the following DIV has UL fill in the details of the song, you can analyze the page to understand the specific details.

Here is our implementation code:

#!/usr/bin/env python#-*-coding:utf-8-*-"" "********************************************************* Get Cool Dog Music home leaderboard list dependent on HTML page div selector If the churn program may fail by WANGDQ 2015-01-05 (http://blog.csdn.net/wangdingqiaoit) ***************** "" "from BS4 import beautifulsoupfrom urllib2 import Urlopen, Request, Urlerror , Httperrorimport timedef make_soup (URL): # "" Opens the specified URL gets the BeautifulSoup object "" "Try:req = Request (URL) re Sponse = Urlopen (req) HTML = response.read () except Urlerror, E:if hasattr (E, ' code '): Print        ' Error code: ', E.code, ', cannot complete the request. ' Elif hasattr (E, ' reason '): print ' request failed: ', E.reason, ' Unable to connect to server ' Else:return BeautifulSoup (HTML) def ge T_music (B_soup, SEL): # "" "Get Song List" "" Main_div = B_soup.select (SEL) [0] # get category List sum_category = Main_div.select (  ' p > Strong > A[title] ') [0].string titles = [sum_category+ ' +a.string for A in Main_div.select (' p > span > A[title] ') index= 0 Song_dict = {} # parse the underlying song list one after the other and add the Category: Song single Dictionary object for div in Main_div.find_all (' div ', recursive=false): # Here we can't recursively find Part = Div.find_all (' span ', class_= ' text ') if part:song_dict[titles[index] "= part in    Dex + = 1 return song_dictbase_url = ' http://www.kugou.com/' #这是酷狗首页榜单的div选择器 # If the page changes need to be changed here Div_list = [' Div#single0 ', # Recommended Song Section div ' div.clear_fix.hot_top_10 ', # Hot List top10 part div ' div.clear_fix.hot_global.hot_top_10 ' # Global Hot List part div]def MA In (): Soup = Make_soup (base_url) If soup is none:print ' Sorry, unable to complete the extraction task, about to exit ... ' exit () print ' Get Time: ' +time.strftime ("%y-%m-%d%h:%m:%s") for K in Div_list: # parses the for category from the song single DIV, items in Get_music (soup, k).                Iteritems (): # Print Category: Song single Dictionary object content print ' * ' *20+category+ ' * ' *30 count = 1 for song in items: Print count, song.string count + = 1 print ' * ' *60 print ' Get song single end ' if __name__ = = ' __main__ ": Main ()

get the cool dog home leaderboard results as follows (with abridged):

Acquisition Time: 2015-01-06 22:29:11******************** recommended single live ******************************1 Han Hong-dawn (Live) 2 A-lin-Give me a reason to forget ( Live) ... Omit ******************** recommended Tan ******************************1 Leehom Wang-is now 2 fish Leong-in the being knit waiting for you "only because of single together theme song" ... Omit ******************** recommended single Japanese-Korean ******************************1, San e-coach Me2 Hello venus-wiggle Wiggle ... Omit ******************** recommended Singles European ******************************1 Glee cast-problem2 Justin Bieber, Lil twist-intertwine ... Omit ******************** Hot List TOP10 latest ******************************1 exo-machine (Live) 2 Chang-Love is not afraid ... Omit ******************** Hot List TOP10 hottest ******************************1 Chopsticks bros-Little Apple 2 Deng Zi-like you ... Get the song list to the end


4.2 Extract Free photo Gallery site Image Resources

Below learn to crawl the image resources on the network, other resources the same. This site uses the free site footage of CNN to crawl.

We need to use multithreading technology when necessary, the simple use of multithreading, not very complex, you can refer to: Python multi-threaded tutorial.

We first analyze the structure of this site, as follows:


This site search keyword is the page URL appears in the section, that is, url = http://so.sccnn.com+ child +1.html, where 1 is the current page number, the figure shows a total of 145 pages, We need three steps to complete the crawl:

First, the number of results for a search keyword includes the total number, the number of pages;

Second, through the keyword and page number, splicing URL, parse each page contains a list of pictures, this page is very simple, the image is mainly used IMG tag to mark;

Thirdly, when all the image URLs are resolved, turn on multiple threads and notify the user of the download.

The main code for the number of parse results is as follows:

result_string = Unicode (Result.get_text ()). Encode (' Utf-8 ')  # For example total: 2,307 divided into 145 pages per page 16    p = re.compile (R ' \d+ ')    count_list = [Int (x) for x in P.findall (result_string)]  # count For example [' 2307 ', ' 145 ', ' 16 ']

threads use Python's threading library to notify clients of the download situation usingProgress 1.2, this library is very simple.

The specific implementation code is as follows:

#!/usr/bin/env python#-*-coding:utf-8-*-"" "********************************************************* Multi-Threaded Download footage CNN free pictures The website URL: Http://so.sccnn.comby wangdq 2015-01-05 (http://blog.csdn.net/wangdingqiaoit) How to use: Follow the prompts to enter a picture to save the path for example:/home to find a keyword for a picture, for example: Pet ********************************************************* "" "from BS4 Import Beautifulsoupfrom progress.bar Import barimport urllib2import osimport reimport threadingbase_url = "http// So.sccnn.com "def make_soup (URL): #" "opens the specified URL to get the BeautifulSoup object" "" Try:req = urllib2. Request (URL) response = Urllib2.urlopen (req) HTML = response.read () except URLLIB2.            Urlerror, E:if hasattr (E, ' code '): print ' Error code%d, unable to complete request '% E.code elif hasattr (E, ' reason '): print ' Error code%d, unable to connect to server '% E.reason else:return beautifulsoup (HTML) def join_url (keyword, count): # "" "Link u RL path "" "Unicode_url = Unicode (base_url+ '/search/' +keyword+ '/' +str (count) + '. html ', ' Utf-8 ') return Unicode_url.encode (' GB2312 ') def load_url (keyword, folder): # "" "Search keyword Get picture img element List" "" Local_url = Join_url (keyword, 1) soup = Make_soup (lo Cal_url) img_list = [] If soup is none:return img_list result = Soup.find (' TD ', style=true) if result Is None:print ' did not find any pictures about '%s ', please retry other keywords '% keyword return img_list result_string = Unicode (result.get_te XT ()). Encode (' Utf-8 ') # For example total: 2,307 divided into 145 pages per page 16 p = re.compile (R ' \d+ ') count_list = [Int (x) for x in P.findall (res ult_string)] # count For example [' 2307 ', ' 145 ', ' + '] print ' has found%d pictures of '%s ', total%d pages '% (Count_list[0], keyword,count_list[1]) url_b AR = Bar (' Parsing picture address ', max=count_list[1]) for X in range (Count_list[1]): Page_soup = Make_soup (join_url (keyword, x +1)) images = Page_soup.find_all (' img ', alt=true) img_list.extend ([img for img in images if Img.has_attr (' SR C ') and Img.has_attr (' Alt ')]) Url_bar.next () return Img_listclass downimage (threading. Thread): # "" "Download Picture Thread class" "Def __init__ (self, IMg_list, folder, bar): Threading.            Thread.__init__ (self) self.img_list = img_list Self.folder = Folder Self.bar = Bar def run (self): For img in self.img_list:photo_url = img[' src '] try:u = Urll Ib2.urlopen (photo_url) # Use picture description as Picture filename with open (Os.path.join (self.folder, img[' Alt ' ]+photo_url.split ('. ')                    [-1]), "WB") as Local_file:local_file.write (U.read ()) Self.bar.next () U.close () except Keyerror, Urllib2.                    Httperror:print ' Error downloading picture%s '% img[' alt '] except Keyboardinterrupt: Raisedef Main (): # "" "Start getting data with new thread" "" Max_thread = 5 # Open up to 5 threads at the same time download folder = raw_input (' Please enter picture storage path: ') if not OS. Path.exists (folder): print ' folder '%s ' does not exist '% folder exit () Image_list = [] While not image_list:k EY = raw_input (' Please enter a search imageKey words: ') image_list = Load_url (key, folder) Try:down_count = Len (image_list) bar = bar (' Downloading picture ', MA  X=down_count) threads = [] Unit_size = Len (image_list)/Max_thread # Each thread assumes the task volume for I in xrange (0, Len (image_list), unit_size): thread = Downimage (Image_list[i:i+unit_size], folder, bar) Thread.star T () threads.append (thread) for the thread in Threads:thread.join () except threading. Threaderror:print ' Sorry, unable to start ' except Keyboardinterrupt:print ' user has canceled download ' print ' \ n Download completed! ' if __name__ = = "__main__": Print __doc__ Main ()

The operating conditions are as follows:



The downloaded image looks like this:




Section

by using the BeautifulSoup library, we are familiar with the general crawler technology, through the two examples of this section, I believe that can be extended to the general situation of the crawl, but specific problems to be specific analysis, in terms of rules, efficiency, etc. to be carefully considered.

Python natural language processing to fetch data from the network

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.