Web crawler Small Test sledgehammer

Source: Internet
Author: User

In the era of big data, data is a valuable asset to us, and in machine translation, the first step is to collect a large number of the original sentences of Chinese and English translation, and where should we go to get those sentences? The simplest, most direct, most effective, and most readily available approach is to crawl. Since I haven't done anything like this before, I'm going to use the powerful Python to help me.

First select the site to crawl http://news.iyuba.com/, a very cool bilingual website.

First we found that this page does not have our direct needs of the Chinese and English material, but clicked on the navigation bar on the major categories, and then into the specific category page, once again click on the specific content, will have the Chinese and English translation of the article, we crawl ideas as follows:

Idea: first get into the various categories of the page URL, followed by the category page crawled out of the current page all the article URL, and finally into the article to obtain Chinese and English materials.

First of all we need to look at the source code of the page, the navigation bar at the source to find, extract to see if it has any characteristics. Our practice is to click on the right button to view the source code:

After careful searching, we found the correspondence between them:

So the next thing to do is parse the HTML.

After reviewing the data, Python parsing HTML has a ready-made BeautifulSoup library is quite powerful, so the first step to install this powerful library, here my Python version is 2.7, into the Python installation directory, use PIP to install:

After learning some simple API usage, just start practicing:

The first step is to implement the following code:

#-*-coding:utf-8-*-import urllib2from bs4 import beautifulsoupimport re # total urlweburl= ' http://news.iyuba.com/' to crawl Class Climbing (): # set proxy switch enable_proxy = True # total URL url = ' # Initialize def __init__ (self, URL) : Self.url = URL Proxy_handler = urllib2. Proxyhandler ({"http": ' web-proxy.oa.com:8080 '}) Null_proxy_handler = Urllib2.            Proxyhandler ({}) if Self.enable_proxy:opener = Urllib2.build_opener (proxy_handler) Else:     Opener = Urllib2.build_opener (Null_proxy_handler) Urllib2.install_opener (opener) # based on the URL, gets the soup object that requested the return content def __getresponsesoup (self, url): request = Urllib2. Request (URL) #request. Add_header (' user-agent ', "mozilla/5.0") #request. Add_header (' accept-language ', ' zh-ch, zh;q=0.5 ') response = Urllib2.urlopen (request) Resault = Response.read () soup = BeautifulSoup (resault , "Html.parser") Return Soup # home page to grab the URL to each category Def GETCATegoryurl (self): soup = Self.__getresponsesoup (self.url) allinfo = Soup.find_all (' ul ', attrs={"class": "            Nav Navbar-nav "}) [0].find_all (' a ') for info in Allinfo:chinese = Info.get_text (). Encode (' Utf-8 ')        href = info.get (' href ') if href = = self.url:continue print Chinese, href c = Climbing (Weburl) C.getcategoryurl ()

After running, the output is:

Campus http://news.iyuba.com/essay_category/120/1.html
Entertainment http://news.iyuba.com/essay_category/121/1.html
Technology http://news.iyuba.com/essay_category/122/1.html
Sports http://news.iyuba.com/essay_category/123/1.html
Economic http://news.iyuba.com/essay_category/126/1.html
Workplace http://news.iyuba.com/essay_category/124/1.html
Political http://news.iyuba.com/essay_category/125/1.html
Cultural http://news.iyuba.com/essay_category/127/1.html
Life http://news.iyuba.com/essay_category/128/1.html

You can see the various categories of ur has been easy for us to take down, and then we crawl the URL of the article under each category URL, the code used is as follows:

# Continue parsing the taxonomy URL to get the specific article URL    def getdetailurl (self, Category, url):        print category,url        soup = self.__ Getresponsesoup (URL)                otherurl = Soup.find_all (' A ', attrs={"target": "_blank"}) for        info in otherurl:            tmp = Info.find (Re.compile ("^b"))                    if tmp:                detailurl = Self.url + info.get (' href ')                print Detailurl

After running, the result is:

After getting the URL of the specific article, the next step is to grab out the English and Chinese in each URL, using the same similar method, with the following code:

    # to get the translated data according to the specific URL    def gettranslatecontent (self, url):        print ' *************** ' +url        soup = self.__ Getresponsesoup (URL) All        = Soup.find_all (' P ', attrs={"ondblclick": "Javascript:doexplain ();"})        For words in all:            print Words.get_text (). Encode (' Utf-8 ') all        = Soup.find_all (' P ', attrs={"class": "P2"})        For words in all:            print Words.get_text (). Encode (' Utf-8 ')

The result of the output is:

In fact, such a crawl is not enough, because each classification I just grabbed the first page of the article link, and I need to be the whole site of all the content crawl down, so each need to page to crawl, the number of pages is not determined, so can only set a threshold, and then to try, until the 404 page appeared. Of course, the entire translation process also need to re-url, even in this way there will be some duplicate URLs, from the log can be seen:

Finally, attach the final code:

reptile.pytorepeat.py log.py

In the end, this code helped me crawl 6,908 articles, 7w+ Chinese and English translation content:

Of course, in the process also stepped a lot of pits, here are a few:

Problem one: The Korean content that is opened in notepad++ is garbled, but Windows Notepad can be displayed, and finally find out that there is no Korean library in the font used by notepad++.

Problem two: Crawl some sites, crawl down garbled, here need to specify BeautifulSoup (Resault, "Html.parser", from_encoding= ' UTF-8 ') the third parameter and the specified page encoding format can be

Problem three: In the construction of the URL, sometimes with the Chinese up, the URL will help you convert (' Hello ' to '%e4%bd%a0%e5%a5%bd '), sometimes need to manually convert Chinese into such a format, in Python using Urllib.quote ( keyword) to proceed, the opposite% character to Chinese use unquote.

Question four: Some URLs, URLs in the browser can be accessed, but using Python will return, because the need to set up UA, here need to add in the code:

    

Question five: Some of the Web page source code in the following format

      

If you need to remove it, you can use the following command:

[Span.extract () for span in Soup.findall (' span ')] after removal, the contents of P that are obtained directly using soup.find_all ("P") are not in span .

Web crawler Small Test sledgehammer

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.