Use Python to crawl Baidu search, Baidu news search keyword number

Source: Internet
Author: User
Tags socket error urlencode

Because of the requirements of the experiment, we need to count a series of strings through Baidu search to get the number of keywords, so use Python wrote a related script.

In the process of writing this script encountered a lot of problems, the following will be a word.

PS: I did not systematically learn python, just a long time ago used it, feeling that it is more convenient to use, so this time it picked up and used. Of course, this is also considered to have a Python machine learning practical books, so it is estimated that a period of time will be used again.

idea: First use the Python library function to crawl the contents of the Web page, and then use regular expressions to match the desired string, and finally string processing to get what you want.

Specific Methods (Baidu Search as an example):

(1) Read the text that holds the keyword

FID = open (filename,'r'= Fid.readlines ()

(2) Search according to the content read in turn

Socket.setdefaulttimeout (4)#set 4s delay forEachtextinchAll_text:eachtext= Eachtext.strip ('\ n')#remove the last line break from the keywordOutput = Open (r'Data.txt','w+')#Create an intermediate text that is used to store the data that is read, but I do not need to do this here, but for the convenience of debugging, here is still addedFlag = 1#Set Flag     while(flag):#Sometimes the speed is not good, then the program stuck, the above set 4s delay time, and then set the flag bit here to loop read (if a delay error occurred)        Try: Res=urllib2.urlopen (("http://www.baidu.com/s?"+urllib.urlencode ({'WD': Eachtext}) +"&pn={0}&cl=3&rn=100")) HTML=res.read () flag=0exceptSocket.error:errno, Errstr= Sys.exc_info () [: 2]             iferrno = =socket.timeout:Print "There was a timeout"             Else:                 Print "There is some other socket error"content= Unicode (HTML,'Utf-8','Ignore') output.write (HTML) output.seek (0)#move the pointer to the head of the file

(3) Match content with regular expressions

 forLineinchoutput.readlines ():#The final result is obtained by matching the two-time regular expressionm = Re.search (r'relevant results about. * a', line)ifM:text=m.group () Re_text=text.replace (',',"") M= Re.search (r'[0-9]{1,15}', Re_text)ifM:fout.write (M.group ()+'\ n')#write the matching content to the file                Printeachtext+':'+m.group ()#Print some debugging information                 Break #match to jump directly out of the loop

Problems encountered:

(1) The problem of Chinese display, more detailed, is the problem of coding, I believe that every person learning Python will encounter this problem! But this problem is generally not difficult to solve, Baidu has a lot of other people's experience.

>> in my program, I use the global encoding is UTF-8 encoding, so run in the shell is not a problem, but if run in the console, the Chinese display is garbled, because the Chinese system default encoding is GBK.

My solution is to decode and then encode the Eg:print substr.decode (' Utf-8 ') where the Chinese is to be displayed. Encode (' GBK ')

(2) In order to run my script on a machine that does not have Python, I use Py2exe to package the program, but the icon for the Discovery program cannot be displayed, and the package code is as follows:

 fromDistutils.coreImportSetupImportPy2exeImportSYS includes= ["Encodings","encodings.*"] Sys.argv.append ("Py2exe") Options= {"Py2exe":   {"Bundle_files": 1}} Setup (Options=Options, Description='Search', ZipFile=None, console= [{"Script":'baidu_search.py','icon_resources': [(1,'Logo.ico')]

Online said there is the number 1 for 0 to be able to display (but there is no egg), and some related methods I have tried, and finally get a feasible way: http://blog.csdn.net/xugangjava/article/details/8049224

(3) In order to expand the scope of the search, such as the use of Baidu News search, People's Daily search, Sogou searches, I have made some attempts.

Baidu News Search:

# Search by the way Res=urllib2.urlopen ("http://news.baidu.com/ns? "+'cl=2&rn=20&tn=news&'+urllib.urlencode ({'Word  ': Eachtext})))

Search:

found that the use of JS, view the source code does not have any role, and will not simulate the behavior of the browser (it is expected to take a lot of time to learn, but temporarily not necessary), as a novice crawler, can only give up the manual crawl instead.

Sogou Search:

This is going to detect the crawler, I have been a number of IP, you can use proxy IP way to deal with it, but the online proxy IP resources are few. In addition, the browser access has not been blocked, so personally think there must be some skills, but only after the time to study again.

Summary: There are still a lot of information about crawlers, and some frameworks such as scrapy are widely used. I simply applied the above to avoid the cumbersome manual search. 、

Resources:

Http://cuiqingcai.com/1052.html

Http://www.cnblogs.com/fnng/p/3576154.html

Use Python to crawl Baidu search, Baidu news search keyword number

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.