Using Python to crawl a huge number of disease names crawler implementation of Baidu search term entries

Source: Internet
Author: User
Tags gettext html form

Experimental reasons:

At present, there is a medical encyclopedia search items, the project in the search for keywords, the return of a lot of results, unfortunately the results of a bad sort, affecting the user experience. Simply put, the search comes out of all of the diseases that are likely to be the least common diseases that are ranked first, and the most likely diseases may need to be turned over many pages to find.

Experimental Purpose:

In order to optimize the ranking of search results, think of the use of Baidu search after the display of how many entries, using the number of entries, can effectively optimize the disease rankings. From one aspect, the more the number of search terms for a disease in Baidu, indicating that the information of this entry is particularly rich, the side reflects the search for this entry is particularly many people, so that the disease can be introduced in the population may be a higher probability of occurrence. Conversely, if a disease is very rare, people only have a very low probability of suffering from this disease, then the corresponding search for this term will be less, the corresponding page is less, so search engine searches out the number of entries will be less.

Experimental process:

First stage: Get the disease name from the database

This phase involves the use of Python to extract data from a database, and I'm using the MYSQLDB library to create a connection extraction data by using the following code:

db = MySQLdb.connect (' localhost ', ' root ', ', ' medical_app ', charset = ' utf8 ') cu = Db.cursor () cu.execute (' select * from TA BLE order by id ')

Where I set the CharSet property in the Connect function, because if you do not do this, the information that Python reads from the database will become garbled.

At this stage, I wrote a Dbmanager class, mainly responsible for the database read and insert work, just described can be completed a read task, then how to use Python to add the data?

Cu.execute (' INSERT into table (id,name) value (?,?) ', [A, b])

Second stage: Complete crawl of data

At first I tried to use the Python urllib to crawl Baidu Web pages, found that the road is not going through, Baidu found that the machine will return the wrong page code after crawling.

So I think of the way to simulate the browser to the Baidu Web page to interact, crawl the content of the page. Found the Mechanize library, which means the library is very useful and easy to use. In addition to being able to imitate the browser to read Web pages, but also to interact with the Web page, and then you can also set the robot options, through this option can read the Screen robot Web page, such as Baidu.

br = Mechanize. Browser () Br.open ("http://www.example.com/") # follow second link with element text matching regular expressionresponse1 = Br.follow_link (text_regex=r "Cheese\s*shop", Nr=1) assert br.viewing_html () print br.title () print Response1.geturl () Print Response1.info ()  # headersprint Response1.read ()  # Bodybr.select_form (name= "Order") # Browser Passes Through unknown attributes (including methods) # to the selected htmlform.br["cheeses"] = ["Mozzarella", "Caerphilly"]
   
    # (the method here is __setitem__) # Submit A current form.  Browser calls. Close () on the current response on# navigation, so this closes response1response2 = Br.submit ()
   

These are the simple instructions for using the mechanize. The following is the official website of Mechanize Simple explanation, have to say, this with the HTML parser is too convenient.

  • mechanize.Browsermechanize.UserAgentBase urllib2.OpenerDirector and implement the interface of, so:

    • Any URL can is opened, not justhttp:

    • mechanize.UserAgentBaseOffers easy dynamic configuration of user-agent features like protocol, cookie, redirection robots.txt and handling, without ha Ving Tsun to make a new each time OpenerDirector , e.g. by calling build_opener() .

  • Easy HTML form filling.

  • Convenient link parsing and following.

  • Browser history ( .back() and .reload() methods).

  • The Referer HTTP header is added properly (optional).

  • Automatic observance of robots.txt .

  • Automatic handling of HTTP-EQUIV and Refresh.

About BeautifulSoup, this is a very useful HTML parser in the field of data mining, and he parses the Web page into a tree-shaped dictionary structure with tags, and it's easy to find an element in a Web page with the Find function.

  

html = Urllib.open (' http://mypage.com ') soup = BeautifulSoup (Html.read ()) soup.find (' div ', {' class ': ' Nums '})

The above phrase means to find the div tag content in the page with the class attribute nums.

Phase three: Exception capture and timeout settings

Crawl Web content crawler has been written, take out run a run, found that Baidu regularly will return some of the wrong page caused by the page can not fill the table or analysis, this time we need to crawl this inside the exception so that the program can continue to run, so that the seizure of abnormal disease name put back to the queue later crawl. For exception captures, the code is as follows:

  

1 try:2    br = mechanize. Browser () 3    br.set_handle_robots (False) 4    br.open (URL) 5    br.select_form (' F ') 6    br[' wd '] = name[1]. Encode (' UTF8 ') 7    response = Br.submit () 8    #print ' form submitted, waiting result ... ' 9    #分析网页, there may be Baidu returned error page    soup = BeautifulSoup (Response.read ()) One    # text = soup.find (' div ', {' class ': ' Nums '} ). GetText ()    if Soup.find (' div ', {' class ': ' Nums '}):       text = soup.find (' div ', {' class ': ' Nums '}). GetText ( )    else:15       print ' $Return page error,collect again ... '       Self.manual.push_record (name)       Continue18             except socket.timeout:19                 print ' $there is an error occur.it would check later ... '                 Self.manual.push_record (name)                 print name[1], ' pushed into the list. '                 continue

Here you can see that in order to improve retrieval efficiency, a timeout exception is set, using timeout in the socket component. We need to set the time-out period before this piece of code

1 Import Socket2 3 socket.setdefaulttimeout (5) 4 5 try:6     ... 7 except Socket.timeout:8     print ' timeout '

This example code sets the 5-second time-out.

  

So far, with this knowledge, I have completed a single-threaded crawler that does not go wrong during the crawl process, and it is obvious that the crawler is very slow to crawl content. I decided to use multithreading to get it up.

Stage four: Multi-threaded crawler control

This stage, we need to design a multi-threaded web page crawling crawler design. In this we mainly consider two points: 1, how to implement multi-threading, 2, about the public variable synchronous read how to achieve.

For the first question, there are two ways to implement multi-threaded implementations in Python:

The first is a functional formula:

Function: Call the Start_new_thread () function in the thread module to generate a new thread. The syntax is as follows:

Thread.start_new_thread (function, args[, Kwargs])

Parameter description:

  • function-thread functions.
  • Args-arguments passed to the thread function, he must be a tuple type.
  • Kwargs-Optional parameters.

The second type is the threading module:

Python provides support for threads through the two standard library thread and threading. The thread provides a low-level, primitive thread, and a simple lock.

Other methods provided by the thread module:

  • Threading.currentthread (): Returns the current thread variable.
  • Threading.enumerate (): Returns a list that contains the running thread. Running refers to threads that do not include pre-and post-termination threads until after the thread has started and ends.
  • Threading.activecount (): Returns the number of running threads with the same result as Len (Threading.enumerate ()).

In addition to using methods, the thread module also provides the thread class to handle threads, and the thread class provides the following methods:

  • run (): the method used to represent thread activity.
  • Start (): initiates thread activity.
  • join ([TIME]): waits until the thread aborts. This blocks the calling thread until the thread's join () method is called abort-gracefully exits or throws an unhandled exception-or an optional timeout occurs.
  • isAlive (): Returns whether the thread is active.
  • getName (): returns the thread name.
  • setName (): sets the thread name.

Before trying to use the first way to implement the function, found that the implementation of the code structure is not clear, when the variable synchronization will appear very chaotic. The second method is used to implement this function:

Class MyThread (threading. Thread):   #继承父类threading. Thread    def __init__ (self, ThreadID, name, Manual):        Threading. Thread.__init__ (self)        self.lock = Thread.allocate_lock ()        self.threadid = ThreadID        self.name = name        Self.manual = Manual    def run (self):                   #把要执行的代码写到run函数里面 thread runs the run function directly after it is created        print "starting" + Self.name        Self.get_rank ()        print "Exiting" + self.name    def get_rank (self): #爬虫代码, keep getting the disease score ...        ...

How do you implement a thread's operation? The code is as follows:

For I in Xrange (thread_count): #建立线程    mythread = mythread (i, ' thread-' +str (i), m)    Thread_queue.append ( Mythread) for I in Xrange (thread_count): #启动线程    Thread_queue[i].start () for I in Xrange (thread_count): #结束线程    Thread_queue[i].join ()

Now we can crawl the disease by using threads, but how to synchronize the crawl results, how to read the name of the disease synchronously?

Phase IV: Synchronous operation of variables

At this stage we need to design how to be feasible to operate the synchronization variables, which is more burning brain ... The design is as follows:

  

Where Class B is a thread class, a class is a synchronization variable control class, the main function of Class A is to provide synchronous operation of variable v1,v2, including reading and writing. Class B is a thread class that is responsible for crawling the data of a Web page.

Class B has been mentioned above, the implementation of Class A is as follows:

Class Manual: #同步变量控制    def __init__ (self, names):        self.names = names        self.results = []        Self.lock = Threading. Rlock ()    def get_name (self): #获得疾病名称        self.lock.acquire ()        If Len (self.names):            name = Self.names.pop ()            #print ' name get '            self.lock.release ()            return name        else:            self.lock.release ()            return None    def put_result (self, result): #存放得分        self.lock.acquire ()        self.results.append (Result)        print ' (%d/6811) '%len (self.results)        self.lock.release ()    def push_record (self, name): #放回获取失败的疾病名        Self.lock.acquire ()        self.names.append (name)        self.lock.release ()

Finally, all the parts have been implemented, and the assembly is running. At present, the lab is running in a tough run. The net does not give the force, took 4 threads to run, the conservative visual inspection had 4 hours.

"Original-blaxon"

Using Python to crawl a huge number of disease names crawler implementation of Baidu search term entries

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.