Python crawl Baidu search results ur summary

Source: Internet
Author: User

After writing two articles, I think the main point of the crawler is the analysis process.

What to analyze:

1) First identify the target you want to crawl

For example, this time we need to crawl the use of Baidu search after all out of the URL results

2) analyze the manual process of acquiring the target, in order to achieve the program

Such as Baidu, we first enter the keyword search, and then Baidu feedback to our search results page, we then click on a query

3) Thinking about how the program is implemented and overcoming specific difficulties in implementation

So let's start by following the steps above, we first recognize the engine searched, provide a search box for the user to enter, and then click Execute

We can simulate the search and discover that the full URL after clicking the search is key, as follows

Http://www.baidu.com/s?wd= Search Content ...

After we try to get rid of the above URL again after we attempt to find the return information, we can conclude that the requested URL only needs to fill in the WD parameter.

Then we should try Requests.get () to see if it can return to the page normally, to prevent Baidu's anti-crawler

Hey, lucky to be back to the page normal haha ~

(Of course if you do not return to normal information, just set up the headers or strict cookies are OK)

Import'http://www.baidu.com/s?wd= ... '  = requests.get (URL)print r.status_code,r.content

Okay, next thing we want to know is how to crawl all the results.

I'm just looking at the URL again, there is a key to the URL that controls the page number:

Http://www.baidu.com/s?wd=...&pn=x

This x is a page each 10, the first page is 0, and a total of 76 pages, that is, 750 maximum, greater than 750 to return to the first page

Next, we can analyze the crawled page

Or the use of friendly BeautifulSoup

We found that the URL we needed was in the href in tag a, and the format was:

Http://www.baidu.com/link?url= ...

Because there are a lot of other URL confusion, so we just need to make a filter on the line

And this gets the URL is not the URL we want to result, this is just a jump link Baidu

But to my comfort, when our team this jump link to make a GET request, directly return the URL of the Get object is the result we want to link

And then we try again and find there's no other anti-crawler mechanism haha

The original idea is whether we need to make a filter on the status code returned by the new URL, not 200 (or even some headers)

But I found that, in fact, even if not 200, we just return the URL of the request object, and can not return to normal does not matter

Because our purpose is not the requested page result, but the URL of the request

So I just need to print it all out.

Of course I suggest writing a simple general headers write get so that at least some unnecessary results can be excluded.

And then the whole idea of our request is almost there.

On the code:

#Coding=utf-8ImportRequestsImportSYSImportQueueImportThreading fromBs4ImportBeautifulSoup as BSImportreheaders= {    ......}classBaiduspider (Threading. Thread):def __init__(self,queue,name): Threading. Thread.__init__(self) self._queue=Queue Self._name=namedefRun (self): while  notself._queue.empty (): URL=Self._queue.get ()Try: Self.get_url (URL)exceptexception,e:PrintePass                #Be sure to handle the exception!!! Otherwise the halfway will stop, crawl content is not complete!!!     defGet_url (Self,url): R= Requests.get (url = url,headers =headers) Soup= BS (R.content,"Html.parser") URLs= Soup.find_all (name='a', attrs={'href': Re.compile (('.'))})#For i in URLs:#Print I        #crawl Baidu search results in a tag, where the href is included in Baidu's jump address         forIinchURLs:if 'www.baidu.com/link?url=' inchi['href']: a= Requests.get (url = i['href'],headers =headers)#once access to the jump address, return the access URL to get the URL we need to crawl the results                #if A.status_code = =:                #Print A.urlWith Open ('e:/url/'+self._name+'. txt') as F:ifA.url not inchF.read (): F= Open ('e:/url/'+self._name+'. txt','a') F.write (A.url+'\ n') F.close ()defMain (keyword): name=keyword F= Open ('e:/url/'+name+'. txt','W') f.close () queue=Queue.queue () forIinchRange (0,760,10): Queue.put ('http://www.baidu.com/s?wd=%s&pn=%s'%(Keyword,str (i))) Threads=[] Thread_count= 10 forIinchRange (Thread_count): Spider=Baiduspider (queue,name) threads.append (spider) forIinchThreads:i.start () forIinchThreads:i.join ()Print "It ' s down,sir!"if __name__=='__main__':    ifLen (SYS.ARGV)! = 2:        Print 'No keyword'        Print 'Please enter keyword'Sys.exit (-1)    Else: Main (sys.argv[1])

The function of our tools is:

Python 123.py keyword

To write the URL results to a file

This way, sys, I have something to say

In the If __name__ = = ' __main__ ': First make a judgment, if the input field is one, then we return the reminder message, let the user type

If it is two, then the second type is recorded as a keyword operation

Of course, there is a flaw in logic, that is, more than two characters will have other problems (Other problems Oh!!!). )

Worth studying, but that's not the point of our story.

Well, today's Baidu URL results hand collection is so much!

Thank you for watching Oh!

Python crawl Baidu search results ur summary

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.