Python crawl Baidu search results ur summary

Last Update:2017-08-06 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

After writing two articles, I think the main point of the crawler is the analysis process.

What to analyze:

1) First identify the target you want to crawl

For example, this time we need to crawl the use of Baidu search after all out of the URL results

2) analyze the manual process of acquiring the target, in order to achieve the program

Such as Baidu, we first enter the keyword search, and then Baidu feedback to our search results page, we then click on a query

3) Thinking about how the program is implemented and overcoming specific difficulties in implementation

So let's start by following the steps above, we first recognize the engine searched, provide a search box for the user to enter, and then click Execute

We can simulate the search and discover that the full URL after clicking the search is key, as follows

Http://www.baidu.com/s?wd= Search Content ...

After we try to get rid of the above URL again after we attempt to find the return information, we can conclude that the requested URL only needs to fill in the WD parameter.

Then we should try Requests.get () to see if it can return to the page normally, to prevent Baidu's anti-crawler

Hey, lucky to be back to the page normal haha ~

(Of course if you do not return to normal information, just set up the headers or strict cookies are OK)

Import'http://www.baidu.com/s?wd= ... '  = requests.get (URL)print r.status_code,r.content

Okay, next thing we want to know is how to crawl all the results.

I'm just looking at the URL again, there is a key to the URL that controls the page number:

Http://www.baidu.com/s?wd=...&pn=x

This x is a page each 10, the first page is 0, and a total of 76 pages, that is, 750 maximum, greater than 750 to return to the first page

Next, we can analyze the crawled page

Or the use of friendly BeautifulSoup

We found that the URL we needed was in the href in tag a, and the format was:

Http://www.baidu.com/link?url= ...

Because there are a lot of other URL confusion, so we just need to make a filter on the line

And this gets the URL is not the URL we want to result, this is just a jump link Baidu

But to my comfort, when our team this jump link to make a GET request, directly return the URL of the Get object is the result we want to link

And then we try again and find there's no other anti-crawler mechanism haha

The original idea is whether we need to make a filter on the status code returned by the new URL, not 200 (or even some headers)

But I found that, in fact, even if not 200, we just return the URL of the request object, and can not return to normal does not matter

Because our purpose is not the requested page result, but the URL of the request

So I just need to print it all out.

Of course I suggest writing a simple general headers write get so that at least some unnecessary results can be excluded.

And then the whole idea of our request is almost there.

On the code:

#Coding=utf-8ImportRequestsImportSYSImportQueueImportThreading fromBs4ImportBeautifulSoup as BSImportreheaders= {    ......}classBaiduspider (Threading. Thread):def __init__(self,queue,name): Threading. Thread.__init__(self) self._queue=Queue Self._name=namedefRun (self): while  notself._queue.empty (): URL=Self._queue.get ()Try: Self.get_url (URL)exceptexception,e:PrintePass                #Be sure to handle the exception!!! Otherwise the halfway will stop, crawl content is not complete!!!     defGet_url (Self,url): R= Requests.get (url = url,headers =headers) Soup= BS (R.content,"Html.parser") URLs= Soup.find_all (name='a', attrs={'href': Re.compile (('.'))})#For i in URLs:#Print I        #crawl Baidu search results in a tag, where the href is included in Baidu's jump address         forIinchURLs:if 'www.baidu.com/link?url=' inchi['href']: a= Requests.get (url = i['href'],headers =headers)#once access to the jump address, return the access URL to get the URL we need to crawl the results                #if A.status_code = =:                #Print A.urlWith Open ('e:/url/'+self._name+'. txt') as F:ifA.url not inchF.read (): F= Open ('e:/url/'+self._name+'. txt','a') F.write (A.url+'\ n') F.close ()defMain (keyword): name=keyword F= Open ('e:/url/'+name+'. txt','W') f.close () queue=Queue.queue () forIinchRange (0,760,10): Queue.put ('http://www.baidu.com/s?wd=%s&pn=%s'%(Keyword,str (i))) Threads=[] Thread_count= 10 forIinchRange (Thread_count): Spider=Baiduspider (queue,name) threads.append (spider) forIinchThreads:i.start () forIinchThreads:i.join ()Print "It ' s down,sir!"if __name__=='__main__':    ifLen (SYS.ARGV)! = 2:        Print 'No keyword'        Print 'Please enter keyword'Sys.exit (-1)    Else: Main (sys.argv[1])

The function of our tools is:

Python 123.py keyword

To write the URL results to a file

This way, sys, I have something to say

In the If __name__ = = ' __main__ ': First make a judgment, if the input field is one, then we return the reminder message, let the user type

If it is two, then the second type is recorded as a keyword operation

Of course, there is a flaw in logic, that is, more than two characters will have other problems (Other problems Oh!!!). ）

Worth studying, but that's not the point of our story.

Well, today's Baidu URL results hand collection is so much!

Thank you for watching Oh!

Python crawl Baidu search results ur summary

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python crawl Baidu search results ur summary

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python crawl Baidu search results ur summary

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support