Python parsing Baidu webpage source code: Take the search engine returns the former PAGE_NUM*10 link URL

Source: Internet
Author: User

Title: Python parsing Baidu Web page source code: Take the search engine returned the first page_num*10 link URL

recently, because of doing "information search" homework, need to search search engine for a query, manually find those search engine returned links, determine whether the first 30 is related, and then evaluate the performance of the search engine. The pain of the point to go into the link, and then see whether the content of the web search and the query you want to inquire about, so that the next work. So think of parsing the Web page, find the URL of the search engine return link, and so finish the homework to do this, no hindrance. At present, the paper analyzes the source code of Baidu, and carries on the analysis work. For other search engine task re-analysis .

# coding:utf-8 "Created on October 27, 2014 @author:shifeng" "Import urllib2import urllibimport stringimport re#---------- --------------------------------------------------------------------------------# will get the URL to be parsed, such as HTTP.//  v.baidu.com/v?s=8word= Infinity switch fr=ala11def myurldecode (URL): list_url_sub = Url.split ("&") S_after_decode = "" For I in range (len (list_url_sub)): # Use & to separate fields dict = {} # print List_url_sub[i], "+" if "=" in List  _url_sub[i]: List = list_url_sub[i].split ("=") key = list[0] Value = list[1] #            Editor issue, under this Eclipse+pydev editor, spin, Khan value_decode = Urllib.unquote (Value.encode ("GB2312")). Decode ("GB2312") S_after_decode = S_after_decode + key + "=" + Value_decode + "&" Else:s_after_decode = S_after_d Ecode + list_url_sub[i] + "&" S_after_decode = s_after_decode[:-1] # print S_after_decode return S_after_deco de#------------------------------------------------------------------------------------------# Analysis of Baidu search engine returned 10 Web links, (other not, but also should have regular) # key: "Id= ' i '" (i=1--10), This part is unique, # take "id= ' I '" to the input query "infinity switch" between the characters, # after finding the only part, take the "href" to the "target" section is a 10 link def my_get_link_url (str_url_ Inbaidu, list_url_query_baidu,query): M = Urllib2.urlopen (Str_url_inbaidu). Read () m = M.decode ("Utf-8") m = M.rep Lace ("\ n", "") # Pit Daddy, regular match to the end of line break?        You have to remove the line break from the source code. For I in range: Str_list = re.compile (' id=\ "' + str (i + 1) + ' \". *? ' + query) # Note the Chinese encoding problem            For j in Str_list.finditer (m): # Cut the string between "id=" 1 "" to the "infinity switch" and then match the "href" to "target", and then do some processing. Str_id_to_query = J.group () Url_1 = Str_id_to_query[str_id_to_query.find ("href"): Str_id_to_query.find ("Tar Get ")].replace (" "," ") # Note to remove the space, some take out the shape as" href = ", instead of" href= "" # Remove the front of the mess: "href=", and then remove the back of the mess: "            " ”。    url_2 = Url_1[len ("href=\" "):] url = url_2[:url_2.find (" \ "")] list_url_query_baidu.append (URL)        #------------------------------------------------------# decode the URL #url_decode = "" #url_decode = Url_decode.decode ("Utf-8") #url_decode = Myurldecode (URL) #---------------------- --------------------------------#print URL # print Url_decode#type (Url_decode), return List_url_ query_baidu#-------------------------------------------------------------------------------------------------- -----# Print Myurldecode (Str_url_inbaidu) #真特么蛋疼, go to the initial URL is not, GB2312 change to Utf-8. Can be replaced with utf-8, those link URL can not go query=u "Infinite Earth" page_num=50str_url_inbaidu= "http://www.baidu.com/s?wd=" +query+ "&pn=" +str (page) + "0&oq=" +query+ "&tn=baiduhome_pg&ie=utf-8&usm=3&rsv_idx=1&f=8&rsv_bp=1" List_ Url_query_baidu = [] # 10 link is present here list_all_url=[]for page in range (page_num): str_url_inbaidu= "http://www.baidu.com/ S?wd= "+query+" &pn= "+str (page) +" 0&oq= "+query+" &tn=baiduhome_pg&ie=utf-8&usm=3& rsv_idx=1&f=8&rsv_bp=1 "#print Str_url_inbaidu #my_Get_link_url, to 10, or one page, returns a list of 10 URLs, how many pages, or List_all_ The length of the URL is #而其中每个元素为一列 the size of the page, where one element is a list, or a page of 10 URLs with a length of ten List=my_get_link_url (Str_url_inbaidu,list_url_query_ baidu,query) list_all_url.append (list) #print len (list_all_url) for I in Range (len (list_all_url)): #print Len (list_all _url[i]) print "This is the first", str (i+1), "page of 10 pages:" For J in Range (Len (List_all_url[i])): print "This is the first", str (i+1), "page", str (j+1), "URL of the link:", List_all_url[i][j] "# str_url_inbaidu=" http://www.baidu.com/s?wd=%E6%97%A0%E9%99%90%E5%BC%80% e5%85%b3&rsv_spt=1&issp=1&f=8&rsv_bp=0&ie=utf-8&tn=baiduhome_pg&bs=%e6%97%a0%e9%99 %90%e5%a4%a7%e5%9c%b0 "# My_get_link_url (str_url_inbaidu,list_url_query_baidu,query) # page=3 '


Python parsing Baidu webpage source code: Take the search engine returns the former PAGE_NUM*10 link URL

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.