The last time I used the requests library to write a crawl page link in the simple code, extension, we can also use it to get our website PR and Baidu weight. The principle is similar. Finally, we can even write a loop to query the site of the bulk of the relevant information.
First talk about Googlepr, full name PageRank. It is Google's official assessment of a website SEO rating, this should not be unfamiliar. Since it is officially given, of course there is an official interface to get it. We use the official interface to get Google Pr.
The code is as follows:
Gpr_hash_seed = "Mining PageRank is against GOOGLE ' S TERMS of SERVICE. Y\
Es, I ' m talking to you, scammer. "
def google_hash (value):
Magic = 0x1020345
For i in Xrange (len (value)):
Magic ^= Ord (gpr_hash_seed[i% len (gpr_hash_seed)) ^ ord (Value[i])
Magic = (Magic >> | Magic << 9) & 0xFFFFFFFF
Return "8%08x"% (Magic)
def GETPR (WWW):
Try
url = ' Http://toolbarqueries.google.com/tbr? ' \
' client=navclient-auto&ch=%s&features=rank&q=info:%s '% (Google_hash (www), www)
Response = requests.get (URL)
Rex = Re.search (R ' (. *?:.*?:) (\d+) ', Response.text)
Return Rex.group (2)
Except:
Return None
How to use: Incoming domain name, return PR value
Google_hash This function is just an algorithm that calculates a domain name that resembles a hash value and returns. We can not control how it is implemented, we mainly look at GETPR this function. Our official Google interface is this: Http://toolbarqueries.google.com/tbr?client=navclient-auto&ch={hash}&features=rank &q=info:{Domain}
{Hash} Here we use Google_hash () This function, passed in the domain name, return its corresponding HASH value. For example, our farewell song domain name www.leavesongs.com, its Google hash is 8b1e6ad00, so the construction of the consultation site is: http://toolbarqueries.google.com/tbr?client= Navclient-auto&ch=8b1e6ad00&features=rank&q=info:www.leavesongs.com
Access it and get rank_1:1:0. The number after the second quotation mark is PR, because my station is no PR, so the PR is 0.
So, we use Requests.get () to access the constructed URL, and then get a result like rank_1:1:0, and finally get the PR value of 0 by regular or other means.
The above is the execution of the GETPR function. Then see the process of acquiring Baidu weight.
Baidu weight is not the official Baidu to give a standard, is a number of third-party website calculation of a value, so there is no interface like PR. So we need to crawl the information in these third-party websites. Here is the function to get Baidu weight:
The code is as follows:
def GETBR (WWW):
Try
url = ' http://mytool.chinaz.com/baidusort.aspx?host=%s&sortType=0 '% (www,)
Response = requests.get (URL)
data = Response.text
Rex = Re.search (R ' (. +?) (\d*?) () ', Data,re. I)
Return Rex.group (2)
Except:
Return None
The use method is also the incoming domain name, which returns the weight value.
I crawl is webmaster Tools a weight Consulting page: http://mytool.chinaz.com/baidusort.aspx?host={Domain name}&sorttype=0
My regular Is it: (. +?) (\d*?) (), you can see the source code to see, you know how to write the regular.
OK, let's get the PR and weights for these sites in bulk:
See the results directly:
A single process sweep words will be slightly slower, open 10 20 threads in bulk to get the words should be relatively fast.