Baidu index crawl, and then use image recognition to get the index
Objective:
Tuffo has said that Baidu index is difficult to grasp, Taobao above is 20 block 1 keywords:
Brother so Diao how people can be frightened by him, and then spend the bits and pieces together about 2 days and a half to finish, in this despise the Earth blessing
There are a lot of installed libraries:
Google image recognition Tesseract-ocr
PIP3 Install pillow pip3 Install PYOCR selenium2.45 Chrome47.0.2526.106
m or Firebox32.0.1
Chromedriver.exe
Image recognition Verification Code please refer to: http://www.jb51.net/article/92287.htm
Selenium usage Please refer to: http://www.jb51.net/article/52329.htm
Enter the Baidu index need to log in, login account password written in the text accounts inside:
The universal login code is as follows:
# Open Browser def openbrowser (): Global browser # https://passport.baidu.com/v2/?login URL = ' https://passport.baidu.com /v2/?login&tpl=mn&u=http%3a%2f%2fwww.baidu.com%2f "# Open Google Browser # Firefox () # Chrome () browser = Webdriver. Chrome () # Enter web address browser.get (URL) # Open browser Time # print ("Wait 10 seconds to open Browser ...") # Time.sleep (10) # Find id= ' tangram__psp_3__ UserName dialog box # Empty the input box browser.find_element_by_id ("Tangram__psp_3__username"). Clear () browser.find_element_by_id (" Tangram__psp_3__password "). Clear () # Enter account password # Enter account password = [] try:fileaccount = open (' ...
/baidu/account.txt ") accounts = Fileaccount.readlines () for ACC in Accounts:account.append (Acc.strip ()) Fileaccount.close () except Exception as Err:print (err) input ("Please write account password correctly in Account.txt") exit () BROWSER.F ind_element_by_id ("Tangram__psp_3__username"). Send_keys (Account[0]) browser.find_element_by_id ("TANGRAM__PSP_3_ _password "). Send_keys (Account[1]) # Click on Login # id= 'Tangram__psp_3__submit "browser.find_element_by_id" ("Tangram__psp_3__submit"). Click () # Wait 10 seconds for login # print (' Wait for 10 seconds ... ') # time.sleep ("Wait for URL loaded ...") select = Input ("Please observe whether the browser Web site has landed (y/n):") while 1:if select = = "Y" or S elect = = "Y": Print ("Landing success!") Print ("Ready to open a new Window ...") # Time.sleep (1) # browser.quit () break Elif select = = "N" or select = = "N": Selectno = input ("Account password error please press 0, verify code appears please press 1 ...") # Account password Error re-enter if Selectno = = "0": # Find Id= TANG Ram__psp_3__username dialog box # Empty the input box browser.find_element_by_id ("Tangram__psp_3__username"). Clear () b
rowser.find_element_by_id ("Tangram__psp_3__password"). Clear () # Enter account password accounts = [] Try: Fileaccount = Open ("... /baidu/account.txt ") accounts = Fileaccount.readlines () for ACC in Accounts:account.appen D (Acc.strip ()) Fileaccount.close () except Exception as ERR:PRint (ERR) input ("Please write account password correctly in Account.txt") exit () browser.find_element_by_id ("tangram__psp_3__
UserName "). Send_keys (Account[0]) browser.find_element_by_id (" Tangram__psp_3__password "). Send_keys (account[1)) # Click on login Sign in # id= "Tangram__psp_3__submit" browser.find_element_by_id ("Tangram__psp_3__submit"). Click () elif Selectno = = "1": # Verification Code ID is id= "ap_captcha_guess" dialog box input ("Please enter the verification code in the browser and log in ...") Sele ct = input ("Please observe whether the browser Web site has landed (y/n):") else:print ("Please enter" y "or" n "!)
") select = Input (" Please observe whether the browser Web site has landed (y/n): ")
Landing page:
After landing need to open a new window, that is, open Baidu Index, and switch windows, in the selenium use:
# Open a new window, through the implementation of JS to open a new window
js = ' window.open ("http://index.baidu.com");
Browser.execute_script (JS)
# New Window handle switch, enter Baidu index
# Get the handle of the current open all Windows handles
# handles is an array
handles = Browser.window_handles
# Print (handles)
# switch to the currently open window
Browser.switch_to_window (handles[-1])
Empty the input box and construct the number of clicks:
# Empty the input box
browser.find_element_by_id ("Schword"). Clear ()
# write Baidu index to search
browser.find_element_by_id (" Schword "). Send_keys (keyword)
# click Search
# <input type=" Submit "value=" id= "Searchwords" onclick= " Searchdemowords () ">
browser.find_element_by_id (" Searchwords "). Click ()
time.sleep (2)
# Maximize window
Browser.maximize_window ()
# construction days
sel = int (for 7 days please press 0, 30 days Press 1, 90 days, press 2, six months, press 3:)) Day
= 0
If sel = = 0: Day
= 7
Elif sel = 1: Day
=
Elif sel = 2: Day
=
elif sel = 3: Day
= 180< C21/>sel = '//a[@rel = ' + str (day) + ' "] '
browser.find_element_by_xpath (SEL). Click ()
# too fast
Time.sleep ( 2)
The number of days is here:
Find the Graphics box:
Xoyelement = Browser.find_elements_by_css_selector ("#trend rect") [2]
The graphic frame is:
Offset by different structure of coordinate points:
Select 7-Day coordinates to observe:
The horizontal axis of the first point is 1031.66666.
The horizontal axis of the second point is 1234.
So the difference between 7 days and two coordinates is: 202.33, the other days are similar
Using selenium library to simulate mouse sliding suspension:
From selenium.webdriver.common.action_chains import actionchains
actionchains (browser). move_to_element_with_ Offset (XOYELEMENT,X_0,Y_0). Perform ()
But the definite point is that it is in this position:
That is, the upper left corner of the rectangle, here is not loaded JS display pop-up box, so to the horizontal axis + 1:
Write a loop that follows the number of days to add the horizontal axis:
# by the number of days you choose to loop for
I in Range (day):
# Construction Rule
If day = = 7:
x_0 = x_0 + 202.33 elif Day = = X_0
= x_0 + 41.68
elif Day = =
X_0 = x_0 + 13.64
elif day = 180:
x_0 = x_0 + 6.78
When the mouse is moved, the box pops up and the box is found inside the URL:
Selenium Automatic identification of ... :
# <div class= "Imgtxt" style= "MARGIN-LEFT:-117PX;" ></div>
imgelement = Browser.find_element_by_xpath ('//div[@id = ' viewbox '] ')
and determine the size position of this box:
# Find picture coordinates
locations = imgelement.location
print (locations)
# find picture size
sizes = imgelement.size
Print (sizes)
# The position of the exponent is
rangle = (int (locations[' x ']), int (locations[' y ')), int (locations[' x '] + sizes[' width '),
Int ( locations[' y '] + sizes[' height '])
The captured graphics are:
The following ideas are:
1. Take a screenshot of the entire screen
2. Open the screenshot with the above coordinate rangle to cut
But the last cut out is the black box above, I want the effect is:
So to calculate the rangle, but I am lazy, ignoring the length of the search term, the direct violence written:
# Tectonic index position
rangle = (int (locations[' x '] + sizes[' width ']/3), int (locations[' y '] + sizes[' height ']/2), int (locations[ ' x '] + sizes[' width ']*2/3),
Int (locations[' y '] + sizes[' height '))
This writing is not very good in the end, at least to the length of the keyword to judge the length of too long will result in the screenshot coordinates deviation, anyway I know how to do, is not to write out to you to see!
The complete code that follows is:
# <div class= "Imgtxt" style= "MARGIN-LEFT:-117PX;" ></div>
imgelement = Browser.find_element_by_xpath ('//div[@id = ' viewbox '] ')
# find picture coordinates
Locations = imgelement.location
print (locations)
# find picture size
sizes = imgelement.size
Print (sizes)
# Tectonic index position
rangle = (int (locations[' x '] + sizes[' width ']/3), int (locations[' y '] + sizes[' height ']/2), int (locations[ ' x '] + sizes[' width ']*2/3,
int (locations[' y '] + sizes[' height '))
# intercept current browser
path = ". /baidu/"+ str (num)
browser.save_screenshot (str (PATH) +". png ")
# Open screenshot cut
img = image.open (str (PATH) +". PNG ")
jpg = Img.crop (rangle)
jpg.save (str (PATH) +". jpg ")
But later found that the cropped picture is too small, the recognition precision is too low, so you need to expand the picture:
# enlarge the picture by one Times
# original size 73.29
jpgzoom = image.open (str (PATH) + ". jpg")
(x, y) = jpgzoom.size
x_s = 146
y _s = out
= Jpgzoom.resize ((x_s, y_s), Image.antialias)
out.save (path + ' zoom.jpg ', ' png ', quality=95)
Original size Please right-click-> Property-> Details View, my is 73 pixels long, 29 pixels wide
The last thing is image recognition.
# image Recognition
index = [] Image
= Image.open (str (PATH) + "Zoom.jpg")
code = pytesseract.image_to_string
If code:
index.append (code)
Final Effect Chart:
SOURCE Download: Demo
The above is the entire content of this article, I hope to help you learn, but also hope that we support the cloud habitat community.