Detailed Python3 Baidu Index crawl Instance

Detailed Python3 Baidu Index crawl Instance _python

Last Update:2017-01-18 Source: Internet

Author: User

Tags sleep

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Baidu index crawl, and then use image recognition to get the index

Objective:

Tuffo has said that Baidu index is difficult to grasp, Taobao above is 20 block 1 keywords:

Brother so Diao how people can be frightened by him, and then spend the bits and pieces together about 2 days and a half to finish, in this despise the Earth blessing

There are a lot of installed libraries:

Google image recognition Tesseract-ocr

PIP3 Install pillow pip3 Install PYOCR selenium2.45 Chrome47.0.2526.106

m or Firebox32.0.1

Chromedriver.exe

Image recognition Verification Code please refer to: http://www.jb51.net/article/92287.htm

Selenium usage Please refer to: http://www.jb51.net/article/52329.htm

Enter the Baidu index need to log in, login account password written in the text accounts inside:

The universal login code is as follows:

# Open Browser def openbrowser (): Global browser # https://passport.baidu.com/v2/?login URL = ' https://passport.baidu.com /v2/?login&tpl=mn&u=http%3a%2f%2fwww.baidu.com%2f "# Open Google Browser # Firefox () # Chrome () browser = Webdriver. Chrome () # Enter web address browser.get (URL) # Open browser Time # print ("Wait 10 seconds to open Browser ...") # Time.sleep (10) # Find id= ' tangram__psp_3__ UserName dialog box # Empty the input box browser.find_element_by_id ("Tangram__psp_3__username"). Clear () browser.find_element_by_id (" Tangram__psp_3__password "). Clear () # Enter account password # Enter account password = [] try:fileaccount = open (' ...
    /baidu/account.txt ") accounts = Fileaccount.readlines () for ACC in Accounts:account.append (Acc.strip ()) Fileaccount.close () except Exception as Err:print (err) input ("Please write account password correctly in Account.txt") exit () BROWSER.F ind_element_by_id ("Tangram__psp_3__username"). Send_keys (Account[0]) browser.find_element_by_id ("TANGRAM__PSP_3_ _password "). Send_keys (Account[1]) # Click on Login # id= 'Tangram__psp_3__submit "browser.find_element_by_id" ("Tangram__psp_3__submit"). Click () # Wait 10 seconds for login # print (' Wait for 10 seconds ... ') # time.sleep ("Wait for URL loaded ...") select = Input ("Please observe whether the browser Web site has landed (y/n):") while 1:if select = = "Y" or S elect = = "Y": Print ("Landing success!") Print ("Ready to open a new Window ...") # Time.sleep (1) # browser.quit () break Elif select = = "N" or select = = "N": Selectno = input ("Account password error please press 0, verify code appears please press 1 ...") # Account password Error re-enter if Selectno = = "0": # Find Id= TANG Ram__psp_3__username dialog box # Empty the input box browser.find_element_by_id ("Tangram__psp_3__username"). Clear () b
          rowser.find_element_by_id ("Tangram__psp_3__password"). Clear () # Enter account password accounts = [] Try: Fileaccount = Open ("... /baidu/account.txt ") accounts = Fileaccount.readlines () for ACC in Accounts:account.appen D (Acc.strip ()) Fileaccount.close () except Exception as ERR:PRint (ERR) input ("Please write account password correctly in Account.txt") exit () browser.find_element_by_id ("tangram__psp_3__
        UserName "). Send_keys (Account[0]) browser.find_element_by_id (" Tangram__psp_3__password "). Send_keys (account[1)) # Click on login Sign in # id= "Tangram__psp_3__submit" browser.find_element_by_id ("Tangram__psp_3__submit"). Click () elif Selectno = = "1": # Verification Code ID is id= "ap_captcha_guess" dialog box input ("Please enter the verification code in the browser and log in ...") Sele ct = input ("Please observe whether the browser Web site has landed (y/n):") else:print ("Please enter" y "or" n "!)

 ") select = Input (" Please observe whether the browser Web site has landed (y/n): ")

Landing page:

After landing need to open a new window, that is, open Baidu Index, and switch windows, in the selenium use:

# Open a new window, through the implementation of JS to open a new window
js = ' window.open ("http://index.baidu.com");
Browser.execute_script (JS)
# New Window handle switch, enter Baidu index
# Get the handle of the current open all Windows handles
# handles is an array
handles = Browser.window_handles
# Print (handles)
# switch to the currently open window
Browser.switch_to_window (handles[-1])

Empty the input box and construct the number of clicks:

# Empty the input box
browser.find_element_by_id ("Schword"). Clear ()
# write Baidu index to search
browser.find_element_by_id (" Schword "). Send_keys (keyword)
# click Search
# <input type=" Submit "value=" id= "Searchwords" onclick= " Searchdemowords () ">
browser.find_element_by_id (" Searchwords "). Click ()
time.sleep (2)
# Maximize window
Browser.maximize_window ()
# construction days
sel = int (for 7 days please press 0, 30 days Press 1, 90 days, press 2, six months, press 3:)) Day
= 0
If sel = = 0: Day
  = 7
Elif sel = 1: Day
  =
Elif sel = 2: Day
  =
elif sel = 3: Day
  = 180< C21/>sel = '//a[@rel = ' + str (day) + ' "] '
browser.find_element_by_xpath (SEL). Click ()
# too fast
Time.sleep ( 2)

The number of days is here:

Find the Graphics box:

Xoyelement = Browser.find_elements_by_css_selector ("#trend rect") [2]

The graphic frame is:

Offset by different structure of coordinate points:

Select 7-Day coordinates to observe:

The horizontal axis of the first point is 1031.66666.

The horizontal axis of the second point is 1234.

So the difference between 7 days and two coordinates is: 202.33, the other days are similar

Using selenium library to simulate mouse sliding suspension:

From selenium.webdriver.common.action_chains import actionchains
actionchains (browser). move_to_element_with_ Offset (XOYELEMENT,X_0,Y_0). Perform ()

But the definite point is that it is in this position:

That is, the upper left corner of the rectangle, here is not loaded JS display pop-up box, so to the horizontal axis + 1:

X_0 = 1
y_0 = 0

Write a loop that follows the number of days to add the horizontal axis:

# by the number of days you choose to loop for
I in Range (day):
  # Construction Rule
  If day = = 7:
    x_0 = x_0 + 202.33 elif Day = = X_0
    = x_0 + 41.68
  elif Day = =
    X_0 = x_0 + 13.64
  elif day = 180:
    x_0 = x_0 + 6.78

When the mouse is moved, the box pops up and the box is found inside the URL:

Selenium Automatic identification of ... ：

# <div class= "Imgtxt" style= "MARGIN-LEFT:-117PX;" ></div>
imgelement = Browser.find_element_by_xpath ('//div[@id = ' viewbox '] ')

and determine the size position of this box:

# Find picture coordinates
locations = imgelement.location
print (locations)
# find picture size
sizes = imgelement.size
Print (sizes)
# The position of the exponent is
rangle = (int (locations[' x ']), int (locations[' y ')), int (locations[' x '] + sizes[' width '),
     Int ( locations[' y '] + sizes[' height '])

The captured graphics are:

The following ideas are:

1. Take a screenshot of the entire screen

2. Open the screenshot with the above coordinate rangle to cut

But the last cut out is the black box above, I want the effect is:

So to calculate the rangle, but I am lazy, ignoring the length of the search term, the direct violence written:

# Tectonic index position
rangle = (int (locations[' x '] + sizes[' width ']/3), int (locations[' y '] + sizes[' height ']/2), int (locations[ ' x '] + sizes[' width ']*2/3),
     Int (locations[' y '] + sizes[' height '))

This writing is not very good in the end, at least to the length of the keyword to judge the length of too long will result in the screenshot coordinates deviation, anyway I know how to do, is not to write out to you to see!

The complete code that follows is:

# <div class= "Imgtxt" style= "MARGIN-LEFT:-117PX;" ></div>
imgelement = Browser.find_element_by_xpath ('//div[@id = ' viewbox '] ')
# find picture coordinates
Locations = imgelement.location
print (locations)
# find picture size
sizes = imgelement.size
Print (sizes)
# Tectonic index position
rangle = (int (locations[' x '] + sizes[' width ']/3), int (locations[' y '] + sizes[' height ']/2), int (locations[ ' x '] + sizes[' width ']*2/3,
     int (locations[' y '] + sizes[' height '))
# intercept current browser
path = ". /baidu/"+ str (num)
browser.save_screenshot (str (PATH) +". png ")
# Open screenshot cut
img = image.open (str (PATH) +". PNG ")
jpg = Img.crop (rangle)
jpg.save (str (PATH) +". jpg ")

But later found that the cropped picture is too small, the recognition precision is too low, so you need to expand the picture:

# enlarge the picture by one Times
# original size 73.29
jpgzoom = image.open (str (PATH) + ". jpg")
(x, y) = jpgzoom.size
x_s = 146
y _s = out
= Jpgzoom.resize ((x_s, y_s), Image.antialias)
out.save (path + ' zoom.jpg ', ' png ', quality=95)

Original size Please right-click-> Property-> Details View, my is 73 pixels long, 29 pixels wide

The last thing is image recognition.

# image Recognition
index = [] Image
= Image.open (str (PATH) + "Zoom.jpg")
code = pytesseract.image_to_string
If code:
  index.append (code)

Final Effect Chart:

SOURCE Download: Demo

The above is the entire content of this article, I hope to help you learn, but also hope that we support the cloud habitat community.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More