Python crawler selenium crawl open credit blacklist

Source: Internet
Author: User

The first time with selenium crawl blacklist data, but not enough automation, page total length and how many records per page are manually set variable added, very intelligent.

This code improves the content:

(1) The information about the page number to cut out, automatically get pages

(2) Find out how many records per page

(3) Use two list to save data, better maintain

(4) using Css_selector to obtain data, and changed

(5) written as a function, more standardized

(6) Throw an exception

(7) The problem of timeout, the original set 30, and later timeout thrown an exception, changed to 120

Digression: Selenium is very convenient, the biggest advantage is to solve the problem of dynamic Web pages, although the topic is not dynamic page, but the relative speed is slower, crawl 378 data need more than 400 seconds.

Import time,csv Import traceback from selenium import webdriver from Selenium.webdriver.common.keys import keys url_whole= ' http://www.kaikaidai.com/Lend/Black.aspx ' # load all page def parsepage (): #设置驱动浏览器s browser=webdriver. Chrome () #设置响应 browser.set_page_load_timeout #获取网址 browser.get (url_whole) #找多少页 Page_info=browser.find_ Element_by_css_selector (' #form1 > Div:nth-child (one) > div > Div.jklb_bkd > Div.main > div > Table > t Body > Tr > td:nth-child (1) ') # Cut out the information about the page number, 1th page/All 38 page/page 10/Total 378 pages=page_info.text.split ('/') [1] Pages=int (Pages[1:3]) #遍历每一页 list_data = [] for page in range (pages): #自动读取页数, setting the number of pages elem=browser.find_element_by_name (' Rpmessage ') Elem.send_keys (page) Elem.send_keys (keys.return) #找每页有多少记录 Records=browser.find_element_by_c Ss_selector (' #form1 > Div:nth-child (one) > div > Div.jklb_bkd > Div.main '). Find_elements_by_class_name (' HMD
 _ytab ') #page_datas = Loadrecords (Records)   IDX = 1 for Records:idx +=1 try: #利用css_selector获取数据 name = Record.find_elem Ent_by_css_selector (' #form1 > Div:nth-child (one) > div > Div.jklb_bkd > Div.main > Table:nth-child (' +str ( IDX) + ') > Tbody > Tr:nth-child (1) > Td:nth-child (3) > a '). Text Hid=record.find_element_by_css_selector (' #form1 > Div:nth-child (one) > div > Div.jklb_bkd > Div.main > Table:nth-child (' +str (idx) + ') > tbody " ; Tr:nth-child (2) > Td:nth-child (2) '). Text Email=record.find_element_by_css_selector (' #form1 > Div:nth-child (1 1) > div > Div.jklb_bkd > Div.main > Table:nth-child (' +str (idx) + ') > Tbody > Tr:nth-child (1) > Td:nth -child (5) '). Text Homenumber=record.find_element_by_css_selector (' #form1 > Div:nth-child (one) > div > DIV.J
        KLB_BKD > Div.main > Table:nth-child (' +str (idx) + ') > Tbody > Tr:nth-child (2) > Td:nth-child (4) '). Text Numofloan=record.find_Element_by_css_selector (' #form1 > Div:nth-child (one) > div > Div.jklb_bkd > Div.main > Table:nth-child (' + STR (IDX) + ') > Tbody > Tr:nth-child (1) > Td:nth-child (7) '). Text Numofkai=record.find_element_by_css_select or (' #form1 > Div:nth-child (one) > div > Div.jklb_bkd > Div.main > Table:nth-child (' +str (idx) + ') > tbody & Gt Tr:nth-child (2) > Td:nth-child (6) '). Text Address=record.find_element_by_css_selector (' #form1 > Div:nth-child (one) > div > Div.jklb_bkd > Div.main > Table:nth-child (' +str (idx) + ') > Tbody > Tr:nth-child (3) > Td:n Th-child (2) '). Text Mobilephone=record.find_element_by_css_selector (' #form1 > Div:nth-child (one) > div > di
        V.JKLB_BKD > Div.main > Table:nth-child (' +str (idx) + ') > Tbody > Tr:nth-child (3) > Td:nth-child (4) '). Text Daysofloan=record.find_element_by_css_selector (' #form1 > Div:nth-child (one) > div > Div.jklb_bkd > div.ma In > Table:nth-child(' +str (idx) + ') > Tbody > Tr:nth-child (3) > Td:nth-child (6) '). Text Companyname=record.find_element_by_css_ Selector (' #form1 > Div:nth-child (one) > div > Div.jklb_bkd > Div.main > Table:nth-child (' +str (idx) + ') > t  Body > Tr:nth-child (4) > Td:nth-child (2) '). Text Totalamount=record.find_element_by_css_selector (' #form1 > Div:nth-child > div > Div.jklb_bkd > Div.main > Table:nth-child (' +str (idx) + ') > Tbody > Tr:nth-chil  D (4) > Td:nth-child (6) '). Text Companyaddress=record.find_element_by_css_selector (' #form1 > Div:nth-child (11) > div > Div.jklb_bkd > Div.main > Table:nth-child (' +str (idx) + ') > Tbody > Tr:nth-child (5) > Td:nth-c Hild (3) '). Text data = [] data.append (name) data.append (HID) data.append (email) dat A.append (Homenumber) data.append (Numofkai) data.append (Numofloan) data.append (address) Dat
  A.append (Mobilephone)      Data.append (Daysofloan) data.append (CompanyName) data.append (companyaddress) data.append (tot Alamount) list_data.append (data) Except:traceback.print_exc () #print (record.text) print (l En (List_data)) return List_data # write CSV file def writecsv (list_data): FilePath = ' C:\\USERS\\DESKTOP\\PYWORK\\KKD\\KKD.C SV ' #title = [' name ', ' hid ', ' email ', ' homenumber ', ' Numofkai ', ' Numofloan ', ' address ', ' mobilephone ', ' Daysofloan ', ' CompanyName ', ' companyaddress ', ' TotalAmount '] with open (FilePath, "w+", newline= "") as Csvfile:writer = Csv.writer (csv
  File) #先写入columns_name #writer. Writerow (title) #写入多行用writerows Writer.writerows (List_data) def main ():
 List_data = Parsepage () writecsv (list_data) If __name__ = "__main__": Main ()

The data is correct, it's private, it's not here.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.