Use Selenium + Chrome to crawl a website cloud to expose vulnerability articles and save as PDF files

Source: Internet
Author: User
Tags urlencode xpath blank page wkhtmltopdf

Purpose: Use Selenium + chrome to crawl a cloud of a specific type of Web site exposes the vulnerability article, that is, in the WIN10 Terminal Input Vulnerability type (if not authorized), crawl all the vulnerability articles of that type, and the number of each paging as the folder name, to save all the vulnerability articles under that page.

Summary: This example is just a simple crawl of a type of vulnerability of all articles, but not to crawl multiple types of vulnerability of all articles, sometimes there may be some small bugs cause not to crawl after the crash, need to manually modify and then re-crawl. See comments for other problem solving.

    1. The Python code inside the Chinese in Windows processing, is not fully mastered. Refer to Python for Windows Chinese encoding Problem summary
    2. Timeout problem encountered TimeoutException:Message:timeout not resolved
    3. This example code simply enters the specified type of vulnerability to crawl all of its articles. However, if you want to crawl the type of vulnerability such as unauthorized, SQL, etc., import Paramunittest did not take care of
#!/usr/bin/env python#-*-coding:utf-8-*-from Selenium import webdriverfrom selenium.webdriver.common.keys Import Keysimport unittestimport timefrom lxml import etreeimport urllib2import pdfkitimport randomimport osimport shutilimport urllib# import Chardetimport paramunittest# @paramunittest. parametrized (# {' User ': ' Editor vulnerability ', ' result ': ' true '}, # {' Use R ': ' Unauthorized ', ' result ': ' True '},#) class Wooyunselenium (UnitTest. TestCase): # class Testdemo (UnitTest.        TestCase): "The use of UnitTest module test class to solve JS paging, simulation click on the next page.        "Def SetUp": # def setparameters (self, user, result): ' Initialize method (fixed notation), prepare the environment for testing.        ' # self.user = user # Self.result = result # Creates a Google Browser object. Self.driver = Webdriver.        Chrome () # Creates a Headless browser object, measured faster than Google Chrome, but not much, without a browser window interface. # self.driver = Webdriver.        PHANTOMJS () # input Prompt if it is a string, double quotation marks, if it is a number, direct input (python2), if it is the Raw_input function, regardless of what is entered, all the returned type is a string type. # VName = input (' Please enter the type of vulnerability to query: ') ' Editor Vulnerability ' #Raw_input characters entered without double quotes, always return string type Self.vname = Raw_input (' Enter the type of vulnerability to query: ') # The output type is str type, the encoding format of the string is detected # Prin T Chardet.detect (vname) # URL encoding for string Chinese use the Urllib.quote function to URL-encode Chinese in the dictionary using the UrlEncode function # Print type (urllib.quo        Te (' editor vulnerability ') # Print urllib.quote (' editor bug ') # found in the code inside the Windows system appears in Chinese, the default encoding format is GBK, to decode processing, will not appear in Chinese garbled. # print Self.vname os.mkdir (self.vname.decode (' GBK ')) Os.chdir (Self.vname.decode (' GBK ')) # The place where the decoded code was first found        To Utf-8, the following successful get response content can be resolved. Self.vname = Urllib.quote (Self.vname.decode (' GBK '). Encode (' Utf-8 ')) self.page = 1 # URL encoding for self.vname due to character        strings, so the Urllib.quote () function is used to encode the string into a URL-encoded format, and if it is a dictionary, it is handled using the UrlEncode function. # urllib.uquote (self.vname) # Request a Web page, if the line is stacked in the Testmooc method, it will cause the data to be duplicated and missed.        Note VName to convert to Urlcode encoding format, otherwise it will error Unicodedecodeerror: ' UTF8 ' codec can ' t decode byte 0xb1 in position 41:invalid start byte Self.driver.get ("http://wooyun.jozxing.cc/search?keywords=" + self.vname + "&&content_search_by=by_bugs&&search_by_html=false&&page=" + str (self.page)) # Self.driv Er.get (' Http://wooyun.jozxing.cc/search?keywords=%E7%BC%96%E8%BE%91%E5%99%A8%E6%BC%8F%E6%B4%9E&content_ Search_by=by_bugs ') def Testwooyun (self): # def-testcase (self): "' Specific test case method (the method name begins with a fixed start with test) '        "# plus u used to deal with Chinese garbled, but found not to use PHANTOMJS, always wrong, and Google browser will not.        # os.mkdir (U ' Code execution ') # os.chdir (U ' Command execution ') # Outer loop control number of pages, because after observing that the maximum paging for a type of vulnerability article is not up to 1000, use this number to ensure that all pagination of this type of vulnerability is crawled For I in Range (self.page,1000): # Create a folder to hold all the articles on the page, the folder name is paginated number Os.mkdir (str (i)) # Let the page load completely To avoid data loss resulting from web pages not being loaded Time.sleep (3) # get the Web source HTML = self.driver.page_source # put the source Parse to HTML DOM document content = Etree.            HTML (HTML) # using XPath to match all course links = Content.xpath ('//td[2]/a/@href ') n = 0            # traverse the page inside the article link for each in links:    each = ' http://wooyun.jozxing.cc/' + each print try: # Self.driv Er2 = Webdriver. Chrome () Error, found page blank. As for the vulnerability article page or paging page, not sure, # code indicates that the vulnerability article page request is empty. About Request No. 234, there will be a blank page, waiting for as if the crash. Workaround put the Google browser driver into the Python installation directory under the script directory Self.driver2 = webdriver. Chrome () # self.driver2 = Webdriver. PHANTOMJS () # plus the following 2 lines of code, that is, set the time-out period, the actual detection of effective removal can not read a file under the VR error. Unable to read VR Path Registry from C:\Users\hp\AppData\Local\openvr\openvrpaths.vrpath self.driver2.s Et_page_load_timeout (Ten) Self.driver2.set_script_timeout (Ten) # self.driver2.implicit                    Ly_wait (self.driver2.get) HTML2 = Self.driver2.page_source Content2 = etree. HTML (HTML2) # gets the article Chapter title # The title = Content2.xpath ("//h3[@class = ' wybug_title ']/text ()") [ 0] # When processing cannot get to pageException (page blank), and then request access once (can be set multiple or even loop), the reason for not getting to the page may be caused by the site's anti-crawling mechanism. title = Content2.xpath ("//h3[@class = ' wybug_title ']/text ()") [0] except: # Close the browser window of the current open vulnerability article                    Port, found using close sometimes does not work, then use quit to exit the current Vulnerability Article browser window self.driver2.quit () # Retrieve the vulnerability page Self.driver2 = Webdriver.                    Chrome () # self.driver2.implicitly_wait (self.driver2.set_page_load_timeout) (10) Self.driver2.set_script_timeout # self.driver2 = Webdriver.                    PHANTOMJS () # How to always quote Timeout error: TimeoutException:Message:timeout self.driver2.get (each) HTML2 = Self.driver2.page_source Content2 = etree. HTML (HTML2) title = Content2.xpath ("//h3[@class = ' wybug_title ']/text ()") [0] # Set the saved file name because The Windows environment has a restriction on the file name '/', ' \ ', '? ', ' | ', ' < ', ' > ', ' ' ', ' * ', so it has to be filtered as FileName = title[6:].strip (). Replace (' "', '). Replace ('/', ' _ '). replace (' \ \ \ ', ' _ '). Replace (' < ', '). Replaces (' > ', '). Replace (' (') , "). Replace (') ', '). replace (' [') ', '). replace (' \ \ ', '). replace (' \ \ ', '). replace ('; ', ').                ' * ', '). Replace ('? ', '). Replace (': ', '). Replace (' | ', '). Replace ('. ', '). Replace ('. ', ').                # file = filename + ". pdf" N + = 1 # initial file name file1 = str (n) + '. pdf ' # save file name file2 = filename + '. pdf ' Try:path_wk = R ' C:\Program Files                    \wkhtmltopdf\bin\wkhtmltopdf.exe ' config = pdfkit.configuration (wkhtmltopdf = path_wk) Pdfkit.from_url (each, file1, configuration=config) Except:path_wk = R ' C:\Program Fil                    Es\wkhtmltopdf\bin\wkhtmltopdf.exe ' config = pdfkit.configuration (wkhtmltopdf = path_wk) Pdfkit.from_url (each, file1, configuration=config) self.driver2.quit () # m variable value to distinguish file m with the same name = 1 # Because the file name is duplicated, the recursive function is used to handle the file name, and the filename is appended with the increment number to save.             such as A.pdf,a2.pdf Self.modify_filename (file1,file2,filename,m) # Time.sleep (Random.randint (1,3)) # move all the article files in the current page to the corresponding paging file for the D in Os.listdir ('. '): If D.split ('. ') [-1] = = ' pdf ': Shutil.move (D,str (i)) # exit the loop condition, from the Web page source code does not find a class name, then return to 1, and then as when the next page click, If there is no click, the return value at this time is not 1 if (html.find (' next disabled '))!! = -1:break # analog Browser hand  Click the next page. Phantomjs doesn't seem to support page flipping, but sometimes test discovery is supported.            Don't understand.    Self.driver.find_element_by_xpath ('//li[@class = "Next"]/a '). Click () # self.asserttrue (Self.result = "true")       def modify_filename (self,file1,file2,filename,m): "Change the file name function if there are multiple files with the same name, automatically add a number to the end of the file name, starting with 2.      Method Recursive "If Os.path.exists" (file2):      M + = 1 file2 = filename + str (m) + '. pdf ' Self.modify_filename (file1,file2,filename,m) E                Lse:os.rename (file1,file2) return def tearDown (self): ' exit method (fixed notation), clean the test environment for next test use ' Self.driver.quit () if __name__ = = ' __main__ ': Unittest.main () # Unittest.main (verbosity=2)

Crawl a Web site with Selenium + Chrome cloud exposes vulnerability articles and saves them as PDF files

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.