[Crawler exercises] crawls the same-process Security Emergency Response to publish the vulnerability list and details, and sends a security emergency response
As shown in the following figure:
Today, when I had nothing to do, I remembered that the same link SRC had a public vulnerability module. However, I have the urge to write a crawler to crawl the vulnerability list. There are two versions, one of which is single-threaded. The other is the multi-threaded version.
Single-threaded version:
# Coding = utf-8import requestsimport reprint "\ 033 [0; 31 m "print ''' ●● ● 〓 █ ● bytes★★〓 █ ▄ [@ Author: poacher please refer to the following link for more information' ''print "\ 033 [0 m" bug_url = "http://sec.ly.com/bugs" # first obtain the total number of pages of a public vulnerability. Because there are URLs in href. Directly RegEx matches all the URLs and starts loop requests. Page = requests. get (bug_url) pattern = re. compile ('<a class = "ui button" href = "(.*?) "> ', Re. S) page_text = re. findall (pattern, page. content) # Remove duplicate URLs. Page_url = list (set (page_text) # define a function to get the vulnerability list and details # obtain the title and link address of the vulnerability. Otherwise, data cannot be connected. Bugs_count = 0for I in page_url: bugs_list = requests. get (I) bugs_list_pattern = re. compile ('<a title = "(.*?) ".*? Href = "(.*?) "> ', Re. s) bugs_list_content = re. findall (bugs_list_pattern, bugs_list.content) bugs_count = bugs_count + len (bugs_list_content) for bugs_list in bugs_list_content: # Replace the slash with null to avoid exceptions during file creation. Bugs_title = bugs_list [0]. replace ('/', '') bugs_content = requests. get ("http://sec.ly.com/" + bugs_list [1]) filename = ". /bugs_list/"+ str (bugs_title) + ". html "html = open (filename. decode ('utf-8'), 'w') html. write (str (bugs_content.content) html. close () print "Crawling:" + bugs_titleprint "finished. Total: "+ str (bugs_count) +" vulnerabilities."
Multi-threaded version:
Multithreading uses Threading + Queue
# Coding = utf-8import requestsimport reimport threadingimport timeimport Queueprint "\ 033 [0; 31 m "print ''' ●● ● 〓 █ ● bytes★★〓 █ ▄ [@ Author: poacher please refer to the following link for more information' ''print "\ 033 [0 m" bug_url = "http://sec.ly.com/bugs" # first obtain the total number of pages of a public vulnerability. Because there are URLs in href. Directly RegEx matches all the URLs and starts loop requests. Page = requests. get (bug_url) pattern = re. compile ('<a class = "ui button" href = "(.*?) "> ', Re. S) page_text = re. findall (pattern, page. content) # Remove duplicate URLs. Page_url = list (set (page_text) class Spider_bugs (threading. thread): def _ init _ (self, queue): threading. thread. _ init _ (self) self. _ queue = queue def run (self): start_time = time. time () while not self. _ queue. empty (): bugs_list = requests. get (self. _ queue. get () bugs_list_pattern = re. compile ('<a title = "(. *?) ".*? Href = "(.*?) "> ', Re. S) bugs_list_content = re. findall (bugs_list_pattern, bugs_list.content) for bugs_list in bugs_list_content: # Replace the slash with null to avoid exceptions during file creation. Bugs_title = bugs_list [0]. replace ('/', '') bugs_content = requests. get ("http://sec.ly.com/" + bugs_list [1]) filename = ". /bugs_list/"+ str (bugs_title) + ". html "html = open (filename. decode ('utf-8'), 'w') html. write (str (bugs_content.content) html. close () print "Crawling:" + bugs_title + "time consumed:" + str (time. strftime ('% H: % M: % s') print "Total time consumed:" + str (time. time ()-start_time) def main (): threads = [] thread = 7 queue = Queue. queue () for url in page_url: queue. put (url) for I in range (thread_count): threads. append (Spider_bugs (queue) for I in threads: I. start () for I in threads: I. join () if _ name _ = '_ main _': main ()
Share and learn. If there is anything wrong with the above Code, please point it out. Learn from each other !!! I also hope that the oX can be optimized, and then I am sending a study! Thank you !!!