Python simple web crawler + html body Extraction

Source: Internet
Author: User

Today, we have integrated a BFS crawler and HTML extraction. At present, the function still has limitations. Extract the body, see http://www.fuxiang90.me/2012/02/%E6%8A%BD%E5%8F%96html-%E6%AD%A3%E6%96%87/

  • Currently, only the URLs of the HTTP protocol are allowed to be crawled and tested only on the Intranet, because the connection to the Internet is not unpleasant.
  • A global URL queue and URL set. The queue is for the convenience of BFS implementation. The set is for the purpose of not repeatedly crawling web pages. The process is quite simple and the principle is quite simple.
  • Then there is a single thread, so it should be relatively slow. In the future, we will consider multithreading, crawling webpages, extracting URLs, and extracting the body, which can be synchronized.
  • Where is the source
    Https://www.ibm.com/developerworks/cn/opensource/os-cn-crawler/, and then extract the URL in the web page, I also extracted the body inside, this is to create an index in the future, convenient for Chinese Word Segmentation

There is a problem with the code here, maybe there is HTML Tag inside, please move to the http://www.fuxiang90.me /? P = 728

# Encoding: UTF-8 # Use beautifulsoup to get font | P context # crawls HTML from a single-threaded version, traverses it in depth, and extracts the body, but a single thread is a little slow # You can use this code at will, but keep the following line # Author: fuxiang, mail: fuxiang90@gmail.comfrom beautifulsoup import beautifulsoup # for processing htmlimport urllib2import osimport sysimport reimport queueimport socketimport timesocket. setdefatimetimeout (8) g_url_queue = queue. queue () g_url_queue.put ('HTTP: // www.bupt.edu.cn /') TT = ['HTTP: // www.bupt.edu.cn/'{g_url_set = set (TT) max_deep = 1 # The input parameter is the soup type. This extracts the URL def get_url_list (HTML) from the soup type ): global g_url_set re_html = R' (http: // (\ W + \.) + \ W +) 'res = html. findall ('A') # Find all a tags for X in Res: t = Unicode (X) # Here X is the soup object # URL [POS] = STR (Unicode (X ['href ']) # T = Unicode (X) # print Unicode (X ['href ']) M = Re. findall (re_html, T) If M is none: continue for XX in M: str_url = XX [0] # print str_url g_url_set | = set ('fuxiang ') If str_url not in g_url_set: g_url_queue.put (str_url) g_url_set | = set (str_url) ######################################## ############## def strip_tags (HTML): "function for filtering HTML tags in Python >>> str_text = strip_tags (" <font color = Red> Hello </font> ") >>> print str_text hello "From htmlparser import htmlparser html = html. strip () html = html. strip ("\ n") Result = [] Parser = htmlparser () parser. handle_data = result. append parser. feed (HTML) parser. close () return ''. join (result) ######################################## ############### you can input a URL or local file, parse the body def get_context (URL): re_html = 'HTTP [s]?: // [A-Za-z0-9] +. [A-Za-z0-9] +. [A-Za-z0-9] +

M = Re. match (re_html, STR (URL) If M is none: # If the URL is a local file fp = open (Unicode (URL), 'R') else: fp = urllib2.urlopen (URL) html = FP. read () soup = beautifulsoup (HTML) allfonttext = soup. findall (['A', 'P', 'font']) If Len (allfonttext) <= 0: print 'not found text' fwrite
= Open ('U' + STR (URL), 'w') for I in allfonttext: t = (I. rendercontents () Context = strip_tags (t) fwrite. write (context) fwrite. close () ######################################## ############## def main_fun (deep): Global g_url_set global g_url_queue if deep>
Max_deep: Return COUNT = 0 print 'debug' while g_url_queue.empty () is not true: Print 'debug' l_url = g_url_queue.get () print l_url # capture timeout error. Some webpages cannot be connected to try: fp = urllib2.urlopen (l_url) limit T: Continue html = FP. read () fwrite = open (STR (count + 1), 'w') fwrite. write (HTML)
Fwrite. close () soup = beautifulsoup (HTML) get_url_list (soup) get_context (count + 1) Count + = 1 If count> = 100: return # uncompletedef get_html_page (URL ): furl = urllib2.urlopen (URL) html = furl. read () soup = beautifulsoup (HTML) If _ name _ = "_ main _": main_fun (1)
Time. Sleep (10)

Now I want to implement a multi-threaded solution, that is, the download page and the analysis of the body and URL in the HTML extraction can be synchronized, and then after a simple modification to the above Code, barely running, mainly through the addition of Threading and lock Control for global queue access. Because I have not written multi-thread code before, I think I still want to pass, make suggestions.

# Encoding: UTF-8 # Use beautifulsoup to get font | P context # This code can be used at will, but keep the following line # Author: fuxiang, mail: fuxiang90@gmail.comfrom beautifulsoup import beautifulsoup # for processing htmlimport urllib2import osimport sysimport reimport queueimport socketimport timeimport threadingqueue_lock = threading. rlock () file_lock = threading. rlock () socket. setdefatimetimeout (8) g_url_queue = queue. queue () g_url_queue.put ('HTTP: // www.bupt.edu.cn/') g_file_queue = queue. queue () TT = ['HTTP: // www.bupt.edu.cn/'{g_url_set = set (TT) max_deep = 1 ##################################### ################# def strip_tags (HTML): "function for filtering HTML tags in Python >>> str_text = strip_tags (" <font color = Red> Hello </font> ") >>> print str_text hello "From htmlparser import htmlparser html = html. strip () html = html. strip ("\ n") Result = [] parser = htmlparser () parser. handle_data = result. append parser. feed (HTML) parser. close () return ''. join (result) def get_context (soup, URL): allfonttext = soup. findall (['A', 'P', 'font']) If Len (allfonttext) <= 0: print 'not found text' fwrite = open ('U' + STR (URL), 'w') for I in allfonttext: t = (I. rendercontents () Context = strip_tags (t) fwrite. write (context) fwrite. close () Class get_page_thread (threading. thread): def _ init _ (self, name): threading. thread. _ init _ (Self) self. t_name = Name def run (Self): Global g_url_set global g_url_queue global g_file_queue COUNT = 0 print 'debug' while g_url_queue.empty () is not true: Print self. t_name # Add a lock queue_lock.acquire () l_url = g_url_queue.get () queue_lock.release () print l_url # Catch timeout error. Some webpages cannot be linked to try: fp = urllib2.urlopen (l_url) failed t: continue html = FP. read () fwrite = open (STR (count + 1), 'w') fwrite. write (HTML) fwrite. close () file_lock.acquire () g_file_queue.put (count + 1) file_lock.release () Count + = 1 If count> = 100: exit class get_url_list_thread (threading. thread): def _ init _ (self, name): threading. thread. _ init _ (Self) self. t_name = Name def run (Self): Global g_url_set global g_file_queue global queue_lock global file_lock while waiting () is not true: file_lock.acquire () filename = g_file_queue.get () file_lock.release () FD = open (STR (filename), 'R') html = FD. read (); soup = beautifulsoup (HTML) get_context (soup, filename) re_html = R' (http: // (\ W + \.) + \ W +) 'res = soup. findall ('A') # Find all a tags for X in Res: t = Unicode (X) # Here X is the soup object # URL [POS] = STR (Unicode (X ['href ']) # T = Unicode (X) # print Unicode (X ['href ']) M = Re. findall (re_html, T) If M is none: continue for XX in M: str_url = XX [0] # print str_url g_url_set | = set ('fuxiang ') if str_url not in g_url_set: Random () random (str_url) queue_lock.release () g_url_set | = set (str_url) # uncompletedef get_html_page (URL): Furl = random (URL) html = furl. read () soup = beautifulsoup (HTML) If _ name _ = "_ main _": thread1 = get_page_thread ('A ') thread2 = get_url_list_thread ('B') thread3 = get_page_thread ('C') thread4 = get_page_thread ('D') thread1.start () time. sleep (20) thread2.start () time. sleep (20) thread3.start () thread4.start ()

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.