Python simple web crawler + html body Extraction

Last Update:2018-12-03 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Today, we have integrated a BFS crawler and HTML extraction. At present, the function still has limitations. Extract the body, see http://www.fuxiang90.me/2012/02/%E6%8A%BD%E5%8F%96html-%E6%AD%A3%E6%96%87/

Currently, only the URLs of the HTTP protocol are allowed to be crawled and tested only on the Intranet, because the connection to the Internet is not unpleasant.
A global URL queue and URL set. The queue is for the convenience of BFS implementation. The set is for the purpose of not repeatedly crawling web pages. The process is quite simple and the principle is quite simple.
Then there is a single thread, so it should be relatively slow. In the future, we will consider multithreading, crawling webpages, extracting URLs, and extracting the body, which can be synchronized.
Where is the source
Https://www.ibm.com/developerworks/cn/opensource/os-cn-crawler/, and then extract the URL in the web page, I also extracted the body inside, this is to create an index in the future, convenient for Chinese Word Segmentation

There is a problem with the code here, maybe there is HTML Tag inside, please move to the http://www.fuxiang90.me /? P = 728

# Encoding: UTF-8 # Use beautifulsoup to get font | P context # crawls HTML from a single-threaded version, traverses it in depth, and extracts the body, but a single thread is a little slow # You can use this code at will, but keep the following line # Author: fuxiang, mail: fuxiang90@gmail.comfrom beautifulsoup import beautifulsoup # for processing htmlimport urllib2import osimport sysimport reimport queueimport socketimport timesocket. setdefatimetimeout (8) g_url_queue = queue. queue () g_url_queue.put ('HTTP: // www.bupt.edu.cn /') TT = ['HTTP: // www.bupt.edu.cn/'{g_url_set = set (TT) max_deep = 1 # The input parameter is the soup type. This extracts the URL def get_url_list (HTML) from the soup type ): global g_url_set re_html = R' (http: // (\ W + \.) + \ W +) 'res = html. findall ('A') # Find all a tags for X in Res: t = Unicode (X) # Here X is the soup object # URL [POS] = STR (Unicode (X ['href ']) # T = Unicode (X) # print Unicode (X ['href ']) M = Re. findall (re_html, T) If M is none: continue for XX in M: str_url = XX [0] # print str_url g_url_set | = set ('fuxiang ') If str_url not in g_url_set: g_url_queue.put (str_url) g_url_set | = set (str_url) ######################################## ############## def strip_tags (HTML): "function for filtering HTML tags in Python >>> str_text = strip_tags (" <font color = Red> Hello </font> ") >>> print str_text hello "From htmlparser import htmlparser html = html. strip () html = html. strip ("\ n") Result = [] Parser = htmlparser () parser. handle_data = result. append parser. feed (HTML) parser. close () return ''. join (result) ######################################## ############### you can input a URL or local file, parse the body def get_context (URL): re_html = 'HTTP [s]?: // [A-Za-z0-9] +. [A-Za-z0-9] +. [A-Za-z0-9] +

M = Re. match (re_html, STR (URL) If M is none: # If the URL is a local file fp = open (Unicode (URL), 'R') else: fp = urllib2.urlopen (URL) html = FP. read () soup = beautifulsoup (HTML) allfonttext = soup. findall (['A', 'P', 'font']) If Len (allfonttext) <= 0: print 'not found text' fwrite
= Open ('U' + STR (URL), 'w') for I in allfonttext: t = (I. rendercontents () Context = strip_tags (t) fwrite. write (context) fwrite. close () ######################################## ############## def main_fun (deep): Global g_url_set global g_url_queue if deep>
Max_deep: Return COUNT = 0 print 'debug' while g_url_queue.empty () is not true: Print 'debug' l_url = g_url_queue.get () print l_url # capture timeout error. Some webpages cannot be connected to try: fp = urllib2.urlopen (l_url) limit T: Continue html = FP. read () fwrite = open (STR (count + 1), 'w') fwrite. write (HTML)
Fwrite. close () soup = beautifulsoup (HTML) get_url_list (soup) get_context (count + 1) Count + = 1 If count> = 100: return # uncompletedef get_html_page (URL ): furl = urllib2.urlopen (URL) html = furl. read () soup = beautifulsoup (HTML) If _ name _ = "_ main _": main_fun (1)
Time. Sleep (10)

Now I want to implement a multi-threaded solution, that is, the download page and the analysis of the body and URL in the HTML extraction can be synchronized, and then after a simple modification to the above Code, barely running, mainly through the addition of Threading and lock Control for global queue access. Because I have not written multi-thread code before, I think I still want to pass, make suggestions.

# Encoding: UTF-8 # Use beautifulsoup to get font | P context # This code can be used at will, but keep the following line # Author: fuxiang, mail: fuxiang90@gmail.comfrom beautifulsoup import beautifulsoup # for processing htmlimport urllib2import osimport sysimport reimport queueimport socketimport timeimport threadingqueue_lock = threading. rlock () file_lock = threading. rlock () socket. setdefatimetimeout (8) g_url_queue = queue. queue () g_url_queue.put ('HTTP: // www.bupt.edu.cn/') g_file_queue = queue. queue () TT = ['HTTP: // www.bupt.edu.cn/'{g_url_set = set (TT) max_deep = 1 ##################################### ################# def strip_tags (HTML): "function for filtering HTML tags in Python >>> str_text = strip_tags (" <font color = Red> Hello </font> ") >>> print str_text hello "From htmlparser import htmlparser html = html. strip () html = html. strip ("\ n") Result = [] parser = htmlparser () parser. handle_data = result. append parser. feed (HTML) parser. close () return ''. join (result) def get_context (soup, URL): allfonttext = soup. findall (['A', 'P', 'font']) If Len (allfonttext) <= 0: print 'not found text' fwrite = open ('U' + STR (URL), 'w') for I in allfonttext: t = (I. rendercontents () Context = strip_tags (t) fwrite. write (context) fwrite. close () Class get_page_thread (threading. thread): def _ init _ (self, name): threading. thread. _ init _ (Self) self. t_name = Name def run (Self): Global g_url_set global g_url_queue global g_file_queue COUNT = 0 print 'debug' while g_url_queue.empty () is not true: Print self. t_name # Add a lock queue_lock.acquire () l_url = g_url_queue.get () queue_lock.release () print l_url # Catch timeout error. Some webpages cannot be linked to try: fp = urllib2.urlopen (l_url) failed t: continue html = FP. read () fwrite = open (STR (count + 1), 'w') fwrite. write (HTML) fwrite. close () file_lock.acquire () g_file_queue.put (count + 1) file_lock.release () Count + = 1 If count> = 100: exit class get_url_list_thread (threading. thread): def _ init _ (self, name): threading. thread. _ init _ (Self) self. t_name = Name def run (Self): Global g_url_set global g_file_queue global queue_lock global file_lock while waiting () is not true: file_lock.acquire () filename = g_file_queue.get () file_lock.release () FD = open (STR (filename), 'R') html = FD. read (); soup = beautifulsoup (HTML) get_context (soup, filename) re_html = R' (http: // (\ W + \.) + \ W +) 'res = soup. findall ('A') # Find all a tags for X in Res: t = Unicode (X) # Here X is the soup object # URL [POS] = STR (Unicode (X ['href ']) # T = Unicode (X) # print Unicode (X ['href ']) M = Re. findall (re_html, T) If M is none: continue for XX in M: str_url = XX [0] # print str_url g_url_set | = set ('fuxiang ') if str_url not in g_url_set: Random () random (str_url) queue_lock.release () g_url_set | = set (str_url) # uncompletedef get_html_page (URL): Furl = random (URL) html = furl. read () soup = beautifulsoup (HTML) If _ name _ = "_ main _": thread1 = get_page_thread ('A ') thread2 = get_url_list_thread ('B') thread3 = get_page_thread ('C') thread4 = get_page_thread ('D') thread1.start () time. sleep (20) thread2.start () time. sleep (20) thread3.start () thread4.start ()

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python simple web crawler + html body Extraction

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python simple web crawler + html body Extraction

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support