This article focuses on Python crawling W3shcool's course and saving it to a local method resolution. Has a good reference value. Let's take a look at the little series.
Recently busy looking for work, spare time, also find some crawler project practice practiced hand, write code, know oneself is a rookie, but to more practice, book Mountain has no royal road for the path. You have a test pit can give me a ah, automation, functions, interfaces can be done.
First of all, we have clear needs, many students, there is nothing to see some technology, such as I want to see the syntax of jquery, but I now have no network, no e-book on the phone, really let us very uncomfortable, then don't worry ah, you this demand I meet you here, first of all, Your requirement is to get the syntax of jquery, then I am looking at this requirement and I have a response to the site then we are going to analyze the site next. Www.w3school.com.cn/jquery/jquery_syntax.asp This is the syntax URL, http://www.w3school.com.cn/jquery/jquery_intro.asp this is the URL of the introduction , then we get a lot of URL analysis to, our www.w3school.com.cn/jquery is the same, then we are to analyze how the interface can get these, we can see the corresponding target bar on the right, then we go to analyze
Let's take a look at these links. We can do that. These links and http://www.w3school.com.cn are stitched together. And then make up our new URL,
On the Code
Import urllib.requestfrom BS4 import beautifulsoup import timedef Head (): headers={' user-agent ': ' mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) gecko/20100101 firefox/52.0 '} return headersdef Parse_url (URL): Hea=head () resposne=urllib.request.request ( Url,headers=hea) Html=urllib.request.urlopen (Resposne). Read (). Decode (' gb2312 ') return htmldef url_s (): url= '/http Www.w3school.com.cn/jquery/index.asp ' Html=parse_url (URL) soup=beautifulsoup (HTML) me=soup.find_all (id= ' course ') m _url_text=[] m_url=[] for link in me: m_url_text.append (Link.text) m=link.find_all (' a ') for I in M: m_ Url.append (I.get (' href ')) for I in M_url_text: H=i.encode (' Utf-8 '). Decode (' utf-8 ') m_url_text=h.split (' \ n ') ) Return M_url,m_url_text
This way we can get all of our links using the url_s function.
['/jquery/index.asp ', '/jquery/jquery_intro.asp ', '/jquery/jquery_install.asp ', '/jquery/jquery_syntax.asp ', '/ Jquery/jquery_selectors.asp ', '/jquery/jquery_events.asp ', '/jquery/jquery_hide_show.asp ', '/jquery/jquery_ Fade.asp ', '/jquery/jquery_slide.asp ', '/jquery/jquery_animate.asp ', '/jquery/jquery_stop.asp ', '/jquery/jquery_ Callback.asp ', '/jquery/jquery_chaining.asp ', '/jquery/jquery_dom_get.asp ', '/jquery/jquery_dom_set.asp ', '/jquery /jquery_dom_add.asp ', '/jquery/jquery_dom_remove.asp ', '/jquery/jquery_css_classes.asp ', '/jquery/jquery_css.asp ' , '/jquery/jquery_dimensions.asp ', '/jquery/jquery_traversing.asp ', '/jquery/jquery_traversing_ancestors.asp ', '/ Jquery/jquery_traversing_descendants.asp ', '/jquery/jquery_traversing_siblings.asp ', '/jquery/jquery_traversing_ Filtering.asp ', '/jquery/jquery_ajax_intro.asp ', '/jquery/jquery_ajax_load.asp ', '/jquery/jquery_ajax_get_ Post.asp ', '/jquery/jquery_noconflict.asp ', '/jquery/jquery_examples.asp ', '/jquery/jquery_quiz.asp ', '/jquery/jQuery_reference.asp ', '/jquery/jquery_ref_selectors.asp ', '/jquery/jquery_ref_events.asp ', '/jquery/jquery_ref_ Effects.asp ', '/jquery/jquery_ref_manipulation.asp ', '/jquery/jquery_ref_attributes.asp ', '/jquery/jquery_ref_ Css.asp ', '/jquery/jquery_ref_ajax.asp ', '/jquery/jquery_ref_traversing.asp ', '/jquery/jquery_ref_data.asp ', '/ Jquery/jquery_ref_dom_element_methods.asp ', '/jquery/jquery_ref_core.asp ', '/jquery/jquery_ref_prop.asp '], [' jquery tutorial ', ', ' jquery tutorial ', ' jquery profile ', ' jquery install ', ' jquery syntax ', ' jquery selector ', ' jquery event ', ' ' jquery effect ', ' ', ' Jquer Y Hide/Show ', ' jquery fade ', ' jquery slide ', ' jquery animation ', ' jquery Stop ', ' jquery Callback ', ' jquery Chaining ', ', ' jquery HTML ', ', ' jquery gets ', ' jquery set ', ' jquery add ', ' jquery delete ', ' jquery css class ', ' jquery css () ', ' jquery size ', ' ', ' jquery traversal ', ' ', ' jquery traversal ', ' jquery ancestors ', ' jquery descendants ', ' jquery compatriots ', ' jquery filter ', ' ', ' jquery ajax ', ' ', ' jquery ajax ', ' jquery loading ', ' jquery get/post ', ' ', ' jquery Miscellaneous ', ' ', ' jquery noconflict () ', ', ' Jquery instances ', ', ' jquery instances ', ' jquery quiz ', ', ' jquery reference manual ', ', ' jquery reference manual ', ' jquery selector ', ' jquery event ', ' jquery effect ', ' jque RY document Operations ', ' jquery attribute operation ', ' jquery CSS operation ', ' jquery Ajax ', ' jquery traversal ', ' jquery data ', ' jquery DOM element ', ' jquery core ', ' jquery Sex ', ', ', '])
This is the name of the corresponding syntax module for all links and corresponding links. Then we are going to splice the URLs, using the str splicing
[' http://www.w3school.com.cn//jquery/index.asp ', ' http://www.w3school.com.cn//jquery/jquery_intro.asp ', ' HTTP// Www.w3school.com.cn//jquery/jquery_install.asp ', ' http://www.w3school.com.cn//jquery/jquery_syntax.asp ', ' http:/ /www.w3school.com.cn//jquery/jquery_selectors.asp ', ' http://www.w3school.com.cn//jquery/jquery_events.asp ', ' Http://www.w3school.com.cn//jquery/jquery_hide_show.asp ', ' http://www.w3school.com.cn//jquery/jquery_fade.asp ', ' http://www.w3school.com.cn//jquery/jquery_slide.asp ', ' http://www.w3school.com.cn//jquery/jquery_animate.asp ', ' http://www.w3school.com.cn//jquery/jquery_stop.asp ', ' http://www.w3school.com.cn//jquery/jquery_callback.asp ', ' http://www.w3school.com.cn//jquery/jquery_chaining.asp ', ' http://www.w3school.com.cn//jquery/jquery_dom_ Get.asp ', ' http://www.w3school.com.cn//jquery/jquery_dom_set.asp ', ' Http://www.w3school.com.cn//jquery/jquery_ Dom_add.asp ', ' http://www.w3school.com.cn//jquery/jquery_dom_remove.asp ', ' http://www.w3school.com.cn//jquery/ JquerY_css_classes.asp ', ' http://www.w3school.com.cn//jquery/jquery_css.asp ', ' http://www.w3school.com.cn//jquery/ Jquery_dimensions.asp ', ' http://www.w3school.com.cn//jquery/jquery_traversing.asp ', ' http://www.w3school.com.cn/ /jquery/jquery_traversing_ancestors.asp ', ' http://www.w3school.com.cn//jquery/jquery_traversing_descendants.asp ', ' http://www.w3school.com.cn//jquery/jquery_traversing_siblings.asp ', ' http://www.w3school.com.cn//jquery/ Jquery_traversing_filtering.asp ', ' http://www.w3school.com.cn//jquery/jquery_ajax_intro.asp ', ' HTTP// Www.w3school.com.cn//jquery/jquery_ajax_load.asp ', ' http://www.w3school.com.cn//jquery/jquery_ajax_get_post.asp ', ' http://www.w3school.com.cn//jquery/jquery_noconflict.asp ', ' Http://www.w3school.com.cn//jquery/jquery_ Examples.asp ', ' http://www.w3school.com.cn//jquery/jquery_quiz.asp ', ' Http://www.w3school.com.cn//jquery/jquery_ Reference.asp ', ' http://www.w3school.com.cn//jquery/jquery_ref_selectors.asp ', ' http://www.w3school.com.cn// Jquery/jquery_ref_eveNts.asp ', ' http://www.w3school.com.cn//jquery/jquery_ref_effects.asp ', ' http://www.w3school.com.cn//jquery/ Jquery_ref_manipulation.asp ', ' http://www.w3school.com.cn//jquery/jquery_ref_attributes.asp ', ' HTTP// Www.w3school.com.cn//jquery/jquery_ref_css.asp ', ' http://www.w3school.com.cn//jquery/jquery_ref_ajax.asp ', ' http ://www.w3school.com.cn//jquery/jquery_ref_traversing.asp ', ' http://www.w3school.com.cn//jquery/jquery_ref_ Data.asp ', ' http://www.w3school.com.cn//jquery/jquery_ref_dom_element_methods.asp ', ' http://www.w3school.com.cn/ /jquery/jquery_ref_core.asp ', ' http://www.w3school.com.cn//jquery/jquery_ref_prop.asp ']
So we have all of this URLs, so let's analyze the text of the article.
Analysis can get all of our text is in a id=maincontent, then we directly parse the Id=maincontent label in each interface, get the text document of the response, and save it well.
So all our code is as follows:
Import urllib.requestfrom BS4 import beautifulsoup import timedef Head (): headers={' user-agent ': ' mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) gecko/20100101 firefox/52.0 '} return headersdef Parse_url (URL): Hea=head () resposne=urllib.request.request ( Url,headers=hea) Html=urllib.request.urlopen (Resposne). Read (). Decode (' gb2312 ') return htmldef url_s (): url= '/http Www.w3school.com.cn/jquery/index.asp ' Html=parse_url (URL) soup=beautifulsoup (HTML) me=soup.find_all (id= ' course ') m _url_text=[] m_url=[] for link in me:m_url_text.append (link.text) m=link.find_all (' a ') for I in M:m_url.append (i.ge T (' href '))) for I in M_url_text:h=i.encode (' Utf-8 '). Decode (' Utf-8 ') m_url_text=h.split (' \ n ') return M_url,m_url_ Textdef XML (): url,url_text=url_s () url_jque=[] for link in url:url_jque.append (' http://www.w3school.com.cn/' +link) return Url_jquedef Xiazai (): Urls=xml () i=0 for URLs in Urls:html=parse_url (URL) soup=beautifulsoup (HTML) me=soup.find_ All (id= ' maincontent ') with open (R '%s.txt '%i, ' WB ') as F:for h in Me:f.write (H.text.encode (' Utf-8 ')) print (i) i+=1if __name__ = ' __main__ ': Xiazai ()
Import urllib.requestfrom BS4 import beautifulsoup import timedef Head (): headers={' user-agent ': ' mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) gecko/20100101 firefox/52.0 '} return headersdef Parse_url (URL): Hea=head () resposne=urllib.request.request ( Url,headers=hea) Html=urllib.request.urlopen (Resposne). Read (). Decode (' gb2312 ') return htmldef url_s (): url= '/http Www.w3school.com.cn/jquery/index.asp ' Html=parse_url (URL) soup=beautifulsoup (HTML) me=soup.find_all (id= ' course ') m _url_text=[] m_url=[] for link in me:m_url_text.append (link.text) m=link.find_all (' a ') for I in M:m_url.append (i.ge T (' href '))) for I in M_url_text:h=i.encode (' Utf-8 '). Decode (' Utf-8 ') m_url_text=h.split (' \ n ') return M_url,m_url_ Textdef XML (): url,url_text=url_s () url_jque=[] for link in url:url_jque.append (' http://www.w3school.com.cn/' +link) return Url_jquedef Xiazai (): Urls=xml () i=0 for URLs in Urls:html=parse_url (URL) soup=beautifulsoup (HTML) me=soup.find_ All (id= ' maincontent ') with open (R '%s.txt '%i, ' WB ') as F:for h in Me:f.write (H.text.encode (' Utf-8 ')) print (i) i+=1if __name__ = ' __main__ ': Xiazai ()
Results
Well, at this point, our crawl work is complete, the rest is a minor repair of the cloth, big content we should all be finished.
In fact, Python crawler is very simple, as long as we will analyze the elements of the site, find out all the elements of the pass can be very good to analyze and solve our problems