This article describes how to use python to crawl the JQuery course of w3shcool and save it to a local machine. It has good reference value. Next let's take a look at the small Editor. This article mainly introduces the python method for crawling the JQuery course of w3shcool and saving it to the local for parsing. It has good reference value. Let's take a look at it with the small editor.
Recently, I am busy looking for a job. in my spare time, I am also looking for crawlers to practice, write code, and know that I am a newbie. but I need to do more exercises. Shushan has a path to work diligently. You can introduce automation, functions, and interfaces to me if you have a test pitfall.
First of all, we have a clear requirement. many people want to look at some technologies when they have something to do. for example, I want to look at JQuery's syntax. but now I don't have a network, and I don't have e-books on mobile phones, it really makes us very uncomfortable, so don't worry. I will satisfy you here. First of all, your need is to get the JQuery syntax, so I am seeing this demand, for a website with a response, we will analyze the website. Www.w3school.com.cn/jquery/jquery_syntax.asp this is the syntax url, http://www.w3school.com.cn/jquery/jquery_intro.asp this is the url of the introduction, then we get a lot of url analysis to our pipeline
Let's take a look at these links ,. We can join these links together with the http://www.w3school.com.cn. Then form our new url,
Code on
import urllib.requestfrom bs4 import BeautifulSoup import timedef head(): headers={ 'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 Firefox/52.0' } return headersdef parse_url(url): hea=head() resposne=urllib.request.Request(url,headers=hea) html=urllib.request.urlopen(resposne).read().decode('gb2312') return htmldef url_s(): url='http://www.w3school.com.cn/jquery/index.asp' html=parse_url(url) soup=BeautifulSoup(html) me=soup.find_all(id='course') m_url_text=[] m_url=[] for link in me: m_url_text.append(link.text) m=link.find_all('a') for i in m: m_url.append(i.get('href')) for i in m_url_text: h=i.encode('utf-8').decode('utf-8') m_url_text=h.split('\n') return m_url,m_url_text
In this way, we can use the url_s function to obtain all our links.
['/Jquery/index. asp ','/jquery/jquery_intro.asp ','/jquery/jquery_install.asp ','/jquery/jquery_syntax.asp ','/jquery/jquery_selectors.asp ','/jquery/jquery_events.asp ', '/jquery/jquery_hide_show.asp', '/jquery/jquery_fade.asp', '/jquery/jquery_slide.asp', '/jquery/jquery_animate.asp', '/jquery/jquery_stop.asp ', '/jquery/jquery_callback.asp', '/jquery/jquery_chaining.asp', '/jquery/jquery_dom_get.asp', '/jquery/jquery_dom_set.asp', '/jquery/jquery_dom_add.asp ', '/jquery/jquery_dom_remove.asp', '/jquery/jquery_css_classes.asp', '/jquery/jquery_css.asp', '/jquery/jquery_dimensions.asp', '/jquery/jquery_traversingasp ', '/jquery/jquery_traversing_ancestors.asp', '/jquery/upload','/jquery/upload', '/jquery/jquery_traversing_filtering.asp', '/jquery/jquery_ajax_intro.asp ', '/jquery/jquery_ajax_load.asp', '/jquery/upload','/jquery/jquery_noconflict.asp ','/jquery/jquery_examples.asp ','/jquery/jquery_quiz.asp ', '/jquery/jquery_reference.asp', '/jquery/jquery_ref_selectors.asp', '/jquery/jquery_ref_events.asp', '/jquery/jquery_ref_effects.asp', '/jquery/ASP ', '/jquery/callback','/jquery/jquery_ref_css.asp ','/jquery/jquery_ref_ajax.asp ','/jquery/jquery_ref_traversing.asp ','/jquery/jquery_ref_data.asp ', '/jquery/jquery_ref_dom_element_methods.asp', '/jquery/jquery_ref_core.asp', '/jquery/jquery_ref_prop.asp'], ['jquery ', '', 'jquery ', 'jquery introduction', 'jquery installation', 'jquery syntacay', 'jquery selector ', 'jquery event', '', 'jquery effect ','', 'jquery hide/display', 'jquery fade in and out ', 'jquery slide', 'jquery animate', 'jquery stop () ', 'jquery callback', 'jquery chaining ', '', 'jquery HTML ','', 'jquery access', 'jquery settings', 'jquery add', 'jquery delete', 'jquery CSS class ', 'jquery css () ', 'jquery size', '', 'jquery traversal','', 'jquery traversal ', 'jquery ancestor', and 'jquery descendant ', 'jquery compatriot ', 'jquery filter', '', 'jquery AJAX','', 'jquery AJAX introduction', 'jquery upload', 'jquery Get/post ', '', 'jquery Miscellaneous ','', 'jquery noConflict ()', '', 'jquery instances','', 'jquery instances', and 'jquery test ', '', 'jquery Reference Manual ','', 'jquery Reference Manual', 'jquery selector ', 'jquery event', 'jquery effect', 'jquery document operation ', 'jquery attribute operation', 'jquery CSS operation', 'jquery Ajax ', 'jquery traversal', 'jquery data', 'jquery DOM elemental ', 'jquery core ', 'jquery attribute', '',''])
This is the name of the syntax module corresponding to all links and corresponding links. Next we will splice the urls and use the str concatenation.
['http://www.w3school.com.cn//jquery/index.asp', 'http://www.w3school.com.cn//jquery/jquery_intro.asp', 'http://www.w3school.com.cn//jquery/jquery_install.asp', 'http://www.w3school.com.cn//jquery/jquery_syntax.asp', 'http://www.w3school.com.cn//jquery/jquery_selectors.asp', 'http://www.w3school.com.cn//jquery/jquery_events.asp', 'http://www.w3school.com.cn//jquery/jquery_hide_show.asp', 'http://www.w3school.com.cn//jquery/jquery_fade.asp', 'http://www.w3school.com.cn//jquery/jquery_slide.asp', 'http://www.w3school.com.cn//jquery/jquery_animate.asp', 'http://www.w3school.com.cn//jquery/jquery_stop.asp', 'http://www.w3school.com.cn//jquery/jquery_callback.asp', 'http://www.w3school.com.cn//jquery/jquery_chaining.asp', 'http://www.w3school.com.cn//jquery/jquery_dom_get.asp', 'http://www.w3school.com.cn//jquery/jquery_dom_set.asp', 'http://www.w3school.com.cn//jquery/jquery_dom_add.asp', 'http://www.w3school.com.cn//jquery/jquery_dom_remove.asp', 'http://www.w3school.com.cn//jquery/jquery_css_classes.asp', 'http://www.w3school.com.cn//jquery/jquery_css.asp', 'http://www.w3school.com.cn//jquery/jquery_dimensions.asp', 'http://www.w3school.com.cn//jquery/jquery_traversing.asp', 'http://www.w3school.com.cn//jquery/jquery_traversing_ancestors.asp', 'http://www.w3school.com.cn//jquery/jquery_traversing_descendants.asp', 'http://www.w3school.com.cn//jquery/jquery_traversing_siblings.asp', 'http://www.w3school.com.cn//jquery/jquery_traversing_filtering.asp', 'http://www.w3school.com.cn//jquery/jquery_ajax_intro.asp', 'http://www.w3school.com.cn//jquery/jquery_ajax_load.asp', 'http://www.w3school.com.cn//jquery/jquery_ajax_get_post.asp', 'http://www.w3school.com.cn//jquery/jquery_noconflict.asp', 'http://www.w3school.com.cn//jquery/jquery_examples.asp', 'http://www.w3school.com.cn//jquery/jquery_quiz.asp', 'http://www.w3school.com.cn//jquery/jquery_reference.asp', 'http://www.w3school.com.cn//jquery/jquery_ref_selectors.asp', 'http://www.w3school.com.cn//jquery/jquery_ref_events.asp', 'http://www.w3school.com.cn//jquery/jquery_ref_effects.asp', 'http://www.w3school.com.cn//jquery/jquery_ref_manipulation.asp', 'http://www.w3school.com.cn//jquery/jquery_ref_attributes.asp', 'http://www.w3school.com.cn//jquery/jquery_ref_css.asp', 'http://www.w3school.com.cn//jquery/jquery_ref_ajax.asp', 'http://www.w3school.com.cn//jquery/jquery_ref_traversing.asp', 'http://www.w3school.com.cn//jquery/jquery_ref_data.asp', 'http://www.w3school.com.cn//jquery/jquery_ref_dom_element_methods.asp', 'http://www.w3school.com.cn//jquery/jquery_ref_core.asp', 'http://www.w3school.com.cn//jquery/jquery_ref_prop.asp']
So we have all the urls, so let's analyze the text.
The analysis shows that all our bodies are in the same id = maincontent, so we can directly parse the tag id = maincontent in each interface, get the text document of the response, and save it.
So all our code is as follows:
import urllib.requestfrom bs4 import BeautifulSoup import timedef head(): headers={ 'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 Firefox/52.0' } return headersdef parse_url(url): hea=head() resposne=urllib.request.Request(url,headers=hea) html=urllib.request.urlopen(resposne).read().decode('gb2312') return htmldef url_s(): url='http://www.w3school.com.cn/jquery/index.asp' html=parse_url(url) soup=BeautifulSoup(html) me=soup.find_all(id='course') m_url_text=[] m_url=[] for link in me: m_url_text.append(link.text) m=link.find_all('a') for i in m: m_url.append(i.get('href')) for i in m_url_text: h=i.encode('utf-8').decode('utf-8') m_url_text=h.split('\n') return m_url,m_url_textdef xml(): url,url_text=url_s() url_jque=[] for link in url: url_jque.append('http://www.w3school.com.cn/'+link) return url_jquedef xiazai(): urls=xml() i=0 for url in urls: html=parse_url(url) soup=BeautifulSoup(html) me=soup.find_all(id='maincontent') with open(r'%s.txt'%i,'wb') as f: for h in me: f.write(h.text.encode('utf-8')) print(i) i+=1if __name__ == '__main__': xiazai()
import urllib.requestfrom bs4 import BeautifulSoup import timedef head(): headers={ 'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 Firefox/52.0' } return headersdef parse_url(url): hea=head() resposne=urllib.request.Request(url,headers=hea) html=urllib.request.urlopen(resposne).read().decode('gb2312') return htmldef url_s(): url='http://www.w3school.com.cn/jquery/index.asp' html=parse_url(url) soup=BeautifulSoup(html) me=soup.find_all(id='course') m_url_text=[] m_url=[] for link in me: m_url_text.append(link.text) m=link.find_all('a') for i in m: m_url.append(i.get('href')) for i in m_url_text: h=i.encode('utf-8').decode('utf-8') m_url_text=h.split('\n') return m_url,m_url_textdef xml(): url,url_text=url_s() url_jque=[] for link in url: url_jque.append('http://www.w3school.com.cn/'+link) return url_jquedef xiazai(): urls=xml() i=0 for url in urls: html=parse_url(url) soup=BeautifulSoup(html) me=soup.find_all(id='maincontent') with open(r'%s.txt'%i,'wb') as f: for h in me: f.write(h.text.encode('utf-8')) print(i) i+=1if __name__ == '__main__': xiazai()
Result
So far, our crawling work is complete, and the rest is Xiaoxiu Xiaobu. we should have done all the big content.
In fact, the python crawler is still very simple. as long as we analyze the website elements and find out the general items of all elements, we can effectively analyze and solve our problems.
The above describes how to use python to crawl w3shcool courses and save them to a local code instance. For more information, see other related articles in the first PHP community!