Python crawls the JQuery course of w3shcool and saves it to the local machine.

Source: Internet
Author: User

Python crawls the JQuery course of w3shcool and saves it to the local machine.

Recently, I am busy looking for a job. In my spare time, I am also looking for crawlers to practice, write code, and know that I am a newbie. But I need to do more exercises. Shushan has a path to work diligently. You can introduce automation, functions, and interfaces to me if you have a test pitfall.

First of all, we have a clear requirement. Many people want to look at some technologies when they have something to do. For example, I want to look at JQuery's syntax. But now I don't have a network, and I don't have e-books on mobile phones, it really makes us very uncomfortable, so don't worry. I will satisfy you here. First of all, your need is to get the JQuery syntax, so I am seeing this demand, for a website with a response, we will analyze the website. Http://www.w3school.com.cn/jquery/jquery_syntax.asp this is the syntax url, http://www.w3school.com.cn/jquery/jquery_intro.asp This is the introduction url, so we get a lot of url analysis to our pipeline

Let's take a look at these links ,. We can join these links together with the http://www.w3school.com.cn. Then form our new url,

Code on

import urllib.requestfrom bs4 import BeautifulSoup import timedef head(): headers={ 'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 Firefox/52.0' } return headersdef parse_url(url): hea=head() resposne=urllib.request.Request(url,headers=hea) html=urllib.request.urlopen(resposne).read().decode('gb2312') return htmldef url_s(): url='http://www.w3school.com.cn/jquery/index.asp' html=parse_url(url) soup=BeautifulSoup(html) me=soup.find_all(id='course') m_url_text=[] m_url=[] for link in me:  m_url_text.append(link.text)  m=link.find_all('a')  for i in m:   m_url.append(i.get('href')) for i in m_url_text:  h=i.encode('utf-8').decode('utf-8')  m_url_text=h.split('\n') return m_url,m_url_text

In this way, we can use the url_s function to obtain all our links.

['/Jquery/index. asp ','/jquery/jquery_intro.asp ','/jquery/jquery_install.asp ','/jquery/jquery_syntax.asp ','/jquery/jquery_selectors.asp ','/jquery/jquery_events.asp ', '/jquery/jquery_hide_show.asp', '/jquery/jquery_fade.asp', '/jquery/jquery_slide.asp', '/jquery/jquery_animate.asp', '/jquery/jquery_stop.asp ', '/jquery/jquery_callback.asp', '/jquery/jquery_chaining.asp', '/jquery/jquery_dom_get.asp', '/jquery/jquery_dom_set.asp', '/jquery/jquery_dom_add.asp ', '/jquery/jquery_dom_remove.asp', '/jquery/jquery_css_classes.asp', '/jquery/jquery_css.asp', '/jquery/jquery_dimensions.asp', '/jquery/jquery_traversingasp ', '/jquery/jquery_traversing_ancestors.asp', '/jquery/upload','/jquery/upload', '/jquery/jquery_traversing_filtering.asp', '/jquery/jquery_ajax_intro.asp ', '/jquery/jquery_ajax_load.asp', '/jquery/upload','/jquery/jquery_noconflict.asp ','/jquery/jquery_examples.asp ','/jquery/jquery_quiz.asp ', '/jquery/jquery_reference.asp', '/jquery/jquery_ref_selectors.asp', '/jquery/jquery_ref_events.asp', '/jquery/jquery_ref_effects.asp', '/jquery/asp ', '/jquery/callback','/jquery/jquery_ref_css.asp ','/jquery/jquery_ref_ajax.asp ','/jquery/jquery_ref_traversing.asp ','/jquery/jquery_ref_data.asp ', '/jquery/jquery_ref_dom_element_methods.asp', '/jquery/jquery_ref_core.asp', '/jquery/jquery_ref_prop.asp'], ['jquery ', '', 'jquery ', 'jquery introduction', 'jquery installation', 'jquery syntacay', 'jquery selector ', 'jquery event', '', 'jquery effect ','', 'jquery hide/display', 'jquery fade in and out ', 'jquery slide', 'jquery animate', 'jquery stop () ', 'jquery callback', 'jquery chaining ', '', 'jquery HTML ','', 'jquery access', 'jquery settings', 'jquery add', 'jquery delete', 'jquery CSS class ', 'jquery css () ', 'jquery size', '', 'jquery traversal','', 'jquery traversal ', 'jquery ancestor', and 'jquery descendant ', 'jquery compatriot ', 'jquery filter', '', 'jquery AJAX','', 'jquery AJAX introduction', 'jquery upload', 'jquery Get/Post ', '', 'jquery miscellaneous ','', 'jquery noConflict ()', '', 'jquery instances','', 'jquery instances', and 'jquery test ', '', 'jquery reference manual ','', 'jquery reference manual', 'jquery selector ', 'jquery event', 'jquery effect', 'jquery document operation ', 'jquery attribute operation', 'jquery CSS operation', 'jquery Ajax ', 'jquery traversal', 'jquery data', 'jquery DOM elemental ', 'jquery core ', 'jquery attribute', '',''])

This is the name of the syntax module corresponding to all links and corresponding links. Next we will splice the urls and use the str concatenation.

 ['http://www.w3school.com.cn//jquery/index.asp', 'http://www.w3school.com.cn//jquery/jquery_intro.asp', 'http://www.w3school.com.cn//jquery/jquery_install.asp', 'http://www.w3school.com.cn//jquery/jquery_syntax.asp', 'http://www.w3school.com.cn//jquery/jquery_selectors.asp', 'http://www.w3school.com.cn//jquery/jquery_events.asp', 'http://www.w3school.com.cn//jquery/jquery_hide_show.asp', 'http://www.w3school.com.cn//jquery/jquery_fade.asp', 'http://www.w3school.com.cn//jquery/jquery_slide.asp', 'http://www.w3school.com.cn//jquery/jquery_animate.asp', 'http://www.w3school.com.cn//jquery/jquery_stop.asp', 'http://www.w3school.com.cn//jquery/jquery_callback.asp', 'http://www.w3school.com.cn//jquery/jquery_chaining.asp', 'http://www.w3school.com.cn//jquery/jquery_dom_get.asp', 'http://www.w3school.com.cn//jquery/jquery_dom_set.asp', 'http://www.w3school.com.cn//jquery/jquery_dom_add.asp', 'http://www.w3school.com.cn//jquery/jquery_dom_remove.asp', 'http://www.w3school.com.cn//jquery/jquery_css_classes.asp', 'http://www.w3school.com.cn//jquery/jquery_css.asp', 'http://www.w3school.com.cn//jquery/jquery_dimensions.asp', 'http://www.w3school.com.cn//jquery/jquery_traversing.asp', 'http://www.w3school.com.cn//jquery/jquery_traversing_ancestors.asp', 'http://www.w3school.com.cn//jquery/jquery_traversing_descendants.asp', 'http://www.w3school.com.cn//jquery/jquery_traversing_siblings.asp', 'http://www.w3school.com.cn//jquery/jquery_traversing_filtering.asp', 'http://www.w3school.com.cn//jquery/jquery_ajax_intro.asp', 'http://www.w3school.com.cn//jquery/jquery_ajax_load.asp', 'http://www.w3school.com.cn//jquery/jquery_ajax_get_post.asp', 'http://www.w3school.com.cn//jquery/jquery_noconflict.asp', 'http://www.w3school.com.cn//jquery/jquery_examples.asp', 'http://www.w3school.com.cn//jquery/jquery_quiz.asp', 'http://www.w3school.com.cn//jquery/jquery_reference.asp', 'http://www.w3school.com.cn//jquery/jquery_ref_selectors.asp', 'http://www.w3school.com.cn//jquery/jquery_ref_events.asp', 'http://www.w3school.com.cn//jquery/jquery_ref_effects.asp', 'http://www.w3school.com.cn//jquery/jquery_ref_manipulation.asp', 'http://www.w3school.com.cn//jquery/jquery_ref_attributes.asp', 'http://www.w3school.com.cn//jquery/jquery_ref_css.asp', 'http://www.w3school.com.cn//jquery/jquery_ref_ajax.asp', 'http://www.w3school.com.cn//jquery/jquery_ref_traversing.asp', 'http://www.w3school.com.cn//jquery/jquery_ref_data.asp', 'http://www.w3school.com.cn//jquery/jquery_ref_dom_element_methods.asp', 'http://www.w3school.com.cn//jquery/jquery_ref_core.asp', 'http://www.w3school.com.cn//jquery/jquery_ref_prop.asp']

So we have all the urls, so let's analyze the text.

The analysis shows that all our bodies are in the same id = maincontent, so we can directly parse the tag id = maincontent in each interface, get the text document of the response, and save it.

So all our code is as follows:

import urllib.requestfrom bs4 import BeautifulSoup import timedef head(): headers={ 'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 Firefox/52.0' } return headersdef parse_url(url): hea=head() resposne=urllib.request.Request(url,headers=hea) html=urllib.request.urlopen(resposne).read().decode('gb2312') return htmldef url_s(): url='http://www.w3school.com.cn/jquery/index.asp' html=parse_url(url) soup=BeautifulSoup(html) me=soup.find_all(id='course') m_url_text=[] m_url=[] for link in me:  m_url_text.append(link.text)  m=link.find_all('a')  for i in m:   m_url.append(i.get('href')) for i in m_url_text:  h=i.encode('utf-8').decode('utf-8')  m_url_text=h.split('\n') return m_url,m_url_textdef xml(): url,url_text=url_s() url_jque=[] for link in url:  url_jque.append('http://www.w3school.com.cn/'+link) return url_jquedef xiazai(): urls=xml() i=0 for url in urls:  html=parse_url(url)  soup=BeautifulSoup(html)  me=soup.find_all(id='maincontent')  with open(r'%s.txt'%i,'wb') as f:   for h in me:    f.write(h.text.encode('utf-8'))    print(i)  i+=1if __name__ == '__main__': xiazai()
import urllib.requestfrom bs4 import BeautifulSoup import timedef head(): headers={ 'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 Firefox/52.0' } return headersdef parse_url(url): hea=head() resposne=urllib.request.Request(url,headers=hea) html=urllib.request.urlopen(resposne).read().decode('gb2312') return htmldef url_s(): url='http://www.w3school.com.cn/jquery/index.asp' html=parse_url(url) soup=BeautifulSoup(html) me=soup.find_all(id='course') m_url_text=[] m_url=[] for link in me:  m_url_text.append(link.text)  m=link.find_all('a')  for i in m:   m_url.append(i.get('href')) for i in m_url_text:  h=i.encode('utf-8').decode('utf-8')  m_url_text=h.split('\n') return m_url,m_url_textdef xml(): url,url_text=url_s() url_jque=[] for link in url:  url_jque.append('http://www.w3school.com.cn/'+link) return url_jquedef xiazai(): urls=xml() i=0 for url in urls:  html=parse_url(url)  soup=BeautifulSoup(html)  me=soup.find_all(id='maincontent')  with open(r'%s.txt'%i,'wb') as f:   for h in me:    f.write(h.text.encode('utf-8'))    print(i)  i+=1if __name__ == '__main__': xiazai()

Result

So far, our crawling work is complete, and the rest is xiaoxiu xiaobu. We should have done all the big content.

In fact, the python crawler is still very simple. As long as we analyze the website elements and find out the general items of all elements, we can effectively analyze and solve our problems.

The above is all the content of this article. I hope this article will help you in your study or work. I also hope to provide more support to the customer's home!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.