Python crawler learning notes-single-thread crawler and python learning notes

Last Update:2016-10-11 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Introduction

This article mainly introduces how to crawl the course information of the wheat Institute (this crawler is still a single-thread crawler). Before getting started, let's take a look at the results.

How are you doing? First, let's open the web site of the wheat Institute, and then find all the course information of the wheat Institute, as shown below:

At this time turn pages, watch the changes in the Web site, first, the first page of the web site is http://www.maiziedu.com/course/list/, the second page turned into a http://www.maiziedu.com/course/list/all-all/0-2/, the third page turned into a http://www.maiziedu.com/course/list/all-all/0-3/, you can see, each turn a page, the number after 0 will increase by 1, and then someone will think about it. What about the first page? We tried to put the http://www.maiziedu.com/course/list/all-all/0-1/ into the address bar of the browser, found that you can open the first column, it is easy to do, we only need to use re. sub () can easily obtain any page of content. After obtaining the URL link, the following is to obtain the source code of the webpage. First, right-click to view the review or check elements to see the following interface.

After finding the location of the course, you can easily use regular expressions to extract the content you need. As for how to extract the content, you have to rely on yourself, trying to find the rule on your own can lead to greater gains. If you really don't know how to extract it, go on and check my source code.

Actual source code

# Coding = UTF-8 import re import requests import sys reload (sys) sys. setdefaultencoding ("utf8") class spider (): def _ init _ (self): print "to start crawling content... "Def changePage (self, url, total_page): nowpage = int (re. search ('/0-(\ d +)/', url, re. S ). group (1) pagegroup = [] for I in range (nowpage, total_page + 1): link = re. sub ('/0-(\ d +)/', '/0-% s/' % I, url, re. s) pagegroup. append (link) return pagegroup def getsource (self, url): html = requests. get (url) return html. text def getclasses (self, source): classes = re. search ('<ul class = "zy_course_list"> (. *?) </Ul> ', source, re. S ). group (1) return classes def geteach (self, classes): eachclasses = re. findall ('<li> (. *?) </Li> ', classes, re. s) return eachclasses def getinfo (self, eachclass): info ={} info ['title'] = re. search ('<a title = "(. *?) "', Eachclass, re. S). group (1) info ['people'] = re. search (' <p class =" color99 "> (.*?) </P> ', eachclass, re. S ). group (1) return info def saveinfo (self, classinfo): f = open('info.txt ', 'A') for each in classinfo: f. writelines ('title: '+ each ['title'] +' \ n') f. writelines ('People: '+ each ['others'] +' \ n \ n') f. close () if _ name _ = '_ main _': classinfo = [] url = 'HTTP: // your maizispider = spider () all_links = maizispider. changePage (url, 30) for each in all_links: htmlsources = maizispider. getsource (each) classes = maizispider. getclasses (htmlsources) eachclasses = maizispider. geteach (classes) for each in eachclasses: info = maizispider. getinfo (each) classinfo. append (info) maizispider. saveinfo (classinfo)

The above code is not difficult to understand. It is basically the use of regular expressions. Then we can directly run the code to see the content starting with us. Because this is a single-thread crawler, the running speed is a little slow, the multi-thread crawler will be updated later.

At the requirements of our partners, the following is an attachment.Installation and simple example of requests crawler Library

First install pip package management tool and download the get-pip.py. I installed Python 2 and Python 3 on my machine.

Install pip to python2:

python get-pip.py

Install Python 3:

python3 get-pip.py

After pip is installed, install the requests library to enable python crawler learning.

Install requests

pip3 install requests

Python3 and python2 can be directly installed with pip requests.

Example

import requestshtml=requests.get("http://gupowang.baijia.baidu.com/article/283878")html.encoding='utf-8'print(html.text)

The first line introduces the requests library, the second line uses the get method of requests to obtain the webpage source code, the third line sets the encoding format, and the fourth line outputs the text.
Save the obtained webpage source code to a text file:

import requestsimport oshtml=requests.get("http://gupowang.baijia.baidu.com/article/283878")html_file=open("news.txt","w")html.encoding='utf-8'print(html.text,file=html_file)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python crawler learning notes-single-thread crawler and python learning notes

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python crawler learning notes-single-thread crawler and python learning notes

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support