Scrapy Crawl the Web (IMOOC) All course data and deposit it into MySQL database

Last Update:2017-08-03 Source: Internet

Author: User

Tags xpath

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Crawl target: Use Scrapy to crawl all course data, respectively

1. Course Name 2. Course Description 3. Course Level 4. Number of learners

and stored in MySQL database (destination URL http://www.imooc.com/course/list)

I. Exporting data files to a local

1. New IMOOC Project

1 scrapy startproject IMOOC

2. Modify items.py, add Project item

1  from  Import Item,field 2 class Imoocitem (Item): 3     Course_name=field ()# Course name 4     Course_content=field ()# Course Content  5     Course_level=field ()# Course level 6     course_attendance= Field ()# Course number of learners

3. Making crawlers in the Spiders directory

VI imooc_spider.py

1 #-*-coding:utf-8-*-2  fromScrapy.spidersImportCrawlspider3  fromScrapy.selectorImportSelector4  fromImooc.itemsImportImoocitem5  fromScrapy.httpImportRequest6 7 8 classImooc (crawlspider):9Name='IMOOC'TenAllowed_domains = ['imooc.com'] OneStart_urls = [] A      forPninchRange (1,31): -URL ='http://www.imooc.com/course/list?page=%s'%PN - start_urls.append (URL) the  -     defParse (self,response): -item=Imoocitem () -Selector=Selector (response) +Course = Selector.xpath ('//a[@class = "Course-card"]') -  +          forEachcourseinchCourse: ACourse_name = Eachcourse.xpath ('div[@class = "Course-card-content"]/h3[@class = "Course-card-name"]/text ()'). Extract () [0] atCourse_content = Eachcourse.xpath ('div[@class = "Course-card-content"]/div[@class = "Clearfix course-card-bottom"]/p[@class = "Course-card-desc"]/ Text ()'). Extract () -Course_level = Eachcourse.xpath ('div[@class = "Course-card-content"]/div[@class = "Clearfix course-card-bottom"]/div[@class = "Course-card-info"]/ Span/text ()'). Extract () [0] -Course_attendance = Eachcourse.xpath ('div[@class = "Course-card-content"]/div[@class = "Clearfix course-card-bottom"]/div[@class = "Course-card-info"]/ Span/text ()'). Extract () [1] -item['Course_name'] =Course_name -item['course_content'] =';'. Join (course_content) -item['Course_level'] =Course_level initem['course_attendance'] =course_attendance -             yieldItem

4. Now you can run the crawler to export the data, now in the CVS format test

1 scrapy crawl imooc-o data.csv-t CSV

View Files

Two. Crawling data and storing it in MySQL database

1. Using the MySQL database to store data, you need to use the MYSQLDB package to ensure that you have installed

First build the database and table

--Create a databaseCreate DatabaseImoocDEFAULT CHARACTER SETUTF8 COLLATE utf8_general_ci;--Create a tableCreate TableImooc_info2 (Titlevarchar(255) not NULLCOMMENT'Course Name', Contentvarchar(255) not NULLCOMMENT'Course Introduction', Level varchar(255) not NULLCOMMENT'Course Level', sumsint   not NULLCOMMENT'number of courses studied')

2. Modify pipelines.py

1 #-*-coding:utf-8-*-2 3 #Define your item pipelines here4 #5 #Don ' t forget to add your pipeline to the Item_pipelines setting6 #see:http://doc.scrapy.org/en/latest/topics/item-pipeline.html7 8 9 ImportJSONTen  fromTwisted.enterpriseImportAdbapi One  fromScrapyImportLog A ImportMySQLdb - Importmysqldb.cursors - ImportCodecs the  - classImoocpipeline (object): -     def __init__(self): -Self.file = Codecs.open ('Imooc.json','W', encoding='Utf-8') +     defProcess_item (self, item, spider): -line = Json.dumps (Dict (item), Ensure_ascii=false) +"\ n" + Self.file.write (line) A         returnItem at     defspider_closed (self, spider): - self.file.close () -  - classMysqlpipeline (object): -  -     def __init__(self): inSelf.dbpool = Adbapi. ConnectionPool ("MySQLdb", -db ="IMOOC",#Database name touser ="Root",#Database user name +passwd ="hwfx1234",#Password -Cursorclass =MySQLdb.cursors.DictCursor, theCharSet ="UTF8", *Use_unicode =True $                                            )Panax Notoginseng     defProcess_item (self, item, spider): -query =self.dbpool.runInteraction (Self._conditional_insert, item) the Query.adderrback (self.handle_error) +         returnItem A  the     def_conditional_insert (self, TB, item): +Tb.execute ("""INSERT INTO Imooc_info2 (title,content,level,sums) VALUES (%s,%s,%s,%s)""", (item['Course_name'],item['course_content'],item['Course_level'],item['course_attendance'])) -Log.msg ("Item data in db:%s"% Item, level=log. DEBUG) $  $     defHandle_error (Self, e): -Log.err (e)

3. Modify setting.py

Added MySQL config, add new class in pipelines.py

1 #Start MySQL Database Configure setting2Mysql_host ='localhost'3Mysql_dbname ='IMOOC'4Mysql_user ='Root'5MYSQL_PASSWD ='hwfx1234'6 #End of MySQL database Configure setting7Item_pipelines = {8             'Imooc.pipelines.ImoocPipeline': 300,9             'Imooc.pipelines.MySQLPipeline': 300,Ten}

4. Start the crawler

1 scrapy crawl Imooc

View database table data, data is already in storage.

Summary: Scrapy simple application, have not considered anti-crawler, distributed and other problems, but also need more practice.

Scrapy Crawl the Web (IMOOC) All course data and deposit it into MySQL database

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More