Crawl target: Use Scrapy to crawl all course data, respectively
1. Course Name 2. Course Description 3. Course Level 4. Number of learners
and stored in MySQL database (destination URL http://www.imooc.com/course/list)
I. Exporting data files to a local
1. New IMOOC Project
1 scrapy startproject IMOOC
2. Modify items.py, add Project item
1 from Import Item,field 2 class Imoocitem (Item): 3 Course_name=field ()# Course name 4 Course_content=field ()# Course Content 5 Course_level=field ()# Course level 6 course_attendance= Field ()# Course number of learners
3. Making crawlers in the Spiders directory
VI imooc_spider.py
1 #-*-coding:utf-8-*-2 fromScrapy.spidersImportCrawlspider3 fromScrapy.selectorImportSelector4 fromImooc.itemsImportImoocitem5 fromScrapy.httpImportRequest6 7 8 classImooc (crawlspider):9Name='IMOOC'TenAllowed_domains = ['imooc.com'] OneStart_urls = [] A forPninchRange (1,31): -URL ='http://www.imooc.com/course/list?page=%s'%PN - start_urls.append (URL) the - defParse (self,response): -item=Imoocitem () -Selector=Selector (response) +Course = Selector.xpath ('//a[@class = "Course-card"]') - + forEachcourseinchCourse: ACourse_name = Eachcourse.xpath ('div[@class = "Course-card-content"]/h3[@class = "Course-card-name"]/text ()'). Extract () [0] atCourse_content = Eachcourse.xpath ('div[@class = "Course-card-content"]/div[@class = "Clearfix course-card-bottom"]/p[@class = "Course-card-desc"]/ Text ()'). Extract () -Course_level = Eachcourse.xpath ('div[@class = "Course-card-content"]/div[@class = "Clearfix course-card-bottom"]/div[@class = "Course-card-info"]/ Span/text ()'). Extract () [0] -Course_attendance = Eachcourse.xpath ('div[@class = "Course-card-content"]/div[@class = "Clearfix course-card-bottom"]/div[@class = "Course-card-info"]/ Span/text ()'). Extract () [1] -item['Course_name'] =Course_name -item['course_content'] =';'. Join (course_content) -item['Course_level'] =Course_level initem['course_attendance'] =course_attendance - yieldItem
4. Now you can run the crawler to export the data, now in the CVS format test
1 scrapy crawl imooc-o data.csv-t CSV
View Files
Two. Crawling data and storing it in MySQL database
1. Using the MySQL database to store data, you need to use the MYSQLDB package to ensure that you have installed
First build the database and table
--Create a databaseCreate DatabaseImoocDEFAULT CHARACTER SETUTF8 COLLATE utf8_general_ci;--Create a tableCreate TableImooc_info2 (Titlevarchar(255) not NULLCOMMENT'Course Name', Contentvarchar(255) not NULLCOMMENT'Course Introduction', Level varchar(255) not NULLCOMMENT'Course Level', sumsint not NULLCOMMENT'number of courses studied')
2. Modify pipelines.py
1 #-*-coding:utf-8-*-2 3 #Define your item pipelines here4 #5 #Don ' t forget to add your pipeline to the Item_pipelines setting6 #see:http://doc.scrapy.org/en/latest/topics/item-pipeline.html7 8 9 ImportJSONTen fromTwisted.enterpriseImportAdbapi One fromScrapyImportLog A ImportMySQLdb - Importmysqldb.cursors - ImportCodecs the - classImoocpipeline (object): - def __init__(self): -Self.file = Codecs.open ('Imooc.json','W', encoding='Utf-8') + defProcess_item (self, item, spider): -line = Json.dumps (Dict (item), Ensure_ascii=false) +"\ n" + Self.file.write (line) A returnItem at defspider_closed (self, spider): - self.file.close () - - classMysqlpipeline (object): - - def __init__(self): inSelf.dbpool = Adbapi. ConnectionPool ("MySQLdb", -db ="IMOOC",#Database name touser ="Root",#Database user name +passwd ="hwfx1234",#Password -Cursorclass =MySQLdb.cursors.DictCursor, theCharSet ="UTF8", *Use_unicode =True $ )Panax Notoginseng defProcess_item (self, item, spider): -query =self.dbpool.runInteraction (Self._conditional_insert, item) the Query.adderrback (self.handle_error) + returnItem A the def_conditional_insert (self, TB, item): +Tb.execute ("""INSERT INTO Imooc_info2 (title,content,level,sums) VALUES (%s,%s,%s,%s)""", (item['Course_name'],item['course_content'],item['Course_level'],item['course_attendance'])) -Log.msg ("Item data in db:%s"% Item, level=log. DEBUG) $ $ defHandle_error (Self, e): -Log.err (e)
3. Modify setting.py
Added MySQL config, add new class in pipelines.py
1 #Start MySQL Database Configure setting2Mysql_host ='localhost'3Mysql_dbname ='IMOOC'4Mysql_user ='Root'5MYSQL_PASSWD ='hwfx1234'6 #End of MySQL database Configure setting7Item_pipelines = {8 'Imooc.pipelines.ImoocPipeline': 300,9 'Imooc.pipelines.MySQLPipeline': 300,Ten}
4. Start the crawler
1 scrapy crawl Imooc
View database table data, data is already in storage.
Summary: Scrapy simple application, have not considered anti-crawler, distributed and other problems, but also need more practice.
Scrapy Crawl the Web (IMOOC) All course data and deposit it into MySQL database