For some basic content, you can view the previous blog: http://blog.csdn.net/u013082989/article/details/51176073
First, look at the contents of the crawler:
(1) Subject type, curriculum, curriculum corresponding chapters, curriculum corresponding reference materials (mainly to the course chapters to the upper level of the course of crawling, there is the course material crawl more trouble, the following will be mentioned)
Course Chapters:
Course Materials
Textbook content
Second, the design of the entity class:
(1) Course class, curriculum corresponding chapter class (one-to-many), the course corresponds to the textbook class (one-to-many), about the hibernate mapping file is not explained.
Iii. climbing courses and corresponding chapters
(1) The same as the previous blog, the implementation of the Pageprocessor interface
<strong><span style= "FONT-SIZE:18PX;" >public class Coursespider implements Pageprocessor</span></strong>
and the methods.
(2) Analysis of the type of subject, is the LI tag corresponding to a hyperlink, subject type is relatively small, I wrote a crawler to put it into the database (in fact, hand-hit can ...). )。 Basically the following can be used as a crawl link and then request it.
(3) Then the analysis course, take the philosophy discipline as the example, the link is: The domain name +/category/01, but is has the pagination
Click on the second page, found that the link is: domain name +/category/01/2/24, so 2 corresponds to the second page, 24 is the first 2 pages of the record number
So we can load all the content by changing the URL address method.
(4) Analysis of the course chapters, is an LI tag corresponding to a chapter, is also relatively simple
(5) Implementation code:
1. Get the service you need
Private ApplicationContext AC = new Classpathxmlapplicationcontext ("Applicationcontext.xml");// Get Serviceispidercourseservice Spidercourseservice = (ispidercourseservice) ac.getbean ("SpiderCourseServiceImpl"); Spiderprofessiontypeservice Spiderprofessiontypeservice = (ispiderprofessiontypeservice) Ac.getBean (" Spiderprofessiontypeserviceimpl "); Ispiderchapterservice Spiderchapterservice = (ispiderchapterservice) Ac.getBean ( "Spiderchapterserviceimpl"); Ispiderdocumentservice Spiderdocumentservice = (ispiderdocumentservice) Ac.getBean (" Spiderdocumentserviceimpl ");
2, define the site, you can set the settimeout a bit longer, because we change the URL to access more than the number of record bars, or there may be an error.
Private Site site = site.me (). Setretrytimes (5). Setsleeptime (+). SetTimeOut (23000);
3, the process method, satisfies the corresponding regular expression expresses enters the first layer reptile, calls the corresponding method.
@Overridepublic void Process (Page page) {//Format: http://mooc.chaoxing.com/category/01/0/1000if (Page.geturl (). Regex (" Http://mooc\\.chaoxing\\.com/category/\\d+/\\d/\\d+ "). ToString ()! = null) {System.out.println (" first layer "); crawercourse (page);} Format: Http://mooc.chaoxing.com/course/55672.htmlelse if (Page.geturl (). Regex ("Http://mooc\\.chaoxing\\.com/course /\\d+\\.html "). ToString ()! = null) {System.out.println (" second Layer "); Crawcourseinfo (page);}}
4. How to Crawl course Information:
Some parsing the HTML will not say, I use the xpath,webmagic also provides a lot of analytical methods.
We encapsulate the crawled information into a course object and use the URL of the course as a link (and a second layer) to be crawled, place the course object in request (because the second-level crawl section needs to find the corresponding course Class), set the request priority to 1 (the smaller the priority, the higher the first level, And then crawl to the second tier).
/** * Crawl to course information */public void Crawercourse (Page page) {//<div class= "label" >//Philosophy </div>//Filter Professional type string Professi Ontype = page.gethtml (). XPath ("//div[@class = ' label ']/text ()"). ToString ();//<li class= "Ans-slow-anim" >//< Div class= "Picarea ans-slow-anim" ><a href= "/course/198413.html"//target= "_blank" >//</a>//</div >//<div class= "Introarea" ><a href= "/course/198413.html" target= "_blank"//title= "Chinese Traditional thought-dialogue pre-Qin Philosophy" > Chinese Traditional thought-dialogue pre-Qin philosophy </a></div>//<div class= "introArea2" title= "Wan jianquanip Guo Ziyong Xia Cojun Chen Yan Wuhan University" >//million offer elementary//Wuhan University//// </div>//</li>//Filter name list<string> coursenamelist = page.gethtml (). XPath ("//div[@class = ' IntroArea ' ]/a/html () "). All ();//Page.putfield (" Coursenamelist ", coursenamelist);//filter Urllist<string> courseurllist = Page.gethtml (). XPath ("//div[@class = ' Introarea ']/a/@href"). All ();//Page.puTfield ("Courseurllist", courseurllist);//filter information list<string> infolist = page.gethtml (). XPath ("//div[@class = ' IntroArea2 ']/@title "). All ();//Page.putfield (" Infolist ", Infolist), if (coursenamelist.size () > 0) {for (int i = 0; I &l T Coursenamelist.size (); i++) {Spidercourse model = new Spidercourse (Coursenamelist.get (i). ToString (). Trim (), Courseurllist.get (i). ToString () . Trim (), Infolist.get (i). toString (), professiontype); Spidercourseservice.save (model);//Request request2=new// Request (Courseurllist.get (i)). SetPriority (1). PutExtra ("Coursemodel",//model);//Page.putfield ("model", model);// Set the priority to 1page.addtargetrequest (new Request (Courseurllist.get (i)). SetPriority (1). PutExtra ("Coursemodel", model));}} list<spiderprofessiontype> list = Spiderprofessiontypeservice.findall (); for (int j = 2; J < List.size (); j + +) {//Set Priority 0 Page.addtargetrequest (New Request (List.get (j). GETURL () + "/0/1000"). S Etpriority (0)); }//list<string> urllist=new arraylist<string> ();//For (iNT J=2;j<list.size (); j + +) {//Urllist.add (List.get (j). GETURL () + "/0/1000");//}////follow-up URLs as requests// Page.addtargetrequests (urllist);}
5. Crawl the corresponding chapters of the course:
can be done by
Page.getrequest (). Getextra ("Coursemodel");
Get the course object set by the superior, there is a request in each page.
/** * Crawl courses corresponding chapters */public void Crawcourseinfo (Page page) {/** * Get the model from the superior, the user saves the corresponding course */spidercourse Coursemodel = (Spider Course) Page.getrequest (). Getextra ("Coursemodel");//<div class= "mt10 f33 l g5" >//<span> wood frame design </span >//</div>//Filter Course name string coursename = page.gethtml (). XPath ("//div[@class = ' mt10 f33 l G5 ']/span/text ()"). ToString ();//<li class= "mb15 course_section fix" >//<!--<a class= "WH"//href= "/nodedetailcontroller/ visitnodedetail?knowledgeid=789300 "//target=" _blank ">-->//<a class=" WH "//href="/nodedetailcontroller/ visitnodedetail?knowledgeid=789300 ">//<div class=" F16 chapter_index L ">1.1</div>//<div class=" F16 pct80 pr10 R "> Philosophy (i) </div>//</a>//</li>/** * Crawl Chapters *///filter urllist<string> Chapterurllist = page.gethtml (). XPath ("//li[@class = ' mb15 course_section fix ']/a[@class = ' WH ']/@href"). All ();/ Page.putfield ("Chapterurllist", chapterurllist);//Filter Chapter number list<string> chapternumlist = PAge.gethtml (). XPath ("//div[@class = ' F16 chapter_index l ']/text ()"). All ();//Page.putfield ("Chapternumlist", chapternumlist);//filter Chapter name List<string> chapternamelist = page.gethtml (). XPath ("//div[@class = ' F16 pct80 pr10 R ']/ Text () "). All ();//Page.putfield (" Chapternamelist ", Chapternamelist), if (chapterurllist.size () > 0) {for (int i = 0; I & Lt Chapterurllist.size (); i++) {Spiderchapter model = new Spiderchapter (Chapternumlist.get (i). ToString (), Chapternamelist.get (i). ToString (), Chapterurllist.get (i). toString (), Coursename,coursemodel); Spiderchapterservice.save (model);}}}
This is done by crawling the course and corresponding chapters, followed by a piece of test to crawl the reference material.
Iv. to crawl the corresponding reference materials for the course, this is a relatively complex
(1) Analysis interface
Reference material display is through an IFRAME, nested a HTML, try and the same method to get the corresponding information, through debugging found is not get, it should be through JS and then splicing HTML code, we crawler will not get the HTML code after JS execution, so it is more troublesome. Look at the internet there are some tools to get JS after the execution of HTML code, but will certainly reduce the efficiency of crawling.
Try to get all the node information under a Div, find the information of the IFRAME, we see the IFRAME has a data property, corresponding to a JSON format information, so it should be by parsing this JSON format information, and then stitching the HTML code to display the reference material. (Bookname,author and other information in JSON is Unicode encoding, just at the beginning I also struggle to get to the Chinese character after the acquisition, it is a coding method, no tube, get to the line)
There is the URL address in the JSON try to access the wrong, and then analyze the sent request to find the URL address is also through JS splicing
URL information, so we get it before processing. The analysis is complete.
(2) Code:
The code is not much, but the process of analysis is still very troublesome.
/** * Crawl the corresponding section of the course document, this is a bit special, it is an iframe * and after the analysis of the IFRAME has a data property, is the JSON format, and then the website through the JS stitching HTML code * Chinese characters using Unicode encoding */// The resulting JSON-formatted string//format://{"Readurl": "Http://resapi.chaoxing.com/realRead?dxid=000006873411&ssid=12553309&d =bd6eecd6198fdd693fd0e87f715b5f05 ",//" Coverurl ":" Http://cover.duxiu.com/cover/Cover.dll?iid= 6768656b6b696569666f3839393335393236 ",//" BookName ":" \u5148\u79e6\u54f2\u5b66 ",//" Author ":" \u66fe\u4ed5\u793c\ u7f16\u8457 ",//" publisher ":" \u6606\u660e\u5e02\uff1a\u4e91\u5357\u5927\u5b66\u51fa\u7248\u793e ",//" publishdate ":" 2009.09 ",//" id ":" ext-gen1223 "}list<string> allinfolist = page.gethtml (). XPath ("//iframe/@data "). All (); if (Allinfolist.size () > 0) {for (int i = 0; i < allinfolist.size (); i++) {//String to Jsonjsonobject JSON = Jsonobject.fromobject (Allinfolist.get (i ). ToString ()); String Realurl = json.getstring ("Readurl");//http://resapi.chaoxing.com/realread?dxid=000006873411&ssid= Realread in 12553309&d=bd6eecd6198fdd693fd0e87f715b5f05//connection is replaced with innerURL, and add suffix &unitid=7719&readstyle=4&tp=flip&rotate=true&cpage=1realurl = realUrl.replace (" Realread "," Innerurl ") +" &unitid=7719&readstyle=4&tp=flip&rotate=true&cpage=1 "; Spiderdocument model = new Spiderdocument (json.getstring ("BookName"), Realurl,json.getstring ("author"), Json.getstring ("publisher"), Json.getstring ("Publishdate"), Json.getstring ("Coverurl"), Coursemodel); Spiderdocumentservice.save (model);}}
Five: Test
The information corresponds correctly.
Vi. Summary
Although the crawler encountered some trouble, but I still enjoy the process, and learn a lot of things from it.
WebMagic crawler framework and Java EE SSH framework save data to Database (ii)