Wrote a lot of reptile applet, the previous several times mainly with C # + Html Agility Pack to complete the work. Because the. NET FCL provides only "bottom-level" HttpWebRequest and "middle" WebClient, there is a lot of code to be written about HTTP operations. Plus, writing C # requires Visual Studio, a "heavy" tool that has long been a low-performance development.
Recent projects have come into contact with a magical language, groovy, a dynamic language that is fully compatible with the Java language and offers a great deal of extra syntactic functionality. Coupled with the open-source Jsoup project on the Web-a lightweight class library that uses CSS selectors to parse HTML content, this combination of writing crawlers is like a breeze.
Grab Cnblogs Homepage News title Script
Jsoup.connect ("http://cnblogs.com"). Get (). Select ("#post_list > div > Div.post_item_body > H3 > A"). Each { C2/>println it.text () }
Output
Grab Cnblogs Home News Details
Jsoup.connect ("http://cnblogs.com"). Get (). Select ("#post_list > div"). Take (5). Each {
def URL = it.select ("> Div.post_item_body > H3 > A"). attr ("href")
def title = It.select ("> Div.post_item_body > H3 > A"). Text ()
def description = It.select (" > Div.post_item_body > P "). Text ()
def author = it.select (" > Div.post_item_body > div > a "). Text ()
def comments = It.select (" > Div.post_item_body > Div > span. Article_comment > a "). Text ()
def view = It.select (" > Div.post_item_body > Div > S Pan.article_view > a "). Text ()
Println ""
Println "News: $title"
println "Link: $url"
println "Description: $description"
println "Author: $author, Reviews: $comments, read: $view"
}
Output
It's convenient, isn't it? Find a way to write the front-end JavaScript and jquery code, that's it!
Here's a tip, when writing CSS selectors, you can use Google Chrome's development tools,
Take a look at how groovy handles JSON and XML quickly. A word: Convenient to get home.
Grasping Cnblogs's Feeds
New Xmlslurper (). Parse ("Http://feed.cnblogs.com/blog/sitehome/rss"). with {XML-
def title = Xml.title.text ()
def subtitle = Xml.subtitle.text ()
def updated = Xml.updated.text ()
println "Feeds"
println "title-$title"
println "subtitle $subtitle"
println "Updated $updated"
def entrylist = Xml.entry.take (3). collect {
DEF ID = it.id.text ()
def subject = It.title.text ()
def summary = It.summary.text ()
def author = It.author.name.text ()
def Published = It.published.text ()
[ID, subject, summary, author, PUBLISHED]&NBSP
}.each {
println ""
println "article, ${it[1]}"  
println it[0]
println author ${ IT[3]} " 
}
}
Output
Catch MSDN Subscription Product classification information
New Jsonslurper (). Parse (The new URL ("http://msdn.microsoft.com/en-us/subscriptions/json/GetProductCategories?brand= Msdn&localecode=en-us ")). with {RS-
println rs.collect{it. Name}
}
Output
Say the code editor again. Due to the use of groovy in this dynamic language, the program can choose a lightweight text editor, here to recommend sublime. The translation of the text is "tall still" meaning. The rich features and great user experience presented by this small text editor are truly worthy of the name.
Advantages:
- Lightweight (client 6m)
- Support for coloring in various languages, including groovy
- Custom theme Packs (color tables)
- Column editing
- Quick selection, extended selection, etc.
Disadvantages:
- No expense, no open source. Fortunately, the trial version can be used without restriction, just occasionally pop-up dialog box when saving operation
Finally, share a quick script to crawl SouFun housing information
Http://noria.codeplex.com/SourceControl/latest#miles/soufun/soufun.groovy
After crawling and finishing
Writing to this point, I hope that the crawler interested in friends have some help.
Http://www.cnblogs.com/stainboy/p/make-crawler-with-groovy-and-jsoup.html
Crawler Artifact-Groovy + Jsoup + Sublime (RPM)