The artifact for writing crawlers-Groovy + Jsoup + Sublime

Source: Internet
Author: User

Wrote a lot of reptile applet, the previous several times mainly with C # + Html Agility Pack to complete the work. Because the. NET FCL provides only "bottom-level" HttpWebRequest and "middle" WebClient, there is a lot of code to be written about HTTP operations. Plus, writing C # requires Visual Studio, a "heavy" tool that has long been a low-performance development.

Recent projects have come into contact with a magical language, groovy, a dynamic language that is fully compatible with the Java language and offers a great deal of extra syntactic functionality. Coupled with the open-source Jsoup project on the Web-a lightweight class library that uses CSS selectors to parse HTML content, this combination of writing crawlers is like a breeze.

Grab Cnblogs Homepage News title Script

Jsoup.connect ("http://cnblogs.com"). Get (). Select ("#post_list > div > Div.post_item_body > H3 > A"). Each { C2/>println it.text ()   }

Output

Grab Cnblogs Home News Details

Jsoup.connect ("http://cnblogs.com"). Get (). Select ("#post_list > div"). Take (5). Each { 
     def URL = it.select ("> Div.post_item_body > H3 > A"). attr ("href")  
    def title = It.select ("> Div.post_item_body > H3 > A"). Text ()  
    def description = It.select (" > Div.post_item_body > P "). Text ()  
    def author = it.select (" > Div.post_item_body > div > a "). Text ()  
    def comments = It.select (" > Div.post_item_body > Div > span. Article_comment > a "). Text ()  
    def view = It.select (" > Div.post_item_body > Div > S Pan.article_view > a "). Text ()

Println ""
Println "News: $title"
println "Link: $url"
println "Description: $description"
println "Author: $author, Reviews: $comments, read: $view"
}

Output

It's convenient, isn't it? Find a way to write the front-end JavaScript and jquery code, that's it!

Here's a tip, when writing CSS selectors, you can use Google Chrome's development tools,

Take a look at how groovy handles JSON and XML quickly. A word: Convenient to get home.

Grasping Cnblogs's Feeds

New Xmlslurper (). Parse ("Http://feed.cnblogs.com/blog/sitehome/rss"). with {XML-
def title = Xml.title.text ()
def subtitle = Xml.subtitle.text ()
def updated = Xml.updated.text ()

println "Feeds"
println "title-$title"
println "subtitle $subtitle"
println "Updated $updated"

    def entrylist = Xml.entry.take (3). collect { 
         DEF ID = it.id.text ()  
        def subject = It.title.text ()  
        def summary = It.summary.text ()  
         def author = It.author.name.text ()  
        def Published = It.published.text ()  
        [ID, subject, summary, author, PUBLISHED]&NBSP
   }.each { 
        println ""  
        println "article, ${it[1]}"  
         println it[0] 
        println author ${ IT[3]} " 
   } 
}

Output

Catch MSDN Subscription Product classification information

New Jsonslurper (). Parse (The new URL ("http://msdn.microsoft.com/en-us/subscriptions/json/GetProductCategories?brand= Msdn&localecode=en-us ")). with {RS-
println rs.collect{it. Name}
}

Output

Say the code editor again. Due to the use of groovy in this dynamic language, the program can choose a lightweight text editor, here to recommend sublime. The translation of the text is "tall still" meaning. The rich features and great user experience presented by this small text editor are truly worthy of the name.

Advantages:

    • Lightweight (client 6m)
    • Support for coloring in various languages, including groovy
    • Custom theme Packs (color tables)
    • Column editing
    • Quick selection, extended selection, etc.

Disadvantages:

    • No expense, no open source. Fortunately, the trial version can be used without restriction, just occasionally pop-up dialog box when saving operation

Finally, share a quick script to crawl SouFun housing information

Http://noria.codeplex.com/SourceControl/latest#miles/soufun/soufun.groovy

After crawling and finishing

Writing to this point, I hope that the crawler interested in friends have some help.

Http://www.cnblogs.com/stainboy/p/make-crawler-with-groovy-and-jsoup.html

Crawler Artifact-Groovy + Jsoup + Sublime (RPM)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.