[Ruby] uses Ruby to crawl Web pages and process processing

Source: Internet
Author: User

is not a professional web crawler crawling, just before a very bad site to help people brush the ticket started to learn the web crawler tools.
The initial time is to use the Python urllib2, get the page when the text processing, later in the forum to see the Beautifulsoap this level of artifact, Python processing this is really convenient, but later I met the Ruby and rails, from the faithless, Even read the metaprogramming Ruby after the more like not to do.
Recent work pressure is not big, idle to want to crawl some of the stock market data. First encountered a problem is from where to get the listing and Shenzhen all the stock code, even if there is a ready-made list on the web I also want to use the process of grasping processing to appear very cool, so I found this page: http://quote.eastmoney.com/stocklist.html
After reading it feels that the code is not complex, but the first time using Ruby to crawl the web is not familiar with, do not know what tools and how to use it, after a search to use Open-uri and Nokogiri.
First look at the Open-uri, which is a ruby built-in feature. To use Open-uri you only need to add require ' Open-uri ' to your code, and it's easy to use.

1Open"http://www.ruby-lang.org/en") {|f|2F.each_line {|line|P Line}3P F.base_uri#<uri::http:0x40e6ef2 url:http://www.ruby-lang.org/en/>4P F.content_type#"Text/html"5P F.charset#"Iso-8859-1"6P f.content_encoding# []7P f.last_modified#Thu Dec 02:45:02 UTC 20028}


The Open function provides a file object that has been expanded to include meta information for some Web pages.
Open the page when you can also bring into the user-agent and other options, it is also very convenient to use, such as the next can

1Open"http://www.ruby-lang.org/en/",2     "user-agent"="Ruby/#{ruby_version}",3     " from"="[email protected]",4     "Referer"="http://www.ruby-lang.org/") {|f|5     # ...6}

Nokogiri is a gem, it is said that there is another tool before Hpricot, just a brief look, not to repeat.
The first is Nokogiri read the Web content, you can use two methods:
Doc = nokogiri::html (open (URL)), use this method if there is a problem with the character set, you can use doc = nokogiri::html.parse (open (URL), nil, "gb2312") to specify the character set.
Next is the positioning of the page elements, Nokogiri support XPath also support CSS lookup. A brief introduction to several methods of positioning via CSS:
Doc.css (' A ') returns all the a tag elements of the page
Doc.css ('. MyClass div ul ') returns all CSS styles for MyClass under the UL tag element of the Div
Doc.css (' Div#main ') returns the div tag element with the ID of main
The above content basically can solve 90% of the demand, if really use up also need to ask Google teacher.

[Ruby] uses Ruby to crawl Web pages and process processing

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.