Ruby in the implementation of Web crawl, the general use is mechanize, very simple to use.
Installation
Copy Code code as follows:
sudo gem install mechanize
Crawl Web pages
Copy Code code as follows:
Require ' RubyGems '
Require ' mechanize '
Agent = Mechanize.new
page = Agent.get (' http://google.com/')
Analog Click events
Copy Code code as follows:
page = Agent.page.link_with (: Text => ' News '). Click
Simulate form submission
Copy Code code as follows:
Google_form = Page.form (' f ')
google_form["q"] = ' ruby mechanize '
page = Agent.submit (Google_form, Google_form.buttons.first)
PP Page
Analysis page, mechanize with the Nokogiri to parse the page, so you can refer to Nokogiri documents
Copy Code code as follows:
Table = Page.search (' a ')
Text = Table.inner_text
Puts text
There are a few points to note: If you need to log in first page, then you can login in the site, log jsessionid after logging, and then assign value to the agent
Copy Code code as follows:
Cookie = mechanize::cookie.new ("Jsessionid", "ba58528b76124698ad033ee6df12b986:-1")
Cookie.domain = "datamirror.csdb.cn"
Cookie.path = "/"
agent.cookie_jar.add! (cookies)
If you need to save the page, use the. Save_as, (maybe save also, I haven't tried) for example
Copy Code code as follows:
Agent.get ("http://google.com"). Save_as
Small Tips
Puts mechanize::agent_aliases can print out all available user_agent
Puts Mechanize.instance_methods (false) output all methods of the Mechanize module
Puts Mechanize.instance_methods () outputs all the methods of the Mechanize module and the functions of the inherited class