When you crawl the data of a microblog, someone else uses the Fm.view method to pass the HTML tag, so how does Jsoup parse it?

Source: Internet
Author: User

Using Jsoup, here's a little idea.

I only did my own home page and other people's micro-blog crawl other crawl did not try (sorry more lazy ...) )

The first is to use the Jsoup to access the page to see the microblog landing form found in the Ajax way, so the code to get a cookie is a bit difficult

So I stole a lazy. Use IE developer tools to obtain the cookie to be written in the form of a map and then use the code:

Java code
    1. Response res=jsoup.connect ("http://weibo.com"). Cookies (Map). Method (Method.post). Execute ();
    2. String S=res.body ();

Got the next found quite a lot of:



You can write your own script to print map.put (xxx,xxx)

I use Scala to write a paragraph in Java write the same matter:

Scala code
    1. S.split (";"). foreach (s = = {val X=s.split ("=");p rintln (S "" "" Map.put ("${x (0)}", "${x (1)}"); " "")});  

Finally get the body well ... It's a bunch of script tags. Top of the list is the contents of the fixed top column of the microblog (the contents of the navigation bar)

LZ tried to find what is needed is <script>fm.view in an ID of pl_content_homefeed he is the content of the home page

Then the LZ made a simple processing without using the regular because .... The amount ... Bad writing:

Java code
  1. String S=res.body ();
  2. //system.out.println (s);
  3. String[] Ss=s.split ("<script>fm.view");
  4. int i=0;
  5. //pl_content_homefeed
  6. for (String x:ss) {
  7. System.out.println (i++ + "======================================");
  8. System.out.println (x.substring (0, X.length () >100?100:x.length ()));
  9. System.out.println ("===========================================");
  10. //        }
  11. String content=ss[8].split ("\" html\ ": \" ") [1].replaceall (" \\\\n ", " "). ReplaceAll (" \\\\t ",  " "). ReplaceAll ("\\\\", "" ");
  12. Content=content.substring (0, Content.length () <=13?content.length (): Content.length ()-13);
  13. SYSTEM.OUT.PRINTLN (content);

The content of the output is the micro-blog contents displayed on the homepage

However, this output Unicode is not translated into Chinese characters need to use the Native2ascii tool to find a Web:

http://soulshard.iteye.com/blog/346807

Measurements can be used:

Java code
    1. System.out.println (native2asciiutils.ascii2native (content));

Note that the above code LZ is fixed on the homepage, so in the interception of the direct use of index 8

Change the Post method to get method can also get to other people's Weibo page

Then give a way to print out all of the HTML content you've got (try some of the pages available):

Java code
  1. Package jsouptest;
  2. Import java.io.IOException;
  3. Import java.util.ArrayList;
  4. Import Java.util.HashMap;
  5. Import java.util.List;
  6. Import Java.util.Map;
  7. Import Org.jsoup.Connection.Method;
  8. Import Org.jsoup.Connection.Response;
  9. Import Org.jsoup.Jsoup;
  10. Public class Jsouptest {
  11. public static void Main (string[] args) throws IOException {
  12. map<string, string> map = new hashmap<> ();
  13. //map.put Please follow your Weibo cookie to get
  14. Response res = jsoup.connect ("http://weibo.com/u/someone else's homepage id")
  15. . Cookies (Map). Method (Method.get). Execute ();
  16. String s = res.body ();
  17. System.out.println (s);
  18. string[] ss = S.split ("<script>fm.view");
  19. int i = 0;
  20. //Pl_content_homefeed
  21. //Pl.content.homeFeed.index
  22. list<string> list = new arraylist<> ();
  23. For (String x:ss) {
  24. System.out.println (i++ + "======================================");
  25. System.out.println (x.substring (0,
  26. X.length () > 200? 200:x.length ()));
  27. System.out.println ("===========================================");
  28. if (x.contains ("\" html\ ": \")) {
  29. String value = gethtml (x);
  30. List.add (value);
  31. System.out.println (value);
  32. }
  33. }
  34. //Content=ss[8].split ("\" html\ ": \" ") [1].replaceall (" (\\\\t|\\\\n) ",
  35. //""). ReplaceAll ("\\\\\" "", "\" "). ReplaceAll (" \\\\/","/");
  36. //content=content.substring (0,
  37. //Content.length () <=13?content.length (): Content.length ()-13);
  38. //System.out.println (native2asciiutils.ascii2native (content));
  39. }
  40. public static string gethtml (string s) {
  41. String content = s.split ("\" html\ ": \" ") [1]
  42. . ReplaceAll ("(\\\\t|\\\\n|\\\\r)", " " "). ReplaceAll (" \\\\\ "", " \" ")
  43. . ReplaceAll ("\\\\/", "/");
  44. Content = Content.substring (0,
  45. Content.length () <= content.length ()
  46. : Content.length ()- 13);
  47. return native2asciiutils.ascii2native (content);
  48. }
  49. }

The contents of the crawl should be properly formatted before you can use Jsoup to parse

But there's nothing wrong with trying to do the parsing directly (though there are some wrong labels)

It's just a page crawl strategy. I don't want to write more about it. If you use your own Sina Weibo cookie to crawl

When you crawl the data of a microblog, someone else uses the Fm.view method to pass the HTML tag, so how does Jsoup parse it?

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.