Using Jsoup, here's a little idea.
I only did my own home page and other people's micro-blog crawl other crawl did not try (sorry more lazy ...) )
The first is to use the Jsoup to access the page to see the microblog landing form found in the Ajax way, so the code to get a cookie is a bit difficult
So I stole a lazy. Use IE developer tools to obtain the cookie to be written in the form of a map and then use the code:
Java code
- Response res=jsoup.connect ("http://weibo.com"). Cookies (Map). Method (Method.post). Execute ();
- String S=res.body ();
Got the next found quite a lot of:
You can write your own script to print map.put (xxx,xxx)
I use Scala to write a paragraph in Java write the same matter:
Scala code
- S.split (";"). foreach (s = = {val X=s.split ("=");p rintln (S "" "" Map.put ("${x (0)}", "${x (1)}"); " "")});
Finally get the body well ... It's a bunch of script tags. Top of the list is the contents of the fixed top column of the microblog (the contents of the navigation bar)
LZ tried to find what is needed is <script>fm.view in an ID of pl_content_homefeed he is the content of the home page
Then the LZ made a simple processing without using the regular because .... The amount ... Bad writing:
Java code
- String S=res.body ();
- //system.out.println (s);
- String[] Ss=s.split ("<script>fm.view");
- int i=0;
- //pl_content_homefeed
- for (String x:ss) {
- System.out.println (i++ + "======================================");
- System.out.println (x.substring (0, X.length () >100?100:x.length ()));
- System.out.println ("===========================================");
- // }
- String content=ss[8].split ("\" html\ ": \" ") [1].replaceall (" \\\\n ", " "). ReplaceAll (" \\\\t ", " "). ReplaceAll ("\\\\", "" ");
- Content=content.substring (0, Content.length () <=13?content.length (): Content.length ()-13);
- SYSTEM.OUT.PRINTLN (content);
The content of the output is the micro-blog contents displayed on the homepage
However, this output Unicode is not translated into Chinese characters need to use the Native2ascii tool to find a Web:
http://soulshard.iteye.com/blog/346807
Measurements can be used:
Java code
- System.out.println (native2asciiutils.ascii2native (content));
Note that the above code LZ is fixed on the homepage, so in the interception of the direct use of index 8
Change the Post method to get method can also get to other people's Weibo page
Then give a way to print out all of the HTML content you've got (try some of the pages available):
Java code
- Package jsouptest;
- Import java.io.IOException;
- Import java.util.ArrayList;
- Import Java.util.HashMap;
- Import java.util.List;
- Import Java.util.Map;
- Import Org.jsoup.Connection.Method;
- Import Org.jsoup.Connection.Response;
- Import Org.jsoup.Jsoup;
- Public class Jsouptest {
- public static void Main (string[] args) throws IOException {
- map<string, string> map = new hashmap<> ();
- //map.put Please follow your Weibo cookie to get
- Response res = jsoup.connect ("http://weibo.com/u/someone else's homepage id")
- . Cookies (Map). Method (Method.get). Execute ();
- String s = res.body ();
- System.out.println (s);
- string[] ss = S.split ("<script>fm.view");
- int i = 0;
- //Pl_content_homefeed
- //Pl.content.homeFeed.index
- list<string> list = new arraylist<> ();
- For (String x:ss) {
- System.out.println (i++ + "======================================");
- System.out.println (x.substring (0,
- X.length () > 200? 200:x.length ()));
- System.out.println ("===========================================");
- if (x.contains ("\" html\ ": \")) {
- String value = gethtml (x);
- List.add (value);
- System.out.println (value);
- }
- }
- //Content=ss[8].split ("\" html\ ": \" ") [1].replaceall (" (\\\\t|\\\\n) ",
- //""). ReplaceAll ("\\\\\" "", "\" "). ReplaceAll (" \\\\/","/");
- //content=content.substring (0,
- //Content.length () <=13?content.length (): Content.length ()-13);
- //System.out.println (native2asciiutils.ascii2native (content));
- }
- public static string gethtml (string s) {
- String content = s.split ("\" html\ ": \" ") [1]
- . ReplaceAll ("(\\\\t|\\\\n|\\\\r)", " " "). ReplaceAll (" \\\\\ "", " \" ")
- . ReplaceAll ("\\\\/", "/");
- Content = Content.substring (0,
- Content.length () <= content.length ()
- : Content.length ()- 13);
- return native2asciiutils.ascii2native (content);
- }
- }
The contents of the crawl should be properly formatted before you can use Jsoup to parse
But there's nothing wrong with trying to do the parsing directly (though there are some wrong labels)
It's just a page crawl strategy. I don't want to write more about it. If you use your own Sina Weibo cookie to crawl
When you crawl the data of a microblog, someone else uses the Fm.view method to pass the HTML tag, so how does Jsoup parse it?