When you crawl the data of a microblog, someone else uses the Fm.view method to pass the HTML tag, so how does Jsoup parse it?

Last Update:2017-10-29 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Using Jsoup, here's a little idea.

I only did my own home page and other people's micro-blog crawl other crawl did not try (sorry more lazy ...) ）

The first is to use the Jsoup to access the page to see the microblog landing form found in the Ajax way, so the code to get a cookie is a bit difficult

So I stole a lazy. Use IE developer tools to obtain the cookie to be written in the form of a map and then use the code:

Java code

Response res=jsoup.connect ("http://weibo.com"). Cookies (Map). Method (Method.post). Execute ();
String S=res.body ();

Got the next found quite a lot of:

You can write your own script to print map.put (xxx,xxx)

I use Scala to write a paragraph in Java write the same matter:

Scala code

S.split (";"). foreach (s = = {val X=s.split ("=");p rintln (S "" "" Map.put ("${x (0)}", "${x (1)}"); " "")});

Finally get the body well ... It's a bunch of script tags. Top of the list is the contents of the fixed top column of the microblog (the contents of the navigation bar)

LZ tried to find what is needed is <script>fm.view in an ID of pl_content_homefeed he is the content of the home page

Then the LZ made a simple processing without using the regular because .... The amount ... Bad writing:

Java code

String S=res.body ();
//system.out.println (s);
String[] Ss=s.split ("<script>fm.view");
int i=0;
//pl_content_homefeed
for (String x:ss) {
System.out.println (i++ + "======================================");
System.out.println (x.substring (0, X.length () >100?100:x.length ()));
System.out.println ("===========================================");
// }
String content=ss[8].split ("\" html\ ": \" ") [1].replaceall (" \\\\n ", " "). ReplaceAll (" \\\\t ", " "). ReplaceAll ("\\\\", "" ");
Content=content.substring (0, Content.length () <=13?content.length (): Content.length ()-13);
SYSTEM.OUT.PRINTLN (content);

The content of the output is the micro-blog contents displayed on the homepage

However, this output Unicode is not translated into Chinese characters need to use the Native2ascii tool to find a Web:

http://soulshard.iteye.com/blog/346807

Measurements can be used:

Java code

System.out.println (native2asciiutils.ascii2native (content));

Note that the above code LZ is fixed on the homepage, so in the interception of the direct use of index 8

Change the Post method to get method can also get to other people's Weibo page

Then give a way to print out all of the HTML content you've got (try some of the pages available):

Java code

Package jsouptest;
Import java.io.IOException;
Import java.util.ArrayList;
Import Java.util.HashMap;
Import java.util.List;
Import Java.util.Map;
Import Org.jsoup.Connection.Method;
Import Org.jsoup.Connection.Response;
Import Org.jsoup.Jsoup;
Public class Jsouptest {
public static void Main (string[] args) throws IOException {
map<string, string> map = new hashmap<> ();
//map.put Please follow your Weibo cookie to get
Response res = jsoup.connect ("http://weibo.com/u/someone else's homepage id")
. Cookies (Map). Method (Method.get). Execute ();
String s = res.body ();
System.out.println (s);
string[] ss = S.split ("<script>fm.view");
int i = 0;
//Pl_content_homefeed
//Pl.content.homeFeed.index
list<string> list = new arraylist<> ();
For (String x:ss) {
System.out.println (i++ + "======================================");
System.out.println (x.substring (0,
X.length () > 200? 200:x.length ()));
System.out.println ("===========================================");
if (x.contains ("\" html\ ": \")) {
String value = gethtml (x);
List.add (value);
System.out.println (value);
}
}
//Content=ss[8].split ("\" html\ ": \" ") [1].replaceall (" (\\\\t|\\\\n) ",
//""). ReplaceAll ("\\\\\" "", "\" "). ReplaceAll (" \\\\/","/");
//content=content.substring (0,
//Content.length () <=13?content.length (): Content.length ()-13);
//System.out.println (native2asciiutils.ascii2native (content));
}
public static string gethtml (string s) {
String content = s.split ("\" html\ ": \" ") [1]
. ReplaceAll ("(\\\\t|\\\\n|\\\\r)", " " "). ReplaceAll (" \\\\\ "", " \" ")
. ReplaceAll ("\\\\/", "/");
Content = Content.substring (0,
Content.length () <= content.length ()
: Content.length ()- 13);
return native2asciiutils.ascii2native (content);
}
}

The contents of the crawl should be properly formatted before you can use Jsoup to parse

But there's nothing wrong with trying to do the parsing directly (though there are some wrong labels)

It's just a page crawl strategy. I don't want to write more about it. If you use your own Sina Weibo cookie to crawl

When you crawl the data of a microblog, someone else uses the Fm.view method to pass the HTML tag, so how does Jsoup parse it?

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

When you crawl the data of a microblog, someone else uses the Fm.view method to pass the HTML tag, so how does Jsoup parse it?

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

When you crawl the data of a microblog, someone else uses the Fm.view method to pass the HTML tag, so how does Jsoup parse it?

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support