Java Crawl Sina micro-blog Home Data __ Crawler

Source: Internet
Author: User
Tags html tags

The recent plan to do something, the accumulation of data is the first step in the reserve, so it is inevitable to write the crawler, crawl the effective information on the page. Because expect the data to have some topic, the quantity is quite big, consider to visit the microblog frequently, and the topic of microblog is likely to meet the demand of follow up, so decided to crawl Sina Weibo data.

So here's the question:

Question 1: In what language. The most commonly used is java,httpurlconnection and so on, and can easily get the data of the URL by getting or POST request, so let's do it in Java, then consider using Python to try.

Question 2: Sina Weibo data crawl need to log in, and the login process needs a bit of complex analysis process. Some information has been searched on the internet, but it has been invalidated. Also eager to crawl, so decided to set the cookie first way, bypassing the login process, first crawl, and then study the login method "Sina Weibo cookies effective time." is generally set for 30 minutes. "

(temporary) solve the above two problems, you can start the first step, get the page content. Start with the first page first.

Home Address: Http://weibo.com/u/1596867051/home?wvr=5&lf=reg

Use Httpfox to monitor each request during the logon process.

Note the first request, which is the URL of the home page. This is a GET request [nonsense], in the lower left corner of the headers can see the content of the request head, the valid is cookies here, as long as in the crawl, set the head of the cookie content for this content can be. Here, we get the first page, pull down all the content before loading.

Second Stage:

How to get the information you need from the source of the page.

1, see demand "This is nonsense ... 】

I am concerned about the list on the right, you can see, on the right there are "Asian new song list real-time trend", "Hot Topic", the film Hot list, friends dynamic, star power list. Plan to, periodically crawl the "Hot Topic list" Content, parse the content of the page to store the required data to local, and then make the interface, provided to the app call.

Hot Topic page: http://d.weibo.com/100803?refer=index_hot_new

Same as the home page, get requests, headers set above cookies, gets the source code. Content structure: Including 1-hour ranking and 24-hour ranking, first take the 1-hour ranking as an example. Each content includes rank: top1,2,3 topic name #话题 # form, category/label (star, TV show, TV series ...) A student in the micro-blog to do data mining work seems to be responsible for this), topic profile: "Large fashion reality show" China Supermodel, May 21, every Thursday 21:20@ Chongqing satellite TV Premiere ", reading quantity host; The following problem is extracting these elements from the HTML.

Page structure analysis: Firefox, view the page element structure. Hot Topic content under the <divclass= "M_wrap clearfix" > label, <ul class= "Pt_ulclearfix" > list, each Li is a topic. So first locate here and then do further parsing. It looks easy, doesn't it? Then I was wrong. Unfortunately, the results of the direct crawl is not like to see the source code that is very standard HTML, the content of the capture channel contains JS. Therefore, we need to analyze the composition of JS request for each topic. Fortunately, the structure of JS is still relatively easy to identify, each topic is such a JavaScript code snippet:

<script>
Fm
. View ({
"NS": "Pl.content.miniTab.index",
"Domid": "Pl_discover_pt6rank__5",
"CSS": ["STYLE/CSS/MODULE/DISCOVER/DSC_PICTEXT_B.CSS?VERSION=156D8FD88E66BCF3"],
"JS": "page/js/pl/content/minitab/index.js?version=5ea4401897bd1150",
"HTML": "XXXXXX"//This is a long code that encapsulates all the information about a topic, including a lot of useless data.
})
</script>

Have to say, is very messy. But just extract the relevant parts. The basic idea is based on the symmetrical structure of HTML tags, get the content of each topic (such as the above paragraph, married to the Fm.view, the value of the HTML attribute, and then get the required rankings, topics, reading the number of pieces of data)

2.1 Topic content Interception

(1) Get all the <script> code segments that meet the requirements, and use a regular or string matching algorithm here. Given that each <script>xxx</script> is a symmetrical structure, it is certainly a good choice; the regular expression used is:

Pattern p = pattern.compile ("\\<script>fm.view" (. *?) \\</script> ");

Matcherm = P.matcher (BUF);

while (M.find ()) {

if (T_rs.contains ("HTML") &&t_rs.contains ("Pt_li s_line2") {

Rslist.add (T_RS);

}

}

In the code, in the first step of the matching results, but also added the inclusion of Pt_lis_line2 label conditions, filtered, only a script code segment meets the conditions, that is, contains 15 hot topics of the code.

The complete code is as follows: without optimization, the efficiency is a little low. The follow-up will be optimized

public class Testlogin {

public static void Main (string[] args) throws ioexception{
String strurl= "Http://login.sina.com.cn/sso/login.php?client=ssologin.js (v1.4.18)";

Strurl= "Http://weibo.com/u/1596867051/home?wvr=5&lf=reg";
Strurl= "Http://d.weibo.com/100803?refer=index_hot_new";
URL url = new URL (strurl);
HttpURLConnection httpconn = (httpurlconnection) url.openconnection ();

Httpconn.setrequestproperty ("Host", "www.jimubox.com");
Httpconn.setrequestproperty ("User-agent", "mozilla/5.0" (Windows NT 6.1; WOW64; rv:34.0) gecko/20100101 firefox/34.0 ");
Httpconn.setrequestproperty ("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8");
Httpconn.setrequestproperty ("Accept-language", "zh-cn,zh;q=0.8,en-us;q=0.5,en;q=0.3");
Httpconn.setrequestproperty ("accept-encoding", "gzip, deflate");
//
String cookie= "Your cookie";
Httpconn.setrequestproperty ("Cookie", cookie);
Httpconn.setrequestproperty ("Connection", "keep-alive");

Httpconn.setrequestproperty ("CharSet", "Utf-8");

InputStreamReader input = new InputStreamReader (Httpconn.getinputstream (), "utf-8");
BufferedReader bufreader = new BufferedReader (input);
String line = "";
StringBuilder contentbuf = new StringBuilder ();
while (line = Bufreader.readline ())!= null) {
Contentbuf.append (line);
}
String buf = contentbuf.tostring ();

String match/Regular expression to get index of all <script></script> label segments


System.out.println (BUF);

String teststr = "12315<text>show me</text> <text>show me</text>";
Pattern p = pattern.compile ("<text> (. *) </text>");
Matcher m = P.matcher (TESTSTR);
while (M.find ()) {
System.out.println (M.group (1));
// }

String test= "Pattern p = pattern.compile ("\\<script>fm.view" (. *?) \\</script> ");
Pattern p = pattern.compile ("\\<script>" (. *?) \\</script> ");
Matcher m = P.matcher (BUF);
List<string> rslist=new arraylist<string> ();
List<string> lilist=new arraylist<string> ();
while (M.find ()) {
String T_rs=m.group (1);
if (T_rs.contains ("HTML") && t_rs.contains ("Pt_li s_line2")) {
Rslist.add (T_RS);
}
}

if (Rslist.isempty ()) {
SYSTEM.OUT.PRINTLN ("Crawl abnormal!!!!");
}
String topics = rslist.get (0);
SYSTEM.OUT.PRINTLN (topics);
p = pattern.compile ("\\<li class\\=" (. *?) Li> ");
m = p.matcher (topics);
while (M.find ()) {
if (M.group (1). StartsWith ("\\\" Pt_li s_line2\\\ "")) {
String Li=m.group (1);

String regex = "Http.*?faxian_huati";
Pattern P1=pattern.compile (regex);
Matcher M1=p1.matcher (New String (LI));
if (M1.find ()) {
System.out.println (M1.group (0));
// }
//
Regex = "Http:\\\\/\\\\/ww3.sinaimg.cn.*?\\.jpg";
P1=pattern.compile (regex);
M1=p1.matcher (New String (LI));
if (M1.find ()) {
System.out.println (M1.group (0));
// }

Top Rank
Regex = "\\<span class\\=\\\\\" dsc_topicon\\\\\ ">" (. *?) <\\\\/span> ";
P1=pattern.compile (regex);
M1=p1.matcher (New String (LI));
if (M1.find ()) {
System.out.println (M1.group (1));
// }

Topic name
Regex = "\\#" (. *?) #";
P1=pattern.compile (regex);
M1=p1.matcher (New String (LI));
if (M1.find ()) {
System.out.println (M1.group (0));
// }

Category label
Regex = "\\<\\\\/span>" (. *?) <\\\\/a> ";
P1=pattern.compile (regex);
M1=p1.matcher (New String (LI));
if (M1.find ()) {
System.out.println (M1.group (1));
}

}
}

for (String li:lilist) {
Picture URL
String regex = "Http.*?faxian_huati";
Pattern P1=pattern.compile (regex);
Matcher M1=p1.matcher (New String (LI));
if (M1.find ()) {
System.out.println (M1.group (0));
// }
// }
//
for (String li:lilist) {
Picture URL
String regex = "http:\\\\/\\\\/ww3.sinaimg.cn.*?\\.jpg";
Pattern P1=pattern.compile (regex);
Matcher M1=p1.matcher (New String (LI));
if (M1.find ()) {
System.out.println (M1.group (0));
// }
// }
}

}

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.