Java programmer cainiao (8): a program for crawling B2B website information

Source: Internet
Author: User
Tags coding standards


Some time ago, my girlfriend found a sales job, and she was very happy to go to work on the first day. She began to frown the next day. This is because his sales is too troublesome. I will go to some B2B websites to find some customer information every day, and I will find several hundred items every day. At the beginning, I comforted him and said, "It's okay, I will help you find it later. I took over the first day of my work and was very honest. I spent less than an hour searching for 80 items on the first day, but I was bored the next day. It was so boring. My eyes hurt and I suddenly thought about it. I tried to force programmers to do this boring and repetitive job today! Yes, I am a programmer. How can programmers always do this repetitive work. This type of repetitive work is enough for the code to do. After doing this, I finally studied how to write a crawler-like tool to crawl the B2B website information for two days. Although the function is very simple and the technical content is not high, you may want to share it with me. OK. The background of the project is described. The details of the program will be shared below.

First, write a method to obtain the page source code:

// Obtain the return value through URL and postdata. This function is the main function. You can copy the data on an HTML page using the returned value. Public static string getresponsedatabyid (string URL, string postdata) {string content = NULL; try {URL dataurl = new URL (URL); httpurlconnection con = (httpurlconnection) dataurl. openconnection (); // system. out. println (con. getresponsecode (); // system. out. println (con. getcontentlength (); // con. setrequestmethod ("Post"); // con. setrequestproperty ("proxy-connection", "keep-alive"); // con. setdooutput (true); // Con. setdoinput (true); // outputstream OS = con. getoutputstream (); // dataoutputstream dos = new dataoutputstream (OS); // dos. write (postdata. getbytes (); // dos. flush (); // dos. close (); byte d [] = new byte [4]; string line; inputstream is = con. getinputstream (); stringbuffer = new stringbuffer (); reader = new inputstreamreader (is, "GBK"); // added the buffer function bufferedreader = new Buf Feredreader (Reader); While (line = bufferedreader. Readline ())! = NULL) {stringbuffer. append (LINE + "\ n");} If (bufferedreader! = NULL) {bufferedreader. Close ();} content = stringbuffer. tostring ();} catch (exception ex) {ex. printstacktrace ();} return content ;}

PS: it is necessary to describe the related content of the B2B website I want to crawl. Most of these websites provide a search function, enter the company keyword you want to search for in the search box to obtain a list of related companies. Here I want to write a method that can submit a POST request and obtain the source code of the page after the request is submitted. However, after writing the Code, some B2B websites cannot submit POST requests as proxies, therefore, you can only manually enter the search information, and then pass the address of the List of searched information to obtain the source code of the page. The postdata parameter is useless in this tool. It is not deleted for future extension purposes. OK, this method will get the list of companies we want.



After the company list is searched, each company corresponds to a URL. Next, we will traverse the URLs of these companies. After opening the URL of the corresponding company, he is a company's introduction page, where there is no information I want, to get the desired information, you must open a "More details" link. So next we need to get the current page and then get the link. By the way, we also wrote a method to get an HTML page Link. This method is also used to get the link from all the companies listed above. An open-source package htmlparser is used here. This item has powerful functions. You can study it later. Please refer to the Code for getting the Page Link. Here I have used it in several places, and each time there is a filter, so I also set a flag to distinguish.




Public static set <string> gethref (string F, int flag) {set <string> set = new hashset <string> (); try {parser = new Parser (f ); parser. setencoding ("UTF-8"); nodefilter filter = new nodeclassfilter (linktag. class); nodelist links = new nodelist (); For (nodeiterator E = parser. elements (); E. hasmorenodes ();) {e. nextnode (). collectinto (links, filter);} For (INT I = 0; I <links. size (); I ++) {linktag linkta G = (linktag) links. elementat (I); If (flag = 0 & linktag. getlink (). Length ()> 12 &&! (Linktag. getlink (). substring (0, 18 ). equals ("http://www.product") & linktag. getlinktext (). contains ("Jinan") {set. add (linktag. getlink ();} else if (flag = 1 & (linktag. getlinktext (). equals ("More>") | linktag. getlinktext (). equals ("More>") {set. add (linktag. getlink ();} else if (flag = 2 & linktag. getlinktext (). equals ("Next") set. add (linktag. getlink () ;}} catch (parserexception e) {// todo auto-generated catch blocke. printstacktrace ();} // system. out. println (list); return set ;}

OK. The next step is the page where we want the information. There is no technical content here. It is mainly used for string truncation and replacement.

public static String getPart(String source, String type) {if (source == null)return null;if (source.indexOf(type) != -1) {source = source.substring(source.indexOf(type));source = source.substring(0, source.indexOf("</dl>"));source = source.replace("</dt><dd>", "");source = source.replace("</dd>", "");source = source.replace("/p", "");source = source.replace("\n", "");source = source + "\r\n";System.out.println(source);outFile(source);return source.trim();} elsereturn null;}

After obtaining the relevant information, we can output it to a file.

Private Static void OUTFILE (string source) {try {file = new file ("D:" + file. separator + ".txt"); writer out = NULL; // declare the character output stream out = new filewriter (file, true); // indicates that the output can be appended out. write (source); // write data out. close ();} catch (exception e) {e. printstacktrace ();}}

To be more like a small software, I also made a swing interface. Now I have to lament that my interface is still so ugly. The code won't be available. It's very simple. Let's take a look at the interface. Haha




The starting address is the URL you want to start from that page. Because the information you need is different every day, remember the page you got on that day. The best way is to get all your search results at a time, with tens of thousands of results. The quantity of information is the number of information you want.

Let's take a look at the crawled information:



Now, we have basically finished some detailed introductions. Of course, there are many shortcomings in this tool. The most drawback is that every B2B website needs to modify the code, because the architecture of each website is different, you must modify the code for different B2B websites. Just a small change. The general steps are the same. Due to the rush of time, I didn't pay attention to the coding standards. It was just a simple implementation function. This article only provides some code. If you want to get the source code of the entire project, leave your email address below.


PS:Instructor Feng, I did not take up the time for the promotion. This little tool was written when I came back from the library at night. I did not take up the time for the promotion. I swear. Haha, thank you very much, and I am very touched by your daily supervision. Thank you.

Miss Sun, where have you been recently? I have never seen you since the start of school. Do you also love to play with your own games. It's really hard to have a day without friends. Haha. Comrade Sun, have time to gather together.

Li moumou, good work, down-to-earth, as long as you work hard, tiandaoqin funding, I believe that God will not be a man of heart. Come on, I will always be your most solid backing. Regardless of the storm in the future, we will be in the same boat. There is a small house in my place. Finally, I sincerely thank you for your past seven years. We believe that we will be happier in the next 70 years.


Finally, let's say to all the girls in the world: Find a programmer and marry.



Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.