Use HttpURLConnection to request multipart/form-data form submission and httpurlconnection
Write a small program to simulate an Http POST request to obtain data from the website. Parse HTML with Jsoup (http://jsoup.org.
Jsoup encapsulates the HttpConnection function and can submit requests to the server. However, after analyzing the data submission method of the target website (http://rapdb.dna.affrc.go.jp/tools/converter/run), I decided to use the code to simulate form submission with Content-type multipart/form-data.
1. HttpURLConnection: A URLConnection with support for HTTP-specific features. a url that supports HTTP connections.
Connection. setRequestMethod ("POST"); connection. setConnectTimeout (5*60*1000); connection. setReadTimeout (5*60*1000); connection. addRequestProperty ("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.118 Safari/537.36 ");
Connection. addRequestProperty ("Content-Type", "multipart/form-data; boundary = -- testsssssss"); // to request data from the server, set it to true. The default value is falseconnection. setDoOutput (true); // If the post method is submitted, change it to falseconnection. setUseCaches (false); // Connection to the reporter. connect (); output = connection. getOutputStream (); // transmits post data output to the server. write (bodyStr. getBytes ());
After sending a request to the server, the server should be able to receive similar data:
POST/test HTTP/1.1Accept-Language: zh-CN, zh; q = 0.8 // connection. user-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.118 Safari/537.36 Accept: text/html, application/xhtml + xml, application/xml; q = 0.9, image/webp, */*; q = 0.8Content-Type: multipart/form-data; boundary = -- testsssssssCache-Control: no-cachePragma: no-cacheHost: localho StConnection: the body size of the keep-alive // Http request. Is it automatically generated by the program if it is not set manually? Content-Length: 224 -- HrOGHuIjDhR_gtUesEBnpWxVp9JH209pContent-Disposition: form-data; name = "keyword" test -- HrOGHuIjDhR_gtUesEBnpWxVp9JH209pContent-Disposition: form-data; name = "submit" Convert -- HrOGHuIjDhR_gtUesEBnpWxVp9JH209p --
2. Common Data Request methods to the server:
- GET: when the form is submitted, the request parameters are spliced on the URl and separated.
POST: When a form is submitted, the request parameters are encapsulated in the Request body, and a large volume of data can be transmitted. There are many data encapsulation types (http://en.wikipedia.org/wiki/Internet_media_type) at the time of the request, not frequently used:
-
- The default submission method of application/x-www-form-urlencoded is similar to GET. parameters are assembled into Key-value mode and separated with &, but data is stored in the body for submission.
- Multipart/form-data is generally used to upload files or large volumes of data.
The submission method for this website is post, and the MIME type is multipart/form-data. Data needs to be assembled.
When submitting data of this type, you must add the boundary field to the content-type field in the HTTP request header. The data of the body is distinguished by this field:
// Boundary is -- testsssssss
Connection. addRequestProperty ("Content-Type", "multipart/form-data; boundary = -- testsssssss ");
In the Body of the encapsulated Http request, the fields must be separated by boundary: String mimeBoundary = "-- testsssssss ";
StringBuffer sb = new StringBuffer (); // Add two crosslines sb = sb at the boundary level. append ("--"). append (mimeBoundary); sb. append ("\ r \ n"); sb. append ("Content-Disposition: form-data; name = \" keyword \ ""); // there must be two carriage returns before submitting data sb. append ("\ r \ n ");
Sb. append (queryText); sb. append ("\ r \ n"); // the second submitted parameter sb. append ("--"). append (mimeBoundary); sb. append ("\ r \ n"); sb. append ("Content-Disposition: form-data; name = \" submit \ ""); sb. append ("\ r \ n"); sb. append ("Convert"); sb. append ("\ r \ n"); // when the body ends, add two upper and lower crosslines before and after boundary, and add a carriage return line sb at most. append ("--"). append (mimeBoundary ). append ("--"). append ("\ r \ n ");
If the submitted data is of the file or image type, you need to read the file content. Http://www.w3.org/TR/html401/interact/forms.html#h-17.13.4.2
3. parse data using jsoup:
Jsoup parses HTML code in a way similar to javascript.
You can assemble the Document by using the html string received by HttpUrlConnection:
Document doc = Jsoup. parse (html); // obtain the element whose id is tools_converter in html. // assume that the html code is <a id = "tools_converter" href = "http: // localhost "> test </a> Element element = doc. getElementById ("tools_converter"); // you can obtain the data of text as follows: test String text = element. text (); // you can obtain the attr data: http: // localhostString attr = element. attr ("href"); // You can also directly use the HttpConnection encapsulated by Jsoup to request the data source: Document document = Jsoup. connect (url ). userAgent ("Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.118 Safari/537.36 "). get ();
Some methods provided by Jsoup are as follows:
GetElementsByTag (String): Get by Tag: Elements divs = document. getElementsByTag ("div ")
GetElementById (String): obtain by id label: Element idEle = document. getElementById ("blogId ")
GetElementsByClass (String): obtain by CSS Class Name: Elements divs = document. getElementsByClass ("redClass ")
Children (): returns Elements, all child Elements of an element.
Child (int index): returns the Element, which has the following child elements:
Refer to Jsoup API
Link:
Detailed description of URLConnection parameters in JDK