Do not use a third-party framework to get an attribute value for a tag on an HTML page

Source: Internet
Author: User
Tags finally block

Most of the time, we want to get the page source code and other HTML tags in an attribute value of a tag, it is impossible to use a third-party framework, it is a bit sledgehammer kill chicken bright. In this case we can use simple regular expressions to extract the data we want.
For example, I want the next series, most of the sites that are available to us are many lists that require us to go to a single click, which is troublesome, now, just need us to use HTTP to get to the page's HTML string, and then use regular expressions to bulk get to the value of the href attribute of the A tag is fine.
Now we take a movie website a ray download For example, now the demo is as follows:
Tool Method Httpsendget:

public static string Httpsendget (string url, string param,string charsetname) {string result = "";        BufferedReader in = null;            try {String urlnamestring = URL + "?" + param;            URL realurl = new URL (urlnamestring);            The connection between open and URL urlconnection connection = Realurl.openconnection ();            Set the generic request attribute Connection.setrequestproperty ("accept", "*/*");            Connection.setrequestproperty ("Connection", "keep-alive"); Connection.setrequestproperty ("User-agent", "mozilla/4.0" (compatible; MSIE 6.0; Windows NT 5.1;            SV1) ");            Establish the actual connection connection.connect (); Defines the BufferedReader input stream to read the response of the URL in = new BufferedReader (New InputStreamReader (CONNECTION.G            Etinputstream (), charsetname));            String Line;            while (line = In.readline ()) = null) {result + = line; }} catch (Exception e){System.out.println ("Send GET request exception!            "+ e);        E.printstacktrace (); }//Use finally block to close the input stream finally {try {if (in! = null) {In.clos                E ();            }} catch (Exception E2) {e2.printstacktrace ();    }} return result; }

Tool method Match:

public static List<String> match(String source, String element, String attr) {        List<String> result = new ArrayList<String>();        String reg = "<" + element + "[^<>]*?\\s" + attr + "=[‘\"]?(.*?)[‘\"]?(\\s.*?)?>";        Matcher m = Pattern.compile(reg).matcher(source);        while (m.find()) {            String r = m.group(1);            result.add(r);        }        return result;    }

Call Demo:

public static void main(String[] args) {        String url = "https://www.dy2018.com/i/99671.html";        String params = "";        String html = httpSendGet(url,params,"gb2312");        List<String> links = match(html,"a","href");        System.out.println(links);    }

Here you need to explain that the charsetname parameters of Httpsendget need to be noted, otherwise you will get HTML text is garbled.
Finally show the results (of course, the results are not pure, need to filter):
[/,/2/,/0/,/3/,/1/,/4/,/8/,/5/,/7/,/14/,/15/,/html/tv/hytv/index.html,/html/tv/oumeitv/index.html,/html/tv/ri Hantv/index.html,/html/zongyi2013/index.html,/html/dongman/index.html,/support/guestbook.php, #, index.html,/,/ html/tv/,/html/tv/hytv/, Javascript:window.external.addFavorite (' http://www.dy2018.com/', ' dy2018.com-movie Heaven ') Class= "Style11,/webplay/play-id-99671-collection-37.html,/webplay/play-id-99671-collection-36.html,/webPlay/ Play-id-99671-collection-35.html,/webplay/play-id-99671-collection-34.html,/webplay/ Play-id-99671-collection-33.html,/webplay/play-id-99671-collection-32.html,/webplay/ Play-id-99671-collection-31.html,/webplay/play-id-99671-collection-30.html,/webplay/ Play-id-99671-collection-29.html,/webplay/play-id-99671-collection-28.html,/webplay/ Play-id-99671-collection-27.html,/webplay/play-id-99671-collection-26.html,/webplay/ Play-id-99671-collection-25.html,/webplay/play-id-99671-collection-24.html,/webplay/play-id-99671-collection-23.html,/webplay/play-id-99671-collection-22.html,/webplay/play-id-99671-collection-21.html,/webPlay/ Play-id-99671-collection-20.html,/webplay/play-id-99671-collection-19.html,/webplay/ Play-id-99671-collection-18.html,/webplay/play-id-99671-collection-17.html,/webplay/ Play-id-99671-collection-16.html,/webplay/play-id-99671-collection-15.html,/webplay/ Play-id-99671-collection-14.html,/webplay/play-id-99671-collection-13.html,/webplay/ Play-id-99671-collection-12.html,/webplay/play-id-99671-collection-11.html,/webplay/ Play-id-99671-collection-10.html,/webplay/play-id-99671-collection-9.html,/webplay/ Play-id-99671-collection-8.html,/webplay/play-id-99671-collection-7.html,/webplay/ Play-id-99671-collection-6.html,/webplay/play-id-99671-collection-5.html,/webplay/ Play-id-99671-collection-4.html,/webplay/play-id-99671-collection-3.html,/webplay/ Play-id-99671-collection-2.html,/webplay/play-id-99671-collection-1.html,/webplay/ Play-id-99671-collection-0.html, Ftp://g:[email protected]:2166/1001 Night 35.mp4, ftp://g:[email protected]:2166/1001 Night 34.mp4, ftp://g:[email  protected]:2166/1001 Night 33.mp4, ftp://g:[email protected]:2166/1001 Night 32.mp4, ftp://g:[email protected] : 2166/1001 Nights 31.mp4, ftp://g:[email protected]:2166/1001 Nights 30.mp4, ftp://g:[email protected]:2166/ 1001 Nights 29.mp4, ftp://g:[email protected]:2166/1001 Nights 28.mp4, ftp://g:[email protected]:2166/1001 Nights 27.mp4, ftp://g:[email protected]:2166/1001 Night 26.mp4, ftp://g:[email protected]:2166/1001 Night 25.mp4, Ftp://g:[email  protected]:2166/1001 Night 24.mp4, ftp://g:[email protected]:2166/1001 Night 23.mp4, ftp://g:[email  protected]:2166/1001 Night 22.mp4, ftp://g:[email protected]:2166/1001 Night 21.mp4, ftp://g:[email protected] : 2166/1001 Nights 20.mp4, ftp://g:[email protected]:2166/1001 Nights 19.mp4, ftp://g:[email protected]:2166/ 1001 Nights 18.mp4, ftp://g:[email protected]:2166/1001 Nights 17.mp4, ftp://g:[email protected]:2166/1001 Nights 16.mp4, Ftp://g:[email protected]:2166/1001 Nights 15.mp4, ftp://g:[email protected]:2166/1001 Nights 14.mp4, ftp://g:[email protected]:2166/1001 Nights 13.mp4 , ftp://g:[email protected]:2166/1001 Nights 12.mp4, ftp://g:[email protected]:2166/1001 Nights 11.mp4, ftp://g:[ email protected]:2166/1001 Night 10.mp4, ftp://g:[email protected]:2166/1001 Night 09.mp4, ftp://g:[email  protected]:2166/1001 Night 08.mp4, ftp://g:[email protected]:2166/1001 Night 07.mp4, ftp://g:[email protected] : 2166/1001 Nights 06.mp4, ftp://g:[email protected]:2166/1001 Nights 05.mp4, ftp://g:[email protected]:2166/ 1001 Nights 04.mp4, ftp://g:[email protected]:2166/1001 Nights 03.mp4, ftp://g:[email protected]:2166/1001 Nights 02.mp4, ftp://g:[email protected]:2166/1001 Night 01.mp4,/i/99743.html,/i/99734.html,/i/99733.html,/i/99725.html,/i/ 99720.html,/i/99719.html,/i/99716.html,/i/99708.html,/i/99704.html,/i/99695.html,/i/97129.html,/i/97575.html,/ I/97041.html,/i/92091.html,/i/97637.html,/i/92020.html,/i/95187.html,/i/92000.html,/i/98343.html,/i/97363.html]

Do not want to write their own regular expression, you can use the third-party crawler framework, this area to find a lot of online, I will not write.

Do not use a third-party framework to get an attribute value for a tag on an HTML page

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.