Web crawlers and Crawlers
/* Web crawler */import java. io. *; import java.net. *; import java. util. regex. *; class RegexTest2 {public static void main (String [] args) throws Exception {getMails ();} public static void getMails () throws Exception {URL url = new URL ("content to be crawled"); URLConnection conn = url. openConnection (); BufferedReader bufr = new BufferedReader (new InputStreamReader (conn. getInputStream (); String line = null; String mailRe G = "Regular Expression"; Pattern p = Pattern. compile (mailReg); while (line = bufr. readLine ())! = Null) {Matcher m = p. matcher (line); while (m. find () {System. out. println (m. group ());}}}}
Web Crawler tools
Install and use the powerful crawler tool download.csdn.net/detail/aklakl/4082490 in heritrix.
Which of the following is suitable for Web Crawler C ++ and JAVA?
For the language itself, I think Python is an ideal language for web crawlers. It is often used for document analysis in one breath.
From the perspective of performance, the status of C ++ is still indecisive. If you want to crawl massive data and you can control C ++, select it.
If you do not want to start from scratch and want to perform secondary development on the basis of the framework, you can consider Java.
If you just want to do some simple data analysis and capturing, you don't have to use the "language" layer. In some cases, some crawler tools are better than the crawlers you write.