This question has just been queried on the Internet, summarized below.
The main development language of reptiles is Java, Python, C + +
For the general information collection needs, the different languages are not very different.
C, C + +
Search engine without exception to the use of c\c++ development crawler, guess the search engine crawler to collect a large number of sites, the page parsing requirements are not high, some support JavaScript
Python
Powerful network, analog landing, parsing JavaScript, the shortcomings are web analytics
Python writes the program to be really convenient, the famous Python crawler has scrapy and so on
Java
Java has a lot of parsers, the Web page parsing support is very good, the disadvantage is the network part
Java Open Source Crawler is very many, famous such as Nutch domestic have webmagic
Java's excellent parser has htmlparser, Jsoup
For general requirements, both Java and Python are capable.
If you need to simulate landing, anti-collection selection python more convenient, if you need to deal with complex web pages, to parse the content of the Web page to generate structured data or to the content of the Web page detailed analysis can choose Java.
Visible, to really become a reptile, Python and Java are needed, the current network of Python teaching resources, so learn Python first.
-------------------------------------------
We welcome you to join the Reptile Engineer Exchange Group: 494343497, in addition, especially welcome to Chengdu engaged in reptile work friends exchange, my QQ number 2487872782