I. List of skills, mastering Java, especially programming network parts; Li Gang's Java Foundation has seen at least three times;
2, familiar with HTML, JS, Ajax, Firedebug
3, Web page to heavy, find the characteristics of the site
4. Distributed
5. Multithreading
6, a relational database Mysql/oraclelserver/mybatis
7. Regular expressions, CSS selector, XPath
8. DNS Cache
9, Tcp/ip/http protocol tp2.010, Web login protocol
10. SSO, OAuth principle
11. Anti-crawl strategy
12, familiar with HttpClient, OKHTTP3 ...
13, familiar with some extraction tools, Jsoup, Selenim webdriver ...
14, search technology. Familiar with lucene/nutch/heritrix/solr/elastic-search/
15, familiar with XML, JSON, SOAP protocol;
16. MongoDB, Redis, HBase, Hadoop
17. Text analysis, machine learning, data mining, natural language processing [NLP]
18, the completion of Web pages, Weibo,, paste, forum and other data information accurate extraction
19. RPC protocol
20, Netty, NIO
21, Htmlunit, Phantomjs, Slimerjs, Casperjs
22, Agent deployment scenario: Http/socks
23. Nginx, squid, jetty
24. hack iOS
25. Verification code, OCR, tess4j
Second, crawler tools
1, Phantomjs
2. BERSERKJS (improved version based on PHANTOMJS)
3, Slimerjs
4, Casperjs
5, Selenium
Third, Java-related
Common Ide:intellij Idea,eclipse,netbeans
Web Development Related: Tomcat, Resin, Jetty, WebLogic, etc., common components struts,spring
Hibernatenetty: Asynchronous event-driven Network application programming framework for high Concurrency Network programming (NIO framework)
MINA: The simple development of high-performance and high-reliability network applications (also a NIO framework), a lot of hand-travel services are developed with it
Jooq:java ORM Framework Activiti: Workflow engine, similar to JBPM, Snaker
Perfuse: is a user Interface pack that is used to present structured and unstructured data with interactive visual graphics.
Gephi: Complex network analysis software, which is mainly used for various network and complex systems, dynamic and hierarchical mapping of interactive visualization and detection of open source tools
Nutch: A well-known crawler project, Hadoop was developed from this project.
Web-harvest:web Data Extraction Tool
Pom Tool: Maven+artifactorynetflix
Curator:netflix company Open Source, a Zookeeper client library, to simplify the programming of zookeeper clients
Akka: A concurrency Processing framework based on actor model
Java Crawler engineer Skill List "Go"