Introduction to Java Development, web crawler, Natural language processing, data mining

Last Update:2016-06-09 Source: Internet

Author: User

Tags java web java se

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

First, Java development(1) Application development, that is, Java SE Development, does not belong to the advantages of Java, so the market share is very low, the future is not optimistic.

(2) Web development, that is, Java Web development, mainly based on the own or third-party mature framework of the system development, such as SSH, Springmvc, Springside, Nutz, for their respective fields, such as OA, finance, education and other very mature cases, This is the largest market in the present, it is called "Java for the Web". But at present, the disadvantage of its entry is not high, so the treatment is relatively moderate, the rise space is very limited and slow.

(3) Mobile development (Android), is the current trend, but the mobile side often can only act as the role of the client, its technical difficulty and complexity is relatively weak, because the moment of fiery its technology market price is high, but long-term development space is limited, but the development of the PC Web side to increase a little faster.
Second, the network crawler is also called Spider, began in the development of Baidu, Google. But with the rise of big data in recent years, crawler applications have been elevated to unprecedented heights. In terms of big data, in fact, their own data or user-generated data platform is very limited, only like e-commerce, micro-bo such a platform to avoid strong self-sufficiency, like a lot of data analysis mining companies in the way of web crawler to get different metadata collection, and finally for its use, to build their own big data integrated platform. Among them, such as public opinion, financial stock analysis, advertising data mining category, etc. belong to this kind. Described below at the technical level.

(1) Traditional crawlers, such as Nutch, Hetriex and so on, to crawl simple pages as well, that is, there is no complex request page. But with the rise of web2.0, more and more websites use a lot of dynamic interactive technology such as Ajax to improve the user experience, users need to log on to access the page, etc., they can not do, or need two times development cost is too high, many people give up using them.

(2) Custom crawler, for some big data platform, such as Weibo, e-commerce, review network, such as the complex page interaction, users can access after landing, often need to customize custom development of some crawler projects, such as micro-bo for micro-Bo Crawler, for the public comment on the web's custom crawler, the review of the Bean debate Crawler, Are typical of the custom crawler, the difficulty is greater than the traditional crawler, the need for the corresponding custom analysis tools and capabilities, and to have a very solid program design skills, optimize efficiency, overcome the verification code, denial of service and other anti-crawling measures, can make efficient crawler. Now the mainstream is still based on Httpclient+jsoup to take care of network download and page parsing.

(3) New crawler, combined with some mature third-party tools, such as WebKit, Htmlunit, Phantomjs, Casper and other tools, its common point is to maximize the simulation of the way to manipulate the browser to solve with (1), (2) difficult to solve the problem, such as analog landing, the acquisition of complex parameters, complex page interaction and other issues. Often using tools such as the above can easily solve these problems, the biggest drawback is due to the real browser based on the operation, it is less efficient, so often need and httpclient combination, to achieve efficient and practical purposes. Based on Phantomjs do Baidu meta-search capture also proves this point, the next step can be combined with it to complete the simulation of micro-Bo crawler to get the cookie part, after the use of Httpclient+jsoup to solve the massive data capture, is a very good microblogging crawler solutions.

Because it needs more knowledge, its treatment is higher than web development, and the rise and speed are much higher than web development.
Third, the natural language processing is nlp,nature language process abbreviation, it is also many other names abbreviation, many people here have the misunderstanding. It mainly includes three typical parts, Word segmentation, part-of-speech tagging, and syntactic analysis.

(1) Participle: The mainstream includes open source ANSJ participle, ictclas, hit LTP, massive participle, fudan participle, and so on the basis of reference ansj participle, I also reconstructed and developed the dawn participle, and has joined ANSJ to host the Chinese Natural Language processing open source organization Nlpchina https://github.com/NLPchina/.

(2) Part-of-speech labeling: Before the mainstream of the labeling system is Ictclas and Peking University labeling system, there are now such as hit LTP platform, Dalian Science and Technology Natural language Square processing laboratory labeling system, much the same.

(3) Syntactic analysis: This piece relative to the previous two is more difficult, at present domestic I know is hit's LTP to do the Chinese syntax analysis is good, like Stanford's parser to the English syntax can also accept, but the Chinese syntax analysis is more.

Because this area is relatively specialized, difficult and workload is also larger, but because of open source participle more and practical good, so specialized in this piece of personnel are often in large companies or compared to the individual cattle, of course, the treatment is higher than the above one or two.
Iv. Data MiningThat is datamining, this is the current trend, it is often based on the basis of NLP, combined with some typical data mining algorithms, such as classification, clustering, neural network-related algorithms, so as to achieve data mining application development and product of the eye(1) Self-developed related mining algorithms: that is, in certain mathematics and computer based, do some independent research and development of relevant algorithms and tuning, the difficulty is relatively large, is often some cattle or algorithmic research and development engineers to engage.

(2) Referring to third party open source components, such as Weka, Mahout, LIBSVM and so on have provided a lot of packages of various different data mining algorithm components for the upper-level developers to call directly, just learn its API, and according to the instructions input, output can be.
Five or four relationships javaweb development can be said to be a portal, allowing users to better and more directly understand the background of things.

Network crawler, is the way of big Data acquisition, prepare for NLP, datamining.

NLP is the data and datamining middleware that links the spider's Network.

DataMining is the ultimate goal, but also the core of the transition.

These four are a sequence to undertake the relationship, if the four are all, then that is the big.

Write a more casual, inappropriate place to welcome the exchange.

Introduction to Java Development, web crawler, Natural language processing, data mining

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More