Data Mining and Web development

Source: Internet
Author: User
Tags ibm db2

(0) Intro

The following is a real-life example of this blog to explore the point. Maybe something like that is happening right next to you.

My little brother has been working for 5 years and has been confused lately.

The last job in a larger portal to do web development and mobile Internet data mining (more tight hands.) Do it at the same time). Later job hopping to bat among the one do data mining.

The amount of data is quite large. But the feeling does not have much meaning--is the analysis log, make the report only.

Previous high-performance web development experience was completely unused. Feel like doing development, can be related to business.

But data mining is big. The data is very hot in the near future, is also more tangled.

In fact. This is also a very many people's crux, including my personal is also confused ~ ~ ~ in this, ask Bo friends to offer suggestions!

Appreciate it!


(1) Big Data and mathematical grievances
One is that the company did not originally have a business. Now we have to run some machines to learn this thing (from scratch). Second, when you take over the company has a certain foundation. Now you have to tune the performance (from poor to excellent). The former absolutely no matter what mathematics, first with other people's module/code to the system is the kingly. The latter look at the detailed question. Mathematics is not necessary in most cases. In the partial study of some parts of Google X is still practical, may need a better mathematical skills, some deep learning (machine learning) institutions, such as Baidu Phoenix Nest Research Institute or Microsoft Research Asia.

However, for the general data analysis, data Mining project group, especially for a certain classifier, most of the time still look at feature looking good, find an excellent feature race over the bitter force in there tuning 10,000 years (body in the company with KPI priority and take to use the big environment, Leverage existing open source lib packages). Learn linear algebra well. Statistics and convex optimization go out and play strange. Accumulate system experience and dirty trick is the kingly way.

I'm not talking about maths, of course. Just assume that you go to the company, in the premise of learning the linear generation of statistical convex optimization, the same time spent learning computer system construction and systematic thinking methods, than studying mathematics more cost-effective. In most ML studies. or calculus and linear algebra, probability statistics is the most important foundation.


(2) Large data-derived jobs

Data Development Project Division on the side of research and development, this piece I know not much, I understand is mainly the data warehouse development that block.

Data analyst side re-analysis, mainly in conjunction with business needs to do some relevant data analysis work. Find the problem and find out the problem. Propose a solution.

Data mining side re-mining, mainly using some data mining algorithms or machine learning algorithms to do some classification and prediction or other work, for example, loss. Default. Recommend and so on.

The data Product Manager focuses on the product manager and is responsible for PD work related to data products.

Data products are products that are developed based on data analysis or mining.

If hard to analyze, the data development project requires a certain degree of development skills. Work is biased towards the development of data systems.

Data analysts are more like traditional bi. The Data Mining project division is the data mining work aiming at the specific needs, for example, the population preference mining. The data Product manager should be the product direction, the product manager that targets the data business/product.

Speaking of the end, in addition to the Product manager post of the other 3 positions, job responsibilities are very similar, the main work content to the department requirements prevail.

It's a quote from Wikipedia. Data development Project Architect: Build infrastructure that allows big data storage, processing, and computing to be completed at a reasonable cost within the time required. Data analyst: Identify problems, analyze problems, and draw conclusions. Support for decision-making. Data mining project Architect: by building a model. Predict and differentiate objects of interest.

The following are illustrated by a few diagrams:

Watermark/2/text/ahr0cdovl2jsb2cuy3nkbi5uzxqv/font/5a6l5l2t/fontsize/400/fill/i0jbqkfcma==/dissolve/70/gravity /center ">







(3) DM and ML

1. More application of DM. ML more biased research and algorithms (so companies typically have data mining project division, machine learning researcher)

2. The problem of ML is often clearly defined. Contains datasets and targets (and datasets are fixed); DM Usually just defines the target, not even the target (give you a bunch of data, to find something worthwhile and interesting to come out);

DM is able to use a non-stationary data source in the case where the target is defined

3. ML is only one of the methods used by DM. DM can also use other methods (such as statistics, for example, directly looking at the data)

4. As a cross-disciplinary course. ML is an important basis for DM, but there are other basic disciplines in DM. The most important is the statistics and database

5. DM's focus is on data.

So people who do DM may spend 80% of their time Daoteng data in various ways. And just 20% of the time on the algorithm. On the contrary, for ML, 80% of the time is read paper, the test algorithm. 20% of the time spent on processing data


(4) Text mining of data mining

Data mining, also translated into data mining and mining. It is a database knowledge Discovery (English: Knowledge-discovery in Databases. A step in the short name: KDD). Data mining usually refers to the process of searching for hidden information in a large amount of data through algorithms.

Data mining is often related to computer science and is achieved by means of statistics, online analytical processing, information retrieval, machine learning, expert systems (relying on past rules of thumb), and pattern recognition.

Text mining is also sometimes referred to as text exploration, text data mining, and so on. Roughly equivalent to text analysis, generally refers to the process of text processing to produce high-quality information. High-quality information is usually generated by classification and prediction. such as pattern recognition.

Text mining usually involves the processing of the input text (usually analyzed, coupled with some derivative language features at the same time, and the elimination of noise, then inserted into the database), resulting in structured data and finally evaluating and interpreting the output. ' High quality ' text mining generally refers to a combination of relevance, novelty and interest.

A typical text mining method contains text classifications. Text clustering, concept/entity mining, Production precision classification, opinion analysis, document SUMMARIZATION and Entity Relationship model (ie. Learn about relationships between named entities.

Borrow a word from Gauss and rewrite it to all the people who do data mining and text mining.

"Ignorance of data mining and text mining is not a lack of knowledge, but too much reliance on data mining and text mining to ignore others."

Textual data Mining (text Mining) refers to the computer processing technology that extracts valuable information and knowledge from text data. As the name implies, text data mining is data mining from text (Mining). In this sense, text data mining is a branch of data mining.

text Mining methods: 1. Text classification, is a typical machine learning method, generally divided into two stages of training and classification.

2. Text clustering, is a typical unsupervised machine learning method, the choice of clustering method depends on the data type. 3. Information extraction.

4. Summary.

5. Compress.

Among them, text classification and clustering are the two most important and major mining functions.

Mining Tools: 1.IBM DB2 Intelligent Miner. 2.SAS text Miner. 3.SPSS Text Mining. 4.DMC TextFilter (plain text extraction generic library)

Application: Text mining traditional business applications are mainly, enterprise competitive intelligence, CRM, e-commerce sites, search engines, has now been extended to the medical, insurance and consulting industry.


(5) Artificial intelligence, machine learning, statistics and data mining in past life
I assume that the Lord is trying to get a clear picture with a clear dividing line in every field.

So. Here I try to explain the problem in my simplest way.

Machine learning is a science that involves the development of self-learning algorithms. This type of algorithm is inherently generic. Areas that can be applied to a wide range of related issues.

Data mining is a kind of useful application algorithm (mostly machine learning algorithm). Use data from various areas of output to address issues related to each area.

Statistics is a study of how to collect. Organization. The science of analyzing and interpreting digitized information in data. Statistics can be divided into two categories: descriptive statistics and judging statistics.

Descriptive narrative statistics involve the organization, accumulation and characterization of information in the data.

Judging statistics involves using sampled data to judge the whole.

Machine learning uses statistics (mostly judging statistics) to develop self-learning algorithms.
Data mining is to solve this problem by applying statistics (mostly descriptive statistics) to the results obtained from the algorithm.
As a discipline, data mining has been developed to solve problems in a wide variety of industries, especially in business, and the solution process requires different technologies and practices in different fields of study.
1960 Practitioners of solving problems use the term data fishing to address their work.

1989 Gregory Piatetsky Shapiro uses the term knowledge Discovery in the database (KDD, Knowledge discovery on Datasets). 1990 A company used terminology data mining on trademarks to describe their work. Now data mining and KDD two words can be exchanged for use.

Artificial intelligence The purpose of this science is to develop a system or software that simulates how humans can react and behave in a certain environment.

Because this field is extremely broad, AI defines its goals as multiple sub-goals. Then each sub-goal is developed into an independent branch of research.

Here is a list of the main goals for AI to complete (also known as AI issues)
1. Reasoning (Inference)
2. Knowledge representation (knowledge representation)
3, automated planning and scheduling (own initiative planning)
4 Machines Learning (machine learning)
5. Natural Language Processing (Natural language processing)
6. Computer Vision (computer vision)
7. Robotics (Robotics)
8. General intelligence or strong AI (Universal intelligence or strong AI)
As mentioned in the list, the research area of machine learning is developed by a sub-goal of AI. Used to help machines and software to self-learn to solve problems encountered.
Natural language processing is also a research field developed by a sub-goal of AI. Used to help machines communicate with real people.


Computer vision is an area raised by the AI's goal to identify and identify objects that the machine can see.
Robotics is also a target of AI, which is used to give a machine the actual form to complete the actual action.
Are there hierarchical levels of distinction between them? What should be the matter?
One way to explain these scientific and research hierarchies is to analyze their history.


The origins of science and research

Statistics--1749
Artificial Intelligence--1940
Machine Learning--1946
Data Mining--1980
The historical recognition of statistics originates from about 1749 years. Used to characterize information. The researchers used statistics to characterize the country's economic level and to characterize the material resources used for military purposes.

The use of statistics is then extended to data analysis and its organization.


The history of artificial intelligence happens to exist in two categories: Classic and modern. Classic AI can be seen in ancient stories and writings. However, in the 1940, modern artificial intelligence appeared when people described the idea of machines imitating human beings.


1946, as a branch of AI. The origin of machine learning has arisen, and its goal is to solve the target by making the machine self-learning without programming and understanding the hard wiring.

Is it enough to say that they are four areas that use different approaches to solve similar problems?

It can be said that (statistics, artificial Intelligence and machine learning) are highly interdependent areas, and no other areas of guidance and assistance, they can not exist alone.

It is a great pleasure to see that these three areas are a global area rather than three areas of estrangement.
As these three areas are a global area. They have played their own advantages in solving common goals. Therefore, the program is suitable for many different areas. Because of the implicit core issues are consistent.
The next step is the data mining, which takes a holistic approach and applies it to different areas (commercial, military, medical, space) to solve the same implicit nature of the problem.

This is also the time when data mining expands its popularity.

I hope my explanation has answered all the questions asked by the Lord. I believe this will clearly help anyone who wants to understand the key points of these four areas. Suppose you have anything to say or share about the topic. Please write down your thoughts in the comments.

(6) Summary


Folder---related articles

High-speed Python and error-prone (text processing)

Python Text processing and JAVA/C

10 minutes Learn the basic types of Python

High Speed Learning Python (actual combat)

The way to Big data processing (10 minutes to learn Python)



Data Mining and Web development

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.