Mahout and Hadoop: Fundamentals of machine learning

Source: Internet
Author: User
Keywords Machine learning very through data scientists

Computing is often used to analyze data, while understanding data relies on machine learning. For many years, machine learning has been very remote and elusive to most developers.

This is probably one of the most profitable and popular technologies now. No doubt--as a developer, machine learning is a stage that can be a skill.

Figure 1: The composition of machine learning

Machine learning is a reasonable extension of simple data retrieval and storage. By developing a variety of components to make the computer more intelligent learning and behavior.

Machine learning makes it possible to excavate historical data and predict future trends. You may not realize it yet, but you are already using machine learning and benefiting a lot. There are many examples of machine learning, such as search engine results, online referrals, advertising, fraud detection and spam filtering.

Machine learning relies on data for decision making. Intuition, though important, is hard to go beyond empirical data.

All aspects of machine learning

Once you start delving into machine learning, you will encounter the following questions:

1. Supervised and unsupervised learning

2. Classification

3. Markov model, Bayesian network, etc.

Mahout and Hadoop

The purpose of the Apache Mahout project is to build an extensible machine learning library.

There is a certain degree of overlap between large data analysis and Hadoop

With Hadoop, you can get the whole machine to learn open-source projects for free. More content See:

http://mahout.apache.org/

Mahout built-in clustering, classification and collaborative filtering algorithms. In addition to this:

1. Recommendation system based on matrix decomposition

2. K-Means, fuzzy K-means clustering algorithm

3. Implicit Dirichlet assignment algorithm

4. Singular value decomposition

5. Logic regression Classifier

6. (complementary) Naive Bayesian classifier

7. Random Forest classifier

I went to the University of California at Berkeley and found that they had a lot of good classes.

I hope to have more time. I thought about it and decided to start the MIT online course at the following address:

Http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-867-machine-learning-fall-2006/index.htm

Azure is the democratization of machine learning

Machine learning once required sophisticated software with high-end computers, as well as data scientists. For the current machine learning, that is, predictive analysis, what is needed is a fully managed cloud services.

Welcome to ML Studio

By using drag-and-drop (Drag-and-drop) and some data flow diagram can be carried out some experiments, such as writing code generally use a large algorithm.

Data scientists write code with R

For statistics and data mining, R is a popular open source project. The good news is that r can be easily integrated into ML studio. I have a lot of friends in the use of machine learning functional languages, such as F #. But obviously, R still dominates in this field.

The tests and surveys of data mining show that the popularity of R has been increasing in recent years. R was invented by Ross Ihaka and Robert Gentleman of the University of Auckland, New Jersey, and is currently being developed by R-Core team R Development, which Chambers is one of the development members. The name of R is mainly based on the initials of the first two R authors. R is a GNU project, mainly written in C language and Fortran.

Data analysis

The following framework provides a way to understand machine learning predictions. In general, it is when it comes to how to use limited resources to provide decision support to increase revenue or limit costs. Including forecasting consumption model, optimizing supply chain and so on.

How to analyze data

The best way to understand machine learning is to decompose the analysis into 3 questions:

1. What happened?

A from a historical point of view

2. What will happen?

A) predicting the future

3. What should be done next?

A) Norms and guidelines

What role do you play in the analysis process?

1. Information workers

A often use self-service tool power Bi:office 365 is a self service transaction intelligence solution that provides information workers with the ability to analyze and identify data deep transaction prediction visualization through BI Excel and Office 365.

2. It experts

(a) Data conversion, data warehousing, creation of data analysis cubes and data modeling

3. Data scientists

A deep level of technology and skills, including coding, math, statistics, and probability

b The probability can be used to forecast through a series of technologies (for example, the probability of price increase in the next 18 hours is 42%)

c) such as Monte Carlo (Monte Carlo) simulation, model parameterization

d The quality of data scientists

I. Domain knowledge

Ii. clear understanding of scientific methods: objectives, assumptions, validation, transparency

Iii. good at math and statistics

Iv. curiosity and strong ability to think

V. Graphical description and communication skills

Vi. Advanced Computing and data management capabilities

Academic background

If you want to enter the school and learn to become a data scientist, the following courses can be selected:

1. Applied Mathematics

2. Computer Science

3. Economics

4. Statistics

5. Engineering

Industries benefiting from data science include:

1. Financial Services

2. Telecommunications

3. Information Technology

4. Manufacturing

5. Public Utilities

6. Public Health

7. Market

"TechTarget China original content, copyright, by authorized China large data release, declined to reprint." Otherwise techtarget China will retain the right to pursue its legal liability. 】

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.