Important aspects of machine learning

Source: Internet
Author: User
Tags deep learning machine learning framework machine learning machine learning algorithm training algorithm

Important aspects of machine learning

Machine learning sounds like a wonderful concept, and it does, but there are some processes in machine learning that are not so automated. In fact, when designing a solution, many times manual operations are required. However, this is a crucial part of getting good results. Some of these aspects are:

What kind of machine learning algorithm should I use?

Supervised or unsupervised?

Do you have tagged data? That is the input and the corresponding output. If so, then you can use a supervised learning algorithm. If not, then using an unsupervised algorithm can solve the problem.

Classification, regression or clustering?

It depends mainly on what kind of problem you want to solve. If you want to tag data (marked with discrete options), classification may be the right choice. Conversely, if you want to choose a number, such as a score, regression is your best choice. Or if you want to recommend similar products on the e-commerce website for the user's current browsing information, then clustering is your best choice.

Deep learning, SVM, naive Bayes, decision tree... Which is best?

My answer is: there is no best. Obviously, deep learning and support vector machines have proven that they are the most powerful and flexible algorithms in different applications. But considering that depending on the specific application, some machine learning algorithms may be better than others. Analyze their respective strengths and use them!

Feature engineering

Feature engineering is the process by which we extract and select the most important features used to represent training examples and instances for machine learning algorithms. This process is the most important aspect of machine learning (sometimes not given enough praise and attention).

Please note: If you don't provide quality-assured features to the algorithm, the result will be bad, even if you use the best machine learning algorithms in this situation. It's like you are trying to learn how to read with the naked eye in the dark, no matter how smart you are, you can't do it.

Feature extraction

In order to enter data into a machine learning algorithm, you usually need to convert the raw data into something that the algorithm can "understand". This process is called feature extraction. Usually we convert raw data into feature vectors.

In Case 1, how do we enter an image into a machine learning algorithm?

A straightforward way is to convert the image into a vector, each component being the gray value of each pixel in the image. Therefore, each component or feature can be represented by a value from 0 to 255, with 0 representing black, 255 representing white, and 1 to 254 being gray to varying degrees.

This approach may work, but it might work better if we provide a higher level of features:

Does the image contain a human face?

What is skin color?

What color is the eye?

Is there hair on my face?

...

These are higher-level features that provide more knowledge to the algorithm than just the gray values of each pixel (their calculations can be done with other machine learning algorithms). By providing a higher level of characterization, we get better learning information in the "help" machine learning algorithm to determine if my or some other person's face appears in an image.

If we implement better feature extraction:

Our algorithms are more likely to learn and get the expected results.

We may not need as many training examples.

In this way, we can significantly reduce the time required to train the model.

Feature selection

Sometimes (and in most cases), the features we choose to input into the algorithm may not be of much use. For example, when emotionally tagging a tweet, we may feature the length of the tweet, the time the tweet was published, etc. These features may or may not be useful, and there are automatic ways to identify if they are useful. Intuitively, the feature selection algorithm scores each feature by technical means and then returns the most important ones based on their scores.

Another point to remember is to avoid using massive feature sets. Some people may try to add all possible features to the model to let the algorithm learn. But this is not a good idea. When we add more features to represent the instance, the dimension of the space increases, making the matrix more sparse. Intuitively, because we get more features, we have to have a lot of examples in representing the combination of each feature. This is the so-called dimensional disaster. As the complexity of the model increases, the number of training examples needs to grow exponentially. Believe me, this will be a difficult problem.

Training example

You must enter training examples into the machine learning algorithm. Depending on the problem you want to solve, we will use hundreds, thousands, millions or even hundreds of millions of training examples. Moreover, maintaining the quality of the sample is also critical, and if you enter the wrong example into the algorithm, the chances of getting good results are reduced.

Collecting a large amount of high-quality data to train machine learning algorithms is often a labor-intensive task. Unless you already have tagged data, you need to manually or hire someone to tag the data. Some tools on the crowdsourcing platform try to solve such problems, you can find some tools here to complete the task. Or you can make markup operations more efficient by using the helper generated by your own machine learning model.

The general rule of training samples is that the more quality training data you collect, the better you may get better training results.

Test samples and performance indicators

After we have trained a machine learning model, we need to test its performance. This is very important, otherwise you don't know if your model has learned anything!

The concept is very simple, we use a test set, a collection of instances that are not included in the training set. Basically, we will enter each test sample into the model and see if it will produce the expected results. In the case of supervised learning classification, we only need to enter each test data and then check if the model output is as expected. If our model correctly yields the results of 95% of the test samples, we say that the accuracy of this model is 95%.

It is important to keep in mind that the training and test data sets do not overlap, which is the only way to test the generalization and predictive power of the model. You may get a higher accuracy on your training data, but you will get poor accuracy if you are on a separate test set. This is over-fitting, which means that the algorithm over-fitting the training samples leads to poor predictive power. The usual way to avoid overfitting is to use fewer features, simpler models, simplify the model, and use a larger, more representative training set.

Accuracy is the most basic indicator, and you should also focus on other metrics, such as accuracy and recall, which will tell you how well the algorithm performs on each category (when using supervised learning classification). The confusion matrix is a good tool for observing where the classification algorithm is confusing.

For regression and clustering problems, there are other indicators to measure the performance of the algorithm.

Performance

In practice, if you want to implement a solution, you must build a powerful and high-performance solution. This can be a complex task in machine learning applications.

First, you need to choose a machine learning framework. This is not an easy task, because not all programming languages have powerful tools. Python and Scikit-learn are good examples of programming languages that can be used to build powerful machine learning frameworks.

After choosing a framework, consider performance issues. Depending on the amount of data, complexity, and design algorithms, running training algorithms can consume a lot of computation time and memory. You may need to run multiple training algorithms until you get good results. Also, you may usually retrain the model with new instances to improve accuracy.

In order to train a large number of models at the time of use and quickly get results, we usually use a larger memory and multi-core processor machine to train the model in parallel. These are mostly practical issues. If you want to deploy a machine learning solution to a real application, it is very important to consider these issues.

Conclusion

That's it, I briefly outlined what is machine learning. There are many practical applications and machine learning algorithms and concepts not covered in this article. We leave this to the readers to study on their own. Machine learning is very powerful, but training it is also difficult. The difficulties that may arise in training models in this article are just the tip of the iceberg. Often the background of computer science, especially machine learning, is necessary to achieve good results. A person may be disappointed because of many difficulties before getting on the right track.

This is why we created MonkeyLearn, a popular machine learning technology for text analysis. Avoid reinventing the wheel and let every software developer or entrepreneur quickly get practical results. The following are our main working aspects, abstracting the end users of all these issues, sorting from machine learning complexity to actual scalability, and getting plug-and-play machine learning.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.