Data mining modeling Evaluation Data discovery

Source: Internet
Author: User

Data mining will not work unless you are using data that meets specific criteria. The following sections describe some of the issues that deserve your attention in the data and their applications. Whether the data is available.

This may seem like a very obvious problem, but it is worth noting that although the data may be available, its form might not be easy to use. You can import data from a database (through ODBC) or from a file. However, the data may be saved in some other form on your computer and cannot be accessed directly. So before you use it, you need to download or dump it in some appropriate form. The data may also be scattered across a variety of databases and sources, which need to be put together. Even the data may not be online. If the data exists only on paper, data entry is required first before data mining can begin. Whether the data contains related properties.

The purpose of data mining is to determine the related attributes, so this may seem like a strange problem. However, it is useful to see what data is available and try to determine what factors may be relevant but not documented. For example, when you try to predict ice cream sales, you may have a lot of information about your retail sales or sales history, but you may not have information about the weather and the temperature, and that information is likely to be important. Missing attributes does not necessarily mean that data mining cannot produce useful results, but may limit the accuracy of the generated predictions.

A quick way to assess this situation is to perform a comprehensive audit of the data. Before you start auditing, connect a data audit node to the data source and execute the node to generate a complete report. See Data audit nodes for more information. Whether the data has noise.

The data usually contains errors or subjective factors, so there may be deviations and assumptions. These phenomena are collectively called noises. Sometimes the noise in the data is normal. There may be a normal potential rule, but it may not cover 100% of observations.

In general, the more noise there is in the data, the more difficult it is to get accurate results. However, Clementine's machine learning method can handle noise data and has successfully processed data sets containing nearly 50% noises. Whether the data is sufficient.

In data mining, the size of the dataset is not necessarily very important. The representativeness of the dataset, and its coverage of possible results and variable combinations, is much more important.

In general, the more attributes you consider, the more you need to provide a representative range of records.

If the data is representative and there is a common potential rule, perhaps thousands of (or even hundreds of) of data samples can provide as good a result as 1 million records-and you get the results much faster. Whether you have expertise for available data.

In many cases, you are working on your own data, so you are familiar with its content and meaning. However, if you want to manipulate data from other departments in your organization, or to manipulate the customer's data, it can be helpful to have experts who understand the data. These experts can guide you through determining the relevant properties, help you interpret the results of the data mining, and find "gold" from the "yellow Sands" of the information or the "treasures" through the outliers in the dataset.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.