[Introduction to Data Mining]-quality of data quality and quality of Introduction to Data Mining
Data quality
The data used by data mining is usually collected or collected for other purposes without explicit purpose. Therefore, data quality cannot be controlled at the data source. To avoid data quality problems, data mining focuses on two aspects: detection and correction of data quality problems (data cleaning); and the use of algorithms that can tolerate low quality data.
Measurement and data collection problemsPerfect data almost does not exist in reality. For existing data quality problems, we first define measurement errors and data collection errors, and then consider various problems of measurement errors: noise, pseudo image, bias, accuracy and accuracy. Next we will discuss the quality of measurement and data collection data: outlier, missing and inconsistent values, and repeated data.
Measurement error)It refers to problems caused by the measurement process. For example, the recorded value is different from the actual value.Data collection error)It refers to errors such as missing data objects or attribute values, or non-local inclusion of other data objects. For example, specific species may be mixed into data of similar species. Measurement and data collection errors may be system or random.
Noise is the immediate part of the measurement error. For example, 2-5 shows the time sequence after random noise interference. If there is a lot of noise, it may even mask the original data.
Figure 2-6 shows the three sets of data points before and after the added noise points.
Noise is usually used for data that contains temporal or spatial components. In such cases, the image or signal processing technology can be used to reduce noise, but it is very difficult to completely eliminate it. Therefore, data mining involvesRobust algorithm)That is, acceptable results can also be produced under noise interference. Data errors may also result from more deterministic phenomena, such as a group of data with the same errors in the same place. This deterministic distortion is calledArtifact)
Precision): Closeness between repeated measurementsBias (bias): The system variation between the measured value and the measured value. Assuming we have a standard weight of 1 gram, to evaluate the accuracy and bias of the new balance, weigh 5 times to get {1.015, 0990, 1.013, 1.001, 0.986} the average value of these values is 1.001, so the bias is 0.001 ,. The standard deviation is used for measurement. The accuracy is 0.013.
Accuracy: The closeness between the measured value and the actual value. Accuracy depends on precision and bias. Another important aspect isSignificant digit)The purpose is to only use the number of digits that can be determined by the data accuracy to represent the measurement or calculation results.
Outlier)It is a data object that has characteristics different from most of the other data objects in the dataset, or an unusual attribute value relative to the typical value of this attribute. CalledException (anomalous)Object or abnormal value. It should be noted that the difference between noise and outlier can be a valid data object or value. Therefore, unlike noise, the outlier itself is an object of interest.
Missing ValueIt is common for an object to omit one or more attribute values. Sometimes, information collection may be incomplete. However, in any case, the missing values should be considered during data analysis. How to deal with missing values:
- Delete data objects or attributes
- Estimated Missing Values
- Ignore missing values during analysis
Inconsistent valuesThe data may contain inconsistent values, such as incorrect account and password entered by mistake. No matter what causes the inconsistent values, it is important to detect and correct them.
Duplicate dataA dataset may contain duplicate data objects. Duplicate data is usually detected and deleted. However, before performing these steps, you have to deal with two problems: if the two objects actually represent the same object, the corresponding attribute values must be different and these inconsistent values must be solved. Avoid merging two similar but non-repeated data objects. deduplication usually indicates this process.
Introduction to Data Mining ebook
Go to the provincial bookstore to buy
What's the next project
Welcome! Welcome! Welcome! Welcome! Welcome! Welcome!