I. Issues related to Data
- The quality of the data
- Data preprocessing to make data more suitable for analysis
- Analyze data based on data, find links between data, and use contacts for the rest of the analysis
second, the noun explanation
- Datasets: Collections of Data Objects
- Properties: Properties or properties of an object
- Measure scale: A rule that associates a numeric or symbolic value with an object's properties
Characteristics of the data set
- Dimension of
- Sparsity: The proportion of non-0 items is very small, saving only 0 items, which saves time and space.
- Resolution: Affects the nature of the data
Data cleansing: Clean up unreal or repetitive objects (such as a person's height 2 m, weight 2kg)
Questions related to measurement errors:
Noise, pseudo-image, bias, precision, accuracy
Issues related to data quality:
Outliers, omissions, inconsistent values, duplicate data
Data collection errors: Missing data objects, incorrect inclusion of data objects, or interference with other data that is similar but should not be included
Outliers: objects that are different from most other data in the data set
Missing value: Object missing attribute (e.g. someone is unwilling to reveal name, age)
Aggregation: Merging two or more objects into a single object (table 1: Name of the study number, table 2, the number of scores, after the aggregation becomes a table: School number name score)
Sampling: Selecting a subset of data objects for analysis, sampling is used in data mining to save data processing time and cost.
The principle of effective sampling: the more representative the sample, the closer the effect is to the entire data set
Sampling methods: Simple random sampling (with back-up, no-return), stratified sampling (for the overall composition of different types of objects, and the number varies greatly)
Data Mining (i)