Data acquisition and preprocessing

Source: Internet
Author: User

Copyright Note: Content from the internet and books




First, the data collection Method 1. The primary data source: A direct source of data. Second-hand data: an indirect source of data. Internal sources: such as financial statements, including accounting calculations, cost information, etc. External sources: such as statistical Yearbook, etc., the data published by the investigative agency.



2. Sampling Method 1. Classification




According to the sample extraction method is divided into: there is put back sampling, no back sampling
2. Non-probabilistic sampling
3. Classification of probabilistic sampling methods

Probabilistic Samplingmainly divided into simple random sampling, system sampling,categorysampling, cluster sampling, multi-stage sampling ... Most sample surveys in real life are usedProbabilistic Sampling Methodcome toSample Extraction.







4. Sample size for simple random sampling


second, market researchMarket Research: The process of planning, collecting and analyzing data related to marketing decisions and communicating the results to managers.
1. The process of market research1. Different Marketing research objectives (identify problems, what information is needed)2. Generate Research design (descriptive research, or causal research)3. Selection of research Methods (survey, experiment, observation)4. Select a sampling method5. Collecting data6. Analyze data7. Writing and submitting reports8. Tracking
2. Questionnaire 1. Questionnaire classification(1) Structured questionnaires: closed questionnaires, open questionnaires, semi-closed questionnaires(2) Non-structured questionnaire: No prior preparation of forms, questioning methods and standardized alternative answers, only according to the survey to put and ask the content, and the respondents freely talk to the questionnaire. 2. Closed and open questionnaires

3. Questionnaire design principles The principle of order of purpose principles the principle of the principle of principle of acceptability
4. Type of problem
5. Questionnaire Code

various types of questionnaire topics can be divided into single-choice, multi-select, sorting, open topic four types, their variables are defined and the method of processing each have different , we give a detailed example of the following:
1  Single-choice: The answer can only have one option
   Example A do you have an organization-oriented career planning system in your organization today?  
A    b is starting    c no    d had ever been interrupted
encoding: Define only one variable, value 1, 2, 3, 4 represents a, B, C, d  four options respectively.
Input: Input option corresponding value, if selected C input 3

2  Multiple-choice answers: The answer can have multiple options, with an indefinite number of items and multiple selection of items.
   (1) method one (dichotomy)
:
   Examples of your career planning system work in which groups? Please take all the tips into account when drawing hooks.
A monthly employee        b daily      C hourly employee
Encoding: Each of the corresponding options is defined as a variable, each variable value is defined as follows: "0"   not selected, "1"   selected.
Input: Respondents selected option input 1, not selected input 0, if selected by the respondents selected AC, three variables were entered into 1, 0, 1.
(2) method two:
   Example three you think the most important goal of maintaining the advanced education of party members is that three items:
1 (   )     2 ()    3 (   )
A, improve the quality of party members   
A, improve the quality of party members    b, Strengthening grassroots organizations C, upholding democracy
D, stimulating entrepreneurial enthusiasm e, serving the people    f, promoting various work
D, inspire entrepreneurial enthusiasm e, serve the people    f, promote the work of the
Coding: Define three variables representing 1 of the topic, respectively. 2, 33 brackets, three variable value values are equally defined by the corresponding options, namely: "1" A, "2" B, "3" C, "4" D, "5" E, "6" F
Input: The input value 1, 2, 3, 4, 5, 6 represents the option abcdef, corresponding to each parenthesis corresponding to the variable. If the respondents were selected ACF three brackets, then three variables were recorded in 1, 3, 6.
[Note: Can be used in the method of two-coded multi-choice can also be coded, but the number of indefinite multiple-choice can only be used in the dichotomy, that is, method one is the general method of multi-choice. ]

3  sort title:  sorting option Importance
Example four when you buy a product in    ① brand ② popular ③ quality ④ practical ⑤ price    The priority of their attention in   is (please fill in the Code reorder)  
First      second      third bit       fourth     
first bit      second digit      third digit       fourth bit      fifth digit
                
Encoding: Defines five variables, each of which can represent the first   fifth bits, and the value of each variable is defined as follows: "1"   brand, "2"   Popular, "3"   Quality, "4"   Practical, "5"   Price
Input: The number of entries 1, 2, 3, 4, 5 respectively represents five options, such as the respondents to the quality of the first position in the first position of the variable input "3."


4 Select a sort question:
The question of example three is changed to "What do you think is the most important goal of maintaining the advanced education activities of party members is the three items, and the order of importance from high to low", the option is unchanged.
Encoding: Define 6 variables with ABCDEF6 options, and the value of each variable is defined as follows: "1" is not selected, "2" is ranked first, "3" is second, "4" is third.
Input: Enter the value of the variable. For example, three brackets are selected in the ECF, then the value of the 6 variables of the problem should be input: 1 (for the A option is not selected), 1, 3 (for the C option is ranked second), 1, 2, 4.
[Note: This method is a method of multi-choice and sorting problems of the method, the general sorting problem (example IV) is also applicable, but the two methods of analysis are different (example four with the frequency analysis, example five with description analysis), the output results from different sides to reflect the importance of the problem (the previous method from the order of precedence from the frequency of The latter method looks at the sort from the variable). ]

5 Open Numerical and quantitative questions: This type of topic requires respondents to fill in the values themselves, or score
Example six your age (real age): ___
Encoding: A variable that does not define value values
Input: That is, the input is the actual value of the respondents.

6 Open Text questions:
  If possible, it can be encoded by a similar-meaning answer and converted into a closed option for analysis. If the content of the answer is more rich and difficult to classify, the problem should be analyzed directly.


Third, data preprocessing Method 1. Basic steps for Data preprocessing

(1) Data Cleaning--de-noising and irrelevant data

(2) Data integration--combining data from multiple data sources in a consistent data store

(3) Data Transformation-transforming raw data into a suitable form for data mining

(4) Data Specification-the main methods include: Data cube aggregation, dimension normalization, data compression, numerical normalization, discretization and concept layering.


2. Data Cleansing

(1) Handling missing values


(2) Smoothing noise value

Bin Method (sub-box): Smoothing the values of ordered data by examining neighboring values

Clustering method

(3) Identifying and handling outliers

Outlier outlier: Refers to the individual values in the sample whose values deviate significantly from the remaining observations of the sample to which it (or they) belong, also known as outlier data, outliers

Identification method:








Processing Method: Cap Method


(4) Resolution of inconsistencies


3. Data integration

(1), Entity identification problem

Data analysis mostly involves data integration. Data integration merges data from multiple data sources, storing that in a consistent data store, such as in a data warehouse. These data sources may include multiple databases, data cubes, or generic files.

There are many issues to consider when it comes to data integration. pattern Integration and object matching may require skill. How can a real-world equivalent entity from multiple sources of information be "matched"? This involves entity identification issues . For example, how can a data analyst or computer determine that the customer_id of a database and Cust_number in another database refer to the same properties? The metadata for each property includes the name, meaning, allowable range of values for the data type and property, and processing whitespace, 0, or null rules that are worth null values. Such meta data can be used to help avoid schema integration errors. Metadata can also be used to transform data (for example, Pay_type's data encoding can be "H" and "S" in one database, and 1 and 2 in another). Therefore, this step is also related to the data cleanup described earlier.

During integration, you must pay particular attention to the structure of the data when the properties of one database match the properties of another database. This is intended to ensure that the functions and reference constraints in the source system match those in the target system. For example, in one system, discount can be used for orders, and in another system it is used for the goods in the order. If not found before integration, the items in the target system may be discounted incorrectly.

(2), redundancy and correlation analysis

Redundancy is another important issue in data integration. An attribute (for example, yearly revenue) if it can be "exported" by another or another set of properties, this property may be redundant. Inconsistencies in attribute or dimension naming can result in redundancy in the resulting data set.

Some redundancy can be detected by correlation analysis . Given two attributes, this analysis can measure the extent to which a property can contain another based on available data. For nominal data, we use x^2 (Chi-square) detection. For numeric properties, we use correlation coefficients (correlation coefficient) and covariance (covariance), and they all evaluate how one property's value changes with another.

(3), meta-group duplication

In addition to detecting redundancy between attributes, you should also detect duplicates at the tuple level. The Go to spec table is another source of data redundancy. Inconsistencies typically occur between various replicas, due to incorrect data entry, or because some parts of the database have been updated, but all are not updated.

4. Data transformation
(1) Generalization and aggregation of data (2) Data normalization: Maximum minimum specification Z-score specification decimal Calibration

Min-max Standardization (Min-max Normalization)

also called dispersion normalization, is a linear transformation of the original data, so that the results fall into [0,1] interval, the conversion function is as follows:


max min max and min


Z-score Standardization (Zero-mean Normalization)

also known as standard deviation normalization, the processed data conforms to the standard normal distribution, i.e. the mean value 0 , the standard deviation is 1 , whose conversion function is:


which μ is the mean value for all sample data, σ the standard deviation for all sample data.


5. Data reduction (1) dimensionality reduction: Principal component analysis, variable cluster analysis (2) discretization (continuous) (3) categorical recombination (discrete)

Data acquisition and preprocessing

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.