Kaggle Data Mining competition preliminary -- Titanic & lt; Data Transformation & gt;, kaggle -- titanic

Source: Internet
Author: User

Initial kaggle Data Mining Competition -- Titanic <data transformation> and kaggle -- titanic

Full: https://github.com/cindycindyhi/kaggle-Titanic

Feature Engineering Series:

Raw data analysis and data processing in the Titanic Series

Data Transformation of Titanic Series

Derived attributes & Dimension Reduction of Titanic Series

After the missing value is filled, you need to process the attribute with other formats. For example, the values of the Sex Embarked attributes are of the string type, while the models in scikit learn can only process the numeric data. You need to convert the original string data to the numeric data. All data can be divided into two types: quantitative and qualitative. Quantitative attributes (numerical attributes) are usually sorted. For example, in the Titanic dataset, age is a quantitative attribute. The value of a qualitative attribute (Binary Attribute of the nominal ordinal number) is the name of some symbols or transactions. Each value represents a class encoding or State. It is not a measurable volume and does not have sorting significance, for example, Embarked (boarding location ).

Data Transformation of a qualitative attribute

For the conversion of string-type qualitative attributes, if it is simply replaced by numbers, for example, the three values of Embarked s q c are replaced by 1 2 3, respectively, the model regards it as an ordered numerical attribute. For some algorithms that determine the classification based on distance, it cannot run accurately. So how should we convert qualitative attributes into numbers?

(1) dummy varibles (I don't know what Chinese should be said .. Virtual attribute ?)

What is dummy? For example, the Emarked attribute has three s q c values, which respectively represent three boarding locations. The dummy attribute is to add three more attributes to the dataset and name them Embarked_S Embarkde_Q and Embarked_C. If a person boarded the ship at S, the values of these three attributes are (, 0), and the ship is (, 0) on the qpoint. Each attribute is a binary attribute. 1 indicates yes, and 0 indicates no. Therefore, dummy applies to attributes with a relatively small value range.

1     import pandas as pd
    #creat dummy varibles from raw data2 dummies_df = pd.get_dummies(df.Embarked)3 #remana the columns to Embarked_S...4 dummies_df = dummies_df.rename(columns=lambda x:'Embarked_'+str(x))5 df = pd.concat([df,dummies_df],axis=1)

In this way, three dummy attributes will be added to the dataset. Use df.info () to check them:

(2) factorizing (factorization ?)

Dummy can be used to process nominal attributes with a small value range such as Embarked. Dummy is not easy to handle the nominal attribute Cabin (Cabin number, A43 B55. Pandas provides a factorize () function to map the string values in the nominal attributes into a number and map the same string to the same number. Unlike dummy, this ing generates only one attribute at the end. For the Cabin attribute, we can divide it into two parts: string + number, and create two attributes. For a string (A-E & U), you can use factorize () to process it into numbers.

1     import re2     df['CabinLetter'] = df['Cabin'].map( lambda x: re.compile("([a-zA-Z]+)").\3                         search(x).group() )4     df['CabinLetter'] = pd.factorize(df.CabinLetter)[0]

In the previous step, we only proposed the letter before the Cabin ship ID as a new attribute. The number in the ship ID should also be proposed as a new attribute.

1 #plus one for laplace assumption2 df['CabinNumber'] = df['Cabin'].map( lambda x: getCabinNumber(x) ).\3                     astype(int) +14 def getCabinNumber(cabin):5     match = re.compile("([0-9]+)").search(cabin)6     if match:7         return match.group()8     else:9         return 0

Data Transformation of binary quantitative attributes

(1) Data Standardization

Data normalization allows you to compress data to a specific range (usually 0-1 or-1-1) to assign the same weight to all attributes. Normalization is particularly useful for classification algorithms involving neural networks or classification and clustering based on distance measurements. There are many normalization methods, such as rescaling logarithmic normalize. You can find the specific implementation of various normalization methods here. However, in some cases, normalization is not required. For example, when an algorithm uses a similarity function instead of a distance function, such as a random forest, it never compares one feature with another, so normalization is not required, for more information, see www.faqs.org/faqs/ai-faq/neural-nets/part2/section-16.html.

If you normalize the Age attributes (check which of the last classification algorithms are used to determine whether to normalize them, and if you want to normalize them, process other attributes as well), the Code is as follows:

1     if keep_scaled:2         scaler = preprocessing.StandardScaler()3         df['Age_Scaled'] = scaler.fit_transform(df['Age'])

StandardScaler compresses the value to the [-] interval and calculates the formula (2x-max (x)-min (x)/(max (x)-min (x )).

(2) Binning

Just as the bin of the histogram divides the data into several bins, we can also divide the numeric attributes into several bins, which is a method of continuous data discretization. We use the pandas. qcut () function to discretize continuous data. It uses quantiles to divide the data, and we can get the bin with the same size. The following uses Fare as an example. Other continuous attributes such as Age SibSp can also be divided into bin.

1 def processFare():2     global df3     df['Fare'][df.Fare.isnull()] = df.Fare.dropna().mean()4     #zero values divide -- laplace5     df['Fare'][np.where(df['Fare']==0)[0]] = df['Fare'][df.Fare.\6                         nonzero()[0] ].min() / 107     df['Fare_bin'] = pd.qcut(df.Fare, 4)

The value of the resulting df ['fare _ bin'] is as follows,

0 [0.401, 7.91] 3 (31,512.329]
1 (31,512.329] 4 (7.91, 14.454]
2 (7.91, 14.454] 5 (7.91, 14.454]
Because the data is bin, attributes are all intervals, which indicates which interval the data belongs. For such data, we need to convert it to numeric data under factorize.

1     df['Fare_bin_id'] = pd.factorize(df.Fare_bin)[0]+12     scaler = preprocessing.StandardScaler()3     df['Fare_bin_id_scaled'] = scaler.fit_transform(df.Fare_bin_id)

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.