[Machine Learning] data preprocessing: converting data of different types into numerical values and preprocessing Data Conversion
Before performing python data analysis, you must first perform data preprocessing.
Sometimes I have to deal with non-numeric data. Well, what I want to talk about today is how to deal with the data.
Three methods are available:
1. Use LabelEncoder for fast conversion;
2. Use mapping to map a category to a value. However, this method has limited applicability;
3. Use t
) data = PD. DataFrame (Np.random.randn (6,4), index=dates,columns=list (' ABCD ')) print data #输出A列中大于0的行 print Data[data. A > 0] #输出大于0的数据, less than or equal to 0 with a Nan complement print Data[data > 0] #拷贝data data2 = data.copy () print data2 tag = [' A '] * 2 + [' B '] * 2 + [' C '] * 2 #在data2中增加TAG列用tag赋值 data2[' tag ' = tag print data2 #打印TAG列中为a, C's line print data2[data2. Tag.isin ([' A ', ' C '])]
Some of the 8.DataFrame operations (6)
Import NumPy as NP import pandas as PD d
concil_set:if each in ans_attend_set:c Oncil_attend_set.add (each) elif each of Ans_notatt_set:concil_notatt_set.add (each) else:concil_n Otans_set.add (each) #3. Display result Def disp (SS, cap, num = True): #ss: List set #cap: Opening description print (Cap, ' ({}) '. Format (len (ss))) for I in rangE (Np.ceil (LEN (ss)/5). Astype (int)): Pre = i * 5 NEX = (i+1) * 5 #调整显示格式 dd = ' for Each in list (ss) [Pre:nex]: If Len (each) = = 2:DD = dd + "+ each Elif len" (ea ch) = = 3:DD = dd + ' + eac
rate
names = [' Bob ', ' Jessica ', ' Mary ', ' John ', ' Mel ']
births = [968, 155, 77, 578, 973]
Use the zip function to merge the two lists together.
# Check the zip function's help
zip?
Babydataset = List (zip (names, births))
Babydataset
[(' Bob ', 968), (' Jessica ', "), (' Mary ',), (' John ', 578), (' Mel ', 973)]
We have completed the creation of a basic dataset. We now use Pandas to export this data to a CSV file.
DF is a dataframe obj
Data import:Import CSV fromPandasImportREAD_CSV;DF= Read_csv ('D://pa//4.1//1.csv') to import text, to be converted to UTF-8 No BOM format: fromPandasImportREAD_TABLE;DF= Read_table ('D://pa//4.1//2.txt') Import Excle fromPandasImportREAD_EXCEL;DF= Read_excel ('c:/pa/4.1/3.xlsx')View CodeData export: Export of data: fromPandasImportDATAFRAME;
Machine learning: Predicting Google stock using Scikit-learn's linear regression
This is the first article in the Machine Learning series.This article will Python use scikit-learn the linear regression to predict Google's stock trend. Please do not expect this example to make you a stock master. Here's how to do it in step-by.Preparing dataThe data used in this article comes from the www.quandl.com site. Using the Python appropriate quandl library, you can get the data we want with a few si
Generally, Unix administrators have a set of common tools, techniques, and systems for assisting process management. This article provides a variety of key utilities, command line chains, and scripts used to simplify each process. Some of these tools come from the operating system, and most of the skills come from long-term experience and requirements for reducing the workload of system administrators. This seriesArticleFocuses on maximizing the use of tools available in a variety of UNIX enviro
Python pandas common functions, pythonpandas
This article focuses on pandas common functions.1 import Statement
import pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport datetimeimport re2. File Reading
Df = pd.read_csv(path+'file.csv ')Parameter: header = None use the default column name, 0, 1, 2, 3...Names = ['A', 'B', 'C'...] Custom column nameIndex_col = 'A' | ['A', 'B'...] specifies the name of the index column. If Multiple ind
:
Create a tbs_usage table on the Data host to reflect the amount of data files used in the data. The tbs_timeid is the primary key of the table and is used as the id that uniquely identifies the tablespace of the database on the current day. The tbs_timeid is df. tablespace_name | "-" | (sysdate)1. pansky users are responsible for daily management. Currently, they are mainly used to monitor the table space data volume.SQL> create user pansky identifi
[Data cleansing]-clean "dirty" data in Pandas (3) and clean pandasPreview Data
This time, we use Artworks.csv, And we select 100 rows of data to complete this content. Procedure:
DataFrame is the built-in data display structure of Pandas, and the display speed is very fast. With DataFrame, we can quickly preview and analyze data. The Code is as follows:
import pandas as pddf = pd.read_csv('../data/Artworks.csv').head(100)df.head(10)
Statistical date data
Let's take a closer look at the data i
Most of the students who Do data analysis start with excel, and Excel is the most highly rated tool in the Microsoft Office Series.But when the amount of data is very large, Excel is powerless, python Third-party package pandas greatly extend the functionality of excel, the entry takes a little time, but really is the necessary artifact of big data!1. Read data from a filePandas supports the reading of multiple format data, of course the most common are Excel files, csv files, and txt files.name
Basic operations:
Get the Spark version number (in Spark 2.0.0 for example) at run time:
SPARKSN = SparkSession.builder.appName ("Pythonsql"). Getorcreate () Print sparksn.version
Create and CONVERT formats:
The dataframe of Pandas and Spark are converted to each other:
PANDAS_DF = Spark_df.topandas ()
SPARK_DF = Sqlcontext.createdataframe (PANDAS_DF)
Reciprocal conversion to spark RDD:
RDD_DF = Df.rdd
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.
A Free Trial That Lets You Build Big!
Start building with 50+ products and up to 12 months usage for Elastic Compute Service