import NumPy as NPImport Pandas as PD1 #string Common methods-strip2s = PD. Series (['Jack','Jill','Jease','Feank'])3DF = PD. DataFrame (Np.random.randn (3,2), columns=['Column A','Column B'],index=range (3))4 Print(s)5 Print(df.columns)6 7 Print('----')8 Print(S.str.lstrip (). Values)#Remove the left space9 Print(S.str.rstrip (). Values)#Remove the space on the rightTenDf.columns =Df.columns.str.strip () One Print(Df.columns)Results:0 Jack 1
capture dividend data, the dividend is only one page, where the multi-page data, and the number of pages is not uniform. This score red data crawl to solve more than two problems: first, to put the data of different years together, for splicing. Second, determine when the oldest year is and when to stop crawling. Reptile ProgramOperating environment: WIN10 system; Python 3.0;sublime text editor;(1) first on the procedure. As if the source effect, first, the relevant instructions see code commen
The main data were three kinds of preprocessing:
1. Interval Scaling
reading data, data processing, storing data
Import pandas as PD
import NumPy as NP from
Sklearn import preprocessing
import matplotlib.pyplot as PLT
p lt.rcparams[' Font.sans-serif '] =[' Simhei '] #用来正常显示中文标签
plt.rcparams[' Axes.unicode_minus '] =false #用来正常显示负号
filename = ' Hits persecond_t20m_130.csv '
data_f = pd.read_csv (filename) #二维dataframe格式
#print (data_f)
famous data Analysis library in Python panda
The Pandas Library is a numpy-based tool that is created to solve data analysis tasks and is also built around the two core data structures of series and DataFrame, where series and DataFrame correspond to one-dimensional sequences and two-dimensional table structures.
Pandas provides a number of functions and methods that enable us to process data
The world is messy, and data from the real world is just as messy. A recent survey shows that data scientists spend 60% of their time collating data. Unfortunately, 57% of people think it's the most frustrating part of the job.
Organizing the data is time-consuming, but there are a number of tools that have been developed to make this critical step a little more bearable. The Python community provides many libraries to make the data clear and orderly-from formatting
Python For Data Analysis study notes-1, pythondataanalysis
This section describes how to process a MovieLens 1 Mbit/s dataset. The book introduces this dataset from GroupLens Research (http://www.groupLens.org/node/73), which will jump directly to the very 1 m dataset is also in it.
The downloaded and decompressed folder is as follows:
All three dat tables are used in the example. The Chinese version of Python For Data Analysis (PDF) I read is the first version in 2014. All examples are based
This article describes how to read and write csv files in python. how to read and write csv files in python
In data analysis, you often need to access data from csv files and write data into csv files. It is very convenient and easy to directly read the data in the csv file as the dict type and DataFrame. the following code takes iris data as an example.Csv file read as dict
Code
#-*-Coding: UTF-8-*-import csvwith open ('E:/iris.csv ') as csvfile: rea
allows users to set up and get all spark and Hadoop configurations related to spark SQL. When you get the config value,Listenermanager functionPublic Executionlistenermanager Listenermanager ()An interface for registering custom queryexecutionlisteners to listen for execution metrics.Experimental functionPublic experimentalmethods Experimental ()The collection function, which is considered a experimental, can be used to query the advanced features of the query scheduler.UDF functionsPublic udfr
, title, tick labels, and annotations. This is because creating a chart typically requires multiple objects. In the pandas, it will save a lot of trouble. Pandas can use Dataframe's object features to create advanced drawing methods for standard charts. The author says the best learning tool for pandas online documentation may be outdated.Line chart#-*-encoding:utf-8-*-import NumPy as Npimport pandas as Pdimport Matplotlib.pyplot as Pltfrom pandas import Series,dataf Rames = Series (NP.RANDOM.RA
size:For Name,group in Grouped2: print (name) print (Group.shape)Standardize the data: (prevent the value from being too large)Numeric: The column, each minus the average divided by the standard deviation of the columnZscore = lambda s: (S-s.mean ())/S.STD () grouped1.transform (Zscore)Filter:Some groups of samples are too large!# assume that each group sample is less than 10cond1 = Lambda S:len (s) Previously: Set index:Pok1 = Pokemon.set_index ([' Type 1 ', ' Type 2 '])To GROUP by index
some coincidences, so that the content you really want is not extracted, and other content like pattern. Therefore, first of all, take out the key blocks first, and then take out the specific information.
1 import re2 3 re_books = Re. Compile ('
Check the source code of the webpage, find matching rules for retrieving the main information, and obtain all the intermediate content. The rest is to extract every item of information in each book through regular expressions. This is to observe their r
processTask 172:spark Task submission Process drawing summaryMission 173:blockmanager in-depth analysisMission 174:cachemanager in-depth analysisThe 6th chapter: SparksqlTask 175: Description of the default number of partitionsMission 176:sparkcore Official Case DemoMission 177:spark's Past lifeRelease Notes for Task 178:sparkTask 179: What is DataframeTask 180:dataframe First ExperienceTask 181:rdd turn Datafram
# Coding:utf-8__author__ = ' Weekyin 'Import NumPy as NPImport Pandas as PDDatas = Pd.date_range (' 20140729 ', periods=6)# first create a time index, the so-called index is the ID of each row of data, you can identify the unique value of each rowPrint Datas# for a quick start, let's look at how to create a 6x4 data: The RANDN function creates a random number, the parameter represents the number of rows and columns, and dates is the index column created in the previous stepDF = PD.
:spark Task Submission Detail processTask 172:spark Task submission Process drawing summaryMission 173:blockmanager in-depth analysisMission 174:cachemanager in-depth analysisThe 6th chapter: SparksqlTask 175: Description of the default number of partitionsMission 176:sparkcore Official Case DemoMission 177:spark's Past lifeRelease Notes for Task 178:sparkTask 179: What is DataframeTask 180:dataframe First ExperienceTask 181:rdd turn
Two ways to solve this problem are the existing solutions on the Internet.Scenario Description:There is a data file that is saved as text and now has three columns of user_id,plan_id,mobile_id. The goal is to get new documents only mobile_id,plan_id.Solution SolutionsScenario One: Use the Python open file to write the file directly through the data, for loop processing data and write to the new file.The code is as follows:defreadwrite1 (Input_file,output_file): F= Open (Input_file,'R') out= Open
Import NumPy as NP
import pandas as PD
Stack
Rotate the row index to a column index and complete the hierarchy index.
In the following example, first create a box of 5x2 dataframe.
It is then stack, so the original row index becomes the outer index, and the original column index becomes an inner index.
Df_obj = PD. Dataframe (Np.random.randint (0,10, (5,2)), columns=[' data1 ', ' data2 '])
print Df_obj
1 concat
The Concat function is a method underneath the pandas that allows for a simple fusion of data based on different axes.
Pd.concat (Objs, axis=0, join= ' outer ', Join_axes=none, Ignore_index=false, Keys=none, Levels=none, Names=None,
Verify_integrity=false)1 2 1 2 1 2
Parameter descriptionObjs:series,dataframe or a sequence of panel compositions lsitAxis: Axis that needs to merge links, 0 is row, 1 is columnJoin: Connecting the way i
and did not see this method, so I went to Sklearn GBDT API looked under, sure enough there is apply () method can get leaf indices:
There are differences in the code because the Xgboost has its own interface and Scikit-learn interface. At this point, the basic understanding of the use of GBDT (XGBOOST) structure combination features of the implementation method, followed by two interfaces to practice a wave. 2. Practice of combining features with GBDT structure
Departure from the departure ~
(1
code can achieve many of the functions of Java, similar to the FP in the immutable and lazy computing, The distributed Memory object Rdd can be realized and pipeline can be realized at the same time.2, Scala is good at borrowing power, such as the design of the original intention to include the support of the JVM, so it can be a perfect use of the ecological power of Java; Spark like, many things do not write themselves, direct use, reference, such as directly deployed in yarn, Mesos, EC2, usin
missing data, here because the whole sample size is large, so I directly delete the missing data. In addition, since the original data is not all separated by commas, you need to separate the columns with the following code:# 删除缺失数据feature_set2=feature_set[feature_set[1]!=-1] # 只获取不是-1的DataFrame即可。# print(feature_set2) # 没有问题feature_set2=feature_set2.reset_index(drop=True)print(feature_set2.head())# 第0列既包含日期,又包含时间,故要拆分成两列need_split_col=feature_set2[0
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.