) New_titanic_survival= Titanic_survival.dropna (subset=[' Age','Body','home.dest'])Multi-line IndexThis is the original titanic_survival.After I deleted the rows with the Body column Nan, the data becomes the following New_titanic_survival = Titanic_survival.dropna (subset=["body"])Visible, in the New_titanic_survival table, the row's index remains the same as before, and is not recalculated from 0. In the previous article, Pandas (i), you can know that pandas uses the loc[m] function to index
. Match the split data table with the original df_inner data table.
df_inner=pd.merge(df_inner,split,right_index=True, left_index=True)
V. Data ExtractionThe following functions are mainly used: loc, iloc, and ix. loc are extracted by TAG value, iloc is extracted by location, and ix can be extracted by TAG and location at the same time.1. Extract the value of a single row by index
df_inner.loc[3]
2. Extract
A lot of programming in data analysis and modeling is used for data preparation: onboarding, cleanup, transformation, and remodeling. Sometimes, the data stored in a file or database does not meet the requirements of your data processing application. Many people choose to specialize in data formats using common programming languages such as Python, Perl, R, or Java, or UNIX text processing tools such as SED or awk. Fortunately, the pandas and Python standard libraries provide a set of advanced,
Readers only need to browse the directory structure of this article, I believe I have mastered 10%-20% of Pandas knowledge.The purpose of this article is to establish an approximate knowledge structureIn the data mining python read the source code, intermittent access to some pandas data, and in the source of the general sense of pandas in the data cleaning convenience.First of all the data you consult with the actual application of the method commonly used in the form of learning notes to sort
Pandas has two main data structures:Series and DataFrame. A Series is an object that is similar to a one-dimensional array, consisting of a set of data and a set of data labels associated with it. Take a look at its use processIn [1]: From pandas import series,dataframeIn [2]: Import pandas as PDIn [3]: Obj=series ([4,7,-5,3])In [5]: objOUT[5]:0 41 72-53 3Dtype:int64The object generated by the Series is indexed to the left and the specific value to t
anything, but some languages can only do something in a certain field. SQL is such a language, which can only describe data operations. However, it is classified into programming languages in the case of big classification. It requires lexical analysis and syntax analysis. For those who do not know this process, you can see it.0x02 prepare data
Because the data has been prepared this time, all we need is to write a small script to read it out, and I will package what we need.
: Download
#-*-Co
In addition to the series, dataframe these two commonly used data structures in the Pandas library, there is also a panel data structure that can typically be created with a dictionary of Dataframe objects or a three-dimensional array to create a Panel object. 1 # 2 3 created on Sat Mar 18:01:05 4 5 @author: Jeremy 6 7 import NumPy as NP 8 from Pandas import Series,
-1.5.1-bin-hadoop2.4]$/bin/run-example streaming.networkwordcount 192.168.19.131 9999Then in the first line of the window, enter for example: Hello World, world of Hadoop world, Spark World, Flume world, Hello WorldSee if the second row of the window is counted;
1. Spark SQL and DataFrameA, what is spark SQL?Spark SQL is a module that spark uses to process structured data, which provides a programmatic abstraction called dataframe and acts as a
Reference: Https://spark.apache.org/docs/latest/sql-programming-guide.html#overviewhttp://www.csdn.net/article/2015-04-03/2824407Spark SQL is a spark module for structured data processing. IT provides a programming abstraction called Dataframes and can also act as distributed SQL query engine.1) in Spark, Dataframe is a distributed data set based on an RDD, similar to a two-dimensional table in a traditional database. The main difference between
We used pandas to do some basic operations, then further understand the operation of the data,
Data cleansing has always been a very important part of data analysis.
Data merge
In pandas, you can merge data through merge.
Import NumPy as Npimport pandas as Pddata1 = PD. DataFrame ({' Level ': [' A ', ' B ', ' C ', ' d '], ' numeber ': [1,3,5,7]}) data2=pd. DataFrame ({' Level ': [' A ', ' B ', ' C
Pandas is a very important data processing library in Python, and pandas provides a very rich data processing function, which is helpful to machine learning and data preprocessing before data mining.
The following is the recent small usage summary: 1, pandas read the CSV file to obtain the Dataframe type object, which can enrich the execution of data processing. Missing value processing Dropna () or Fillna () 2,
Translation http://spark.apache.org/docs/latest/ml-guide.html machine Learning Library Mlib Guide
Mlib is a machine learning library running on spark to facilitate machine learning in the Scala language. Provides the following features: ML algorithm: Provides common machine learning operator functions such as classification, regression, clustering, and collaborative filtering: feature extraction, transformation, dimensionality reduction, and selection of pipe lines: build, evaluate, and tune too
', index_col1_01_test1_pd.read_csv('test.csv ', index_col = 0) SexCode = pd. dataFrame ([], index = ['female ', 'male'], columns = ['sexcode']) # converts gender to 01 training = training. join (SexCode, how = 'left', on = training. sex) training = training. drop (['name', 'ticket ', 'barked', 'cabin ', 'Sex'], axis = 1) # delete a few variables that do not participate in modeling, including name, ticket number, and cabin number test = test. join (Sex
', header== data['1990 ' ]one_year.plot ()One problem with this solution is that the object type is not plot, view pandas read csv file Typeerror:empty ' DataFrame ': No numeric data to plotIn addition, the style of plot can be viewed by the document itself to choose Favorite, Document Link(2) Histogram and density mapHistogram, you know, he has no timing, just in a time range of variable range statistics, such as the data is divided into 10 bins, we
Motivation
We spend a lot of time migrating data from a common interchange format (such as CSV) to an efficient computing format like arrays, databases, or binary storage. To make things worse, many people do not migrate data to efficient formats because they don't know how (or cannot) manage specific migration methods for their tools.
The format of the data you choose is very important, it will strongly affect the performance of the program (empirical rules show that there will be 10 times ti
10-lesson from Dataframe to Excel from Excel to Dataframe from Dataframe to JSON, from JSON to Dataframe
Import pandas as PD
import sys
Print (' Python version ' + sys.version)
print (' Pandas version ' + pd.__version__)
Python version 3.6.1 | Packaged by Conda-forge | (Default, Mar 2017, 21:57:00)
[GCC 4.2.
2 DataframesSimilar to Python's Dataframe, Pyspark also has dataframe, which is handled much faster than an unstructured rdd.
Spark 2.0 replaced the SqlContext with Sparksession. Various Spark contexts, including:Hivecontext, SqlContext, StreamingContext, and SparkcontextAll are merged into Sparksession, which is used only as a portal to read data.
2.1 Creating DataframesPreparatory work:
>>> Import Pyspark
. Features: Master, worker, and executor all run on separate JVM processes.4. Yarn cluster: The applicationmaster role in yarn ecology, using the Apache developed Spark Applicationmaster instead, The NodeManager role in each yarn ecosystem is equivalent to a worker role in the spark ecosystem, and Nodemanger is responsible for executor startup.5. Mesos cluster: No detailed research.Ii. about Spark SQLBrief introductionIt is primarily used for structured data processing and for executing SQL-like
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.