International - English

Cart Console

Topic Center

Contact Sales

Home > Others

Pyspark Series--Read and write Dataframe

Last Update:2018-07-26 Source: Internet

Author: User

Tags pyspark

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Catalogue
1. Connect Spark 2. Create Dataframe
2.1. Create 2.2 from the variable. Create 2.3 from a variable. Read JSON 2.4. Read CSV 2.5. Read MySQL 2.6. Created from Pandas.dataframe 2.7. Reads 2.8 from the parquet stored in the column. Read 3 from Hive. Save data
3.1. Write to CSV 3.2. Save to Parquet 3.3. Write to Hive 3.4. Write to HDFs 3.5. Write to MySQL 1. Connect Spark

From pyspark.sql import sparksession

spark=sparksession \.
        builder \
        . AppName (' my_first_app_name ') \
        . Getorcreate ()

2. Create Dataframe 2.1. Creating from Variables

# Generate comma-delimited data
Stringcsvrdd = Spark.sparkContext.parallelize ([
    (123, "Katie", "Brown"),
    (234, "Michael") , "green"),
    (345, "Simone", +, "blue")
]
# Specify the pattern, Structfield (name,datatype,nullable)
# Where:
# Name   : The field names,
#   DataType: The field's data type,
#   Nullable: Indicates whether the field's value is empty from
pyspark.sql.types Import Structtype, Structfield, Longtype, StringType  # importing type

schema = Structtype ([
    Structfield ("id", Longtype (), true),
    Structfield ("name", StringType (), true),
    Structfield ("Age", Longtype (), true),
    Structfield ("Eyecolor", StringType (), True)
]

# applies the pattern to the RDD and creates dataframe
swimmers = Spark.createdataframe (Stringcsvrdd,schema)

# Create a temporary view with Dataframe
swimmers.registertemptable ("Swimmers ")

# View the number of lines Dataframe
Swimmers.count ()

2.2. Creating from Variables

# Use automatic type inference to create dataframe

data = [(123, "Katie", "Brown"),
        (234, "Michael", "a", "green"),
        (345, "Simone", "Blue")]
df = spark.createdataframe (data, schema=[' id ', ' name ', ' age ', ' Eyccolor '))
df.show ()
Df.count ()

2.3. Read JSON

# Read the following sample data from spark

file = r "D:\hadoop_spark\spark-2.1.0-bin-hadoop2.7\examples\src\main\resources\people.json"
df = Spark.read.json (file)
df.show ()

2.4. Read CSV

# first create a CSV file
import pandas as PD
import numpy as NP
DF=PD. DataFrame (Np.random.rand (5,5), columns=[' A ', ' B ', ' C ', ' d ', ' e ']). \
    Applymap (lambda x:int (x*10))
file=r "D : \hadoop_spark\spark-2.1.0-bin-hadoop2.7\examples\src\main\resources\random.csv "
df.to_csv (file,index= False)

# Read the CSV file again
monthlysales = spark.read.csv (file, Header=true, inferschema=true)
Monthlysales.show ()

2.5. Read MySQL

# At this time need to put the Mysql-jar driver under the Spark-2.2.0-bin-hadoop2.7\jars
# # Standalone environment feasible, cluster environment no
# re-execute
df = Spark.read.format (' JDBC '). Options (
    url= ' jdbc:mysql://127.0.0.1 ',
    dbtable= ' mysql.db ',
    user= ' root ',
    password= ' 123456 ' 
    ). Load ()
df.show ()

# can also pass in the SQL statement

sql= "(SELECT * from mysql.db where db= ' wp230 ') t"
df = Spark.read.format (' jdbc '). Options (
    url= ' jdbc:mysql://127.0.0.1 ',
    dbtable=sql,
    user= ' root ',
    password= ' 123456 ' 
    ). Load ()
df.show ()

2.6. Create from Pandas.dataframe

# If you do not specify schema then use Pandas's column name
df = PD. DataFrame (Np.random.random (bis))
SPARK_DF = Spark.createdataframe (df,schema=[' A ', ' B ', ' C ', ' d '])

2.7. Read from the parquet of the column store

# Read Example the parquet file below
file=r "D:\apps\spark-2.2.0-bin-hadoop2.7\examples\src\main\resources\users.parquet "
Df=spark.read.parquet (file)
df.show ()

2.8. Read from Hive

# If you have configured Spark connection hive parameters, you can read hive data directly from
Spark = sparksession \
        . Builder \
        . Enablehivesupport () \.      
        Master ("172.31.100.170:7077") \
        . AppName ("My_first_app_name") \
        . Getorcreate ()

df=spark.sql ("SELECT * FROM Hive_tb_name ")
df.show ()

3. Save your data 3.1. Write to CSV

# Create Dataframe
import NumPy as NP
df = PD. DataFrame (Np.random.random ((4, 4)), columns=[' A ', ' B ', ' C ', ' d '])
SPARK_DF = Spark.createdataframe (DF)

# Write to CSV
file=r "D:\apps\spark-2.2.0-bin-hadoop2.7\examples\src\main\resources\test.csv"
Spark_ Df.write.csv (Path=file, Header=true, sep= ",", mode= ' overwrite ')

3.2. Save to Parquet

# Create Dataframe
import NumPy as NP
df = PD. DataFrame (Np.random.random ((4, 4)), columns=[' A ', ' B ', ' C ', ' d '])
SPARK_DF = Spark.createdataframe (DF)

# Write to Parquet
file=r "D:\apps\spark-2.2.0-bin-hadoop2.7\examples\src\main\resources\test.parquet"
Spark_ Df.write.parquet (path=file,mode= ' overwrite ')

3.3. Write to Hive

# Open Dynamic partition
spark.sql ("Set hive.exec.dynamic.partition.mode = nonstrict")
spark.sql ("set Hive.exec.dynamic.partition=true ")

# Write to partition table
spark.sql (" ""
    insert overwrite table Ai.da_ using normal Hive-sql Aipurchase_dailysale_hive 
    partition (saledate) 
    Select ProductID, PropertyId, Processcenterid, Saleplatform, SKU, POA, Salecount, saledate from 
    szy_aipurchase_tmp_szy_dailysale distribute by saledate
    "" ")

# or JdbcDF.write.mode ("overwrite") with each rebuild of the partitioned table
. Partitionby ("Saledate"). Insertinto ("Ai.da_aipurchase_ Dailysale_hive ")
jdbcDF.write.saveAsTable (" Ai.da_aipurchase_dailysale_hive ", None," append ", partitionby= ' Saledate ')

# does not write partition table, just simple import into hive table
jdbcDF.write.saveAsTable ("Ai.da_aipurchase_dailysale_for_ema_predict ", none," overwrite ", none)

3.4. Write to HDFs

# data is written to HDFs, and
jdbcDF.write.mode ("overwrite") is saved in CSV format. Options (header= "true"). CSV ("/home/ai/da/da_aipurchase_ Dailysale_for_ema_predict.csv ")

3.5. Write to MySQL

# The fields are automatically aligned, that is, SPARK_DF columns do not necessarily have all the columns of the MySQL table in line

# overwrite empty the table and import the
spark_df.write.mode ("overwrite"). Format ("JDBC"). Options (
    url= ' jdbc:mysql://127.0.0.1 ',
    user= ' root ',
    password= ' 123456 ',
    dbtable= " Test.test ",
    batchsize=", "
). Save ()

# append Append mode
spark_df.write.mode (" append "). Format (" JDBC "). Options (
    url= ' jdbc:mysql://127.0.0.1 ',
    user= ' root ',
    password= ' 123456 ',
    dbtable=" Test.test ",
    batchsize=", "
). Save ()

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

Related Keywords:

pyspark and kafka read and write csv file in php convert pandas series to dataframe sample cobol program to read and write file pandas create series from dataframe pandas series to dataframe columns dataframe to time series in r

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

What's Trending

Top 10 Tags

datastax versions naming convention zookeeper client class definition md5 microsoft sql server 2005 data structures exception handling error handling

Top 10 Keywords

microsoft download center down wordpress address url site address url wordpress address url windows installer 4 0 download 302 not found web address url definition site address url wordpress db2 integer mac os installation step by step pdf abbreviation for return

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Pyspark Series--Read and write Dataframe

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support