CSV Sample Data
[hadoop@ip-10-0-52-52 ~]$ cat test.csv
id,name,address
1,zhang san,china Shanghai
2,li si, "
China Beijing "
3,tom,china Shanghai
the following versions of Spark 2.2 read CSV
There is a read exception problem
scala> val df1 = spark.read.option("header", true).csv("file:///home/hadoop/test.csv")
df1: org.apache.spark.sql.DataFrame = [id: string, name: string ... 1 more field]
scala> df1.count
res4: Long = 4
scala> df1.show
+--------+---------+--------------+
| id| name| address|
+--------+---------+--------------+
| 1|zhang san|china shanghai|
| 2| li si| china|
|beijing"| null| null|
| 3| tom|china shanghai|
+--------+---------+--------------+
This problem can also be solved by reading a binary file, but this is not a good scenario, such as the following Pyspark implementation:
def SPARK_READ_CSV_BF (Spark, Path, schema=none, encoding= ' UTF8 '):
"
:p Aram Spark: Spark 2.0 Sparksession
:p Aram Path: csv path
:p Aram encoding:
: Return:dataframe
""
Rdd = Spark.sparkContext.binaryFiles (path). values () \
. FlatMap (Lambda X:csv. Dictreader (IO. Bytesio (x)) \
. Map (lambda x: {k:v.decode (encoding) for k,v in X.iteritems ()})
if schema:
return Spark.createdataframe (RDD, Schema)
else:
return rdd.todf ()
version Read CSV after Spark 2.2
The bug has been fixed by the release of Spark 2.2, and the implementation can be seen by adding a parameter multiLine the function call to resolve the problem, refer to the link:
[SPARK-19610] [SQL] Support parsing multiline CSV files
[SPARK-20980] [SQL] Rename Wholefile to MultiLine for both CSV and JSON
scala> val df2 = spark.read.option("header", true).option("multiLine", true).csv("file:///home/hadoop/test.csv")
df2: org.apache.spark.sql.DataFrame = [id: string, name: string ... 1 more field]
scala> df2.count
res6: Long = 3
scala> df2.show
+---+---------+--------------+
| id| name| address|
+---+---------+--------------+
| 1|zhang san|china shanghai|
| 2| li si| china
beijing|
| 3| tom|china shanghai|
+---+---------+--------------+