Spark reads CSV parsing cell multiline numeric problem

CSV Sample Data
[hadoop@ip-10-0-52-52 ~]$ cat test.csv 
1,zhang san,china Shanghai
2,li si, "
China Beijing "
3,tom,china Shanghai
the following versions of Spark 2.2 read CSV

There is a read exception problem

scala> val df1 ="header", true).csv("file:///home/hadoop/test.csv")
df1: org.apache.spark.sql.DataFrame = [id: string, name: string ... 1 more field]

scala> df1.count
res4: Long = 4

|      id|     name|       address|
|       1|zhang san|china shanghai|
|       2|    li si|         china|
|beijing"|     null|          null|
|       3|      tom|china shanghai|

This problem can also be solved by reading a binary file, but this is not a good scenario, such as the following Pyspark implementation:

def SPARK_READ_CSV_BF (Spark, Path, schema=none, encoding= ' UTF8 '):
    :p Aram Spark:    Spark 2.0 Sparksession 
    :p Aram Path:     csv path
    :p Aram encoding: 
    : Return:dataframe
    Rdd = Spark.sparkContext.binaryFiles (path). values () \
                . FlatMap (Lambda X:csv. Dictreader (IO. Bytesio (x)) \
                . Map (lambda x: {k:v.decode (encoding) for  k,v in X.iteritems ()})
    if schema:
        return Spark.createdataframe (RDD, Schema)
        return rdd.todf ()
version Read CSV after Spark 2.2

The bug has been fixed by the release of Spark 2.2, and the implementation can be seen by adding a parameter multiLine the function call to resolve the problem, refer to the link:

[SPARK-19610] [SQL] Support parsing multiline CSV files

[SPARK-20980] [SQL] Rename Wholefile to MultiLine for both CSV and JSON

scala> val df2 ="header", true).option("multiLine", true).csv("file:///home/hadoop/test.csv")
df2: org.apache.spark.sql.DataFrame = [id: string, name: string ... 1 more field]

scala> df2.count
res6: Long = 3

| id|     name|       address|
|  1|zhang san|china shanghai|
|  2|    li si| china
|  3|      tom|china shanghai|

