First, local CSV file read:
The easiest way:
Import pandas as PD
lines = pd.read_csv (file)
lines_df = Sqlcontest.createdataframe (lines)
Or use spark to read directly as Rdd and then in the conversion
lines = sc.textfile (' file ')
If your CSV file has a title, you need to remove the first line
Header = Lines.first () #第一行
lines = lines.filter (lambda row:row!= header) #删除第一行
At this time lines for RDD. If you need to convert to Dataframe:
schema = Structtype ([Structfield (' HWMC ', StringType (), True), Structfield (' Code ', StringType (), true)]
LINES_DF = Sqlcontest.createdataframe (Lines,schema)
Second, HDFs on the CSV file read:
1, using the form of first reading for RDD conversion
2, the use of SqlContext.read.format (), this has a prerequisite to do well in advance rely on com.databricks.spark.csv
SqlContext = SqlContext (SC)
sqlContext.read.format (' com.databricks.spark.csv '). Options (header= ' true ', Inferschema= ' true '). Load (' file ')