In a write-spark program, querying a field in a CSV file is usually written like this:
(1) Direct use of dataframe query
Val df = sqlcontext.read
. Format ("Com.databricks.spark.csv")
. Option ("Header", "true")//Use the all F Iles as header
. Schema (Customschema)
. Load ("Cars.csv")
val selecteddata = Df.select ("Year", "model")
Reference index: Https://github.com/databricks/spark-csv
The above read CSV file is spark1.x, spark2.x writing is not the same:
Val df = SparkSession.read.format ("Com.databricks.spark.csv"). Option ("Header", "true"). Option ("Mode", " Dropmalformed "). Load (" People.csv "). Cache ()
(2) Build case class.
Case class Person (name:string, Age:long)
//For implicit conversions from RDDs to dataframes
import spark.implic Its._
//Create an RDD of person objects from a-text file, convert it to a dataframe
val peopledf = Spark.sparkcon Text
. Textfile ("Examples/src/main/resources/people.txt")
. Map (_.split (","))
. Map (Attributes => Person (attributes (0), attributes (1). Trim.toint))
. TODF ()
//Register the Dataframe as a temporary view
Peopledf.createorreplacetempview ("People")
//SQL statements can is run by using the SQL methods provided by SPARK
val teenagersdf = Spark.sql ("SELECT name, age from People WHERE Age BETWEEN")
//The columns of a row in The result can is accessed by field index
teenagersdf.map (teenager => "Name:" + teenager (0)). Show ()
//+--- ---------+
// | value|
// +------------+
// | name:justin|
// +------------+
This is the example above on the spark2.2.0 website.
Reference index: http://spark.apache.org/docs/latest/sql-programming-guide.html
The above 2 kinds of writing, if just test small file, file column header of the field is not more than (dozens of) can be used. For example, I only query a user's name, age, Sex these fields.
But in fact, you have this problem:
* * (1) I'm not sure which fields to look at;
(2) I'm not sure how many fields to check. * *
The example above is not enough. There is a third method (3):
Import Org.apache.spark.sql.types._//Create an RDD val peoplerdd = Spark.sparkContext.textFile ("examples/src/main/ Resources/people.txt ")//The schema is encoded in a string Val schemastring = ' name age '//Generate the schema based O n the string of schema Val fields = Schemastring.split (""). Map (FieldName => Structfield (fieldName, StringType, NULL Able = true) val schema = Structtype (fields)//Convert Records of the RDD (people) to Rows val rowrdd = Peoplerdd. Ma P (_.split (",")). Map (Attributes => Row (attributes (0), attributes (1). Trim))//Apply the schema to the RDD Val people DF = Spark.createdataframe (Rowrdd, Schema)//creates a temporary view using the Dataframe PEOPLEDF.CREATEORREPLACETEMPVI EW ("people")//SQL can is run over a temporary view created using dataframes val results = spark.sql ("Select name from P" Eople ")//The results of SQL queries are dataframes and support all the normal RDD operations//The columns of a row in The result can be accessed BY field index or by field name Results.map (Attributes => "name:" + attributes (0)). Show ()//+-------------+//|
value| // +-------------+
// |
name:michael| // |
name:andy| // |
name:justin| // +-------------+
The above example, also from the Spark website, still uses dataframe, but queries the field structure, using Structfield and structtype. Each field of the query, using a number instead of a specific name,age field name. However, for example (3) The use of the effect with the example (1), (2) similar, can not solve the problem raised above, but also need to improve.
Example (4):
Val df = SparkSession.read.format ("Com.databricks.spark.csv"). Option ("Header", "true"). Option ("Mode", " Dropmalformed "). Load (" People.csv "). Cache ()
var schemastring =" Name,age "
//Register temp Table
Df.createorreplacetempview ("People")
//sql query
var DATADF = sparksession.sql ("Select" +schemastring+ "from People ")
//Turn Rdd
var dfrdd = Datadf.rdd
val fields = Schemastring.split (", "). Map (FieldName => Structfield (FieldName, stringtype, nullable = True))
var schema = Structtype (fields)
//convert RDD to DF
var NEWDF =sparksession.createdataframe (Dfrdd, schema)
This will enable the above mentioned issues to be achieved.
Dataframe is very quick, especially in the new version. Of course, in a production environment, we may still be using RDD to transform the desired data. At this point, you can write this:
Extracts an array of the entire row of a CSV into the desired field and turns it into an array
such as query
Val Queryarr = Array ("NAME", "age")
Val rowRDD2 = rowrdd.map (attributes => {
val myattributes:array[string] = attributes
//array containing the column of the field you want to query, such as , nth column
val mycolumnsnameindexarr:array[int] = Colsnameindexarrbroadcast.value
var mycolumnsnamedataarrb: Arraybuffer[string] = new Arraybuffer[string] () for
(i<-0 until mycolumnsnameindexarr.length) {
Mycolumnsnamedataarrb+=myattributes (Mycolumnsnameindexarr (i)). ToString
}
Val Mycolumnsnamedataarr:array [String] = Mycolumnsnamedataarrb.toarray
mycolumnsnamedataarr
}). map (x => Row (x)). Cache ()
In this way, the returned RDD each row is an array, and then based on the number of rows, you can convert rows into columns.