Spark query any field and use Dataframe to output the results __spark

Source: Internet
Author: User
Tags databricks

In a write-spark program, querying a field in a CSV file is usually written like this:
(1) Direct use of dataframe query

Val df = sqlcontext.read
    . Format ("Com.databricks.spark.csv")
    . Option ("Header", "true")//Use the all F Iles as header
    . Schema (Customschema)
    . Load ("Cars.csv")
val selecteddata = Df.select ("Year", "model")

Reference index: Https://github.com/databricks/spark-csv

The above read CSV file is spark1.x, spark2.x writing is not the same:
Val df = SparkSession.read.format ("Com.databricks.spark.csv"). Option ("Header", "true"). Option ("Mode", " Dropmalformed "). Load (" People.csv "). Cache ()

(2) Build case class.

Case class Person (name:string, Age:long)
//For implicit conversions from RDDs to dataframes
import spark.implic Its._

//Create an RDD of person objects from a-text file, convert it to a dataframe
val peopledf = Spark.sparkcon Text
  . Textfile ("Examples/src/main/resources/people.txt")
  . Map (_.split (","))
  . Map (Attributes => Person (attributes (0), attributes (1). Trim.toint))
  . TODF ()
//Register the Dataframe as a temporary view
Peopledf.createorreplacetempview ("People")

//SQL statements can is run by using the SQL methods provided by SPARK
  val teenagersdf = Spark.sql ("SELECT name, age from People WHERE Age BETWEEN")

//The columns of a row in The result can is accessed by field index
teenagersdf.map (teenager => "Name:" + teenager (0)). Show ()
//+--- ---------+
// |       value|
// +------------+
// | name:justin|
// +------------+

This is the example above on the spark2.2.0 website.

Reference index: http://spark.apache.org/docs/latest/sql-programming-guide.html

The above 2 kinds of writing, if just test small file, file column header of the field is not more than (dozens of) can be used. For example, I only query a user's name, age, Sex these fields.

But in fact, you have this problem:
* * (1) I'm not sure which fields to look at;
(2) I'm not sure how many fields to check. * *

The example above is not enough. There is a third method (3):

Import Org.apache.spark.sql.types._//Create an RDD val peoplerdd = Spark.sparkContext.textFile ("examples/src/main/ Resources/people.txt ")//The schema is encoded in a string Val schemastring = ' name age '//Generate the schema based O n the string of schema Val fields = Schemastring.split (""). Map (FieldName => Structfield (fieldName, StringType, NULL Able = true) val schema = Structtype (fields)//Convert Records of the RDD (people) to Rows val rowrdd = Peoplerdd. Ma P (_.split (",")). Map (Attributes => Row (attributes (0), attributes (1). Trim))//Apply the schema to the RDD Val people DF = Spark.createdataframe (Rowrdd, Schema)//creates a temporary view using the Dataframe PEOPLEDF.CREATEORREPLACETEMPVI EW ("people")//SQL can is run over a temporary view created using dataframes val results = spark.sql ("Select name from P"  Eople ")//The results of SQL queries are dataframes and support all the normal RDD operations//The columns of a row in The result can be accessed BY field index or by field name Results.map (Attributes => "name:" + attributes (0)). Show ()//+-------------+//|
value| // +-------------+
// |
name:michael|   // |
name:andy| // |
name:justin| // +-------------+

The above example, also from the Spark website, still uses dataframe, but queries the field structure, using Structfield and structtype. Each field of the query, using a number instead of a specific name,age field name. However, for example (3) The use of the effect with the example (1), (2) similar, can not solve the problem raised above, but also need to improve.

Example (4):

Val df = SparkSession.read.format ("Com.databricks.spark.csv"). Option ("Header", "true"). Option ("Mode", " Dropmalformed "). Load (" People.csv "). Cache ()
var schemastring =" Name,age "
//Register temp Table
Df.createorreplacetempview ("People")
//sql query
var DATADF = sparksession.sql ("Select" +schemastring+ "from People ")
//Turn Rdd
var dfrdd = Datadf.rdd
val fields = Schemastring.split (", "). Map (FieldName => Structfield (FieldName, stringtype, nullable = True))
var schema = Structtype (fields)
//convert RDD to DF
var NEWDF =sparksession.createdataframe (Dfrdd, schema)

This will enable the above mentioned issues to be achieved.

Dataframe is very quick, especially in the new version. Of course, in a production environment, we may still be using RDD to transform the desired data. At this point, you can write this:

Extracts an array of the entire row of a CSV into the desired field and turns it into an array
such as query
Val Queryarr = Array ("NAME", "age")

    Val rowRDD2 = rowrdd.map (attributes => {
      val myattributes:array[string] = attributes
      //array containing the column of the field you want to query, such as , nth column
      val mycolumnsnameindexarr:array[int] = Colsnameindexarrbroadcast.value
      var mycolumnsnamedataarrb: Arraybuffer[string] = new Arraybuffer[string] () for
      (i<-0 until mycolumnsnameindexarr.length) {
        Mycolumnsnamedataarrb+=myattributes (Mycolumnsnameindexarr (i)). ToString
      }
      Val Mycolumnsnamedataarr:array [String] = Mycolumnsnamedataarrb.toarray
      mycolumnsnamedataarr
    }). map (x => Row (x)). Cache ()

In this way, the returned RDD each row is an array, and then based on the number of rows, you can convert rows into columns.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.