Spark query any field and use Dataframe to output the results _

Spark query any field and use Dataframe to output the results __spark

Last Update:2018-08-21 Source: Internet

Author: User

Tags databricks

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In a write-spark program, querying a field in a CSV file is usually written like this:
(1) Direct use of dataframe query

Val df = sqlcontext.read
    . Format ("Com.databricks.spark.csv")
    . Option ("Header", "true")//Use the all F Iles as header
    . Schema (Customschema)
    . Load ("Cars.csv")
val selecteddata = Df.select ("Year", "model")

Reference index: Https://github.com/databricks/spark-csv

The above read CSV file is spark1.x, spark2.x writing is not the same:
Val df = SparkSession.read.format ("Com.databricks.spark.csv"). Option ("Header", "true"). Option ("Mode", " Dropmalformed "). Load (" People.csv "). Cache ()

(2) Build case class.

Case class Person (name:string, Age:long)
//For implicit conversions from RDDs to dataframes
import spark.implic Its._

//Create an RDD of person objects from a-text file, convert it to a dataframe
val peopledf = Spark.sparkcon Text
  . Textfile ("Examples/src/main/resources/people.txt")
  . Map (_.split (","))
  . Map (Attributes => Person (attributes (0), attributes (1). Trim.toint))
  . TODF ()
//Register the Dataframe as a temporary view
Peopledf.createorreplacetempview ("People")

//SQL statements can is run by using the SQL methods provided by SPARK
  val teenagersdf = Spark.sql ("SELECT name, age from People WHERE Age BETWEEN")

//The columns of a row in The result can is accessed by field index
teenagersdf.map (teenager => "Name:" + teenager (0)). Show ()
//+--- ---------+
// |       value|
// +------------+
// | name:justin|
// +------------+

This is the example above on the spark2.2.0 website.

Reference index: http://spark.apache.org/docs/latest/sql-programming-guide.html

The above 2 kinds of writing, if just test small file, file column header of the field is not more than (dozens of) can be used. For example, I only query a user's name, age, Sex these fields.

But in fact, you have this problem:
* * (1) I'm not sure which fields to look at;
(2) I'm not sure how many fields to check. * *

The example above is not enough. There is a third method (3):

Import Org.apache.spark.sql.types._//Create an RDD val peoplerdd = Spark.sparkContext.textFile ("examples/src/main/ Resources/people.txt ")//The schema is encoded in a string Val schemastring = ' name age '//Generate the schema based O n the string of schema Val fields = Schemastring.split (""). Map (FieldName => Structfield (fieldName, StringType, NULL Able = true) val schema = Structtype (fields)//Convert Records of the RDD (people) to Rows val rowrdd = Peoplerdd. Ma P (_.split (",")). Map (Attributes => Row (attributes (0), attributes (1). Trim))//Apply the schema to the RDD Val people DF = Spark.createdataframe (Rowrdd, Schema)//creates a temporary view using the Dataframe PEOPLEDF.CREATEORREPLACETEMPVI EW ("people")//SQL can is run over a temporary view created using dataframes val results = spark.sql ("Select name from P"  Eople ")//The results of SQL queries are dataframes and support all the normal RDD operations//The columns of a row in The result can be accessed BY field index or by field name Results.map (Attributes => "name:" + attributes (0)). Show ()//+-------------+//|
value| // +-------------+
// |
name:michael|   // |
name:andy| // |
name:justin| // +-------------+

The above example, also from the Spark website, still uses dataframe, but queries the field structure, using Structfield and structtype. Each field of the query, using a number instead of a specific name,age field name. However, for example (3) The use of the effect with the example (1), (2) similar, can not solve the problem raised above, but also need to improve.

Example (4):

Val df = SparkSession.read.format ("Com.databricks.spark.csv"). Option ("Header", "true"). Option ("Mode", " Dropmalformed "). Load (" People.csv "). Cache ()
var schemastring =" Name,age "
//Register temp Table
Df.createorreplacetempview ("People")
//sql query
var DATADF = sparksession.sql ("Select" +schemastring+ "from People ")
//Turn Rdd
var dfrdd = Datadf.rdd
val fields = Schemastring.split (", "). Map (FieldName => Structfield (FieldName, stringtype, nullable = True))
var schema = Structtype (fields)
//convert RDD to DF
var NEWDF =sparksession.createdataframe (Dfrdd, schema)

This will enable the above mentioned issues to be achieved.

Dataframe is very quick, especially in the new version. Of course, in a production environment, we may still be using RDD to transform the desired data. At this point, you can write this:

Extracts an array of the entire row of a CSV into the desired field and turns it into an array
such as query
Val Queryarr = Array ("NAME", "age")

    Val rowRDD2 = rowrdd.map (attributes => {
      val myattributes:array[string] = attributes
      //array containing the column of the field you want to query, such as , nth column
      val mycolumnsnameindexarr:array[int] = Colsnameindexarrbroadcast.value
      var mycolumnsnamedataarrb: Arraybuffer[string] = new Arraybuffer[string] () for
      (i<-0 until mycolumnsnameindexarr.length) {
        Mycolumnsnamedataarrb+=myattributes (Mycolumnsnameindexarr (i)). ToString
      }
      Val Mycolumnsnamedataarr:array [String] = Mycolumnsnamedataarrb.toarray
      mycolumnsnamedataarr
    }). map (x => Row (x)). Cache ()

In this way, the returned RDD each row is an array, and then based on the number of rows, you can convert rows into columns.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More