Spark SQL InferSchema Implementation rationale (Python) "Go"

Source: Internet
Author: User
Tags deprecated

Tag:log    call    tty    subclass     Match     div    Data     steps     unified    

Using spark SQL is based on several tables of "register", an important part of which is the pattern, and Spark SQL offers two options for users to choose from:  (1) applyschema   Applyschema mode requires user code to display the specified pattern, the advantages: The data type is clear, the disadvantage: many tables have a certain amount of code work.   (2) Inferschema  inferschema mode without user code to display the specified mode, but the system automatically infer the pattern, the code is relatively concise, but since it is inferred, It is possible to infer an error (i.e., a mismatch with the data type expected by the user), so we need to have a clear understanding of its inference process in order to better apply it in practical applications.   This article is only for Python (spark-1.5.1), the inference process is dependent on SqlContext (Hivecontext is sqlcontext subclass) InferSchema Implementation:   SqlContext InferSchema has been deprecated in version 1.3 and is replaced by the fact that Createdataframe,inferschema can still be used in the 1.5.1 version, and the actual execution process is SqlContext Createdataframe, it is important to note that a parameter samplingration, whose default value is None, will discuss its specific role later.    Here we only consider the case where the data type is inferred from the RDD, i.e. isinstance (data, RDD) is true, the code execution process goes to Sqlcontext _createfromrdd:    from the above code invocation logic can be seen that the schema is none, the code execution process into sqlcontext _inferschema:  sqlcontext _ The main process of InferSchema is broadly divided into three steps:  the first step: Get the first line of the RDD, and ask for a null value (note not none), and the second step: if first is of type "dict", Will output a warning message: The recommended RDD element type is Row (Pyspark.sql.Row), Dict has been deprecated, and the third step: if Samplingratio is None, the schema is inferred directly using first (that is, the RDD).Plingration is not none, the data inference mode is "filtered" based on the value.   We will focus on the implementation logic of the third step.  1. Samplingratio is None   _infer_schema uses a row to record the inference pattern for row (the first line of the RDD), roughly divided into four steps:  (1) If the record row's data type is dict;    Thus we can conclude that items are actually a list of key-value pairs, where key-value pairs can also be understood as (column names, column values) A sort operation (sorted) is performed to ensure consistency in the order of column names (Dict.items () is not responsible for returning the list element order).   (2) If the data type of row is a tuple or list, it can be subdivided into three cases:  a. Row has a data type of row, and the simulation process:    b. Row has a data type of namedtuple , simulation process:    c. Row data type is other (tuple or tuple), simulation process:    (3) If the data type of row is object;     (1), (2), (3) can be seen, their final logic is consistent, is to convert the record row into a list of key-value pairs, if (1), (2), (3) do not match, it is considered impossible to infer, throw an exception.   (4) Creation mode (STRUCTTYPE) each key-value pair in  items corresponds to a pattern that is used to describe a column, which receives three parameters: column name, column type, can contain none The column name is "key" and the column type needs to be inferred from "value" (_infer_type), where the default setting can contain none.   Iteration of these key-value pairs in items forms a Structfield list, and finally creates a pattern through structtype.   This is the process of creating patterns based on an rdd line of records, which does not involve how the specific data types are inferred, and we also need to look at _infer_type:  _infer_type to infer the type based on the incoming obj, The return value is a type instance, and there are six situations that need to be handled:  (1) If obj is None, then the type is Nulltype; (2) really did not understand, do not explain; (3) try to get the corresponding type information directly from _type_mappings based on type (obj) datatype,_type_mappings is a dictionary that preserves some of the python types in advance with spark The corresponding relationship of the SQL data type, as follows:   If datatype is not none, then return the instance of the corresponding type directly; special handling is Decimaltype, Given the fact that there may be precision and scale inconsistencies in the actual data, this is handled uniformly as precision:38,scale:18, and if datatype is none, then obj is a composite data type (array, dictionary, struct).   (4) If the data type of obj is dict, we need to infer its key type (recursive call _infer_type), value type (recursive call _infer_type), and then construct the Maptype instance and return the;  inferred key, value type. Select only one key-value pair: Its key, value is not none, if there are more than one of these key-value pairs, then the selection is random, depending on dict.items (), if you cannot find such a key-value pair, it is considered that the key, the value of the type is nulltype.   (5) If the data type of obj is a list or array, select one of the elements that are not of none to infer its type (recursive call to _infer_type), or the element type if it cannot find an element that is not a none: Nulltype And finally constructs the ArrayType instance and returns;  (6) If the inference cannot be completed (1), (2), (3), (4), (5), then we think that obj may (just possibly) be a struct type (structtype), using _infer_ Schema infers its type;  2. Samplingratio is not none samplingratio to None, only the first line of the RDD is selected to participate in the inference, which puts high demands on the "quality" of this line of records, which in some cases cannot represent the global At this point we can use the display settings Samplingratio, "filter" enough data to participate in the inference process.   If the value of Samplingratio is less than 0.99, use the RDD sample API to participate in inference based on the Samplingratio "filter" portion of the data (RDD), otherwise all records of the entire RDD (RDD) participate in inference. The   inference process can be easily understood as two-step:  (1)Each line of the RDD is inferred from the method _infer_schema a type (map), (2) aggregating these types (reduce).   We'll focus on the implementation logic of the aggregation:   the implementation logic of the aggregation is done by the method _merge_type, there are six situations to be handled:  (1) If A is an instance of Nulltype, then the type of B is returned ; (2) If A is not an instance of Nulltype, B is an instance of Nulltype, then the type of a is returned, (3) if the type of a and B is different, an exception is thrown;  the following processing is based on the same type as a and B.   (4) If the type of a is structtype (struct), the elements in a are the template merge type (recursive call to _merge_type), and the element (type) of the b-a (difference set) is appended, and (5) if the type of a is arraytype (array), Then merge (recursively call _merge_type) the element type of both, (6) If the type of a is Maptype (dictionary), you need to merge the key type of both (recursive call _merge_type), value type (recursive call _merge_type).   Personally feel that the current type of aggregation logic is too simple, the actual use of little meaning.

Spark SQL InferSchema implementation rationale (Python) "Go"

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.