Pyspark-histogram detailed

Source: Internet
Author: User
Tags pyspark

Recently learning Spark, I am mainly programming with the Pyspark API,

The network of Chinese interpretation is not many, API official documents are not very easy to understand, I combined with their own understanding of the record, convenient for others reference, but also convenient to review it

This is the introduction of Pyspark. Rdd.histogram

Histogram (buckets)

The input parameter buckets can be a number, or it can be a list

The output is a tuple with two lists of buckets and histograms, respectively.

For example: A list of 0 to 50 sequences,

when the buckets is a number

>>> Rdd = sc.parallelize (range)
>>> Rdd.histogram (2)
([0, 25, 50], [25, 26])
The parameter buckets is 2, the output is two parts, [0,25,50] is barrels, [25,26] is within each barrel distribution. Frequency

when buckets is a list

>>> Rdd.histogram ([0, 5, +])
([0, 5, 25, 50], [5, 20, 26])
The parameter buckets is [0,5,25,50], the output is two parts, [0,5,25,50] is a bucket, [5,20,26] is the frequency within each barrel distribution

>>> Rdd = sc.parallelize (["AB", "AC", "B", "BD", "EF"])
>>> Rdd.histogram (("A", "B", "C"))
((' A ', ' B ', ' C '), [2, 2])

Summarize the following points:

1, histogram used to calculate the histogram distribution result according to the given parameter buckets, buckets parameter can be a number, can also be a list

2, all the results of the histogram set interval to the right is open interval, except for the last interval

For example: Results [0, 25, 50] indicate that the barrel result is [0,25], [25,50], i.e., 0 <= x <, 50 <= x <=

3, the bucket must be orderly, and does not contain duplicate elements, at least two elements

For example: the result [0,25,50] is small to large ordered, and contains at least two elements

4. If the parameter buckets is a number, it generates a bucket that is evenly distributed between the maximum and minimum values of the RDD,

For example: If the minimum value is 0, the maximum value is 100, and the given buckets equals 2, the result of the bucket is [0,50], [50,100]

If the RDD contains infinity, the buckets must be at least 1

If the element in the RDD does not change, Min equals Max, and always returns a bucket

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.