Recently learning Spark, I am mainly programming with the Pyspark API,
The network of Chinese interpretation is not many, API official documents are not very easy to understand, I combined with their own understanding of the record, convenient for others reference, but also convenient to review it
This is the introduction of Pyspark. Rdd.histogram
Histogram (buckets)
The input parameter buckets can be a number, or it can be a list
The output is a tuple with two lists of buckets and histograms, respectively.
For example: A list of 0 to 50 sequences,
when the buckets is a number
>>> Rdd = sc.parallelize (range)
>>> Rdd.histogram (2)
([0, 25, 50], [25, 26])
The parameter buckets is 2, the output is two parts, [0,25,50] is
barrels, [25,26] is within each barrel distribution.
Frequency
when buckets is a list
>>> Rdd.histogram ([0, 5, +])
([0, 5, 25, 50], [5, 20, 26])
The parameter buckets is [0,5,25,50], the output is two parts, [0,5,25,50] is a bucket, [5,20,26] is the frequency within each barrel distribution
>>> Rdd = sc.parallelize (["AB", "AC", "B", "BD", "EF"])
>>> Rdd.histogram (("A", "B", "C"))
((' A ', ' B ', ' C '), [2, 2])
Summarize the following points:
1, histogram used to calculate the histogram distribution result according to the given parameter buckets, buckets parameter can be a number, can also be a list
2, all the results of the histogram set interval to the right is open interval, except for the last interval
For example: Results [0, 25, 50] indicate that the barrel result is [0,25], [25,50], i.e., 0 <= x <, 50 <= x <=
3, the bucket must be orderly, and does not contain duplicate elements, at least two elements
For example: the result [0,25,50] is small to large ordered, and contains at least two elements
4. If the parameter buckets is a number, it generates a bucket that is evenly distributed between the maximum and minimum values of the RDD,
For example: If the minimum value is 0, the maximum value is 100, and the given buckets equals 2, the result of the bucket is [0,50], [50,100]
If the RDD contains infinity, the buckets must be at least 1
If the element in the RDD does not change, Min equals Max, and always returns a bucket