Http://spark.apache.org/docs/1.2.1/streaming-programming-guide.html
How to shard data in sparkstreaming
Level of Parallelism in Data processing
Cluster resources can be under-utilized if the number of parallel tasks used on any stage of the computation are not high E Nough. For example, for distributed reduce operations like reduceByKey
reduceByKeyAndWindow
and, the default number of parallel tasks are controlled by The spark.default.parallelism
configuration property. You can pass the level of parallelism as a argument (see PairDStreamFunctions
documentation), or set the spark.default.parallelism
Change the default.
Parallel Data processing level
if the number of parallel tasks used at any stage of the calculation is not high enough , the cluster resources may be underutilized . For example, for distributed reduce operations Reducebykey and Reducebykeyandwindow, the default number of parallel tasks is determined by the spark.default.parallelism Configuration Property Control. You can control parallelism by parameters ( see pairdstreamfunctions documentation, or set Spark.default.parallelism configuration properties to make changes.)
For example: sparkconf sparkconf = new sparkconf (). Setappname ("NAME"). Set ("Spark.default.parallelism", "5");
Spark Learning Note-spark Streaming