1. Background overview
There is a certain demand in the business, in the hope of real-time to the data from the middleware in the already existing dimension table inner join, for the subsequent statistics. The dimension table is huge, with nearly 30 million records, about 3g data, and the cluster's resources are strained, so you want to squeeze the performance and throughput of spark streaming as much as po
Spark Streaming 1.2 provides a Wal based fault-tolerant mechanism (refer to the previous blog post http://blog.csdn.net/yangbutao/article/details/44975627), You can guarantee that the calculation of the data is executed at least once,
However, it is not guaranteed to perform only once, for example, after Kafka receiver write data to Wal, to zookeeper write offse
. We must find a good balance between the two parameters, because we do not want the data block to be too large, and do not want to wait too long for localization. We want all tasks to be completed within several seconds.
?? Therefore, we changed the localization options from 3 s to 1 s, and we also changed the block interval to 1.5 s.
--conf "spark.locality.wait=1s" --conf "spark.streaming.blockInterval=1500ms" \2.6 merge temporary files
?? Inext4In the file system, we recommend that you enable
includes Spark, Mesos, Akka, Cassandra, and Kafka, with the following features:
Contains lightweight toolkits that are widely used in big data processing scenarios
Powerful community support with open source software that is well-tested and widely used
Ensures scalability and data backup at low latency.
A unified cluster management platform to manage diverse, different load application
1 decrypting spark streaming operating mechanism Last lesson we talked about the technology industry's Dragon Quest. This is like Feng Shui in the past, each area has its own dragon vein, Spark is where the dragon vein, its dragon Cave or the key point is sparkstreaming. This is one of the conclusions we know very clearly in the last lesson. And in the last lesso
-snapshot.tar.gzcd/var/lib/tomcat7/webappscp/srv/jstorm/jstorm-ui-0.9.6.2.war./MV ROOT ROOT.oldln -sjstorm-ui-2.0.4-snapshot ROOT2.zookeeper-web-ui2.1. Download3.jstorm integration with Apache3.1Apache Load AJP ModuleApache2.2 above can use AJP way, simple and convenient;Execute the following command to view the modules that Apache has loaded:Apachectl-t-D Dump_modulesExecute the following command to load the PROXY_AJP module:A2enmod PROXY_AJPYou can use the View command to view the modules that
Schema background spark parameter optimization increase Executor-cores resize executor-memory num-executors set first deal decompression policy x Message Queuing bug bypass PHP end limit processing Action 1 processing speed increased from 1 to 10 peak Period non-peak status description increased from 10 to 50 peak off-peak status description use pipeline to elevate the QPS of the Redis 50 to a full-scale PM period Peak State Analysis
Architecture
back
, on the Hadoop monitor page, you can see it in the pending process, although I repeatedly delete-Create multiple6. After a while, we finally got a new environment and referenced the document (Https://kyligence.gitbooks.io/kap-manual/content/zh-cn/quickstart/quickstart_ cdh.cn.html), the yarn has been partially set as follows,Then submit the order, finally, finally see the expected results!!! (Cost 1 day)Command:Curl-x PUT--user admin:kylin-h "Content-type:application/json;charset=utf-8"-d ' {"S
Follow the spark and Kafka tutorials step-by-step, and when you run the Kafkawordcount example, there is always no expected output. If it's right, it's probably like this:
......
-------------------------------------------
time:1488156500000 Ms
------------------------------------- ------
(4,5) (
8,12)
(6,14)
(0,19)
(2,11)
(7,20)
(5,10)
(9,9)
(3,9
) (1,11)
...
In fact, only:
......
----------------------
In order to better understand the processing mechanism of the spark streaming sub-framework, you have to figure out the most basic concepts yourself.1. Discrete stream (discretized stream,dstream): This is the spark streaming's abstract description of the internal continuous real-time data stream, a real-time data stream We're working on, in
1. Join for different time slice data streams
After the first experience, I looked at Spark WebUi's log and found that because spark streaming needed to run every second to calculate the data in real time, the program had to read HDFs every second to get the data for the inner join.
Sparkstreaming would have cached the data it was processing to reduce IO and incr
The contents of this lesson:1. Spark Streaming job architecture and operating mechanism2. Spark streaming job fault tolerant architecture and operating mechanismUnderstanding the entire architecture and operating mechanism of the spark s
Spark streaming and Storm are now popular real-time streaming computing frameworks that have been widely used in real-time computing scenarios where spark streaming is a spark-based extension that is later than Storm. This chapter
First, the Java Way development1, pre-development preparation: Assume that you set up the spark cluster.2, the development environment uses Eclipse MAVEN project, need to add spark streaming dependency.3. Spark streaming is calculated based on
First, the Java Way development1, pre-development preparation: Assume that you set up the spark cluster.2, the development environment uses Eclipse MAVEN project, need to add spark streaming dependency.650) this.width=650; "Src=" http://images2015.cnblogs.com/blog/860767/201604/860767-20160425230238517-586254323. GIF "style=" margin:0px;padding:0px;border:0px; "/
Contents of this issue:
Spark Streaming Resource dynamic allocation
Spark streaming dynamically control consumption rate
Why dynamic processing is required:Spark is a coarse-grained resource allocation, that is, by default allocating a good resource before computing, coarse granularity has a benefit
Spark version is 1.0Kafka version is 0.8
Let's take a look at the architecture diagram of Kafka for more information please refer to the official
I have three machines on my side. For Kafka Log CollectionA 192.168.1.1 for serverB 192.168.1.2 for ProducerC 192.168.1.3 for Consumer
First, execute the following command in the Ka
Contents of this issue:1 Online Dynamic Computing classification the most popular products case review and demonstration2 Case-based penetration Spark Streaming the operating sourceFirst, the case codeDynamically calculate the hottest product rankings in different categories of e-commerce, such as the hottest three phones in the phone category, the hottest three TVs in the TV category, etc.Package Com.dt.sp
) The exception here is because the Kafka is reading the specified offset log (here is 264245135 to 264251742), because the log is too large, causing the total size of the log to exceed Fetch.message.max.bytesThe Set value (default is 1024*1024), which causes this error. The workaround is to increase the value of fetch.message.max.bytes in the parameters of the Kafka client.For example://
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.