Network-related Big data analysis architecture with Kafka + Spark + Hadoop better, or elk solution better. Regardless of machine learning, the main use of spark SQL and streaming to do timing processing and data aggregation query, found that elk can also complete the same function, elk is relatively lightweight, easier to deploy and maintain.
Something that's not the same field.
Elk mainly do search, log, not suitable for big data statistics, of course, the amount of data is not large, or in the existing data on the way to support the line, but in the statistical analysis and Hadoop, spark and the flow can not be compared. And the technology stack, tool chain is also the day difference.
Compare these two sets of time to consider two points:
Data volume: Spark a set of potentially processed data volumes, but most business requirements are far from the performance bottleneck stream of spark streaming or elk: In the case of complex computations, the Spark API provides more expressive power, And it's easier to do adequate unit testing than Elk's configuration language.
Author: Yinfeng Qin
Links: https://www.zhihu.com/question/35214783/answer/128798385
Source: Know
Copyright belongs to the author. Commercial reprint please contact the author for authorization, non-commercial reprint please specify the source.
Author: Yen Yun cloud Computing
Links: https://www.zhihu.com/question/35214783/answer/150224381
Source: Know
Copyright belongs to the author. Commercial reprint please contact the author for authorization, non-commercial reprint please specify the source.
Instead of comparing spark with elk, it is better to make a handshake and complement each other.
Here is a combination of spark and lucene, you can refer to, let spark performance has a qualitative improvement, also let Lucene function more complete.
A cheaper implementation based on spark sequencing-spark-based performance testing
Sorting can be said to be a lot of log system tough (such as chronological order), if a big data system can not be sorted, basically this system is not available, sorting is a big Data system is a "just need", whether big data using Hadoop, or spark, Or impala,hive, sorting is essential, and sequencing performance testing is also essential.
The sort benchmark global ranking, which calculates the Olympic Games, is held every year, and every year the Giants make a huge investment in the sort, and how important it is to sort the speed. However, for most enterprises, hundreds of millions of hardware investment, it does not go, or even far beyond the enterprise's project budget. There is a cheaper way to do this than in the big data world of violent sequencing.
Here, we introduce a new inexpensive sorting method, which we call Blocksort.
500G of data 30 billion data, using only 4 16 cores, 32G memory, Gigabit network card virtual machine can be 2-15 seconds of sorting (you can sort all tables, you can also filter with any filter after sorting).
first, the basic idea is this, as shown in the following figure:
1. Pre-partition data according to size, such as divided into large, medium and small three blocks (block).
2. If you want to find the biggest data, then you just need to find it in the largest block.
3. This is still a hierarchical structure, if the amount of data within each block, you can go to the following sub-fast to continue to find, can be divided into multiple layers to sort.
4. In this way, a billion-dollar level of data (such as the long type), the worst of the worst extreme situation will be 2048 times file seek to filter the results. <img src="https://pic2.zhimg.com/v2-8a4e20c07a3a64df2dafc1a5921b7c91_b.png" data-rawwidth="867" data-rawheight="251" Class="origin_image zh-lightbox-thumb" width="867" Data-original="https://pic2.zhimg.com/v2-8a4e20c07a3a64df2dafc1a5921b7c91_r.png">
How, the principle is not very simple, so that even if the amount of data is very much, then the number of sorting and finding is fixed.
Second, this is our previous performance test based on Spark, for your reference
On the sort, Ydb has an absolute advantage, whether it is a full table, or a combination filter based on any condition, basic seconds to kill Spark in any format.
test results (time in seconds)
<img src="https://pic4.zhimg.com/v2-2d00349f4ddd2ebe24b0f2051631530f_b.png" data-rawwidth="847" data-rawheight="429" Class="origin_image zh-lightbox-thumb" width="847" Data-original="https://pic4.zhimg.com/v2-2d00349f4ddd2ebe24b0f2051631530f_r.png">
third, of course, in addition to the sort, our other performance is much higher than spark, this piece can also be understood by everyone
1, and Spark txt on the search performance comparison test.
Note: Memo. The figure below, in fact, nothing special, but because the YDB itself index characteristics, do not want spark as violent, will lead to the performance of the scan is much higher than spark, not surprisingly high performance.
The following figure is a multiple of ydb relative to the spark txt boost <img src="https://pic1.zhimg.com/v2-ec6e64b8f5cc8beef7ce6631e5e32410_ b.png" data-rawwidth="802" data-rawheight="439" Class="origin_image zh-lightbox-thumb" width="802" Data-original="https://pic1.zhimg.com/v2-ec6e64b8f5cc8beef7ce6631e5e32410_r.png">
2. These are compared to the parquet format (in seconds) <img src="https://pic1.zhimg.com/v2-670b183631360e958fb0347923b95114 _b.png" data-rawwidth="582" data-rawheight="257" Class="origin_image zh-lightbox-thumb" width="582" Data-original="https://pic1.zhimg.com/v2-670b183631360e958fb0347923b95114_r.png">
<img src="https://pic3.zhimg.com/v2-d2fd38051e709a11495e9e4f657c866a_b.png" data-rawwidth="578" data-rawheight="282" Class="origin_image zh-lightbox-thumb" width="578" Data-original="https://pic3.zhimg.com/v2-d2fd38051e709a11495e9e4f657c866a_r.png">
<img src="https://pic3.zhimg.com/v2-66423dd2c7b14c9119321fb267ece292_b.png" data-rawwidth="588" data-rawheight="251" Class="origin_image zh-lightbox-thumb" width="588" Data-original="https://pic3.zhimg.com/v2-66423dd2c7b14c9119321fb267ece292_r.png">
<img src="https://pic1.zhimg.com/v2-e55d1b752ca15a68df737c869c9dccd0_b.png" data-rawwidth="591" data-rawheight="267" Class="origin_image zh-lightbox-thumb" width="591" Data-original="https://pic1.zhimg.com/v2-e55d1b752ca15a68df737c869c9dccd0_r.png">
<img src="https://pic3.zhimg.com/v2-66423dd2c7b14c9119321fb267ece292_b.png" data-rawwidth="588" data-rawheight="251" Class="origin_image zh-lightbox-thumb" width="588" Data-original="https://pic3.zhimg.com/v2-66423dd2c7b14c9119321fb267ece292_r.png">
<img src="https://pic4.zhimg.com/v2-f25f4fbebc9f68065cdae74b9b1e0a7f_b.png" data-rawwidth="583" data-rawheight="254" Class="origin_image zh-lightbox-thumb" width="583" Data-original="https://pic4.zhimg.com/v2-f25f4fbebc9f68065cdae74b9b1e0a7f_r.png">
<img src="https://pic4.zhimg.com/v2-f34eb3476a936651924676950f7e953f_b.png" data-rawwidth="581" data-rawheight="241" Class="origin_image zh-lightbox-thumb" width="581" Data-original="https://pic4.zhimg.com/v2-f34eb3476a936651924676950f7e953f_r.png">
<img src="https://pic3.zhimg.com/v2-488409cddce1def4b7709f01b4a1093a_b.png" data-rawwidth="580" data-rawheight="266" Class="origin_image zh-lightbox-thumb" width="580" Data-original="https://pic3.zhimg.com/v2-488409cddce1def4b7709f01b4a1093a_r.png">
3. Comparison with Oracle Performance
Compared to traditional databases , there is no point inOracle , which is not suitable for big Data , and any big data tool is far beyond Oracle performance. <img src="https://pic4.zhimg.com/v2-abf665afa10fab37fee1d9505214f1e7_b.png" data-rawwidth="702" data-rawheight="420" Class="origin_image zh-lightbox-thumb" width="702" Data-original="https://pic4.zhimg.com/v2-abf665afa10fab37fee1d9505214f1e7_r.png">
4. Audit have been supervised scene performance test
<img src="https://pic2.zhimg.com/v2-691e92dcb3a49e4a32126a3dd7a44f79_b.png" data-rawwidth="666" data-rawheight="349" Class="origin_image zh-lightbox-thumb" width="666" Data-original="https://pic2.zhimg.com/v2-691e92dcb3a49e4a32126a3dd7a44f79_r.png">
Iv. Ydb How to get spark to accelerate.
Based on the Hadoop distributed Architecture , real-time, multidimensional, interactive query, statistics, analysis engine, with trillions of data on the scale of second-level performance, and with enterprise-class stable and reliable performance.
Ydb is a fine-grained index with an exact granularity index. Data is instantly imported, indexes are generated instantly, and indexes are efficiently targeted to relevant data. Ydb is deeply integrated with spark , and Spark performs a direct analysis of the YDB search result set, and the same scenario speeds up the performance of spark a hundredfold.
<img src="https://pic2.zhimg.com/v2-2c6106f29375369c380167a23a50426d_b.png" data-rawwidth="690" data-rawheight="361" Class="origin_image zh-lightbox-thumb" width="690" Data-original="https://pic2.zhimg.com/v2-2c6106f29375369c380167a23a50426d_r.png">
Five, which users are suitable for using YDB.
1. Traditional relational data have been unable to accommodate more data, and users who have been severely affected by query efficiency.
2. Currently in the use of SOLR, ES to do full-text search, I feel that SOLR and ES provide too little analysis function, can not complete complex business logic, or the data changes more after SOLR and Es become unstable, in the drop and balance in the vicious circle, can not automatically restore services, Operations personnel often need to get up in the middle of the night to restart the cluster situation.
3. Based on the analysis of massive data, but the speed and response time of the existing offline computing platform do not meet the business requirements of users.
4. Users who need to do multidimensional directional analysis of user portrait behavior data.
5. Users who need to retrieve a large amount of UGC (User Generate Content) data.
6. When you need to make fast, interactive queries on large datasets .
7. When you need to perform data analysis, not just simple key-value pairs are stored.
8. When you want to analyze the data generated in real time. PS: Said a lot of, plainly the most suitable or trace analysis because of the large data volume, the data also requires real-time, query also fast. That's the point.
Video address (see the students can enter Tencent video HD playback)
Https://v.qq.com/x/page/q0371wjj8fb.html
Https://v.qq.com/x/page/n0371l0ytji.html
Interested readers can also read the YDB Programming Guide http://url.cn/42R4CG8. You can also refer to the book itself to install Yen Yun ydb for testing.
It is no problem to deal with common log summaries and analyses, and the amount of petabytes of data is not difficult.
Elk is a simple, lightweight, easy-to-expand, but data capacity, two-time extraction, and the surrounding ecology, or no Hadoop is good.
Elk can extract arbitrary data (requires semi-formatted data) and work with the same data after mastering a simple regular. Spark also needs to learn a programming language.
Author: User-aware
Links: https://www.zhihu.com/question/35214783/answer/72938116
Source: Know
Copyright belongs to the author. Commercial reprint please contact the author for authorization, non-commercial reprint please specify the source.