About nine types of technology and their fields. Then, since there is a meal, there must be cooking. So the big data technology structure selection, must have at least three kinds of components (source, calculation, storage)
The simplest data processing architecture:
The least unit of data processing scheme, of course this is not the best, for what, the problem:
1. When streaming data (streaming), the amount of data is stored in HDFs, 20M or 100K, which is the case. The storage of this computational result is a huge waste of storage space. HDFs does not apply to storage of large batches of small files (just not applicable, not not)
2. When the data volume is large, it can not be processed (receiver data received but, collapsed, that line I get up a few receiver not on the line, if the data volume suddenly big and small to do, the challenge your program limit, how to do? Receiver more waste, less no, no optimal solution)
In view of the above two problems, what to do? (This indicates that the technical structure is not, then change), we learn Io, there is a package pipeline, large pipe tube, the equivalent of a buffer pool, the efficiency is high, in view of this, we give the technical sub-block in the middle of the buffer layer (which technology can meet the requirements?) Of course it's the message queue.)
In order to improve the size of the data, we need Kafka to do the data cache layer
Its data processing structure is as follows:
The technical structure becomes this form, so that control data flow rate, AC Kafka management, reasonable increase receiver (streaming data receiving point)
Configuring the Kafka configuration file, which considers the default configuration to intervene, can resolve data skew. Ensure that the spark cluster data data is received measured. The results of the processed data are sent to Kafka, and the data of our results are accumulated and persisted into HDFS to solve the waste of storage space caused by small files.
Is this the best? If it is a flow calculation, this is really good, but it is just good, for what, this is only for data processing, (data processing? Of course, in general, the data processing is generalized, in practice, data processing refers to the process from the data source to the data data can be analyzed before the procedure called data processing. The data processing to the chart display is the data analysis stage, this process data multidimensional analysis, the comparison, extracts the value curve, presents. This accounts for half of the big data, and data mining requires reasonable analysis to extract value and present it to customers. Does this seem to have a much less obvious relationship with data selection?
At first glance, it's okay to think that we don't have big data for these new technologies, traditional data analysis is based on MySQL and Oracle's relational database analysis. In the actual production project construction, are gradually replaced, data interaction coupling, background and front-end connection, or database form data and database form data good integration, it is so, if we want to replace the traditional database of large data memory distributed extensible database can come in handy. Such data processing structures are subject to change, such as:
Of course, when doing offline processing, we can get the data directly from the base memory, or our data source in memory, whether it is offline or streaming computing, will be from the disk to obtain historical data for long-period aggregation operations, these data in memory acquisition is bound to improve computational efficiency. This data-based memory management is particularly important, tachyon as a distributed memory file management system, to solve such a problem, its structure
This data processing and analysis of the business, the basic improvement, the optimization greatly improved the operational efficiency, but the big data another processing scenario, the quality of the day is very large, the target data acquisition will become data processing and analysis of the bottleneck, how to solve this problem. HBase as a non-relational database of distributed Columnstore, he is more like a data search engine, why? (because he is not based on column search, but the line health) its search speed is also very scary, if you ignore the line health, make a UUID, then you come to big data is funny, hbase is not a database, because of this, the understanding of the search engine is more cutting and.
As the saying goes, the double fist difficult enemy four hands, the hero Jiabuzhu person many. So when the amount of data searched reaches millions, the rate drops. Is there any way to improve it? Yes, the only way to improve search efficiency is by indexing, indexing, and querying faster. Why is it? (Relational data to hbase transformation is known) we and the use of Lucene and SOLR to build a text search engine, do a lot of data index and fuzzy results based on the exact search hbase, get results. By turning large batches of query results into precise, small-batch searches, the query rate for HBase can be faster, as well as provide users with a vast amount of raw data search experience and index data processing analysis (accurate calculation). Its structure
The whole big data processing business of technology selection, and functional collocation is so. You want to see the essence of collocation, not to say what others use, what I use, the actual selection is based on real business scenarios and production docking situation.
Enterprise-Class Big Data processing Solution-02. Environmental Decision requirements, performance decision selection