KeywordsBig data big data bubbles big data bubbles real time big data bubbles real time these big data bubbles real time these some
We're in the middle of a hype about big data and distributed computing, and it's time for the big data bubble to burst.
Yes, through a hype cycle that divides technology across the gap, from early adopters to the wider mass community. And, at least, it hints at a technological advance beyond academic dialogue and pilot projects. But a wider audience using the technology may just drift along and there is a lack of important warning points.
Follow the trend
In a hype cycle, there is usually a supplier group that follows the trend, rushing to implement a trendy technology that tries to keep it relevant and not lose track of the chaos. But the products of these companies may confuse the market because eventually these technologies will be inappropriately used.
Projects that use these products will face the risk of failure, even if customers have paid a lot of resources and energy, it is possible to produce almost no return on investment, then customers may begin to question the technology. Now the Hadoop stack is facing this situation.
Breaking large data bubbles to identify some subtle differences about their products and patterns begins. Here are some important factors that are divided into three areas of focus that should be understood before you consider a technology related to a Hadoop distributed infrastructure.
Hadoop is not a Rdbbms killer.
The Hadoop distributed system runs on commodity hardware and storage, making it much cheaper than a traditional relational database management system (RDBMS), but it is not a database substitute. The Hadoop distributed architecture was built to take advantage of sequential data access to larger blocks of data (write multiple reads) rather than separate records. Because of this, the Hadoop distributed system optimizes the analysis workload, not the transaction processing of the relational database management system.
Frankly speaking, low latency reading and writing are not effective in the Distributed File System (HDFS) of Hadoop. Only the coordinated writing and reading of a single byte of data requires multiple Terminal Control Protocol/network protocol connections to the distributed system of Hadoop, which brings a very high latency to transaction operations.
However, in a well tuned Hadoop cluster, the throughput of reading and writing large chunks of data is very high.
Hive and non-hive files
The hive file allows developers to query the data in a Hadoop distributed system and use a language similar to Structured Query Language (SQL). More and more people know that structured query language can be written in the Hadoop Distributed system parallel programming technology local code, which makes using hive file can have an attractive and cheaper way to recruit new talent, or let developers learn Java programming language and programming technology code programming mode.
However, before making any decision about the hive file as your big data solution, there are some very important trade-offs to note:
? HIVEQL (the dialect of the hive file Structured Query language) only allows you to query structured data.
? The hive file itself does not have a extract/transform/load (ETL) tool. So while you can save money using Hadoop Distributed systems and hive files as your database, in-house developers can also run a combination of skill sets for structured query languages, but maintain custom load scripts and prepare data payments as demand changes.
? The hive bottom uses the HDFs and Hadoop mapreduce calculation methods. It seems to mean, for reasons as already discussed, that from a traditional relational database management system to an end user accustomed to a normal structured query language response time, it may be disappointing to "query" the somewhat clumsy batch method used by the hive file.
Is this a real-time Hadoop distributed system? Not really.
Let's explore some of the technical factors that make Hadoop distributed systems unsuitable for real-time applications. The MapReduce calculation method of the Hadoop distributed system follows a map preprocessing step and a reduce data aggregation/refinement step. Although it is possible to apply this map operation to real-time streaming data, reduce cannot.
This is because the reduce step requires that all input data be mapped and sorted first for each unique data key. However, there is an attack on the buffer-related process, and even hackers are unable to operate in real time, so buffers can hold only a small amount of data.
Some NoSQL products also use MapReduce to analyze workloads. So when these data repositories can perform near-real-time data queries, they are also not tools for real-time analysis.
While there are other big data rumors that need to be crushed, the Hadoop distributed system cannot be replaced as a relational database management system. The disadvantages of hive files and the adaptability of programming tools to the application of real-time streaming data are the biggest obstacles in our observations.
Finally, to realize the commitment to large data, it is necessary to understand the appropriate application through appearances. Information Technology (IT) organizations must break out of a large data bubble and focus their efforts on the Hadoop distributed system to provide real, different value areas.
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.