There is no doubt that all the people in the world concerned about the development of technology are aware of the potential value of "big data" for business business, and the goal is to address the pain caused by the growth of business data in the process of enterprise development.
The reality is that many problems hinder the development and practical application of large data technologies.
Because a successful technology requires some measure of standard. Now we can measure large data technologies with a few basic elements--streaming, parallelism, summary indexing, and visualization.
Who uses the big data?
A year ago, some of the major users of large data technology were large web companies, such as Facebook and Yahoo, which needed to analyze the stream data. But today, "Big data technology has gone beyond the web, and it is possible for businesses that have a lot of data to deal with." "For example, banks, utilities, intelligence services and so on are taking the big data this car."
In fact, some of the big data technologies have been used by companies with cutting-edge technologies, such as those that are driven by social media to create the appropriate Web services. They are important for the contribution of large data projects.
In other vertical industries, some companies are realising that their value-orientation based on information services is much larger than they had previously imagined, so big data technology has quickly attracted these companies ' attention. Coupled with a drop in hardware and software costs, these companies find themselves in a perfect storm of opportunity for a big business transition.
Three major challenges to dealing with large data processing: large-capacity, multi-format data and speed
Large-capacity data (TB, Petabyte, even EB): More and more business data from people and machines create more challenges to IT systems, storage and security of data, and future access and use of these data have become difficult.
Multi-format data: The massive data includes more and more different formats of data, and these different format data also need different processing methods. From simple emails, data logs and credit card records, to scientific research data, medical data, financial data, and rich media data (including photos, music, videos, etc.) that the instrument collects.
Speed: Speed refers to the speed at which data is moved from the endpoint to the processor and storage.
What does large data technology cover?
Flow processing
Along with the pace of business development and the complexity of business processes, our attention is increasingly focused on "data flow" rather than "dataset".
Policymakers are interested in fastening the lifeblood of their organization and getting real-time results. What they need is a framework that can handle data streams that occur at any time, and the current database technology is not suitable for data flow processing.
For example, to calculate the average of a set of data, you can use a traditional script implementation. However, for the calculation of the average value of mobile data, there are more efficient algorithms, whether it is arrival, growth or one unit after another. If you want to build a data warehouse, and perform arbitrary data analysis, statistics, open source products R or similar to SAS commercial products can be achieved. But what you want to create is a data flow statistic set that incrementally adds or removes blocks of data, moves average calculations, and the database does not exist or is immature.
The ecosystem around the data stream is underdeveloped. In other words, if you are negotiating a large data project with a vendor, you must know whether the data flow process is important to your project and whether the supplier is capable of providing it.
Ii. parallelization
There are many different definitions of large data, which is relatively useful. The "small data" situation is similar to the desktop environment, disk storage capacity between 1GB to 10GB, "medium data" data in between 100GB to 1TB, "large data" distributed storage on multiple machines, including 1TB to multiple PB data.
If you work in a distributed data environment and want to process data in a very short time, this requires distributed processing.
Parallel processing stands out in distributed data, and Hadoop is a widely known example of distributed/parallel processing. Hadoop contains a large distributed file system that supports distributed/parallel queries.
Iii. Summary Index
A summary index is a process that creates an estimated summary of the data to expedite the running of the query. The problem with the summary index is that you have to plan for the query to be executed, so it is limited.
Data growth is fast, and the requirement for a summary index is far from stopping, whether long-term or short-term, and the vendor must have a defined strategy for the development of the summary index.
Iv. Data Visualization
There are two broad categories of visual tools.
Exploratory visual descriptive tools can help decision-makers and analysts tap into the links between different data, which is a visual insight. Similar tools include tableau, Tibco and Qlikview, which is a class.
Narrative visualization tools are designed to explore data in a unique way. For example, if you want to visually view the sales performance of an enterprise in a time series in a visual fashion, the visual format is created in advance. The data is displayed monthly by region and sorted according to the predefined formulas. Supplier perceptive pixel belong to this category.
V. Ecological system Strategy
Many of the biggest and most successful companies spend a lot of money building ecosystems around their products. These ecosystems are supported by product characteristics and business models and work with partners ' products and technologies. If a product does not have a rich strategic ecosystem, it is difficult to adapt to customer requirements.
(Responsible editor: Lu Guang)