Today, "Bigdata" (large data) the advent of the era has been no doubt, especially in the telecommunications, financial and other industries, almost to the "data is the business itself". One of the more important trends in this is the socialization of data (socialization of 挟裹). From the blog forum to the gaming community to the microblog, from the Internet to the mobile internet to the IoT, the real-time networking of human beings and various physical entities has been and will continue to produce incalculable data. For companies that always focus on the direction of the market, the data they need to focus on is clearly not limited to the business data in the enterprise's internal database, but also to the relevant data records generated by various network activities on the Internet (and the future of IoT).
Obviously, big data is an innovation and it knows where you are at all times. We can use this data in an effective way, and we have seen market demand.
Recently, in Beijing, held a large data World forum, memory calculation, real-time query, effective storage management, intelligent mining analysis, has become the focus of many it people.
User requirements--massive + real-time analysis
Benjamin Woo, vice president of IDC's global storage and large data research project, said global data use was expected to surge 44 times times to 35.2ZB by 2020. What is the concept of 35ZB? (1ZB=1024EB=1048576PB=1073741824TB,1073741824TB*35=37580963840TB), which means that the world probably needs 37.6 billion 1TB hard drives to store data.
Another requirement comes from the real-time query access requirements for massive data. In the first day of data warehousing, there has always been a topic to break down large queries into small tasks, which are done by a single machine.
"Our requirement is that the analysis of billions of data can be done in 5 seconds," Yianyang, general manager of China National Securities Information Technology Department, said that because of the volatility of development in the securities industry, the investment of securities companies was also based on their income, using a database (Data Warehouse) of MPP (large-scale parallel processing) architecture. The basic feature is that multiple SMP servers are interconnected through nodes, and each node can only access local resources, is a completely unshared structure, so extensibility is good. When developing applications based on this database, it is not necessary to consider how many nodes the server is composed of or to consider the load problem.
Obviously, the advantage of MPP is that it is more suitable for complex data summation analysis and processing than NUMA architecture (non-consistent storage access). Of course, it also needs to support MPP database system to screen the load balancing and scheduling complexity between nodes.
At the same time, Yianyang also gave the BI and ETL tool selection suggestions-compatibility, operability, functionality, display power, maintenance. In data centers, he says, each transformation ensures data integrity and accuracy, and security management and control is achieved through development tools plus moderate programming. If the development team's understanding and integration of the tool is not good, the final data warehouse data will form a lot of garbage, and even cause a lot of wrong data appear. Therefore, should take into account five aspects to carry out.
Large data does not equal it refactoring
Another message from this conference is that for businesses that really need large data, the need for large data is not equal to it refactoring.
In fact, not all data need to be analyzed, for the enterprise, the core is to find the key data, and the use of coherent data for MPP system to get some structured data, and some unstructured data, can be used in Hadoop processing.
On the other hand, the infrastructure for large data needs to be forward-looking, and as the data grows, users need to think about what architecture to implement from the hardware and software levels. Inevitably, the future needs is a very good, scalable, file storage-friendly file system.
At present, whether IBM, EMC or SAP have launched their own bigdata total solution, from the comparison we can see.
IBM Data Warehouse Integrated Machine
IBM has made the Data warehouse a hardware product, standard power server, storage, plus the newly acquired Netezza. Netezza New Data Warehouse integrated Machine channel manager Cheng said, "Anyone who has played cloud computing, Hadoop, for the Netezza environment is a piece of cake." "It is understood that Netezza integrated machine is a small, integrated into a very high level of parallel processing environment." Its architecture is based on MPP, the two hosts system accepts query requests, the query statements will be dispersed into dozens of or hundreds of fragments, passed to the second layer, each of the second layer is the same (internal name s-blades partition memory), and then each fragment as MPP environment in each machine, The middle network forms an MPP network.
When the data is stored, the data is divided evenly on all nodes, and the balance of the nodes is determined by Netezza. When the data is in the hard disk, the format is "Zone Maps", and all the fields in a table are divided into all the slices according to the value. With zone maps, when the FPGA scans, only those areas that contain records are read, where the hard drive accelerates efficiently. If the data is modified, it will add to the new data records, the timing of the data reorganization, compression, these are through the background process management.
emc--self-Service Data Warehouse
Greenplum Chorus is a data cloud platform for EMC, with a stack of large data analysis tools (Business Analytics, BI, statistics, etc.).
The chief technical officer of EMC China Research and Development Center, Taupo, says the Greenplum chorus composition consists primarily of self-service provisioning and data Service collaboration analysis. When users deploy chorus, VMware, and Greenplum, they can fully implement their own services, either as a server or as a sandbox, which can be based on either virtualized it cloud computing platforms or on a greenplum database. At the same time, users can import other raw data in the enterprise cloud into their own sandbox to operate. At the same time, developers can create a collaborative environment that shares data while controlling authorization and avoiding improper use of private data. For example, write a section of r code, and the R code to share, while seeing other people's comments, can do their work anytime and anywhere.
sap--Memory Computing + open platform
SAP at this conference presented an open memory computing platform Hana, according to SAP Solutions and Architecture Department general manager Zhang Zhiqi introduced, Hana will become the future platform for SAP, all future applications will be published on the Hana platform, and used as real-time analysis. And Hana will be mainly used for real-time processing of a large number of transaction data software, including data modeling, data and lifecycle management, security, operational tools, based on industry standards for multiple interfaces to provide support.
"Memory is 1 million times times faster than disk calculation, traditional database disk read speed is 5 milliseconds, memory database disk read speed is 5 nanoseconds." Users can take advantage of high-speed memory performance, faster access to data, summary data, analysis of data. In addition, a large data age, a server may not be enough, the user can divide and conquer the dynamic server to different nodes, rapid data dispersion calculation, data dispersed summary, to obtain faster results. Each server also has more CPUs, which can be more power in total memory calculations. All data can be distributed through the environment. ”
Zhang Zhiqi said, "Hana will be an open platform, SAP will and business partners, etc. on the Hana development applications, reflect more support." The SAP Hana platform will support more memory applications, such as reports, operations reports, capability Analysis accelerators, and more value-for-gain. Hana is not just an analytical application, but a business application. We want companies to get access to information without the attention of IT staff. This in itself is the capacity provided by the SAP HANA platform. ”
Data Analysis professional future is very popular
With the advent of large data, it is foreseeable that data analysts will be one of the hottest industries in the future, with a large number of industries such as the government, the energy industry, and the Internet industry that require data scientists to continually raise critical, critical business-related issues. Alibaba now has more than 100 data analysts, their different departments are in the data analysis, sharing of data analysis and data managers, business management to form collaborative data analysis.
Informatica, Director of enterprise data Integration product Management Zheng, says large data analysts may have an increasing need, possibly part of an IT organization or part of a business organization. From the Internet's point of view, the provision of user value, can explore a lot of user value, such as car sales and so have a direct sales impact. In various parts of the world, especially some data more mature industry, there will be a large number of data analysis, data scientists demand.
Teradata Company chief customer officer Zhou Junling said that companies must have a number of scientists focused on data research, once the data scientists, enterprises need to be related to data science training, to understand the relevant business.
Steven Sitze, director of IBM's Big Data Project at the Silicon Valley Laboratory, argues that some big companies now have data analysts. These data analysts do not now analyze a large number or subtype of data, which is a great opportunity to develop tools. For companies, it is not enough to have a platform with large data, and there are some tools to help them use it to show the ability of analyzing and demonstrating large data.
Observation: The big data industry is just on the road
For data storage vendors, whether large or massive data still have a lot of challenges, the first is that they must strengthen the effectiveness of relational database, increase data management and data compression functions.
Because in the past the relational database products processing a large number of data is not fast, need to introduce new technologies such as Hadoop to speed up the function of data query. In addition, the data store manufacturers are also beginning to try not only to use traditional hard disk to store data, such as the use of Fast flash memory database, flash database, etc., are gradually generated.
Another challenge is that the traditional relational database can not analyze unstructured data, therefore, mergers and acquisitions have the analysis of unstructured data manufacturers and data management vendors, is the current data storage manufacturers to expand the direction of power.
At the same time, the impact of data management lies in the consideration of data security. Wunai, IDC software market analyst, says large data can also have an impact on storage technology and resource security. First of all, snapshots, data deduplication and other technologies are important in the large data age, which derives the management of data rights.
For example, now the enterprise backend and the front end of the data pattern is not the same, when the enterprise to deal with unstructured data, it is necessary to work out whether IT department or business unit is the data manager. As this involves not only technical issues, but also the formulation of corporate policy, so the definition of data managers is the most headache of the enterprise at present, and all this, for large data, whether the technology or applications are just on the road.
(Responsible editor: admin)