How to implement large data projects on the cloud

Source: Internet
Author: User
Keywords Large data real time delivery implementation
Tags access alerts analysis application based basic big data business

Cloud computing and large data are now hot topics, and how to combine the two together to achieve large data projects on the cloud is a new practice area. Senior data expert David Gillman, based on his own experience, lists the basic elements that need to be considered in the cloud's large data scenarios, including real-time indexing of data, free-mode search and analysis, monitoring of data, and providing real-time alerts to help users better assess and select solutions.

When it comes to implementing large data projects on the cloud, David highlights three real-time elements, real-time indexing, real-time data, and real-time monitoring. In particular, a real-time index refers to the creation of a generic real-time index for all machine data:

This is the core of what most people think is the big data, which is often equivalent to the Open-source project Hadoop. Companies may have been overwhelmed by requests from radio-frequency ID (RFID) Mobile, web clicks, and other potentially structured data. If you know how to use this data and how to query and access it in the future, it is worthwhile to invest in the processing of that data.

Hadoop provides a solution without knowing the potential future uses of the data. By getting the incoming data as is, the large data defers the data definition step to the execution analysis. Without limiting the future use of data, Hadoop distributes data across many servers and keeps track of data locations.

Real-time data refers to "free search and analysis of real-time and historical data", and storing data is only part of the road to achieving the goal. On the other hand the information needs to be found relatively easily. To do this, the quickest way to do this is to provide a quick search function (in terms of implementation, not response time). Therefore, you need to find tools that support text search of unstructured data. Getting a response directly from a watchdog can make it vaguely clear that all information is stored correctly and accessible. The management step of this process is to index the content of the data stored in the distributed node. Search for queries, and then access the indexes on distributed nodes in parallel to provide faster responses.

Real-time monitoring refers to "monitoring data and providing real-time alerts":

Look for a tool to monitor data in large data. Some tools can create a query that is processed continuously, looking for conditions to be met. I can't list all the possible uses for real-time monitoring of data entering Hadoop. If most of the incoming data is unstructured and does not apply to relational databases, real-time monitoring may be one of the most careful ways to examine data elements.

In addition to three "real time", Daivid also listed seven other points, which can be summed up as:

Automatic discovery of valid information from data

Performing manual search and manual reporting can also affect analysis efficiency. Data Mining and predictive analysis tools are rapidly moving to the ability to use large data as a database for analyzing data sources, or as a database for continuous monitoring of changes. All data mining tools follow this goal. Someone identifies the purpose of the analysis, looks at the data, and then develops a statistical model that provides insights or predictions. These statistical models need to be deployed in a large data environment to perform a continuous evaluation. This part of the operation should be automated.

Provides strong, specific reporting and analysis

Similar to knowledge discovery and automated data mining, analysts need to gain access to retrieve and summarize information in large data cloud environments. Vendors with large data reporting tools seem to be growing every day. Large cloud-based data providers should support both Pig and HQL statements from external requesters. In this way, large data stores can be queried by people using tools of their choice (even tools that have not yet been created).

Provides the ability to quickly build custom dashboards and views

Like the evolution of traditional business intelligence projects, when people can query large data and generate reports, they want to automate the feature and create a dashboard to iterate through beautiful pictures. Unless people write their own Hive statements and use only the Hive shell, most tools have the ability to create views like dashboards using query statements. It is too early to enumerate many dashboard examples in large data deployments. One prediction based on the history of business intelligence is that dashboards will become an important internal delivery tool for large aggregated data. And from the historical development of business intelligence, having good large data dashboard is essential to gain and maintain top leadership support.

Efficient scaling with normal hardware to support any amount of data

This consideration has little practical significance when using cloud-wide data services. It is the service provider's responsibility to procure, equip, and deploy the hardware used to store data. The choice of hardware should be easy. But, thankfully, the bill shows that large data is suitable for use with ordinary hardware. A "high quality" server is useful on some nodes in the schema. However, most of the nodes in the large data architecture (the nodes that store the data) can be placed on "lower-quality" hardware.

Provides granular, role-based security, and access control

When unstructured data is in relational data, the complexity of accessing data can prevent people from acquiring data. Common reporting tools do not work. Considering using large data is an effective step to simplifying complex access. Unfortunately, the same security settings are often not migrated from existing relational systems to large data systems. The more large data you use, the more important the good security becomes. Initially, security may be minimal because no one knows how to handle large data. As companies develop more analysis using large data, the results (especially reports and dashboards) need to be protected, similar to protecting reports from the current relational system. Start using large cloud-based data to see when you need to apply security.

Support for multi-tenant and flexible deployments

The use of the cloud brings a multi-tenant concept, but it is clearly not a consideration in the internal large data environment. Many people are uneasy about putting critical data in the cloud. Importantly, the cloud provides the low-cost and rapid deployment needed to start implementing large data projects. It is because the cloud provider puts the data in the architecture with shared hardware resources, the cost is significantly reduced. God is fair. Put the data on your server, and it's all well for someone else to manage the setup. However, when large data requirements are intermittent, this is not a cost-effective business model. The result will be higher spending because the company will pay for a lot of free time, especially during the first project, when analysts explore, think, and understand big data.

Integration APIs and scaling them

Large data is designed for custom application access. Common access methods use the RESTful application programming interface (API). These APIs can be used for each application in a large data environment for administrative control, storage of data, and reporting data. Because all the underlying components of large data are open source, these APIs are fully illustrated and widely used. Expect large cloud-based data providers to allow access to all current and future APIs with appropriate security.

(Responsible editor: Fumingli)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.