Ten factors to consider in setting up a large data environment in the cloud

Last Update:2014-12-22 Source: Internet

Author: User

Keywords Cloud computing large Data Environment

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Large data as a concept in the IT field has been recognized by many people. As in many aspects of the IT field, new technologies were first used by large enterprises, and then in the late stages of the entire use curve, small and medium enterprises began to use it. Big data seems to have gone through the same process.

As large data continues to evolve in the real world, it is gradually being applied to less large data elements. Most standards consider smaller datasets being handled by large data tools in a way that is specific to large data architectures.

Still, there is a consensus that there will be more data in the future, not less, and that more data sources will send data to businesses and that the flow of data will increase. That's where the big data is coming from. One problem with this area is where large data will be located (internally or in the cloud) and when you have to consider choosing to use those services.

Definition of large data solution based on cloud

As with most solutions for the cloud, it may be tricky to define the cloud precisely. There are many different cloud features in large data areas, and no definition is universal (but some definitions are better than others).

First, let's play a word game. Large data states can be reached when the number, type, and speed of incoming data are too large to be processed and used in real time with the current relational database. Deploying technologies in large data projects is an attempt to handle the condition and provide a new way to productively use the data, which means that you need to use some hardware and provide a new way to organize your data for fast storage and fast reading. That's the nature of big data.

It is also a reason for the existence of Apache Hadoop, MapReduce, and similar projects and products. A large cloud-based Data environment needs to be able to reference external data, such as enterprise resource planning systems and other internal databases, and periodically update it with fresh data. (The outside represents a large data sandbox.) ）

This step is responsible for storing the data. Next you need to get a way to analyze where it will affect your business processes and show the results of your analysis.

Large data services need to be able to view a variety of data sources outside the data center, to include new data in the datacenter, to accommodate new data elements that have not been considered, and to provide a way to analyze and report all this data. scalability, flexibility, and scalability requirements make it more appropriate for large data environments rather than cloud services.

Start implementing large cloud based data projects

These considerations cover the basic assessment conditions for achieving large data projects. startup, experimentation, and continuous learning, the more you provide the information you want to get from large data, the more targeted your experiment will be, and the quicker you will accumulate skill sets.

1. Establish a common real-time index for all machine data

This is the core of what most people think is the big data, which is often equivalent to the Open-source project Hadoop. Do not confuse indexes in Hadoop with indexes in relational databases: The Hadoop index is a file index. So Hadoop can get many different types of data.

Companies may have been inundated with requests from radio-frequency ID (RFID) Mobile, site clicks, and other potentially structured data (if IT staff took some time to convert them to structured data and put them in relational databases). If you know how to use this data, how to query and access it in the future, it is worthwhile to invest in dealing with these feeds.

Hadoop provides a solution without knowing the potential future uses of the data. By getting the incoming data as is, the large data defers the data definition step to the execution analysis. Without limiting the future use of data, Hadoop distributes data across many servers and keeps track of data locations.

2. Free search and analysis of real-time and historical data

Storing data is only part of the road to achieving a goal. On the other hand the information needs to be found relatively easily. To do this, the quickest way to do this is to provide a quick search function (in terms of implementation, not response time). Therefore, you need to find tools that support text search of unstructured data. Apache Lucene is a common tool for providing text indexing and searching in large data environments.

Getting a response directly from a watchdog can make it vaguely clear that all information is stored correctly and accessible. The management step of this process is to index the content of the data stored in the distributed node. Search for queries, and then access the indexes on distributed nodes in parallel to provide faster responses.

3. Automatically find useful information from the data

This is an important business reason for the adoption of large data scenarios. As with the inability to efficiently migrate all semi-structured data to a relational database, performing manual search and manual reporting can also affect analysis efficiency.

Data Mining and predictive analysis tools are rapidly moving to the ability to use large data as a database for analyzing data sources, or as a database for continuous monitoring of changes. All data mining tools follow this goal. Someone identifies the purpose of the analysis, looks at the data, and then develops a statistical model that provides insights or predictions. These statistical models need to be deployed in a large data environment to perform a continuous evaluation. This part of the operation should be automated.

4. Monitor data and provide real-time alerts

Look for a tool to monitor data in large data. Some tools can create a query that is processed continuously, looking for conditions to be met.

I can't list all the possible uses for real-time monitoring of data entering Hadoop. If most of the incoming data is unstructured and does not apply to relational databases, real-time monitoring may be one of the most careful ways to examine data elements.

For example, you can set a warning when storing RFID chips in frozen foods in a non-frozen area. This warning can be sent directly to the storage used in the mobile device to prevent food decay.

Customers are also able to move around the store and play ads for customers standing in front of a particular product on a strategically designed monitor. (This is very trendy and may be a bit autocratic, but it's entirely possible.) ）

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More