Development experience of large data processing

Last Update:2014-12-18 Source: Internet

Author: User

Keywords Can very very large data application

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Big Data is no new topic, in the actual development and architecture process, how to optimize and adjust for large data processing, is an important topic, recently, consultant Fabiane Nardon and Fernando Babadopulos in "Java magzine" The newsletter in electronic journals shares his own experience.

The author first emphasizes the importance of the Big Data revolution:

The big data revolution is underway and it's time to get involved. The amount of data generated by the enterprise is increasing every day, and the common data set that can be reused to discover new information is also widely used. Coupled with a small, on-demand cloud computing portfolio, you are in a new world of infinite possibilities. It is not hard to imagine that many disruptive applications will emerge by leveraging the power of large data technologies in cloud computing. Many emerging companies now offer new exciting applications that failed to get enough financial support a few years ago. As Java developers, we are fully capable of participating in this revolution, because many of the most popular data tools are java-based. However, to build truly scalable and powerful applications while keeping hosting costs under control, we have to rethink the architecture while trying not to get lost in the myriad of tools available.

This article introduces the Apache Hadoop, a framework that allows distributed processing of large datasets, and may be one of the most well-known of these tools. In addition to providing a robust mapreduce implementation and a reliable Distributed file system--hadoop Distributed File System (HDFS), a large data tool ecosystem is built on Hadoop, including the following:

Apache HBase is a distributed database for large tables.

The Apache hive is an infrastructure in a data warehouse that allows temporary SQL-like queries in data stored in HDFs.

Apache Pig is a high-level platform for creating mapreduce programs.

Apache Mahout is a machine learning and data Mining library.

Both Apache Crunch and cascading are the frameworks that are used to create mapreduce pipelines.

Although these tools are powerful, they also add a lot of overhead, and unless your dataset is very large, you can't offset those costs. For example, you can try running the code example provided in the article in a very small dataset, such as a file with only one row. You will see that the processing time will be much longer than you expected. How do you determine if you really have a big data problem? The article points out that although there are no fixed data, there are several indicators that can be used to help you decide if your data is large enough:

All of your data is not suitable for running on a single machine, which means you need a server cluster to process your data within an acceptable time range.

Most of what you're dealing with is terabytes rather than gigabytes of data.

The amount of data you are working on continues to grow and may increase by one times a year.

If your data is really big, keep things simple. You might save time and money with traditional Java applications, even with simpler tools such as grep or awk to process your data.

The authors stress that if you decide to use Hadoop to analyze your dataset, you will want to avoid performance bottlenecks as the data increases. You can apply a lot of configuration tuning to the Hadoop cluster, and if your application does not process the data as quickly as you need, you can always add more nodes. However, keep in mind that nothing is more efficient than your own code to have a greater impact on your large data applications.

When implementing a large data-processing application, your code is usually executed millions of or billions of times per processing cycle. For example, consider that you have to deal with a 10GB log file, and each line is 500 bytes long. The code that analyzes each row of files will run 20 million times. If you can get your code to process each line 10 microseconds faster, this will make the process of processing files fast for 3.3 minutes. Because you may have to deal with a lot of 10GB of files every day, and over time, those minutes will show significant benefits in resources and time.

The lesson here is that every microsecond is important. Choose the fastest Java data structure for your problem, use caching where possible, avoid unnecessary object instantiation, use efficient string manipulation, and produce the most efficient code you can produce with your Java programming skills.

In addition to producing efficient code, understanding how Hadoop works is important for avoiding common mistakes.

Deploying large data applications in the cloud has many advantages. The article points out that as the volume of data increases, you can buy more machines as needed and be ready for spikes. However, if you want to use cloud services without costly expansion, you have to consider the specifics of the cloud platform in your mind before you build your application. First, more efficient code means less processing time, or less hosting costs. Every time you add a new node to your cluster, you're adding more costs, so it's a good idea to make sure your code works as efficiently as possible.

When deploying large data applications in the cloud, it is important to consider using a shared-free schema. A shared-only architecture is essentially a single computer that is connected only by the network, and does not share any disks or memory. This is why this architecture is very extensible because it will not be a bottleneck because of competition from disk access or the presence of another process. Each machine has to take into account all the work, the machine is independent and self-sufficient.

The fault-tolerant features of Hadoop open the possibility of exploration, even cheaper cloud machines can provide. Examples of Amazon spot (you might lose the machine when the price is higher than your bid). When you use this machine to run Tasktracker, for example, you can afford to lose the machine at any time because Hadoop detects that you have lost one or more nodes that are working, and it will rerun the job on another node.

In fact, in many large data applications, it is even acceptable to lose a small amount of data. If you are doing statistical processing, it is common that a small dataset that is not processed may not affect the end result, and you can use this as your advantage when creating your architecture.

You can use a service that supports Hadoop in the cloud to host your application. The Amazon EMR is a good example of this service. Managed services using Hadoop will ease the burden of installing and maintaining your own Hadoop cluster. However, if you need more flexibility, you can also install your own Hadoop solution in the cloud.

The article points out that another benefit of using Hadoop in the cloud is that you can monitor the behavior of the job, and you can automatically add or remove nodes as needed, even when the work is running. The fault-tolerant nature of Hadoop ensures that everything will continue to work. The trick here is to preconfigured the master node to allow a range of IP addresses. This is done through the conf/slaves file in the Hadoop installation directory. With this configuration, you can enable a new node from a pre-configured IP address and it will automatically join the cluster.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More