Transferred from: Http://www.infoq.com/cn/news/2015/02/apache-samza-top-project
Apache Samza is an open-source, distributed streaming framework that uses the open source distributed messaging system Apache Kafka for messaging services and uses the resource manager Apache Hadoop yarn for fault-tolerant processing, processor isolation, security, and resource management. Recently, from the official Apache blog, the open source distributed streaming framework Samza after a 18-month incubation period has finally upgraded to become Apache's top project. Samza was open source by LinkedIn in September 2013 and contributed to Apache as a incubation project.
Kevin Scott, senior vice president of engineering and operations at LinkedIn, said in a blog post that:
Samza can be widely used and become an Apache top-notch project. It's exciting to develop SAMZA to help address the high-performance challenges of LinkedIn streaming data processing, SAMZA has become a core part of the LinkedIn business architecture.
Improve Digital's CTO Garry Turkington in blog post said:
Improve Digital has accumulated a wealth of SAMZA experience, which makes Improve
Digital uses SAMZA to build a powerful streaming data processing platform. In addition, it's great to be able to upgrade to Apache's top project Samza.
The Samza is ideal for real-time streaming data processing (like Apache Storm), such as data tracking, log services, and real-time services, which can help developers with high-speed message processing while also providing good fault tolerance. During the Samza flow data processing, each Kafka cluster is connected to a cluster that can run yarn and processes Samza jobs. A simple process for Samza is as follows:
The main features of Samza are as follows:
- The simple Api:samza provides a simple callback-based message processing API that is compatible with MapReduce.
- State management: Samza provides a leveldb-based Key/value database to store historical data, enabling stateful message management.
- Fault-tolerant processing: yarn will transparently migrate related tasks to other machines whenever a machine in the cluster fails.
- Persistence: Samza uses Kafka to guarantee the orderly processing of messages and to persist to partitions without the possibility of loss of messages.
- Scalability: Samza in each layer structure is partitioned and distributed, Kafka provides an ordered, partitioned, and can be appended, fault-tolerant stream; yarn provides a distributed, SAMZA-ready container environment.
- Pluggable/out-of-the-box: Samza provides a pluggable feature API that enables SAMZA to use not only Kafka and yarn, but also other messaging systems and execution environments.
- Resource isolation: Support for the Hadoop security model and resource isolation is achieved by using yarn.
Leading companies such as LinkedIn, Microsoft, confluent, Oracle, Hortonworks, Uber, and improve digital are contributing code for SAMZA. Samza has been widely used in business intelligence (BI), financial services, healthcare, security services, mobile applications, software development and other industries, including enterprise mobile application provider DoubleDutch, Europe's leading real-time advertising technology provider improve Digital, Financial services company Jack Henry & Associates, Mobile commerce solutions provider Mobileaware, Cloud-based microservices provider Quantiply, social media business intelligence solution provider Vintank, and more.
In addition to Samza, the real-time/Stream computing framework includes Google Dremel, Apache Drill, Apache Storm, and Apache S4. Interested readers can try the Samza with the official Hello Samza project, or see the background page for more information about Samza. Readers can also read a blog post by LinkedIn veteran SRE Jon Bringhurst, which focuses on how LinkedIn is expanding with Samza and yarn, Kafka, to help you get to know Samza a step further.
Apache Samza Stream Processing framework introduces--KAFKA+LEVELDB's Key/value database to store historical messages +?