From Linkedin,apache Kafka to Unix philosophy

Source: Internet
Author: User

Original link:
Http://www.confluent.io/blog/apache-kafka-samza-and-the-Unix-philosophy-of-distributed-data
Martin Kleppmann
Translator: Jay Micro magazine-macsokolot (@gmail. com)

When I was doing research for my book, I realized that modern software engineering still needed to learn a lot from the the 1970s. In such a fast-growing field, we tend to have a tendency to think that old ideas are useless-so it's hard to end up paying the same lesson again and again. Despite the fact that computers are getting faster, data volumes are growing and demand is becoming more complex, many old ideas still have a big niche.


In this article, I want to emphasize a stale idea, but it should now be more concerned about: Unix philosophy (philosophy). I'll show you why this philosophy is different from the way the mainstream database is designed, and explore what it will look like today if modern distributed data Systems learn some of the fur from UNIX.



In particular, I think there are many similarities between Unix pipelines and Apachekafka, because these similarities make large-scale applications have good architectural features. But before we get into it, let me mention a little bit about the basics of Unix philosophy. Perhaps you've seen the power of UNIX tools before-but I'll start with a concrete example that everyone can talk about.
Suppose you have a Web server that writes an entry to the log file each time it is requested. Assuming that you use Nginx's default access log format, this line of logs might look like this:


216.58.210.78--[27/feb/2015:17:55:11 +0000] "Get/css/typography.css http/1.1"
3377 "http://martin.kleppmann.com/" "mozilla/5.0 (Macintosh; Intel Mac OS X
10_9_5) applewebkit/537.36 (khtml, like Gecko) chrome/40.0.2214.115 safari/537.36 "


(There's actually only one row, and it's easy to read into multiple lines.) The log of this row indicates that the server received a file request/css/typography.css from the client address 216.58.210.78 on February 27, 2015 17:55:11. It also records a variety of other details, including the browser's user-agent string.


Many tools are able to take advantage of these log files and generate your website traffic reports, but to practice practiced hand, we build a tool for ourselves, using some basic UNIX tools to identify 5 popular URLs on our website. First, we need to extract the requested URL path, where we can use awk.


Awk does not know the format of the Nginx log-it simply treats the log file as a text file. By default, awk can only process one line of input at a time, separated by a space, so that it can be used as a space-delimited part of a variable. In the Nginx log example, the requested URL path is a 7th space-delimited part:



Now that we have extracted the path, we can then determine the 5 most popular sites on the server, as follows:


This sequence of commands executes after the output is as follows:


4189/favicon.ico
3631/2013/05/24/improving-security-of-ssh-private-keys.html
2124/2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html
1369/
915/css/typography.css



If you're not familiar with UNIX tools, the above commands may seem a bit difficult to understand, but it's really powerful. These simple commands can process gigabytes of log files in seconds, and you can easily change the analysis content as needed. For example, you now want to count the most visited client IP addresses, rather than the hottest ones, just change the awk parameter ' {print $} '


Use these combination commands on Demand awk, sed, grep, sort, uniq, Xargs, the massive data analysis can be done in a matter of minutes, and its performance is surprising. This is not a coincidence, it is the result of UNIX design philosophy.



Unix philosophy is a set of design guidelines that came into being when designing and implementing UNIX systems in the the late 1960s and early 70. There is a lot of elaboration on Unix philosophy, but there are two points that stand out from Doug McIlroy, Elliot Pinson and Berk Tague described in 1978 as follows:


1. Only one thing is done for each program. If there is a new task requirement, write a new program instead of adding a new "feature" to an older program to make it more complex.


2. Expect the output of each program to be an input to other programs, even if it is an unknown program.


These guidelines are the basis for connecting a variety of programs to a pipeline, while a pipeline can accomplish complex processing tasks. The core idea here is that a program doesn't know or need to be concerned about where the input is coming from, where it's going: it could be a file, or another program of the operating system, or a program developed entirely by a developer.



The tools shipped with the operating system are generic, but they are designed to be large programs that can be combined to perform specific tasks.


The Benefits of UNIX designers following this approach to programming are somewhat like the results of agile and devops that emerged decades later: Scripting and automation, rapid prototyping coding (rapid prototyping), incremental iterations, and friendly testing (being friendly to experimentation), as well as the decomposition of large projects into manageable modules. Plus the CA changes. (Plus? a change.)



When you add a pipe designator to the 2 commands in the shell, the shell starts the 2 command programs at the same time, and then the output from the first program as input to the second program. This connection mechanism is provided by the operating system for piping system invocation services.

Note that this linear processing is not done by the program itself, but by shell--, which makes each program "loosely coupled", which allows the program not to worry about where their input comes from and where the output goes.



In 1964, the pipe was invented by Doug McIlroy, who, for the first time, was described in the Bell Lab's internal memorandum as: "We need some way to connect programs like a hose in a garden--when it becomes another necessary message data, it needs to be screwed into other message segments." Dennis Richie later wrote his opinion into a memo.



They also realized early on that interprocess communication mechanisms (pipelines) are very similar to the read-write file mechanism. We now call this input redirection (with a file content as input to a program) and an output redirect (output a program's results to a file).


UNIX programs have such a high combination of flexibility because they follow the same interface: Most programs have a data input stream (stdin) and two output streams (stdout general data output stream and STDERR error and diagnostic information output stream).



Programs usually can do other things besides reading stdin stream and writing stdout stream, such as reading and writing files, communicating on the network, or drawing a user interface. However, the stdin/stdout communication is considered the most important way to flow data from one UNIX tool to another.


In fact, the most gratifying thing is that anyone can easily implement the Stdin/stdout interface in any language. You can develop your own tools, and as long as they follow this interface, your tools can be as efficient as any other standard tool and can be part of the operating system.



For example, when you want to analyze a log file for a Web server, you might want to know how much traffic is coming from each country. But this log does not tell you the country information, just tell you the IP address, then you can use IP geodatabase to convert IP address into a country. By default, your operating system does not come with this database, but you can write a tool that puts the IP address into the stdin stream and puts the output country into the stdout stream.


Once you've written this tool, you can use it in the data processing pipeline we discussed earlier, and it will workplace well. If you've been using Unix for a while, it might seem easy, but I'd like to stress that it's great: Your own code program is the same as the one that came with the operating system.


The graphical user interface of the program and Web application does not seem so easy to be extended or strung up like this. You can't use a pipeline to send Gmail to a standalone search engine app, and then output the results to a wiki. But now is the exception, and as always, there are programs that work together like Unix tools.



A change of topic. At the same time as the development of UNIX systems, the relational data model was proposed, and soon evolved into SQL, which was applied to many mainstream databases. Many databases are actually still running on UNIX systems. Does this mean that they also follow the Unix philosophy?



In most database systems, data flow is very different from UNIX tools. Unlike using stdin and stdout streams as communication channels, DB server and multiple clients are used in the database system. The client sends a query (queries) to read or write data on the server, the server-side processes the query (queries), and sends a response to the client. This relationship is fundamentally asymmetric: both the client and the server are different roles.



What does composable and extensibility mean in UNIX systems? The client (clients) can do anything they like (because they are program code), but DB server is mostly doing the work of storing and retrieving data, and running arbitrary code that you write is not their top priority.


In other words, many databases provide methods that you can use to extend the functionality of the database server with your own code. For example, in many relational databases, let yourself write stored procedures, basic programming languages such as PL/SQL (and some that allow you to run code such as JavaScript in a general-purpose language). However, what you can do in the stored procedure is limited.


Other ways to expand, like some databases that support user-defined data types (this is the early design goal of Postgres), or support pluggable data engines. Basically, these are plug-in interfaces:

You can run your code in the database server as long as your module follows a plug-in interface for a particular purpose database server.


This extension is not as composable as the UNIX tools we see. This plug-in interface is completely controlled by the database server and belongs to it. The extension code you write is like a guest in the database server home, not an equal partner.



The result of this design is that you can't use a pipe to connect one database to another, even if they have the same data model. Nor can you insert your own code into the internal processing pipeline of the database (unless the server has explicitly provided an extension point, such as a trigger).


I think the database design is very self-centered. The database seems to think it is the center of your universe: it may be the only place where you want to store and query data, the real source of the data, and all the queries that eventually arrive. The closest way you get the plumbing data is through bulk load and bulk dump (bulk-dumping) (backup) operations, but these operations don't really use any of the features of the database, such as query planning and indexing.


If the database adheres to the UNIX design idea, then it will be based on a small subset of the core primitives that you can easily combine, expand and replace at random. In fact, the database is like an extremely complex, huge beast. UNIX also acknowledges that the operating system does not allow you to do what you want, but it encourages you to expand it, and you may be able to implement all of the functions of your database system with just one program.



In a simple application with only one database, this design might be good.


However, in many complex applications, they deal with their data in a variety of different ways: for OLTP requires fast random access, data analysis requires a large sequence of scans, full-text search requires inverted indexes, data graph indexes for connections, recommendation engines require machine learning systems, message notifications require push mechanisms, Fast reads require a variety of different cache representation data, and so on.


A common database can try to centralize all of these features on one product ("one for All"), but in nine to ten, the database does not execute only one utility for a particular function. In practice, you can often get the best results by federating various data storage and retrieval systems: for example, you can put the same data and store it in a relational database for random access, full-text search in Elasticsearch, columnar format analysis in Hadoop, and cached in memcached in non-normalized format.


When you need to integrate different databases, the lack of Unix-style composition is a serious limitation for consolidation. (I've finished using pipelines from Postgres to output data to other applications, but there's a long way to go until we can simply use a pipeline to export data from any database to a different database.) )



We say that UNIX tools are composable because they all implement the same interface--stdin,stdout and stderr--they are file descriptors, which are: byte streams that can be read and written like files. This interface is so simple that anyone can easily implement it, but it's also powerful enough that you can do anything with it.


Because all of the UNIX tools implement the same interface, we call it a unified interface. This is why you can not hesitate to pipe the gunzip data out of the WC, even if the author of the two tools may never have communicated. It's like Lego bricks, they all use the same pattern to achieve nodes and slots, allowing you to pile up Lego bricks as you wish, regardless of their shape, size or color.



The Uniform interface of UNIX file descriptors is not only suitable for the process of input and output, it is a very wide application pattern. If you open a file on the file system, you will get a file descriptor. Pipelines and UNIX sockets provide a file identifier that can provide a communication channel for other programs on the same machine. In Linux, the virtual file under/dev is the interface of the device driver, so you can deal with the USB port or even the GPU in this area. The virtual file under/proc is the kernel API, but it exists as a file, and you can use the same tool to access it in a normal file format.


Even a program that connects to another machine via TCP is also a file descriptor, although the BSD Socket API (most commonly used to establish a TCP connection) is not like Unix. Plan 9 shows that even a network can be fully integrated into the same unified interface.

Full content access to this link view: Http://www.jointforce.com/jfperiodical/article/1036?f=jf_tg_bky

From Linkedin,apache Kafka to Unix philosophy

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.