How to use cloud-extended collaborative processing, non-volatile memory interconnection and storage

Source: Internet
Author: User
Keywords Cloud computing cloud expansion collaborative processing

Breakthroughs in device technology are used to transform between "compute-centric" and more balanced "data-centric" computing infrastructures. The author investigates storage-level memory, demonstrates how to populate the long-standing performance gap between RAM and rotating disk storage, and details the use of I/O bus coprocessor (which handles similar data), and explains how to build a low-cost high-performance interconnect network using InfiniBand, The extensible storage of unstructured data is discussed.

Computational systems engineering has historically been controlled by an extended processor and a dynamic RAM (DRAM) interface for memory work, leaving a huge gap between data-driven and computational algorithms. The interest in data-centric computing is growing rapidly, along with novel system design software and hardware equipment to support data transformations in a large number of datasets.

Software-focused data is no doubt the current focus of applications, such as video analytics, sensor networks, social networks, computer vision and augmented reality, intelligent transportation, and machine-wide data initiatives for machine systems, such as IBM's intelligent planet and intelligent city.

At present, the focus is on collecting, processing, transforming, and mining large datasets:

in nonvolatile memory (storage-level memory, SCM), data focus tends to be a new device-level breakthrough, which makes large data more needed to be processed. At the same time, the input/output coprocessor makes processing more data-prone. Finally, low latency, high-bandwidth, off-the-shelf interconnects such as InfiniBand support researchers to quickly build 3D rings and fat tree clusters that can be used to limit the most bizarre and expensive custom high-performance computing (HPC) designs.

So far, system software, even system design, is still affected by outdated bottlenecks and ideas. For example, consider threading and multiple programming. The whole idea stems from slow disk drive access, and can the program do anything other than run another program while waiting for the data? Of course, we have redundant array of independent disks (RAID) extensions and NAND Flash solid-state disks (SSD), but as the IBM Almaden study shows, the time scale differences in access time gaps are huge in human languages.

For each device, the access time gap between CPU, RAM, and storage can be measured in a typical performance form, but perhaps the gap may be easier to understand in the case of human language (as the Institute of IBM Almaden for illustration).

If typical CPU operations resemble what humans do in a matter of seconds, a 100-fold RAM access delay may take several minutes to access information. However, over a similar comparison, 100-times delayed disk access is about a few months (100 days) compared to RAM. (see Figure 1.) )

Figure 1. Data access gap

Many experienced computer engineers do not seriously think about 100 to 200 random I/O operations per second (IOPS): This is a disk-driven mechanical boundary. (Of course, sequential access can be as high as hundreds of megabytes per second, but random access is still similar to more than 50 years ago, with 15K RPM search and rotational access latency.) )

Finally, as Almaden points out, the tape is extremely slow, just as slow as the glacier moves. So why are we still confused? Of course it's because of capacity. But how do we handle data or make data processing more efficient?

Let's look at figure 1 again. Improvements to NAND flash memory for mobile devices and more recent SSDs help narrow the gap; however, it is widely believed that NAND flash technology will soon reach its limits, as many system researchers have pointed out. The use of the transistor floating gate technology has reached the expansion limit, further expansion will lead to lower reliability, so although this is a stopgap to use for data-centric computing, but this may not be the solution.

Conversely, several new non-volatile RAM (NVRAM) device technologies may be solutions that include:

Phase Change RAM (Pcram): This memory uses a heating element to turn a material called a sulfur compound into a crystalline or amorphous glass state, thus storing two programmable and read states that can be maintained even if they are not powered. For M class synchronous Non-volatile memory (NVM), Pcram seems to have honoured most of the recent commitments. Resistive RAM (Rram): Most rram are often described as a circuit, unlike capacitors, sensors, or resistors, Rram provide unique voltage and current relationships (unlike other well-known devices that store electric charges or electromagnetic waves), or provide linear resistance to the current. In the past few decades, the use of materials with a property called a memory resistor has been tested, and engineers often try to avoid using them because of their non-linear properties and lack of applications. The IEEE member, Leon Chua, describes it in the article "Memristor:the Missing incrementally Element". The behavior of the resistor can be summarized as follows: The current in one direction increases the resistance and the resistance in the other direction decreases. Similarly, a non-volatile state can be stored and programmed and state read. Spin transfer moment RAM (Spin transmits torque Ram,stt-ram): A spin-polarized current can be generated by the current of the magnetosphere, which can change its direction by angular momentum when pointing at a magnetosphere. This behavior can be used to excite vibrations and to flip the direction of nano-scale magnetic devices. The main disadvantage is that the flip direction requires a higher current.

From a system perspective, as these devices evolve, where these devices are used and how each device can better fill the access gap depends on the following aspects of the device:

Cost Scalability (device integration size must be less than transistor to defeat cache; less than 20 nm) program and read delay device reliability Perhaps the most important is durability (the programming and erasing frequency before becoming unreliable).

Based on these device performance considerations, IBM grouped SCM into two broad categories:

S class: Asynchronous access via I/O controller. Thread or multiple programming is used to hide I/O latency for a device. M class: Synchronized access through a memory controller. It can be considered a RAM access wait state in which the CPU core stops running.

Also, NAND SSD is considered fast storage, accessed via block-oriented storage controllers (higher I/O rates, but similar to rotational disk-driven bandwidth).

This may seem like a cancellation of asynchronous I/O to Data processing (except for archival access or cluster expansion, of course), but this could be a panacea for data-centric processing. That is true in a sense, but system designers and software developers must change that habit. The I/O latency concealment requirements on each node of the system are largely lost, but cannot be completely eliminated. The cluster build in InfiniBand will use the message passing Interface or MapReduce mode to handle data transfer latency for node to node, and you can enjoy the similar performance of this idea SCM node, but at startup or when node data exceeds node work Except for RAM size.

Therefore, for extensions, it is still necessary to hide I/o latency between nodes in the cluster interconnect and in the cluster.

The coprocessor makes processing more data-prone

Fast access to large data seems perfect, and looks promising, but some applications will always benefit from another alternative (making processing closer to the data interface). There are a number of examples such as graphics (graphics processing units, GPU), network processors, protocol offload engines (such as TCP/IP offload Engine, RAID on chips, cryptographic coprocessor), and the recent advent of computer vision collaboration processor concepts. My research involves computer vision and graphics processors, both in clusters of a certain size or in embedded systems, with computer vision and graphics processors. The work I'm currently working on is called the Computer Vision Processing unit, and with Khronos's release of the OPENVX 2012 announcement, it will be more popular than a few coprocessor.

In the embedded world, such a method may be described as a smart sensor or smart camera, and the raw data preprocessing method is provided by the sensor interface, an embedded logic device or microprocessor, or even a multi-core system (SoC) on the chip.

In the scalable world, this typically involves the use of a coprocessor bus or channel adapter (such as PCI Express, PCIe, and Ethernet or InfiniBand), which provides data processing between the data source (network side) and the node I/O Controller (host side).

Whether the process should be completed or processed should be more efficient, they are hot topics of discussion when they are processed on the I/O path or CPU core, but they are clearly useful based on existing theories (GPU and network processors), which are more common than processors with coprocessor based technologies. So we'll take a quick look at several of these methods:

vector processing for multiple data for a single program is currently provided by the GPU, the multipurpose GPU (GP-GPU), and the Application processor unit (APU). The idea is that the data can be converted to an output device (such as a monitor) in its own way, or sent to a Gp-gpu/apu and converted during a roundtrip process between the host and the device. "Universal" means more complex functions, for example, a double operation is only suitable for a particular graphic processing, compared to a single precision operation. Multiple core traditional multi-core processor cards are available to individual vendors. The rationale here is that by using a simple but large number of I/O bus cores to reduce costs and power consumption, it is more powerful to use two-way unload processing for cards, but consumes more power and requires expensive, comprehensive multi-core hosts. In general, multicore coprocessor may require more cores than hosts, typically including gigabit or 10G Ethernet, or other types of network interfaces. I/O bus field Programmable gate Array (FPGA) in the early stages of development, FPGA cards are often used to prototype a new processor, and can also be used as a small capacity coprocessor solution. Embedded SoC A multi-core solution that can be used in I/O devices to create smart devices such as stereo ranging or flight time cameras. Interface fpga/configurable programming logic device Digital Logic state machine provides buffering and continuous I/O data conversion, such as digital video coding.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.