This article is excerpted from the book "big talk storage 2", which will be published in April by the famous storage expert donggutou.
100 Mb/s. What does this rate mean? Some people say that 10 MB of data can be transmitted per second (in 8/10B encoding ). This is usually true. However, if the distance between the sender and the receiver is very long, such as several hundred or even one thousand kilometers, then you will find that it is not true at all. Let's analyze it now.
We all know that the transmission of light or electrical signals has a fixed speed, that is, approximately 0.3 million kilometers per second (in fact, far from reaching, the optical fiber transmission rate is only 0.2 million kilometers per second, the transmission rate of electrical signals in cables is approximately 0.21 million kilometers per second. Basically, the speed of light in a vacuum or air is 2/3 ). If the distance between the two points is 1000 kilometers, it takes 1000 milliseconds to transmit the signal back and forth (to the peer end, and then to the backend to respond to ACK. What is the concept? That is, if you want to transmit 1-bit data to a location one thousand kilometers away, it will take at least 6.6 Ms. How long does it take to transmit 10-bit, 100-bit, 1 kb, and mb data? The first thing that comes to mind is at least 1 bit slower. How long does it take? Let's take a look at this formula: transmission back and forth time = (data volume growth link rate × 2) + (distance between transmission speed X 2 ). During data transmission, data is first serialized and then transmitted to the circuit or optical path through the Encoding Circuit. This encoding rate is the link bandwidth, the difference between 100 Mbit/s bandwidth and 1000 Mbit/s bandwidth is that the latter can encode data equivalent to 10 times the former in a unit of time, but no matter how much bandwidth there is, after the data is encoded, the transmission time of the data on the circuit is the same for links of various rates, because the transmission time has nothing to do with the link encoding rate (bandwidth, after being transmitted to the other party, the other party also needs to decode (so the time required for encoding is multiplied by 2), which also depends on the link bandwidth.
Therefore, when the transmission distance between two points is very close, for example, 1 km, the transmission latency is ≈ 0.0066 MS, which can basically be ignored. So the formula changed to: transmission time = (data volume compaction link rate ). Therefore, the higher the link rate, the faster the encoding speed. It does not mean that the transmission speed is faster, the transmission speed is fixed, and the speed of light is used. Another analogy is that there is a long-distance car. 50 people need to get on the bus in line. It takes 120 s to get on the bus in line, 60000 s to get on the car, and 120 s to get off in line for 50 people. 50 people are queued to get on the bus, just like the data is serialized and put into the circuit for transmission. Driving a car is equivalent to transmitting circuit signals from one end to the other, and 50 people are queued to get off the bus, the decoding process is better than that of the terminal, but it is not over yet. When the car arrives at the destination, the driver must return a message at the start point, this is just like the TCP protocol sends an ACK response to the source end after receiving the data. The driver can run back without a car (send an ACK response packet separately ), you can also bring some return messages to the destination region (TCP can include ack response information in reverse traffic to improve efficiency ). However, in a disaster tolerance system, data is always routed from the source to the target, or from the target to the source during disaster recovery. In short, there is only one direction for the physical data to flow, at this time, the return Ack is an independent ack response packet (the Independent ACK packet is small, so the coding and decoding time is ignored ).
In addition, a car can carry a limited number of people, that is, a trip to pull, this is like the maximum data length sent by TCP each time, that is, the TCP Sliding Window Length. The TCP score batch transfers user data, and each transmission must be smaller than the TCP Sliding Window Length, after each transmission, the other party needs to send an ACK (ACK merging and other special cases are not considered here ). Although each batch of data may be split at the underlying layer, for example, the tcp mss (max segment size) slice is generally equal to the MTU of the underlying link, and the underlying link is sliced by the MTU value, however, after these underlying slices are transmitted to the peer end, there is no need to respond to the underlying protocol of the Peer end. Only after the peer TCP receives a batch of data sent by the TCP,.
Then we can calculate how many rounds can be transferred every second between two points separated by 1000 km: 151 milliseconds (milliseconds) and 6. 6 milliseconds (milliseconds. If the calculation is based on the typical TCP sliding window, that is, 16 KB (each request sends 16 KB of data and waits for the response, regardless of special circumstances such as delayed response or merged response ), the throughput per second is only 151 × 16 KB = 2416kb, that is, MB per second. Exaggerated?
Of course, the above formula ignores the time spent by codec and the processing delay caused by various relay, forwarding, or protocol conversion devices on the entire Link, the throughput is lower. The more accurate formula for calculating the actual data transmission throughput is V = TCP window size limit 2 (TCP window size limit link bandwidth + distance between half light speed + link device processing latency ). In short, the farther the distance is, the lower the actual transmission throughput, which must be used in practical applications.
The distance is short, and the latency caused by distance can be ignored. At this time, it is obvious that whoever has a higher bandwidth will transmit the data quickly. However, when the distance is long, the bandwidth will not be high at this time, because the big data is exhausted by distance. In addition, even if the bandwidth of the underlying link is the same and the distance is the same, different protocols are used for transmission, resulting in different latencies. However, imagine that no matter how long the link spans, if there is always data on this link, then the sender and receiver can send and receive information at the native rate of the link bandwidth, there is only latency. Just like satellite TV, the transmission rate will not be discounted at this time. If this is done, it is very good for a disaster recovery system, at best, only data within several milliseconds is lost. However, this is not the case. The remote transmission is afraid of the jamming of the data stream. It doesn't matter if the data is stuck twice at a time. If the data is frequently shelled, the link bandwidth cannot be used at all. This is like the disk seek operation. Originally, the head can read and write data on the disk at a very high speed, but there is no way to change the channel. This change causes the external speed to drop. It happens that the average seek time of a 15 k sas disk to a second is 5.5 ms, and the transmission latency of a one thousand km distance is 6.6 Ms. These two values are close and interesting.
The transfer protocol cannot avoid "jamming", because it always takes a period of rest to wait for the other party to scream and see if it is received. For example, TCP will waste the underlying link time slots for no reason. In addition, the high transmission latency over long distances will waste a lot of time, therefore, even in the Gigabit link above, the theoretical value of 1000 MB can only be transmitted from the distance of kilometers per second, and the actual value will be lower.
In addition, using protocols such as iSCSI over a long distance would be a waste. We all know that the SCSI layer itself has a transmission mechanism, and people have their own ACK sets, while the underlying TCP comes back to this set. It is reasonable to say that with the transmission mechanism of the SCSI layer, its lower-layer protocol stack should be a stateless similar link layer protocol, and the data should be passed directly, but the reality is that it has to pass on for a while, stop for a while, wait for the other party to say "OK", and then stop the transfer again. Not only that, but also SCSI also needs to stop the transfer, that is, increasing the complexity. Therefore, running the SCSI protocol such as FCP and iSCSI over a long distance and the FC/TCPIP protocol will be a nightmare.
Reducing the number of unnecessary ack and increasing the sliding window are all wan acceleration technologies, which will increase the transmission rate to a certain extent. However, the final solution is to minimize the distance between two locations or develop dedicated optimization protocols.
Speaking of private protocols, let's talk about them here. All of the above scenarios are established in scenarios where only one TCP connection is established between two points, that is, a single stream, the link bandwidth cannot be fully utilized at this time, and it has also been mentioned, if the underlying link is not idle for a moment, its effective bandwidth can be used more efficiently. What should I do? Obviously, by increasing the number of concurrent connections, we can make full use of the time slot of the underlying link. This idea is similar to how the disk array controller makes full use of the bandwidth of the backend fcal loop. For details, refer to the 5th question in Appendix 1.
We all know that there is a multi connection per session concept in iSCSI. If Microsoft's soft iSCSI initiator is used, it can be set, the initiator end can establish multiple concurrent TCP connections with the iSCSI target end at the same time to improve the efficiency of remote transmission. Of course, this feature requires the support of the iSCSI target end. But for FCP, there is no such special consideration for the concurrent connection design. The private protocol designed for concurrent connections can greatly improve the efficiency of remote data transmission.
Since multi-stream concurrency is mentioned, the flexibility will be expanded. For an asynchronous data disaster recovery replication system, at least data consistency at the disaster recovery end must be ensured, and data consistency has multiple layers, the bottom-layer consistency is the so-called "Time Sequence consistency". The disaster recovery end should at least ensure that each Io is flushed into the disaster recovery end data set according to the sequence in which it is executed on the source end. If a single-stream TCPIP is used, the time sequence can be ensured, but the transmission efficiency is very low. However, in the case of multi-stream concurrency, because the original stream is not associated with the stream, i/O executed at the source end may be transferred to the peer end but then executed. In this case, more complex logic needs to be introduced to ensure that the data synchronized to the past is executed in sequence. There are two ways to consider this. One is to ensure RPO, maintain strong consistency among multiple streams, and forcibly associate multiple streams to ensure the sending and receiving order, at this time, the disaster recovery end can immediately fl the received Io data into the underlying data set; the second is to sacrifice RPO, and end-to-end consistency group technology is used between the master and slave sites, ensure the time sequence between the data batch and the data batch, instead of between each Io. At this time, the disaster recovery end cannot immediately brush in after receiving the data, for example, only after a batch of data is received. Although this may lead to the loss of a batch of data rather than a few Io, it can easily ensure data consistency. This article is excerpted from "big talk storage 2", which has not yet been published. You are also welcome to make mistakes and suggestions. Thank you!