Data Deduplication and Encryption in Cloud Storage Systems

Source: Internet
Author: User
Keywords cloud storage data security data encrypted
1. Background


Large-scale cloud storage systems often face two contradictory requirements: on the one hand, the system needs to compress data to save storage space overhead; on the other hand, users want to store their data encrypted for data security and privacy considerations. At present, a very effective and commonly used method for data compression is deduplication, that is, to identify redundant data blocks in the data, store only one copy, and store pointer-like data structures in the rest.

But deduplication is directly contradictory to the goal of data encryption. Why do you say that? This is due to the nature and goals of encryption itself. Humans have used encryption to protect data security for thousands of years. The earliest passwords can be traced back to the ancient kingdom of Egypt in 1900 BC. In modern times, especially in the two world wars, the fighting methods of all parties have greatly promoted the development of cryptography. In the computer age, cryptography has become even more powerful. The Internet that emerged in the 1990s, especially the rapid development of e-commerce, has not only put forward higher demands for cryptographic technology (thus further promoting its development), but also enabled cryptographic technology to be applied to all aspects of people's lives. Today, from online banking to online shopping, from email to social networks and even games, cryptography is everywhere.


However, people's understanding of password security did not make substantial progress until the 1980s. Prior to this, people lacked a strict definition of the goal of password security, and thus lacked a measure of it. How can an encryption algorithm be considered secure? This seemingly simple question actually requires a very deep thinking about the nature of passwords. The encryption algorithm is not isolated, it will be used in different environments and conditions. Cryptographers expect to be able to accurately characterize and rigorously prove the nature of encryption algorithms, so as to avoid vulnerabilities in data security. This is very difficult to do. It depends not only on people's understanding of the nature of cryptography, but also on the development of related disciplines (such as information theory, etc.).



1.1 Information-theoretic Security

In 1949, Claude Shannon, the founder of information theory, investigated the information leaked in the original text from the perspective of information theory, and proposed the concept of information-theoretic security.

Simply put, the ciphertext generated by an encryption algorithm with information-theoretic security does not contain any information about the original text for a person who does not have the corresponding secret key.

The consequence of this nature is that even if the opponent has unlimited computing power and time, it is impossible to decipher. This is obviously a very strong concept, but it is difficult to use in practice because it requires the use of random keys with the same length as the original text, and each key can only be used once. There is only one encryption algorithm that meets this nature, and that is One-time pad (OTP).

In reality, OTP is impossible to be practical, and it is difficult for people to securely distribute secret keys with the same length as the original text. So people retreated to pursue computational secrecy. That is, we assume that the opponent's computing power is limited, and our encryption algorithm can be considered safe as long as the opponent cannot decipher it within a feasible time. In this mode, we need a new way to measure the information leaked in the ciphertext.



1.2 Semantic Security

In 1982, Shafi Goldwasser and Silvio Micali [39], who were still studying at UC Berkeley at the time, proposed the concept of semantic security.

The formulation of this concept can be based on comparing the probabilities of two events:

1. Given a ciphertext, and the length of the original text, a polynomial-time opponent can calculate any part of the information about the original text (such as whether the original text is odd or even);

2. Based only on the length of the original text (no cipher text), any polynomial-time algorithm can calculate any part of the information about the original text.

If the probability of 1 and 2 are close enough, the encryption algorithm is considered to satisfy semantic security. Let's try to understand this definition. In the case of 1, the opponent gets the ciphertext and the length of the original text. In the case of 2, the opponent only gets the length of the original text and no ciphertext at all. If the probability of the opponent successfully obtaining the original text information in these two cases is close, it means that the ciphertext generated by the encryption algorithm is not enough to obtain any information about the original text, because there is little difference between it and without it. This is obviously a safe state.



Later Goldwasser and Micali gave an equivalent definition based on ciphertext indistinguishability (IND) [40]. IND means that given a ciphertext generated by randomly selecting one of the two original texts m0 and m1 for encryption, the adversary cannot distinguish which one is encrypted. The latter is widely adopted because it is easier to use.



This is a milestone work, it gives a clear and strict mathematical definition of security, so that the design and analysis of passwords have a clear direction, its meaning can not be overemphasized. As ACM commented on the two, their work "helped make cryptography a precise science." Partly due to their pioneering work, the two authors won the ACM Turing Award 30 years later (2013). ).



2. Deduplication and encryption in cloud storage systems



Coming back to the application of cloud storage, the main challenges of using encryption that supports deduplication in cloud storage systems are as follows:



1. The encrypted ciphertext needs to retain the redundancy of the original text, that is, the encrypted ciphertext of the same original data block is still the same (the same here is not necessarily the same as the ciphertext, and the system only needs to identify the ciphertext that contains the same content. The means of text is enough), so that de-duplication can work.

2. Cross-user decryption (CUD), that is, the data block encrypted and uploaded by a certain user should be decrypted by all users with read permission, even if the latter is not the original uploader and encryptor.
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.