Website data storage needs to be well planned in the early stage. Otherwise, there will be many problems in both management and performance after the data volume comes up. This is a key topic for websites that need to store a large number of files and images, when comparing the text content of a webpage with the storage space occupied by the image content of a webpage, it is obvious that the storage space occupied by the image content is much larger than that occupied by the website text content.
Website data storage needs to be well planned in the early stage. Otherwise, there will be many problems in both management and performance after the data volume comes up. This is a key topic for websites that need to store a large number of files and images, when comparing the text content of a webpage with the storage space occupied by the image content of a webpage, it is obvious that the storage space occupied by the image content is much larger than that occupied by the website text content.
Website data storage needs to be well planned in the early stage. Otherwise, there will be many problems in both management and performance after the data volume comes up. This is a key topic for websites that need to store a large number of files and images, when comparing the text content of a webpage with the storage space occupied by the image content of a webpage, it is obvious that the storage space occupied by the image content is much larger than the storage space required by the website text content, from another point of view, the bandwidth occupied by accessing images far exceeds the bandwidth occupied by text content.
The bottleneck in system operation is not the I/O bottleneck of Internal System Computing for Internet users, but the bottleneck of network bandwidth. The image storage server and the network portal for accessing the image storage server are independent, and higher network bandwidth and independent domain names are provided as conditions permit, which is conducive to scalability and overall performance, the scalability of computing and storage and the reasonable allocation of bandwidth resources do not affect each other.Current.
In addition, you also need:
1. Use cheap machines (old machines) to build a distributed network storage environment and support storage capacity of over TB,
2. Online data synchronization, supporting file copy replication, without obvious single point of failure, you can quickly restore faulty nodes,
3. For general-purpose file systems, you can use them without modifying upper-layer applications (fuse is supported). web servers can directly read distributed buckets without intermediate conversion.
4. Storage space can be expanded without downtime,
5. Efficient random read/write and support for efficient read/write of massive small files (5 kb,
6. Monitor the storage usage status during running, preferably the web interface.
Some solutions are available, but they are still not satisfied.Higher expectations:
1. Avoid repeated writes to image files. Create an index for Image Storage and query the unique image file name before determining whether to write data,
2. small images, signature photos, small portraits, and emoticon images are stored in the cache for reading, bringing data closer to the cpu. The golden principle of design is as follows, we try to read the data that can be stored in the Redis cache in Redis.
Architecture
1. Server Load balancer: HAproxy uses the RoundRobin Server Load balancer algorithm to load frontend user requests to each web image server,
2. web Service: the web server that uses Nginx-0.9.6 to make pictures, read the large, medium and small pictures of the website, and the Redis module of Nginx reads the micro (Avatar) picture in the cache,
3. Cache Server: stores the website's micro images, signature photos, small portraits, And emoticons. It can be directly read through the Redis module of Nginx and written through the Redis java API program,
4. Storage Unit: Uses Moosefs to store large, medium, and small images, and provides a monitoring management interface to view the running status of the bucket,
5. Image Indexing: The image name and image url path are used as Key/Value pairs and stored in HBase for data query to avoid repeated image storage and facilitate future management,
6. Application Server: All image write operations are performed by the Java application server.
Our current image storage system architecture ,:
Click here to view the big chart
I will not talk about the HBase topic here. For more information, refer to the document articles I wrote earlier.Moosefs System:
1 Master Management Server: responsible for the management of various data storage servers, file read/write scheduling, file space recycling and recovery. multi-node copy
2. Chunk metadata log server: backs up the change log files of the master server. The file type is changelog_ml. *. mfs, which can be used to take over when the master server has a problem.
3 Metalogger data storage server: connects to the management server, follows the scheduling of the Management Server, provides storage space, and provides data transmission for the customer.
4 Client: the fuse kernel interface is used to mount the data storage server managed on the remote management server. It seems that the shared file system has the same effect as the local unix file system.
The most representative success stories
Douban.com (Douban Inc), 155 TB of storage space,
Master server: Gentoo Linux/ReiserFS 3.6.
24 chunk servers: Gentoo Linux/ReiserFS/XFS
3. metalogger (s): Gentoo Linux/ReiserFS 3.6
37 client machines: Gentoo Linux
In fact, using a certain technology is only a specific means of implementation. Maybe from the external statistics, other products such as mogileFS, ceph, FastDFS have higher read/write performance, it can even meet any of the above requirements, but we have different focuses on it. It needs to be simple, easy to use, and sufficient. So here we use Moosefs to make a reference.
-End-
Original article address: website image server (NGINX) under the shanzhai technology. Thank you for sharing it with the original author.