Most of the articles seem a bit old, and don't know what the architecture of FB, TUMBLR, Pinterest and Twitter are like now.
1, clustering vs sharding? Auto/manual (need to remove join, add cache,nosql doesn't seem to be as mature as MySQL?) But Hbase/cassandra seems to be able to do it again.
2, technology for business services, architecture for application services, so innovation lies in the discovery of real valuable problems (demand)
3. Apply a specific database? Materialized "Data Items", lock-free transactions, append-only storage, for large scale design: General FS, ceph/... (Distributed Object Database)
4, LB: Shorten the path between user and "content"
5, howto protect data? Howto use them?
6. User table (the table storing information for users) is not sharded.
7, shard with large capacity planning (means ' hash big ') <--add timestamp to hash key?
8. Mapping (Shard/Storage) & Reverse-mapping (query)
9, Cache:memcache/redis (support data structure richer point)--do not know now memcached function is perfect?
10. Scripting:sharding filter scheme, migrating data (not so good)
11, Pyres:python over Redis? (Resque--)
12, Dev:everyone have access to everything, be careful. (Unified global View) Small teams with git may not be appropriate, and Git vs svn is sometimes just a performance reason to manipulate large repo
13, SOA: The actual DB Proxy is also a service!
14. Keep It & Fun
15, Architect is doing the right thing,if growth can be handled by adding more of the same stuff. (Horizontal expansion)
16, do not be afraid (?) ) Loss of part of the data, based on data nature Cap/base
17, Master-slave lag (the disadvantage of master-slave replication): Of course, the main-master replication will introduce the distributed consistency problem, the first should be Shard writes (how to really do the non-join design?) By adding redundancy? )
18. Keep Load at <= 50% (live capacity must be controllable) (or "Set aside resilience")
19, use Tool,not Framework (the former means small composable, the latter is actually an intrusive design, such as the disgusting spring)
20, to avoid (distributed) Joins:de-normalize? Designed to be extensible/"stretched" from the start
21. Turn the website into a service (API): Twitter's early success practices
22. Prevent abuse
23, Cache vs Log: Note the similarities between the two, the cache is actually cached in the recent hot spot data, and LOG analysis can be deleted, that is, will not run out of storage space
24:facebook 2011:batching IO, avoid hbase hot keys?
Java <--> Thrift <--> PHP
Sharding plan: Hand slicing? This is supposed to be before the data center.
10000 Writes per sec per Server
25, Dropbox 2011:python for backend and client (Python write Tortoisehg/mercury actually good performance, sublime not also python write); but it can't be used on Android (-_-)
Memory Fragmentation Issues
Ps:rsync synchronization of a large number of deep nested small files when poor performance, as a one-time compression download
26, Anti-Spam (Mollom 2011)
To protect the ML algorithm, users cannot submit wrongly rejected data
Free user input to help improve training ml
Disks, SSD-to Cassandra:raid 10 (stripe & Mirror), for heavy writes & row caching;aging mechanism (just for privacy, "right to oblivion" in European legislation)
You can design your own local data with HTML5 local storage to store all of your users ~
Client lb: First request a list of available servers, and then order the request, one cannot replace one (here can not randomly request!)
What is the reputation mechanism for IP addresses?
AWS Virtual Server: IO is the bottleneck, scale up
Coredump analysis of 16GB leaks? Oh Crazy
27, Redis:lpush/ltrim;lrem;zadd/zrevrange/zrank/zrange;sadd;pub/sub; the usual get/set
28. Reliability: Mttr indicator <--MTBF
Capacity Planning & Expect Failure (k), Alone to defeat! )
Read notes for articles under All-time-favorites aggregation on the high Scalability website