Youku Video Site Architecture
I. Overview of basic data of the website
According to 2010 statistics, Youku daily average number of independent visitors (UV) reached 89 million, daily average visit (PV) is reached 1.7 billion, Youku with this data to become the Google List of the top domestic video site manufacturers.
Hardware, Youku introduced the Dell server mainly in the PowerEdge 1950 and the PowerEdge 860 mainly, the storage array to Dell MD1000-based, 2007 data shows that Youku has more than 1000 servers throughout the country's major provinces and cities, now should be more.
second, the website front-end framework
From the beginning, Youku built a set of CMS to solve the front page display, the separation between the modules is more appropriate, the front-end extensibility is very good, the separation of the UI, so that development and maintenance becomes very simple and flexible, is the Youku front-end Module call Relationship:
In this way, the module, method, and params are determined to call the relatively independent modules, which is very concise. A cool front-end local architecture diagram is attached below:
third, database architecture
It should be said that the database architecture of Youku has experienced many twists and turns, from the first single MySQL server (Just Running) to the simple MySQL master-slave replication, SSD optimization, vertical sub-Library, horizontal sharding sub-Library, this series of processes only experienced will have a deeper understanding of it, Like the architectural experience of MySpace, the architecture is gradually growing and maturing in a step-by-step way.
1, simple MySQL master-slave replication:
MySQL master-slave replication to solve the database read and write separation, and good to improve the read performance, the original diagram is as follows:
The process of its master-slave replication is as follows:
However, master-slave replication also brings a number of other performance bottlenecks:
-Write cannot be extended
-Write cannot be cached
-Replication delay
-lock Example up
-table becomes larger, cache rate drops
The problem has to be solved, which results in the following optimization scheme, together to see.
2. mysql vertical partition
If the business is cut enough to be independent, it will be a good idea to put different business data into different database servers, and in case one of the services crashes, it will not affect the normal operation of other business, and also play a role of load shunt, greatly improving the throughput of the database. The database schema diagram after vertical partitioning is as follows:
However, although the business is already independent enough, but some of the business is more or less connected to each other, such as the user, basically will be associated with each business, and this partitioning method does not solve the problem of the single-sheet data explosion, so why not try the level sharding it?
3. mysql level shard (sharding)
This is a very good idea, the user according to a certain rule (by ID hash) group, and the group of users of the data stored in a database shard, that is, a sharding, so as the number of users increased, as long as the simple configuration of a server, the schematic is as follows:
How to determine the Shard of a user, you can build a user and shard corresponding data table, each request first from this table to find the user's Shard ID, and then from the corresponding shard query the relevant data, as shown in:
However, Youku is how to solve the cross-shard query, this is a difficult point, according to the introduction Youku is as far as possible not cross-shard query, really not through multi-dimensional Shard index, distributed search engine, the worst is a distributed database query (this is very cumbersome and performance)
Iv. Caching Policies
Seemingly large systems have a "cache" feeling, from the HTTP cache to the memcached memory data cache, but Youku means that there is no memory cache, for the following reasons:
Avoid memory duplication and avoid memory locks
If the Big Brother notice to take a video off, if in the cache is more troublesome
and squid write () user process space consumption, Lighttpd 1.5 of AIO (asynchronous I/O) read files to user memory resulting in inefficient.
But why would our visit to Youku be so smooth that the video loading speed of Youku is notch above compared to potatoes? This is due to Youku established a relatively perfect content distribution network (CDN), it is distributed across the country in a variety of ways to ensure that users in the vicinity of the visit-the user click on the video request, Youku will be based on the user's location, the closest to the user, the best service status video server address to the user, So that users can get a fast video experience. This is the advantage of CDN, the nearest visit, more information about CDN, please google a bit.
This is a complete pdf:http://www.blogkid.net/qconppt/youkuqiudanqconbeijing-090423080809-phpapp01.pdf.
Transferred from: http://www.kaiyuanba.cn/html/1/131/147/7541.htm
YouTube Site Architecture
YouTube is growing fast with more than 100 million video hits per day, but only a few people are maintaining the site and ensuring scalability. This is similar to PlentyOfFish, where a minority maintains a large system. What is the reason? Rest assured is not by character, nor by loneliness, below to see the overall YouTube technology structure it.
Platform
1. Apache
2. Python
3. Linux (SuSe)
4. MySQL
5, Psyco, a dynamic Python-to-C compiler
6, LIGHTTPD instead of Apache to do video viewing
Status
1. Support more than 100 million video clicks per day
2, founded in February 2005
3. Reached 30 million video hits per day in March 2006
4. Reached 100 million video hits per day in July 2006
5, 2 system administrators, 2 scalable software Architects
6, 2 software development Engineers, 2 network engineers, 1 DBAs
Web Server
1,netscaler for load balancing and static content caching
2, run Apache using mod_fast_cgi
3, use a Python application server to process the requested route
4, the application server interacts with multiple databases and other sources of information to get data and format HTML pages
5, it is generally possible to increase scalability at the Web layer by adding more machines
6,python Web layer code is not typically a performance bottleneck, and most of the time is blocked in RPC
7,python allows for rapid and flexible development and deployment
8, typically less than 100 milliseconds per page service
9, use Psyco (a dynamic Python-to-C compiler similar to the JIT compiler) to optimize the internal loop
10, for intensive CPU activity like encryption, use C extension
11, for some expensive blocks using pre-generated and cached HTML
12, row-level caching is used in the database
13, cache the full Python object
14, some data is calculated and sent to each program, so these values are cached in local memory. This is a poorly used strategy.
The fastest cache in the application server will not take much time to send pre-computed values to all servers. Just get an agent to listen for changes, precomputed, and then send.
Video Services
1, cost including bandwidth, hardware and energy consumption
2, each video by a mini cluster to host, each video is more than one machine held
3, using a cluster means:
-More hard drive to hold content means faster speed
-failover. If one machine fails, the other machine can continue to serve.
-Online Backup
4, use LIGHTTPD as a Web server to provide video services:
-apache too much overhead.
-Use Epoll to wait for multiple FDS
-transition from single-process configuration to multi-process configuration to handle more connections
5, most popular content is moved to CDN:
-cdn back up content in multiple places, so the chances of content closer to the user are higher
-CDN machines are often out of memory because content is so popular that few bumps in and out of memory
6, less popular content (1-20 views per day) using YouTube servers at many colo sites
-long tail effect. A video can have multiple plays, but many videos are playing. Random hard drive blocks are accessed
-In this case the cache will not be good, so spending money on more caches may not make much sense.
-Adjust raid control and pay attention to other low-level issues
-Adjust the memory on each machine, not too much and not too little
Video service key points
1, keep it simple and cheap
2, keep Simple network path, do not have too many devices between content and user
3, using common hardware, expensive hardware difficult to find help documents
4, using simple and common tools, using most of the tools built on or above Linux
5, very good handling random search (Sata,tweaks)
Thumbnail service
1, to be efficient and surprisingly difficult
2, about 4 thumbnails per video, so thumbnails are much more than video
3, thumbnails are only host on several machines
4, holding some of the problems encountered by small things:
Large number of hard disk lookups and Inode and page cache issues at the-os level
-Single directory file limit, especially Ext3, later moved to multi-layered structure. Recent improvements in kernel 2.6 may have allowed EXT3 to allow large directories, but it's not a good idea to store a large number of files in a file system.
-A large number of requests per second, because Web pages may display 60 thumbnails on a page
-Apache behaves badly under this high load
-Squid is used in the front-end of Apache, which works for some time, but fails due to increased load. It makes 300 requests per second into 20
-Try using lighttpd but it's in trouble because of the use of a single thread. Problems with multiple processes because they each keep their own separate caches
-so many pictures that a new machine can only take over 24 hours
-Restart the machine takes 6-10 hours to cache
5, in order to solve all these problems YouTube started using Google's bigtable, a distributed data store:
-Avoid small file problems because it collects files together
-Quick, false tolerance
-Lower latency because it uses distributed multilevel cache, which works with multiple different collocation sites
-See Google Architecture,googletalk Architecture and bigtable for more information
Database
1, early
-Use MySQL to store metadata, such as users, tags and descriptions
-Storage of data using RAID 10来 of a whole 10 HDD
-Rely on credit card so YouTube leased hardware
-youtube after a common revolution: Single-server, then single-master and multiple-read slaves, then database partitioning, and then sharding Way
-Pain with backup delay. The master database is multithreaded and runs on a large machine so it can handle a lot of work, slaves is single-threaded and usually runs on smaller servers and backups are asynchronous, so slaves will lag far behind master
-Update causes cache invalidation, slow I/O on hard disk causes slow backup
-Using a backup architecture requires a lot of money to get increased write performance
One solution for-youtube is to prioritize transfers by dividing the data into two clusters: a video viewing pool and a generic cluster
2, late
-Database Partitioning
-Divided into shards, different user designations to different shards
-Diffusion read/write
-Better cache location means less IO
-30% reduction in hardware
-Reduced backup latency to 0
-can now arbitrarily improve the scalability of the database
Data Center Policies
1, relies on credit cards, so initially only managed hosting providers can be used
2, managed hosting provider cannot provide scalability, cannot control hardware or use good network protocols
3,youtube use colocation arrangement instead. Now YouTube can customize everything and contract its own contract
4, using 5 to 6 data centers plus a CDN
5, the video comes from any data center, not the nearest match or anything else. Move to CDN If a video is popular enough
6, depending on the video bandwidth rather than the real delay. Can come from any Colo
7, picture delay is very serious, especially when a page has 60 pictures
8, use BigTable to back up pictures to different data centers, code to see who is the nearest
I learned something.
1,stall for time. Creative and risky skills allow you to solve problems in a short period of time and you will find long-term solutions
2,proioritize. Find out what's at the core of your service and prioritize your resources
3,pick your battles. Don't be afraid to split your core services. YouTube uses a CDN to distribute their most popular content. Creating your own network will take too much time and too much money
4,keep it simple! Simple allows you to quickly re-architect to respond to problems
5,shard. Sharding helps isolate storage, CPU, memory and Io, not just for more write performance
6,constant Iteration on bottlenecks:
-Software: DB, Cache
-os: Hard disk I/O
-Hardware: Memory, RAID
7,you succeed as a team. Have a cross-law understanding of the entire system and know what kind of team inside the system, such as installing printers, installing machines, installing networks and so on.
With a good team all things is possible.
Transferred from: http://www.kaiyuanba.cn/html/1/131/147/7540.htm
Twitter Site architecture
I. Overview of the Twitter site Basics
Until April 2011, Twitter has a registered user of about 175 million, and the number of new users registered to 300000 per day growth, but its real active users far less than this number, most of the registered users are no followers or no attention to others, which is not comparable with Facebook's 600 million active users.
Twitter has 1.8 million independent access users per month, and 75% of traffic comes from sites outside of Twitter.com. The API has 3 billion requests per day, averaging 5,500 tweet,37% active users per day, and about 60% of tweets come from third-party apps.
Platform: Ruby on Rails, Erlang, MySQL, Mongrel, Munin, Nagios, Google Analytics, AWStats, Memcached
Is the overall architecture design for Twitter:
Second, the platform of Twitter
The Twitter platform is roughly comprised of twitter.com, mobile phones, and third-party applications, as shown in:
The main source of traffic is mobile phones and third parties.
Ruby on Rails:web application framework
Erlang: Common concurrency-oriented programming language, open source project address: http://www.erlang.org/
AWStats: Real-time log Analysis system: Open source project address: http://awstats.sourceforge.net/
Memcached: Distributed Memory Cache Build
Lightweight Message Queuing developed by Starling:ruby
Varnish: High performance Open source HTTP accelerator
Kestrel:scala written by the message middleware, open source project address: Http://github.com/robey/kestrel
Comet Server:comet is an Ajax long-connected technology that enables servers to proactively push data to a Web browser to avoid the performance penalty of client polling.
Libmemcached: a memcached client
Using the MySQL database server
Mongrel:ruby HTTP server, dedicated to rails, open source project address: http://rubyforge.org/projects/mongrel/
Munin: Server-Side Monitoring program, project address: http://munin-monitoring.org/
Nagios: Network Monitoring system, project address: http://www.nagios.org/
Third, Cache
Talking about caching, it's true that caching plays an important role in large Web projects, after all, the closer the data gets to the faster the CPU accesses. Is the twitter cache architecture diagram:
Large use of memcached for caching
For example, if you get a count that is very slow, you can throw count into memcached within 1 milliseconds
Getting a friend's status is complicated, and there are other issues such as security, so a friend's status is updated after it is thrown in the cache instead of making a query. No access to the database
The ActiveRecord object is large so it is not cached. Twitter stores the properties of critical in a hash and Sanga when accessed
90% of requests are API requests. So do not do any page and fragment cache on the front end. Pages are very time sensitive and inefficient, but Twitter caches API requests
In the memcached cache strategy, there are some improvements as follows:
1, create a write-through vector cache vector caches, contains a tweet ID array, tweet ID is a serialized 64-bit integer, hit rate is 99%
2. Join a write-through row cache, which contains database records: Users and tweets. This cache has a 95% hit rate.
3. Introduced a read-only fragment cache Fragmeng cache, which contains the sweets serialized version accessed by the API client, which can be packaged in JSON, XML, or Atom format, and also has a 95% hit rate.
4. Create a separate cache pool for page caches page cache. The page cache pool uses a generational key pattern instead of a direct effect.
Iv. Message Queuing
Use a lot of messages. Producers produce messages and put them in queues, which are then distributed to consumers. The main function of Twitter is as a message bridge between different forms (Sms,web,im, etc.)
Using DRB, this means distributed ruby. There is a library that allows you to send and receive messages from a remote Ruby object over TCP/IP, but it's a bit fragile
Move to Rinda, which is a share queue using the Tuplespace model, but the queue is persistent and the message is lost when it fails
I tried Erlang.
Move to Starling, a distributed queue written in Ruby
Distributed queues are used to save system crashes by writing them to the hard disk. Other large sites also use this simple way
v. Summary
1, the database must be reasonably indexed
2. To be aware of your system as quickly as possible, it is necessary that you have the flexibility to use a variety of tools
3. Cache, cache, cache, cache everything can be cached, let your app fly up.
Partly transferred from: http://timyang.net/architecture/twitter-cache-architecture/
Here's a copy of the English translation: http://hideto.iteye.com/blog/130044
JUSTINTV Website Architecture
Justin.tv has 30 million independent visits per month, defeating YouTube in the field of game video uploads, adding 30 hours of video per minute each day, while YouTube is only 23.
Below are the platforms used by Justin.tv's real-time video systems, their architectural details, and the things they should learn from them.
The platform to use
- twice--Proxy service system, mainly using buffering to optimize Application server load
- xfs--File System
- haproxy--for tcp/http Load Balancing
- LVS Stack and idirectord--high reliability
- Ruby on rails--Application Server system
- Nginx--web Server System
- postgresql--database for user and meta data
- mongodb--Database for internal analysis
- memcacheddb--database for storing data that is often modified
- syslog-ng--Log Service System
- Rabitmq--job system
- puppet--Creating a Service
- git--Source Code Control
- wowza--flash/h.264 video Server and many Java-written custome modules
- usher--a logical control server that plays a video stream
- s3--for storing small mirrors
Some statistics of Justin.tv
- 4 data centers across the United States
- At any time, there are more than 2000 simultaneous streams of data
- Add 30 hours of video per minute every day
- 30 million independent visits per month (excluding multiple visits by the same user)
- Real-time network traffic is around 45G per second
Real-time video structure details
Real-time Video architecture
1. Use of peers and CDN
Most people think that only need to constantly improve the bandwidth, the incoming data into memory, and constantly receive data flow on it, it is not true. Real-time video requirements cannot be interrupted, which means that you cannot overload bandwidth usage. YouTube only need to let the player buffer, you can use 8G bandwidth to solve the 10G channel demand, but in real-time video, you can not buffer, if the traffic on the channel over its transmission capacity, even if only for a moment, then all the users are looking at the moment will be card. If you add a bit of load to its limit, everyone will be in the buffer state immediately.
Justin.tv uses a point-to-point structure to solve this problem, and of course they have a better solution, which is one of the CDN (content distribution networks). When the user's traffic load exceeds the load capacity of Justin.tv, Justin.tv will be very clever to introduce excessive traffic into a CDN, Usher control the processing logic, once received by the user's load request, Usher will immediately forward these new users to the CDN.
2.1% available time and maintenance contradictions
Real-time video systems are built to ensure that 100% of the time is available and that the machine can be maintained. Unlike the general Web site, the general site maintenance problems only a few people will find, concern, and real-time video sites, users will soon find the maintenance of any problems, and spread to each other very fast. This makes it no problem to conceal the user, facing the current user's picky, you must avoid maintenance problems. When you are maintaining a server, you cannot actively end a user's process, and you must wait for all users on that server to end the service on their own before the process can begin, which is often very slow.
3.Usher and load Balancing
Justin.tv the biggest trouble is immediate congestion, when a large number of users at the same time to see the same column, it will suddenly generate sudden network congestion. They have developed a real-time server and data center scheduling system, which is usher.
The Justin.tv system has done a lot on the sudden peak congestion. Their network can handle a large number of link-in connections per second. Users are also involved in load balancing, which is one of the reasons Justin.tv needs users to use Justin.tv's own player. As for TCP, because of its typical processing speed is hundred kbps, so there is no need to modify the TCP protocol.
Their video servers seem to be a little bit less than their traffic because they can use usher to maximize the performance of each video server, and load balancing ensures that traffic never exceeds their load limit. The load is mostly in memory, so this system can get the speed of the network to its limit. Servers They were once bought from Rackable (a series of SGI servers), and they did just a selection from the inside of all the presets.
Usher is a custom software developed by Justin.tv to manage load balancing, user authentication, and other processing logic for streaming playback. Usher allocates resources by calculating how many servers each stream needs to provide support to ensure that the system is in optimal condition, which is the difference between their system and another. Usher typically calculates and measures the servers needed for a streaming media from the following metrics:
- What is the load per data center?
- What is the load per server?
- Angle of delay optimization
- List of servers currently available for this stream
- User's country (obtained via IP address)
- Whether the user has an available peer network (obtained by querying the IP address in the routing database)
- Which data center the request came from
Usher use these metrics to optimize service costs, place services on relatively idle servers, or place services on servers closer to the user, resulting in lower latency and better performance for users. The usher has a number of selectable modes to achieve a very fine granularity of control.
Each server of the Justin.tv system can do the edge server, output the video stream directly for the user, and each server can also make the source server, and transmit the video stream for other servers. This feature makes the load structure of the video stream a dynamic, often changing process.
4. The server forms a weighted tree
The links generated by the copy of the video stream between the servers are very similar to the weighted trees. The number of data streams is often sampled and counted by the system, and if the number of users watching a video stream is soaring, the system copies it to a number of other servers. This process is executed repeatedly, resulting in a tree-like structure that eventually draws all the servers in the network. Justin.tv video streams from the source server, copied to other servers, or copied to the user's entire process, are in memory, there is no concept of hard disk path.
5.RTMP and HTTP
Justin.tv uses Flash as much as possible, because it uses the RTMP protocol, and the system has a separate session to maintain it for each video stream. With this protocol, the cost is quite high. Because the ISP does not support the download stream, it is not possible to use multicast and peer technology. Justin.tv really wanted to use multicast to copy data streams between internal servers, but because of their system control covering the entire network and the large amount of cheap bandwidth available inside, the use of multicast technology did not produce much benefit. At the same time, because their optimization algorithm minimizes the number of streams on each server, doing things on a very granular scale can be cumbersome or even more profitable than they are.
Justin.tv's usher controls the topology of the service by using HTTP requests to control which video stream is loaded by a server. Justin.tv uses HTTP on streaming data, but one problem is that it has no latency and real-time performance. Some people say that the real-time definition is 5-30 seconds, however, this is obviously not possible in the face of thousands of of people doing real-time video, because they also need real-time discussion, communication, which means the delay can not be higher than 1/4 seconds.
6. From AWS to your own data center
At first Justin.tv used AWS, then migrated to Akamai (cloud service provider) and finally to its own data center.
The reasons for leaving AWS to Akamai are: 1, cost, 2, the speed does not meet their needs. Video Live is very sensitive to bandwidth, so having a fast, reliable, stable, and low latency network is critical. With AWS, you can't control these, it's a shared network, often overloaded, and AWS's speed is no faster than 300Mbps. They attach great importance to dynamic range churn and cloud APIs, but nothing is done on performance and cost issues.
3 years ago, Justin.tv calculated their cost per user, CDN is $0.135,aws is 0.0074,datacenter is $0.001 today, their CDN costs are reduced, but their data center costs are still the same.
The key to having multiple datacenters is to be able to get close to all the major switching nodes, and they choose the best location in the country to provide access to the largest number of nodes in the country, and to save costs and build these data centers directly into these other networks. This eliminates the cost of handling these transit traffic before, and improves performance by connecting directly to their so-called "eyeball" network, which contains a large number of CABLE/DSL users, and a "content" network connection that is similar to Justin.tv's " Eyeball "The traffic is mainly from the end user, in most cases, these are free, do not spend a penny, to do is to even come in." The Justin.tv has a backbone that is used to stream video streams in different data centers, because the selection process to a usable node is often difficult to find the process that is willing to do the peer-to node.
7. Storage
The video stream is not formed from disk, but is saved to disk. The source server copies an incoming video stream to a local disk, and then uploads the file to long-term storage, and every second of the video is recorded and archived.
The storage device, like YouTube, is a disk library that uses the XFS file system. This structure is used to record broadcasts propagated through the server. The default video stream is saved for 7 days, and the user can set it manually, even if you can save it forever (if the company doesn't fail).
8. Real-time transcoding
Added real-time transcoding to convert any kind of streaming data into transport-layer data or code, and can be re-programmed into streaming media in a new format. There is a transcoding cluster, which is used to process the conversion work, and the Exchange session is managed using the job system. If the required transcoding service exceeds the processing power of the cluster, all servers can be used as transcoding servers.
Web structure
WEB structure
The 1.justin.tv front end uses Ruby on Rails.
2. Cache with twice
Each page of the system uses their own custom twice cache system, and the role twice plays is the combined role of the lightweight reverse proxy server and the template system. The idea is to cache each page for each user, and then incorporate the updates for each page. With twice, each process can process 150 requests per second while processing 10-20 requests in the background, which expands the number of Web pages that the server can process before 7-10 times. Most dynamic Web Access is within 5ms. Twice has a plug-in structure, so it can support a feature of the application, such as adding geographic information.
You can automatically cache data like your user name without touching the application server.
Twice is a customized development for the needs and environment of Justin.tv. If you are developing a new rails application, using varnish may be a better idea.
3. Network traffic is serviced by one data center, and other data centers are video services.
4.justin.tv all operations are monitored. Each click, view page and every action is recorded so that the service can be continuously improved. Front-end, network calls, or log messages from an application server are converted to syslog messages and forwarded via SYSLOG-NGTO. They scan all the data, load it into MongoDB, and execute the query using MONGO.
5.justin.tv's API comes from the application server of the website, which uses the same buffer engine to extend its API by extending the site.
6.PostegreSQL is their most important database. Structure is a simple master-slave structure, consisting of a host and several subordinate read databases.
Because of the type of their website, they do not need many write databases, and the buffering system controls these read databases. They found that PostgreSQL was not good at writing, so Justin.tv was using MEMCACHEDDB to deal with data that was often written, such as counters.
7. They have a chat server cluster that is specifically designed to serve chat functions. If a user enters a channel, the user will have 5 different chat servers to serve him, the extended chat function is more simple than the extended video function, the user can be divided into different rooms, these rooms are loaded by different servers. They won't let 100,000 people chat together at the same time. They limit 200 people in each room so that they can have a more meaningful conversation in a group. It's also helpful for extensions, which is really a smart strategy.
8.AWS is used to store document mirroring. They did not develop specialized systems for storing many small images, and they used S3. It's very handy, and it's cheap, so it's not going to take more time on them. Their images are used very frequently, all of them are buffered, and there are no follow-up issues left behind.
Design of network topological structure
The network topology is very simple, each server rack top has a pair of 1G cards, each rack has multiple 10G interfaces, the interface is connected to the external core router. They use Dell Power edge switches, which are not fully supported for L3 (TCP/IP), but are much better than L2 (Ethernet). Each switch transmits 20G of data per day and is cheap. The core router is the 6500 series of Cisco. Justin.tv wants to minimize the nodes, which reduces latency and reduces the processing time for each packet. Usher manages all access control and other logic, not just network hardware.
Using multiple datacenters allows you to take advantage of the benefits of a peer network and move traffic to the nearest location. There are very many connections to other networks and nodes. This gives you a number of alternative transports, so you can use the best path. If they are experiencing network congestion, they can choose a different path. They can find the corresponding ISP by IP address and time.
Development and deployment
They use the puppet server host, which has 20 different kinds of servers. Everything that comes out of the database goes through the cache, using puppet they can turn this cache into anything they want.
They have two software teams. One is the product team and the other is the hardware infrastructure team. Their team is very small, probably only 7-8 people per team, each team has a product manager. They employ a general technician, but they employ a network structure and database-related experts.
They use a web-based development system, so each new change will be done in a matter of minutes. QA must be completed before it becomes a product, which usually takes 5-10 minutes.
Justin.tv uses git to manage source code. Justin.tv like this feature of Git, you can write a copy of the program, 20-30 lines, and then it can be fused to other people who are modifying the copy. This work is independent and modular. When you have to revoke a copy of your submission, you can easily modify or revoke your code. Every few days, everyone tries to incorporate their own copy of the code into the main code to eliminate the conflict. They make 5-15 changes to the software every day, ranging from bugs in 1 lines of code to a wide range of tests.
The database schema is completed by manual update. Migrating their replicated database copies together will form a version of the latest dynamic record. It is tested in many different environments before the changes are finally applied to the product.
Puppet Management configuration file. Each small change is basically an experiment, and they track the impact of each change to the core file and the previous version. These tests are important because through it they can find out which changes are really raising their concerns.
The future of Justin.tv
Their goal is to add an order of magnitude. First of all to slice their video metadata system, due to the large increase in streaming data and servers, their metadata load also refers to a number of bursts of growth, so they need to be large-scale segmentation, for the network database, will use Cassandra to split it. Second, in order to recover from the disaster, the core data center should be backed up.
I learned something.
- own development or purchase. They have made a lot of wrong decisions on this issue. For example, they should initially buy a video server instead of making one themselves. Software engineers like to personalize the software and then use the open source community to maintain something that has many benefits. So they come up with a better process to make this decision: 1. Is this project an activity? or the maintenance? Or patch the bug? 2. Does anyone else want to use it? Can you ask someone how to define it? 3. The problem of extensibility, they have to make changes. 4. If we develop ourselves, can we do it faster, better, or can we get more of the features we need? Just like using usher, they consider whether they can create a new external feature and interact with another system. Taking usher as the core of video extensibility is an example of a very good decision for a relatively clumsy video server.
- Pay attention to what you do and don't care what others do. Their goal is to have the best system available, maximum service time and the most perfect extensibility. They spent 3 years developing technologies that could manage millions of broadcast concurrency.
- Do not outsource. The core value you learn is experience, not code or hardware.
- To do everything as an experiment. All things are measured, tested, tracked, measured. It's a bargain. Do it from the start, using excellent measuring tools. For example, they attach a tag to the copied URL, and then they can tell if you've shared the link. They never measure up to today's altitude measurements. By rewriting the broadcast process, their number of sessions increased by 700%. They want the site to run faster, respond faster, page loads faster, video services better, and the latency of every millisecond of system extrusion brings more broadcasters. They have 40 experiments, and if they want to make a user a broadcaster, they want to see the post-broadcast retention rate, the availability of the broadcast, the session rate, and then make a sensible decision about each change.
- The most important thing is to understand how your site shares services and how to optimize it. By reducing the depth of the shared links in the menu, they succeeded in increasing the share rate by 500%.
- The use of public building blocks and infrastructure means that the system will immediately identify what is important and then execute it. It's important to have a network capability, which is where they should be concerned from the start.
- Let the system get busy. With all the power of the system, why put the money on the table? Constructs a system that can be properly allocated by answering the system.
- Don't waste time on unimportant things. If it is convenient and does not cost much, there is no need to spend time on it. Using S3 to store images is a typical example.
- Try to support what users want to do, rather than doing what you think users should be using. The ultimate goal of Justin.tv seems to be to turn everyone into a broadcast point. As users experiment, they try to make the process as simple as possible by getting out of the way they use it. In the process, they found that the game was a huge force. Users like to get the Xbox out and share it, discuss it, and there's a good chance there's something you don't want to put in the business plan.
- Design for peak load. If you're only designing for a static state, your site will break down at a peak. On real-time video, this is usually a big deal, and if you get into this trouble, people will soon begin to spread the word against you. Designing for peak loads requires the use of a technology at all levels.
- Keep your network structure simple. Using a multi-datacenter, use a point-to-point network connection structure.
- Don't worry about dividing things into more expandable chunks. For example, instead of using a 100,000-person channel, divide them into more social and scalable channels.
- Real-time systems cannot hide any problems from users, which is how difficult it is to persuade users that your site is reliable. Because the connection between them and the real-time system is fixed, this will make every problem and fault of the system let everyone know that you can't hide. Everyone will find out, and everyone will communicate through the communication what happens, and soon, users will have a feeling that your site has a lot of problems. In this case, communicating with your users becomes important, building a trustworthy, high-quality, scalable, high-performance system from the start, designing a system that users can use as simple and comfortable as possible
Architecture for Youku, YouTube, Twitter and justintv several video sites