On the history of Internet development from the technology of famous websites

Source: Internet
Author: User
Keywords Facebook alexa rankings FACEBOOK FINAGLE KESTREL MESSAGES REPLICATE RESONATE SCALA SPDY WEBP entrepreneurship know

In recent years, with the continuous innovation and development of the Internet industry, batch after group of websites or be eliminated or stand out, for those successful websites, most of them already exist nearly 10 or more than 10 years, in such a long period of development, in addition to the business facing the challenges, Technically, it's also a lot of challenges. The following selected Alexa rankings of the previous site (ranking up to April 21, 2012, by analyzing how they are technically coping with the challenges of business development process, to a deeper understanding of the development of the Internet industry in recent years.

Google is currently ranked 1th in Alexa. It was born in 1997, when it was a research project, a monthly build index, and the build index was distributed across multiple servers (Index Server) via Sharding (Shard by Doc). The specific Web page data is also distributed across multiple servers (DOC server) in a sharding manner, and when a user submits a request, a server at the front end submits the request to the index server for a divided inverted index and then from DOC Server Extracts specific Web page information (such as page title, search keyword matching fragment information, etc.), and finally show it to the user.

As the index Web page increases, the structure can store indexes and Web page data by adding Index Server and Doc server, but still faces many other problems, so over the next more than 10 years, Google has done a lot to improve the structure.

1999, Google added a cache Cluster to cache query index results and document fragment information, while the index server and Doc server through replicate into Cluster. The benefits of these two modifications are the increased responsiveness of the site, the amount of support available, and availability (availability). This change resulted in an increase in costs, Google's hardware style is always not expensive high-end hardware, but at the software level to ensure the reliability and performance of the system, so the same year, Google began to use its own designed servers to reduce costs. In 2000, Google began to design its own datacenter, using a variety of methods (such as alternative refrigeration to replace air-conditioning) to optimize pue (energy efficiency), but also to the design of the server to do a lot of. In 2001, Google changed the format of index to put all the index into memory, the benefits of this transformation is that the site's response speed and the amount of access to support has been greatly improved. 2003, Google published the article Google Cluster Architecture, its Cluster structure composed of hardware Lb+index Cluster+doc cluster+ A large number of Low-cost server (such as IDE hard disk, cost-effective CPU, etc. By parallel processing +sharding to ensure that the response is still very fast while reducing hardware requirements. The same year Google published a paper on Google's file system (GFs was on line in 2000), and this paper is a big part of Google's use of expensive hardware to store large amounts of data by gfs+ a large number of inexpensive servers. In 2004, Google again changed the format of the index to make the site's response speed continue to improve. In the same year, Google published a paper on MapReduce, by mapreduce+ a large number of Low-cost servers can quickly complete the previous use of expensive minicomputer, medium or even mainframe to complete the computing task, which clearly for Google quickly build the index provides a lot of help. 2006, Google published a paper on BigTable (started in 2003), so that the analysis of massive data to meet the requirements of the online system, which is a great help to Google to improve the response speed of the site.

The above 3 papers completely changed the industry's approach to storage, analysis and retrieval of massive amounts of data (gossip: Google has already completed GFs, MapReduce, and bigtable replacements) and has also laid the technology leadership position of Google in the industry.

In some scenarios, Google also uses MySQL to store data. Similarly, Google has made a lot of changes to MySQL, it uses the MySQL information can be learned from https://code.google.com/p/google-mysql/.

In 2007, Google shortened the build index to minutes, and when the new page appeared, it was available to Google in a few minutes, while the index cluster through Kyoto buffers to provide service, For Google's various search (such as Web pages, pictures, news, books, etc.), in addition to the service provided by the index cluster, there are many other services, such as advertising, lexical inspection. A Google search might require more than 50 internal service,service to be written primarily in C + + or java. In 2009, Google's "How Google uses Linux" article, revealed that Google has made a lot of efforts to improve machine utilization, such as the deployment of different types of resource consumption on the same machine.

After that, Google developed the Colossus (next class GFs file system), spanner (next class bigtable mass storage and computing architecture), real-time search (based on Colossus implementation), mainly to improve the real-time nature of the search and storage of more data. In addition to innovations in massive data-related technologies, Google has also been innovating the industry's traditional technologies, such as increasing TCP's initial congestion window values, improving HTTP Spdy protocols, new picture formats WEBP, and more.

In Google's development process, the transformation of its technology mainly around the scalability, performance, cost and availability of 4 aspects, Google does not adopt the style of expensive hardware and the amount of data leading other sites to determine its technical transformation is basically the traditional software and hardware technology innovation.

Facebook is currently ranked 2nd in Alexa. It is built with lamp, and as the business progresses, it also makes a lot of technical modifications.

As a first step, Facebook added memcached to the lamp structure to cache data, dramatically increasing system response time and support access, and then adding services to the news Feed, The more versatile features such as search are available as service to the front-end PHP system, and front-end systems access these service via thrift. Facebook uses multiple languages to write different kinds of service, mainly to choose the right language for different scenarios, such as C + +, Java, and Erlang.

The heavy use of memcached and the rising number of accesses, resulting in too much network traffic to access memcached and inability to support the switch, Facebook has been able to transform UDP to access memcached to reduce network traffic on a single connection. In addition, there are some other modifications, specific information can be viewed Http://on.fb.me/8R0C.

PHP as a scripting language, the advantage is that the development of simple, easy to use, the disadvantage is that the need to consume more CPU and memory. When Facebook's visit grew to a certain scale, the disadvantage became more pronounced, and since 2007 Facebook has tried a number of ways to solve the problem, and the hiphop product, which was born on Facebook hackathon, has been successful. Hiphop can automatically convert PHP into C + + code, Facebook, after using hiphop, the equivalent configuration of the machine, the number of supported requests is 6 times times the previous, CPU utilization rate dropped by an average of 50%, thereby saving Facebook a large number of hosts. In the future, Facebook will also improve the hiphop, through hiphop PHP to bytecode, into the hiphop VM execution, and then by the hiphop VM to compile the machine code, in the same way as the JIT.

Facebook developed the Bigpipe in 2009, and with this system, Facebook managed to boost its web site twice times faster. As Facebook's traffic has risen, the collection of execution logs on numerous servers is beginning to face challenges, and Facebook has developed scribe to solve the problem. For data stored in MySQL, Facebook supports the growing volume of data by vertically splitting the library and splitting the table horizontally. As an important part of the Facebook technology system, Facebook has also made a lot of optimizations and improvements to MySQL, such as online Schema change, and more information is visible http://www.facebook.com/MySQLAtFacebook.

At the beginning of its development, Facebook used high-end storage devices (such as NetApp, Akamai) to store images, and as the picture increased, costs increased dramatically, and in 2009 Facebook developed a haystack for storing images. Haystack can be stored with inexpensive PC server, dramatically reducing costs.

In addition to using MySQL to store data, Facebook has been groping for new ways in recent years. Facebook developed Cassandra in 2008 as a new way of storing in message Inbox search. In 2010, however, Facebook abandoned Cassandra to use HBase as its messages store, and in 2011 applied hbase to more Facebook projects (such as Puma, ODS). It is said that Facebook is now trying to migrate its users and relational data from MySQL to HBase.

Since 2009, Facebook has tried to design its own datacenter and servers to reduce its operating costs and to open up its pue only 1.07 of datacenter-related technology. Facebook's technical fundamentals are: "Open source with open source, optimize it and feed it back to the community." This principle can be seen in the history of Facebook's technology development, and Facebook's technological transformation revolves around 4 aspects of scalability, performance, cost, and availability.

Twitter currently ranks 8th in Alexa. Born in 2006, it was built using Ruby on rails+ MySQL and 2007 memcached as the cache layer to increase response speed. Ruby on Rails gives Twitter a fast-growing capability, but as traffic grows and its consumption of CPU and memory is painful for Twitter, Twitter has done a lot of remodeling and effort, such as writing an optimized version of Ruby GC.

In 2008, Twitter decided to migrate to Java, choosing Scala as its main development language (citing "the difficulty of selling Java to a roomful of Ruby programmers"), using thrift as its primary communications framework, and developing finagle as its service Framework, the backend features can be exposed to service to the front-end system use, so that the front-end system does not care about a variety of communication protocols (for example, the user can use the same call service to access Memcache, Redis, thrift Server), Kestrel was developed as its message middleware (instead of Starling written in Ruby).

Twitter's data store has been used in MySQL, the development of the episode is that when Facebook open source Cassandra, Twitter this plan to use, but eventually give up, still keep the use of Mysql,twitter version of MySQL has been open source (Https://github.com/twitter/mysql). Twitter is also a way to support the large amount of data, using memcached to cache tweet,timeline information is migrated to use Redis to cache.

Twitter had its first self-built datacenter in Salt Lake City in 2010, mainly to increase controllability. From Twitter's development process, 6 years of technical innovation has focused on scalability and usability.

As an E-commerce site employees, please allow me to introduce this Alexa ranked 21 of the famous E-commerce site technology evolution.

1995, the birth of ebay, the use of CGI, the database used gdbm, up to only 50,000 pieces of online products. In 1997, ebay migrated the operating system from FreeBSD to Windows NT, and migrated the database from GDBM to Oracle. In 1999, ebay transformed the front-end system into a cluster (previously only one host), using resonate as a load balancing, and a back-end Oracle machine upgraded to Sun E1000 minicomputer, adding a machine as a repository for the database in the same year to enhance usability. The front-end machine was able to cope with increasing access, but the database machine reached its bottleneck in November 1999 (no more CPU and memory), and began splitting the database into multiple libraries by business in November. In 2001-2002, ebay split the datasheet horizontally, such as storing goods by category, and deploying Oracle's minicomputer for Sun A3500. In the 2002, the entire Web site was migrated to Java, at which point the DAL framework was used to shield the impact of database staging, and a development framework was designed to allow developers to perform functional development better. From the whole development process of ebay, technical renovation is mainly around two points of scalability and availability.

Tencent is ranked 9th in Alexa. The first QQ im using a single access server to handle the user's login and state maintenance, but in the development of 1 million users at the same time online, this server has been unable to support. So QQ im will all the single server transformation for the cluster, and increased the state synchronization server, by its completion of the synchronization of the state of the cluster, the user's information stored in MySQL, do a separate table, the friend relationship stored in their own implementation of the file storage. In order to improve the efficiency of interprocess communication, Tencent realizes the user state IPC. Tencent then transformed the state synchronization server into a synchronized cluster to support more and more online users. After several times before the transformation, has been basically able to support tens other users online, but the usability is poor, so Tencent to QQ im again to transform, to achieve the same city across the IDC disaster tolerance, strengthening the monitoring and operation of the system construction. After Tencent decided to completely rewrite the QQ IM Architecture (probably 2009 years to now), mainly to enhance flexibility, supporting the IDC in the city, support tens friends. In this big technical renovation process, Tencent's data are no longer stored in MySQL, but all stored in their own design system.

From the technology evolution of QQ IM, its technical renovation is mainly around the scalability and usability.

2003, Taobao was born, directly purchased a commercial phpauction software, on this basis to transform the creation of Taobao. 2004, the system moved from PHP to java,mysql migration to Oracle (minicomputer, high-end storage devices), the application server has adopted WebLogic. In the 2005-2007 development process, using JBoss instead of WebLogic, the database was divided into databases, based on the BDB do a distributed cache, the development of a distributed File system TFS to support the storage of small files, and build their own CDN. 2007-2009, the application system for vertical split, split the system are service to provide external functions, the data used vertical and horizontal split.

After the vertical and horizontal split of the data, Oracle has become more and more expensive, so in the next few years, Taobao began to migrate data from Oracle to MySQL, and began to try new data storage scenarios, such as the use of hbase to support the storage and retrieval of historical transactions orders. In recent years, Taobao began to carry out the Linux kernel, JVM, nginx and other software modification customization work, but also designed a low-power server, while the software and hardware optimization to better reduce costs.

From the whole development process of Taobao, technical renovation is mainly around two points of scalability and usability, and now it is beginning to focus on performance and cost. The current Taobao Alexa ranked 14th.

Summary

From the above Alexa rankings from the previous Web site technology development process, look, each site because of its commitment to different business, team composition, different style of work, in the different stages of technology development will be different ways to support the business development, but the basic will be around the scalability, availability, Performance and cost at 4 points, the Web sites have a lot of similarities in the technical structure, and these structures will continue to evolve after the development to a larger scale.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.