Stack Overflow architecture Update-now at 95 million page views a month

Source: Internet
Author: User
Tags creative commons attribution mercurial haproxy bacula dell r710 dell r610
Document directory
  • The stats
  • Data Centers
  • Hardware
  • Dev tools
  • Software and technologies used
  • External bits
  • Developers and System Administrators
  • Content
  • More architecture and lessons learned

A lot has happened since my first article on the stack overflow architecture. contrary to the theme of that last article, which lavished attention on Stack Overflow's dedication to a scale-up strategy, stack overflow has both grown up and out in the last few years.

Stack Overflow has grown up by more then doubling in size to over 16 million users and multiplying its number of page views nearly 6 times to 95 million page views a month.

Stack Overflow has grown out by expanding into the stack exchange network, which has des stack overflow, server fault, and super user for a grand total of 43 different sites. that's a lot of fruitful multiplying going on.

What hasn't changed is Stack Overflow's openness about what they are doing. and that's what prompted this update. A recent series of posts talks a lot about how they 've been handling their growth: Stack exchange's architecture in bullet points, Stack Overflow's New York data center, designing for scalability of management and fault tolerance, Stack Overflow search-now 81% less, Stack Overflow net Work configuration, does stackoverflow use caching and if so, how ?, Which tools and technologies build the stack exchange network ?.

Some of the more obvious differences except SS time are:

  • Just more. More users, more page views, more datacenters, more sites, more developers, more operating systems, more databases, more machines. Just a lot more of more.
  • Linux. Stack Overflow was known for their windows stack, now they are using a lot more Linux machines for haproxy, redis, bacula, Nagios, logs, and routers. all support functions seem to be handled by Linux, which has required the development of parallel release processes.
  • Fault Tolerance. Stack Overflow is now being served by two different switches on two different Internet connections, they 've added redundant machines, and some functions have moved to a second datacenter.
  • Nosql. Redis is now used as a caching layer for the entire network. There wasn't a separate caching tier before so this a big change, as is using a nosql database on Linux.

Unfortunately, I couldn't find any coverage on some of the open questions I had last time, like how they were going to deal with multi-Tenancy limit SS so far diffrent properties, but there's still plenty to learn from. here's a roll up a few different sources:

The stats
  • 95 million page views a month
  • 800 HTTP requests a second
  • 180 DNS requests a second
  • 55 megabits per second
  • 16 million users-traffic to stack overflow grew 131% in 2010, to 16.6 million global monthly Uniques.
Data Centers
  • 1 rack with peak Internet in OR (hosts our chat and data Explorer)
  • 2 racks with peer 1 in NY (hosts the rest of the stack Exchange Network)
Hardware
  • 10 Dell r610 IIS web servers (3 dedicated to stack overflow ):

    • 1x Intel Xeon processor e5640 @ 2.66 GHz quad core with 8 threads
    • 16 GB RAM
    • Windows Server 2008 r2
  • 2 Dell r710 database servers:

    • 2x Intel Xeon processor x5680 @ 3.33 GHz
    • 64 GB RAM
    • 8. Spindles
    • SQL Server 2008 r2
  • 2 Dell r610 haproxy servers:

    • 1x Intel Xeon processor e5640 @ 2.66 GHz
    • 4 GB RAM
    • Ubuntu Server
  • 2 Dell r610 redis servers:

    • 2x Intel Xeon processor e5640 @ 2.66 GHz
    • 16 GB RAM
    • Centos
  • 1 Dell r610 Linux backup server running bacula:

    • 1x Intel Xeon processor e5640 @ 2.66 GHz
    • 32 GB RAM
  • 1 Dell r610 Linux Management Server for Nagios and logs:

    • 1x Intel Xeon processor e5640 @ 2.66 GHz
    • 32 GB RAM
  • 2 Dell r610 VMware esxi domain controllers:

    • 1x Intel Xeon processor e5640 @ 2.66 GHz
    • 16 GB RAM
  • 2 Linux Routers
  • 5 Dell power connect Switches
Dev tools
  • C #: Language
  • Visual Studio 2010 team suite: Ide
  • Microsoft ASP. NET (Version 4.0): Framework
  • ASP. net mvc 3: Web Framework
  • Razor: View Engine
  • Jquery 1.4.2: Browser framework:
  • LINQ to SQL, some raw SQL: Data access layer
  • Mercurial and Kiln: Source Control
  • Beyond compare 3: Compare tool
Software and technologies used
  • Stack Overflow uses a wisc Stack via Bizspark
  • Windows Server 2008 R2 x64: Operating System
  • SQL Server 2008 r2RunningMicrosoft Windows Server 2008 Enterprise Edition x64: Database
  • Ubuntu Server
  • Centos
  • In IIS 7.0: Web Server
  • Haproxy: For Load Balancing
  • Redis: Used as the distributed caching layer.
  • Cruisecontrol. net: For builds and automated deployment
  • Lucene. net: For Search
  • Bacula: For backups
  • Nagios: (With n2rrd and drraw plugins) for monitoring
  • Splunk:For logs
  • SQL Monitor:From red gate-for SQL Server monitoring
  • Bind: For DNS
  • Rovio: A little robot (a real robot) allowing remote developers to visit the office "always ally ."
  • Pingdom: An external monitor and alert service.
External bits

Code that is not encoded as part of the development tools:

  • ReCAPTCHA
  • Dotnetopenid
  • WMD-now developed as open source. See GitHub Network Graph
  • Pretworkflow
  • Google Analytics
  • Cruise control. net
  • Haproxy
  • Cacti
  • Markdownsharp
  • Flot
  • Nginx
  • Kiln
  • CDN: None, all static content is served off the sstatic.net, which is a fast, cookieless domain intended for static content delivered to the stack exchange family of websites.
Developers and System Administrators
  • 14 developers
  • 2 System Administrators
Content
  • License:Creative Commons Attribution-Share Alike 2.5 generic
  • Standards:Opensearch, Atom
  • Host:Peak Internet
More architecture and lessons learned
  • Haproxy is used instead of Windows NLB because haproxy is cheap, easy, free, works great as a 512 mb vm "device" on a network via hyper-v. it also works in front of the boxes so it's completely transparent to them, and easier to troubleshoot as a different networking layer instead of being intermixed with all your windows configuration.
  • A cdn is not used because even "cheap" cdns like Amazon one are very expensive relative to the bandwidth they get bundled into their existing host's plan. the least they cocould pay is $1 K/month based on Amazon's CDN rates and their bandwidth usage.
  • Backup is to disk for fast retrieval and to tape for historical archiving.
  • Full text search in SQL Server is very badly integrated, buggy, deeply incompetent, so they went to Lucene.
  • Mostly interested in peak HTTP request figures as this is what they need to make sure they can handle.
  • All properties now run on the same stack exchange platform. That means stack overflow, Super User, server fault, Meta, webapps, and meta web apps are all running on the same software.
  • There are separate stackexchange sites because people have different sets of expertise that shouldn't cross over to different topic sites. you can be the greatest chef in the world, but that doesn't qualify you for fixing a server.
  • They aggressively cache everything.
  • All pages accessed by (and subsequently served to) annonymous users are cached via output caching.
  • Each site has 3 distinct caches: local, site, global.
  • Local Cache: Can only be accessed from 1 server/Site pair
    • To limit network latency they use a local "L1" cache, basically httpruntime. cache, of recently set/read values on a server. this wocould reduce the cache lookup overhead to 0 bytes on the network.
    • Contains things like user sessions, and pending view count updates.
    • This resides purely in memory, no network or DB access.
  • Site Cache: Can be accessed by any instance (on any server) of a single site
    • Most cached values go here, things like hot question ID lists and user acceptance rates are good examples
    • This resides in redis (in a distinct dB, purely for easier debugging)
    • Redis is so fast that the slowest part of a cache lookup is the time spent reading and writing bytes to the network.
    • Values are compressed before sending them to redis. They have plenty of CPU and most of their data are strings so they get a great compression ratio.
    • The CPU usage on their redis machines is 0%.
  • Global Cache: Which is shared amongst all sites and servers
    • Inboxes, API usage quotas, and a few other truly global things live here
    • This resides in redis (in db 0, likewise for easier debugging)
  • Most items in the cache expire after a timeout period (a few minutes usually) and are never explicitly removed. when a specific cache invalidation is required they use redis messaging to publish Removal Notices to the "L1" caches.
  • Joel Spolsky is not a Microsoft loyalist, he doesn't make the technical decisions for stack overflow, and considers Microsoft licensing a rounding error. Consider yourself corrected Hacker News commentor.
  • For their Io system they selected a raid 10 array of Intel x25 solid state drives. the raid array eased any concerns about reliability and the SSD Drives stored med really well in comparision to fusionio at a much cheaper price.
  • The full boat cost for their Microsoft licenses wocould be approximately $242 K. since stack overflow is using Bizspark they are not paying near the full sticker price, but that's the max they cocould pay.
  • Intel charts are replacing Broadcom charts and their primary production servers. This solved problems they were having with connectivity loss, packet loss, and packet upted ARP tables.
Related Articles
  • Hacker News thread on this post/reddit thread
  • Stack exchange's architecture in bullet points/hackernews thread
  • Stack Overflow's New York data center-hardware of the various machines?
  • Designing for scalability of management and fault tolerance
  • Stack Overflow blog
  • Stack Overflow search-now 81% less crappy-Lucene is now running on an underused cluster.
  • State of the stack 2010 (a message from your CEO)
  • Stack Overflow Network Configuration
  • Does stackoverflow use caching and if so, how?
  • Meta stackoverflow
  • How does stackoverflow handle cache invalidation?
  • Which tools and technologies build the stack exchange network?
  • How does Stack Overflow handle Spam?
  • Our storage demo-
  • How are "hot" questions selected?
  • How are "related" questions selected? -The title, the question body, and the tags.
  • Stack Overflow and DVCs-Stack Overflow selects mercurial for source code control.
  • Server fault chat room
  • C # redis Client
  • Broadcom, die mutha
Todd Hoff | 13 comments | permalink | share Article print Article email articleIn exampletweet reader comments (13)

Did they explain why they use redis instead of memcached for caching? I 've heard of quite a few people using redis for cache, just wondered what does redis do that memcached doesn' t?

If I remember correctly redis is not a distributed database, right? With memcached if I add new nodes the client will automatically redistribute the cache to take advantage of the additional capacity. redis doesn't do that. So why redis?

March 3, 2011 | John

Backup is to disk for fast retrieval and to tape for historical archiving.

Really? People still do this? I know some organizations invested a tremendous amount in automation, robotic tape backup, but seriously, a site founded in 2008 is backing up to tape?

March 3, 2011 | James

Why wowould anybody use Windows/asp over Linux/anything else?
It really surprises me people still do such things ..

March 3, 2011 | Joe

Why wowould anybody use Windows/asp over Linux/anything else?
It really surprises me people still do such things ..

Because. NET is one of the best development frameworks out there. And Linux for networks is cheap, so the combination makes sense.

March 4, 2011 | pal

@ John

One of the advantages of using something like redis or membase instead of memcached is that the cache can be persisted to disk, this can avoid the cache storm issue if the cache goes offline and is then is brought back up.

I guess what we don't know is what configuration the redis boxes are in e.g. Are they sharding, doing Master/Slave replication etc.

Andy

March 4, 2011 | Andy

@ Joe the logic is easy enough if you know your shit: Joel was on the MS Excel team, which wrote VBA and OLE automation.

March 4, 2011 | Root

@ Joe-that's one of the least intelligent comments I 've seen on this site.

March 6, 2011 | Sosh

JAMES: backing up to tape means offline/archival backup. this is often worth the expense and hassystemic, especially for a large important dataset. after the issues a week or three ago, I can tell you that the Gmail guys are * very * gglad they backed up to tape. if all your replicas are online, there's always the possibility that a single bug or slip of the fingers can wipe them simultaneously.

March 11,201 1 | Defenestrator

Technically, the IIS 7.0: Web server is incorrect, under Windows Server 2008r2, it's actually IIS 7.5: Web server.

March 18,201 1 | Jason

@ Sosh-please take it easy and don't elevate yourself in support of Microsoft products. there is no technical reason to run MS stuff among the best and latest of open-source companies and their communities. in fact to really drive this point, the stackoverflow team shoshould be using more * paid/licensed * MS products everywhere to drive their point home. there is also the perspective of using best combination of tools for the job so points there. the answer is really easy: stackoverflow team knows MS products, Visual Studio, C # And. net therefore it was cheapest and fastest (for this team) to deliver stackexchange family of sites. ^ m

March 23,201 1 | simpleweb

Do they have any stated performance goals? How do they monitor site performance under load? These wocould seem to be important questions to ask of any site that gets profiled at highscalability.com...

March 24,201 1 | Mike duy

Yes, most people with serious data still use tape. Also, they are windows because the Founder is an old Microsoft guy!

March 25,201 1 | fatherlinux

You can avoid software license and network hhardware costs by just using a better app. Server:

 

Server Requests per second

---------------------------------------------

G-WAN Web server ....... 142,000

Lighttpd Web server ........ 60,000

Nginx Web server ............ 57,000

Varnish Cache server ....... 28,000

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.