Serendip is a social music service, used as a http://www.aliyun.com/zixun/aggregation/10585.html "> Music sharing" between friends. Based on the "people to clustering" this reason, users have a great chance to find their favorite music friends. Serendip is built on AWS, using a stack that includes Scala (and some Java), Akka (for concurrency), play frames (for Web and API front-end), MongoDB, and Elasticsearch.
Stack selection
The biggest challenge for Serendip is that it needs to deal with a lot of data at first, because Serendip's feature is to collect music-related information from Twitter or other music services. Therefore, when choosing a technology stack, the overriding concern is extensibility.
1. JVM
The JVM is ideal for Serendip system features, and its performance has been validated in many ways, and is largely due to the ability of many open source systems to use local clients.
2. Scala, Akka and play frames
With the JVM ecosystem in mind, Scala's new programming language is very prominent, and Scala can also maintain good interoperability with Java. The choice of Scala,akka is a big part of the infrastructure framework for streaming. As the 2011-year service began to build, the play web framework had just become popular and reliability was emerging, and the technology was very cutting-edge at the time, but it was exciting that Scala and Akka merged into Typesafe at the end of 2011, and play was merged shortly thereafter.
3. MongoDB
There are many reasons to choose MongoDB, such as developer friendliness, ease of use, special solicitation, and scalability (using automatic fragmentation). However, we soon found that the application's data usage and query patterns required a large number of indexes on the MongoDB, so that the system would soon encounter performance and memory bottlenecks. As a result, we have changed the MongoDB usage Strategy-continue to use MongoDB as the primary key-value document store, while also leveraging its atomic increments to support several functions that require counters.
After the replacement of the use strategy, MongoDB performance is very stable. At the same time, MongoDB has 1 easy to maintain advantages, although a large part of the reason is to avoid the use of fragmentation and only a single replica set (MongoDB fragmentation architecture is indeed very complex).
4. Elasticsearch
To better query the data, we need a system with a proven ability to search. Of all possible open source search solutions, Elasticsearch is arguably one of the most scalable of cloud-oriented solutions. With the possibility of dynamic indexing patterns and multiple searches and classifications, Elasticsearch lets us see the possibility of supporting many features, without doubt, as a core component of the system architecture.
There is no managed solution for MongoDB and elasticsearch maintenance, which is based on two reasons: first, we want to have absolute control over two systems, we don't expect to upgrade or downgrade Isian, and secondly, the amount of data that needs to be processed determines that using EC2 will be much more expensive than hosting it.
System statistics
Serendip's "Pump" handles about 5 million messages a day, dealing with data from Twitter, Facebook, or other systems. This information is passed through a series of "filter" (from other services to crawl and process music links, such as YouTube, SoundCloud, Bandcamp, etc.), and add metadata for this information. The role of pump and filter in the system is similar to Akka, the whole process is supported by a dedicated M1.large EC2 instance, so it can be extended on demand and distributed to cluster processing through Akka remote actors.
From these data, the system can obtain approximately 850,000 valid information each day, which contains the relevant music links. All information is indexed in Elasticsearch, as well as in MongoDB, for backup and counting. Because each piece of information modifies several objects, the Elasticsearch speed is approximately 40 per second.
In Elasticsearch, the index of information (tweet and Post) is monthly, contains approximately 25 million messages per month, and has 3 slices. The cluster runs 4 nodes, each based on the M2.2xlarge instance and a m1.xlarge copy.
The establishment of the feed
In the feed design, we expect it to be dynamic and to record user behavior and input. If a user marks "rock-on" on 1 songs, or hits "airs" on an artist, we want this behavior to be immediately reflected in the feed. If a user has no love for an artist, then he will not be recommended for this song.
We also want this feed to integrate multiple resources, such as friends sharing, artists ' work, and those with the same taste. This means that the commonly used "fan-out-on-write" approach will not be appropriate, that services need a more real-time feed system to build methods to take full advantage of the information collected from the user, and Elasticsearch a set of features to make it possible.
The feed algorithm incorporates several strategies that dynamically adjust the ratios of different resources, and of course each strategy focuses on the user's most recent behavior and input. Policy consolidation is translated into different queries on real-time data that are elasticsearch indexed. Given the real-time nature of the data and the monthly establishment of the index, the system only needs to query a small fraction of all the data.
Luckily, Elaticsearch is very good at this kind of search. It also provides a way to extend the schema-by increasing the number of fragments, the search can be extended by adding more replicas and physical nodes.
The "Music soulmates" search process is also a full application of elasticsearch, and as part of an uninterrupted social exchange process, the system will compute the most-shared artists it collects for social network users.
Whenever a SERENDIP user emits a signal (either airing or feed), it causes a "music soulmates" recalculation, which relies on a favorite artist list (which will be constantly modified) to match users with a common taste, of course, similar to popular, This data is also taken into account in the number of times shared. The system also uses another set of algorithms to filter the garbage pusher and outliers.
After a long period of production environment testing, we found that the whole system is running very smoothly, and there is no need to consider additional systems to run more complex clusters or recommended algorithms.
Monitoring and deployment
Serendip uses serverdensity for monitoring and alerting, Serverdensity is a paid service, native provides MongoDB and server monitoring. This tool has been used extensively within the system to display internal system statistics through custom metrics.
Use the Intrastat collection mechanism to collect events for each behavior within the system and save them in the MongoDB collection. The timed job reads this data from the MongoDB every minute. In this way, the operation of data at the same time, Serverdensity also played a role in monitoring elaticsearch and alarm.
Server and deployment management relies on Amazon elastic beanstalk,elastic Beanstalk is the AWS custom PAAs solution that is very easy to start. Although it is not a full PAAs, the basic functionality is sufficient to meet the general use case requirements. It provides a simple, automatic extension configuration and can be fully accessed through EC2.
The establishment of the application relies on the Jenkins instance on EC2, where the play network application is packaged as a war, and the war is pushed to elastic beanstalk as a new application version through Post-build scripts. The new version is not automatically deployed to the server and must be done manually, usually after staged testing to the production environment.
Experience Summary
The following is a summary of several key elements in the SERENDIP process:
1. Know how to expand. It may not be necessary to expand on the first day, but you need to be aware of how well each component of the system can scale and scale, and be sure to extend it for a certain amount of time.
2. Be prepared to cope with spikes. In particular, the initial stage of the program grinding, to retain enough space to cope with sudden load or rapid expansion.
3. Choose a language that does not slow you down. Make sure the language has local clients that you expect to use the technology, or at least actively maintain it, and don't let your application get stuck in the library update.
4. Believe the hype. The technology you expect must be permanent, a resounding, vibrant community, and the technology that reviews it is proving its dynamism.
5. Don't be too trusting in some arguments. Find out more about the technology's hot reviews, and it will let you know where the weaknesses are. But don't worry too much about them, because people are very excited when they are disappointed.
6. Have enough interest. Choose a technology that excites you so that you can stay motivated.
12 Next