[Reprint] prismatic: using machine learning to analyze user interests takes 10 seconds

Source: Internet
Author: User
Tags dynamodb
Prismatic: using machine learning to analyze user interests takes 10 seconds
[Date: 2013-01-03] Source: csdn Author: Todd Hoff [Font: large, medium, and small]

Http://www.chinacloud.cn/show.aspx? Id = 11857 & cid = 17

About prismaticFirst, there are several things to explain. Their entrepreneurial team is small,OnlyComposed of four computer scientistsThree of them are young and promising Stanford and Dr. Berkeley. They are using wisdom to solve the problem of information overload. HoweverThese doctors also serveProgramMember role: Develop background programs for websites, iOS programs, big data, and machine learning. The highlight of prismatic system architecture is that it uses machine learning to solve the problem of processing social media streams in real time. He did not disclose their machine learning technology because of trade secrets, but we can look at it through the architecture. One of the founders of prismatic, Bradford cross, briefly described the prismatic system as:"It is a comprehensive system that provides large-scale, real-time, and dynamic personalized information ranking, classification, and grouping functions."Next we will show you the architecture of this system.

The main function of prismatic is to discover our interests and read it for us.

What do you want to read today? Every day, modern people fall into this dilemma. We usually use some hidden ways to find what they want to read: twitter, RSS, Facebook, Pinterest, G +, email, techmeme and other sources of information.Prismatic replied, "What should I read today ?" ProblemsJason Wolfe generously agreed and gave a detailed description of the solution they are using. The speech contains many fashionable technical terms, such as machine learning, social graphs, and big data, procedural programming and fact requirements. However, the result is that their methods are more hidden, but unlike some other similar applications, they discover your interests, no matter how deep these interests are stored in your information.

As you may expect, their approach is a bit special.They chose clojure as the development language.It is a modern lisp language compiled into Java bytecode. The main purpose is to make full use of procedural languages to build programs with rich granularity and free structure, so as to meet the special needs brought about by various problems. Here are two examples of the advantages of the Procedural language they use: the first one isThey use the image library almost everywhere.. For each user, their graphic computing can be called a low-latency, streamlined map-Reduce task set. Another example is that they use a subgraph to implement modular service configuration.

They put the focus of system design on process-based development,Avoid using a giant framework like hadoopTo use small ones, which makes the system more reliable, easier to correct, easier to expand, and easier to understand. The difficulty of prismatic machine learning is that it requires a very long training period. It should be noted that, first, obtaining high-quality analysis source content does not take much time. Second, the use of a machine learning-based recommendation system requires training of all the data of a user's life from childhood to the present; this scalable thinking Digital Simulation System plays the role of information collectors and producers.

Figure: more and more applications are paying attention to users' reading recommendations

Statistical data

    • Tens of millions of articles per dayArticle, Blogs, and other content are published through Twitter, Facebook, Google Reader, and the Internet.
    • However, it takes only a few seconds for a user to log on, prismatic can use their previous reading history to analyze the user's interests, then recommend reading, and use users' activities on social networks, in this way, we recommend the articles published by contacts that suit the user's appetite. At the same time, prismatic also records users' actions on it and uses it as a source for analysis.
    • Prismatic takes dozens of secondsVisit the user's personal homepage and compare it with the user interest model to analyze what the user is most interested in.
    • Currently, millions of excellent articles are successfully shared with users through prismatic every week.

Platform

    • Prismatic hosts are hosted on AWS EC2 and use the Linux operating system.
    • The background channels and API services of prismatic99.9 % are developed using clojure.
    • All prismatic heavyweight frameworks are deployed in JVM.
    • Prismatic uses MongoDB to store server parameters and user analysis data.
    • Prismatic uses MySQL.
    • Prismatic uses S3.
    • Prismatic uses dynamo.
    • Prismatic focuses on the development needs that can solve special problemsCode.

Data storage and input/output

    • Unlike a typical structure, prismatic services are designed around data and provide the ability to read and write data in real time.
    • Most of the data is transmitted directly between the back-end pipelines and will not be I/O with the disk.
    • Prismatic keeps the data as close as possible to the CPU, which makes API requests have almost no IO latency, and binds the API to the CPU, cleverly expanding the system scale.
    • Prismatic uses many existing solutions: MongoDB, MySQL, and Amazon S3 and Dynamo. Each feature is carefully studied based on actual needs, including scale, access mode, and other features related to data storage.

Service

Externally, the prismatic system is divided into 10 independent service modules, which can be roughly classified into five types: data collection, new user management, API, and other client services; batch processing. Each service is designed for a function. It is scaled horizontally in a special way. It is usually limited by one or two special resources (I/O, CPU, and RAM ).

    • Data collection-Background

Every day, a large amount of content (including articles, blogs, and webpages) is generated, and prismatic wants to capture them as much as possible. To allow users to comment on each content and share it with their friends, prismatic must identify the author, sharer, and some valuable comments of each content. The first step isCollect and analyze content and relationship data.

FirstPerform the following steps at the collection channel entry: Find new articles on RSS by preferences, and collect the latest comments and dynamics on Twitter and Facebook by using APIs. The service that executes these functions is very simple and has no status information, the only State and interesting thing is how to determine which information has been collected and how to intelligently analyze which information to collect next. The most difficult one is the former, because some people may send the same information in multiple places.

Then, feed the information collected from RSS, Twitter, Facebook, and other sources into a filter and let it decideWhich channels are required for analysis?. First, discard the spam and other spam information, and the URL will be sent to the pipeline. There will be many interesting things, and prismatic will create a URL queue, send each queue to a graph operation, use it to describe the URL in detail, capture its HTML code, and use machine learning'sAlgorithmCapture the content of the article, identify the best image, capture the author, tag, and topic. To make these processes as fast as possible, load the memory and improve the expressiveness, they have invested a lot of time. Of course, the way to process these URLs is highly concurrent.

Finally, the document controller is used to receive detailed articles and social content processed aboveMatchTo organize the document set into a series of stories, decide which documents are to be processed, and manage the machine directories related to these Apis.

    • New User Management-Background

The new user management module collects information about newly registered users. It consists of Twitter, Facebook, and googlereader to find out their favorite topics and sources of articles, and analyze their social graphs, find out the articles shared by his favorite friends.

Services that execute these functions are alsoHigh concurrency. Social map analysis is very complex, and the key is how these services achieve such low latency and so reliable throughput: for active users of Twitter, Facebook, and Google Reader, it takes only 15 seconds or even less to find out the topics that hundreds of users like. After the user authorization, prismatic can quickly analyze the user's interests. Users often complete the registration process (entering the password and clicking confirm. Let's take a look at what happened in the last 15 seconds: prismatic crawled the content posted by users on Twitter and Facebook and the content marked as liked by Google Reader, this generally takes about 10 seconds, and collects the 1000 URLs shared by users and their contacts, and then sends the data to the document controller mentioned above, channels that use the same machine learning stack for collection/analysis. These analysis results are aggregated and processed in a later stage, and finally stored in dynamodb and S3.

This process has very strict requirements on latency, so these processes cannot be serialized and must be executed concurrently, so they have to use as much channel and parallel processing technology as possible. This process requires a high throughput capacity, because each process is not small. Once many users register at the same time, you must do more to reduce latency. This is also the unique feature of prismatic. Their stream processing and aggregation classes have all reduced their power and optimized their performance,Each user can use a low-latency map-Reduce task.. In the case of multiple users, the concurrent processing used by their systems has played almost all the performance of the machine.

API-client-oriented

Data collection and new user management are all deployed on the API machine. The main design goals and challenges of the system are: recent articles must be indexed and can meet low-latency requirements. indexes are not small (often many gigabytes) and mustReal-time updateOnly in this way can we pass the latest articles to users. These need to design cross-machine load balancing for users, at the same time, it is required that tasks be quickly and simply divided into new machines (and, of course, close merging). To find the real needs of users, it is not enough to use indexes only, at the same time, we also need users' "fingerprints" (their hobbies, social relationships, recent articles, etc.). These fingerprint data is very big and constantly changing.

The document controller is required for the first few questions (as mentioned above ). The document controller organizes and pre-processes the current document set and stores the index set that has just been prepared in S3 every few minutes. When a new API machine starts, the document controller first reads these files, puts them into the memory, and then passes the latest copy of the index to it. The document controller is also responsible for transferring all index changes (new documents/new comments/deletions) to all running API machines in real time. Some other common functions of API machines are also read from S3 by the document controller, and are periodically updated to S3. If the data exceeds the S3 storage size limit, it is stored in dynamodb.

The remaining problem is how to extract and update the user fingerprints in each request to ensure latency ". Prismatic the method used here is to use a session to bind a user to an API machine. When a user logs on for the first time, his information is stored in a write-back cache with a valid period of time. Put the data in the memory for interest analysis during the life cycle of the user session. In this process, all operations performed by the user are processed through the same API machine. For a small part of non-critical user fingerprints, every few minutes, when the session expires or is closed, update it in batches. More key fingerprints are synchronized directly or at least one direct-write cache is used for updates.

Other services-client-oriented

    • The public interest function is mainly used for non-logged-on users. It collects their requirements through regular APIs and places different requirements on different pages by age.
    • User functions, management of Account creation, login and other functions. It is usually deployed on the SQL database to store users' main data, but these often need to take snapshots periodically and back up.
    • Simplified URL Service
    • Batch Processing and Other Services
    • There are also some other services for machine language training, data archiving, event tracking and analysis.

Image Library

This is a good declarative way to describe graph computing and gives full play to the advantages of clojure. A graph is a clojure, in which key-value pairs are defined. The keys store keywords, and the values store functions obtained after keyword calculation using other methods, this achieves the effect of external parameters.

This method is applied almost everywhere. For example, the document analysis channel is a graph, where the core content of each file depends on it (for example, if you want to define the topic of a document, the premise is to extract the text content first). The subscription generation process is a graph composed of two steps: Query and ranking. Each subscription Generation Service is a graph, each Resource (such as data storage, memory, and HTTP Controller) is a node in this figure, and they depend on each other. This provides convenience for the following work:

Modularization provides a brief description of service configurations (such as the use of subgraphs). Charts are drawn based on the dependency between services. Each node is measured for image document analysis, or the computing time and errors required for complex operations such as operations performed on resources of each complex service (such as APIS; intelligently plan nodes for different computing threads (at the theoretical level or even the machine level); replace some nodes in the graph by simulating them, this allows you to perform simple tests on all production services.

Prismatic applies Machine Learning Technologies to documents and users.

Machine Learning for documentation

Processing HTML documents: extracts the core text (rather than its sidebar, footer, comment, etc.), titles, authors, and useful images from the page; determine the characteristics related to the document (for example, the topic of the article ). The task mode is typical. The model is trained by large-scale Batch Tasks on other machines. First, the data is read from S3, and then the learned parameter files are stored on S3, then, read the model in S3 in the Content Extraction channel (and refresh regularly ). All the data output from the system will also be fed back to the channel, which is more helpful for understanding the user's interests and learning from errors at any time.

The framework flop library developed by prismatic is of the highest interest to software engineers. It implements the most advanced machine learning training and inference Code. It is very similar to the common beautiful clojure code, however, compiling (using a magic macro) is a low-level array execution loop, which makes it very similar to metal in Java and can be used without JNI.

This code framework is more concise and easier to understand than a heavyweight Java, and its execution speed is basically the same as that of Java. Prismatic spent most of its effort on creating a story machine component that can run quickly.

Machine Learning for users

Use social network data to speculate on the user's interests and use the obvious signs in the application (the user adds or deletes some articles) to optimize the speculation.

The question of using these obvious signs is very interesting, because user input must be very fast feedback in their subscriptions. If a user deletes all five of the five articles in a recommended publication, the user will not be able to show any other articles in the publication to the user immediately, or the next day. This means that there is no time for all users to execute another batch processing task. The solution is to learn online: when the user submits the obvious signs, the user's model is updated immediately.

The original data streams generated when the user interacts with the application must be saved. In this way, you can re-run the raw stream data required for machine learning for user interest later, and avoid errors during the process of uploading the data due to the fragile cache, as a result, the data is lost. The introduction of online learning can correct and calculate models more accurately.

What I learned

  1. Find your location. Carefully consider the entire channel and all the data flowing through it. Do more work on numerous challenges and propose specific solutions for each problem. Each service has its own scale, and communication between services is very easy to expand. This method cannot bring too much pressure to other services. Prismatic does not use hadoop to build a system. All data is stored in distributed databases and file systems as raw data.
  2. Discover and make full useHigh concurrency.
  3. Perform parallel operations while waiting. The new user registration process, for example, the new user upload channel needs to work concurrently while the user is waiting, so that the registration can be completed within several seconds.
  4. UseProcedural LanguageDevelop programs with rich granularity and free abstraction. These abstract layers can be used to form the logic layers for special problems.
  5. Avoid heavyweight and large-scale frameworksFor example, hadoop. This produces a small code library, which is easier to understand and expand under the same conditions. The reason for building your own code library is very simple, because most open-source functions are locked into the heavyweight framework, it is difficult to reuse code, and it is difficult to debug when expansion and problems arise.
  6. FindSuitable personDevelopment is very important to the system. Currently, the prismatic background service development team is composed of three computer science doctors who are responsible for all the development work, this includes all the code from Machine Learning Algorithm Research to low-level Web and iPhone client system engineering.
  7. Put all the code into production as early as possible. Although early stages of investment often involve building and debugging tools, this makes creating and debugging product services simple and interesting.
  8. Keep it simple.Do not spend too much effort on complex code libraries or frameworks.When there is a simpler solution that is good enough to use simple things, don't think about it any more. For example, you can use a simple HTTP Communication Protocol instead of a popular framework. If you can work, you should be happy to buy some ready-made management solutions, such as S3 or dynamo.
  9. Mostly inBuild powerful development tools and Class LibrariesMake some effort. For example, the prismatic "flop" Library allows them to write digital computing machine learning algorithms that are as fast as Java processing and have only one tenth of the Code; "Store" abstracts A Lot Of unimportant key-value storage details, allowing high cache, batch processing, and stream data transmission in various environments to be abstracted; "graph" makes it easy to write data, test and manage distributed stream process services.
  10. For eachData TypeDo not expect to find a general I/O and storage solution.

Finally, I will share my feelings:

When I log on to the prismatic homepage, the simple and elegant style is very technical before I use it (the system will recommend public interest ). When I click "register", the interface prompts that I can log on with my Facebook account and enter the user name and password (at this time, he has already transferred me to Facebook, the data in Twitter and Google Reader is analyzed.) The following is an article about cloud computing and big data, of course, it also contains some mobile news (the real-time MS is not a boast ).

This reminds me of domestic applications. The first one is youdao reading. All the friends who have used it should know that it only provides the prismatic unregistered user effect, that is, many popular interests are recommended, and users are asked to check them. The second thing I think of is Netease mail. When the keywords you subscribe to are more accurate, most of the articles he recommends are more in line with the user's taste. However, it is a pity that prismatic currently only supports social tools outside China such as Facebook, Twitter, and Google Reader. The scope of content capture is only well-known foreign publications, therefore, I am very much looking forward to the emergence of comparable applications in China, integrating Weibo, QQ, Renren and other domestic social networking tools to make reading recommendations different for everyone, each article suits the user's taste.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.