Prismatic: It takes only 10 seconds to learn to analyze users ' interests by machine

Last Update:2015-03-17 Source: Internet

Author: User

Keywords These machine-learning and each

Tags .mall .url analysis api application cache channel client

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

There are a few things to explain about prismatic first. Their entrepreneurial team is small, consisting of just 4 computer scientists, three of them young Stanford and Dr. Berkeley. They are using wisdom to solve the problem of information overload, but these PhDs also act as programmers: developing Web sites, iOS programs, large data, and background programs for machine learning needs. The bright spot of the prismatic system architecture is to solve the problem of social media streaming in real time with machine learning. He did not disclose their machine learning skills because of trade secrets, but we can look at them through the architecture. One of the founders of Prismatic, Bradford Cross, described the prismatic system succinctly as: "It is a comprehensive system that provides large-scale, real-time and dynamic personalized information ranking, classification, and grouping functions." Then we'll show you the architecture of the system.

Prismatic main function is to discover our interest and recommend reading for us

What should you read today? Any modern man is in this predicament every day, and people often use some very secretive ways to find what they want to read: Twitter, RSS, Facebook, Pinterest, G, email, and techmeme. Prismatic answered, "What should I read today?" "Question, Jason Wolfe generously agrees and describes in detail the solutions they are using, and contains many fashionable technical terms such as machine learning, social maps, large data, procedural programming, and factual requirements. The result, however, is that their approach is more secretive, but unlike other similar applications, they will discover your interests, no matter how deep their interest is in your message.

As is expected, their approach is somewhat special. They chose Clojure as the development language, a modern Lisp language compiled into Java bytecode. The main purpose is to make full use of the process language to build a program with rich granularity and free structure to meet the special needs of various problems. List two examples of the advantages of the procedural language they use: the first is that they use the gallery almost everywhere. For each user, their graphical calculations can be called low latency, a process-map-reduce set of tasks, and another example is their use of a child graph to implement a modular service configuration.

They put the focus of system design on procedural development, avoiding the use of mega-frames like Hadoop to turn to small, making the system more reliable, easier to correct, easier to expand, and easier to understand. The difficulty of Prismatic's machine learning approach is that it requires a very long training cycle. It should be noted that: first of all, it does not take much time to get the quality of the source content, and secondly, the use of a machine-learning recommendation system requires the training of all data from young children to the present life; This scalable thinking digital simulation system acts as a collector and producer of information.

Figure: More and more applications begin to focus on the user's reading recommendations

Statistical data

now has tens of millions of articles, blogs and other content published every day through Twitter,facebook,google reader and the Internet. But just a few seconds before the user logs in, prismatic can use their previous reading history to analyze the user's interests, then recommend reading, and use the user's activities on the social network, thus recommending the user-readable article from the contact. At the same time, Prismatic also records the user's actions on itself and makes it a source of analysis. Prismatic only takes 10 seconds to access the user's personal homepage, and by comparing it with the user's interest model, you can analyze what the user is most interested in. Now every week millions of of outstanding articles through prismatic successfully shared to the user.

Platform

The

prismatic host is hosted on the EC2 of AWS and uses the Linux operating system. prismatic99.9% 's backend channels and API services are developed using Clojure. Prismatic all the heavyweight frameworks are deployed in the JVM. Prismatic uses MongoDB Storage server parameters and user profiling data. Prismatic uses MySQL. Prismatic used S3. Prismatic used Dynamo. Prismatic focuses on developing code that addresses the needs generated by specific problems.

Data storage and input and output

Unlike typical structures, prismatic services are designed around data, providing the ability to read and write data in real time. Most of the data is transmitted directly between the back-end pipelines and does not have I/O to the disk. Prismatic keeps the data as close to the CPU as possible, leaving the API's request with little IO latency and binding the API to the CPU, which subtly expands the system's size. Prismatic uses a number of out-of-the-box solutions: Mongodb,mysql and Amazon's S3 and Dynamo. Each is carefully studied for actual requirements, including scale, access patterns, and other features related to data storage.

Service

Externally, the prismatic system is divided into 10 separate service modules, which can be roughly aggregation to 5 types: Data acquisition, new user management, APIs, other client services, batch processing. Each service is designed for a feature that is scaled horizontally in a special way, typically subject to 1 to 2 special resource constraints (Io,cpu,ram).

Data Acquisition--backstage

Every day there is a lot of content (including articles, blogs and web pages) generated, Prismatic want to capture as much as possible. To allow users to comment on each content and share it with like-minded friends, Prismatic must identify the author, the share, and some of the more valuable comments for each content. The first step in doing so is to collect and analyze content and relational data.

Start with the following steps in the capture channel Portal: find new articles on RSS by liking them, and collect the latest comments and updates from users ' Twitter and Facebook via APIs. The services that perform these functions are simple, and there is no state information, the only state and interesting place is how to determine what information has been collected and how to intelligently analyze what information to collect next. The hardest part is the former, because some people may have sent the same message in multiple places.

Then, the data collected from RSS, Twitter and Facebook are fed into a filter to determine what needs to go into the channel for analysis. First of all spam and other spam, the URL will be sent to the pipeline, there will be a lot of interesting things happen, prismatic will create a URL queue, each queue into a graph operation, use it to describe the URL in detail, crawl its HTML code, Use the machine learning algorithm to crawl the content of the article, identify the best pictures, grab the author, tag, topic, and so on. In order to make these processes as fast as possible, to be able to load memory and improve performance, they spend a lot of time, of course, the way to deal with these URLs is highly concurrent.

Finally, the document controller, which is designed to receive detailed articles and social content processed above, match them, organize the articles into a series of stories, decide what to do with the documents currently being processed, and manage the machine directories associated with those APIs.

New User Management--backstage

The new user Management module is primarily responsible for collecting information about newly registered users. It is composed mainly of two parts: through the user's Twitter,facebook and Googlereader to find out his favorite themes and favorite article sources, analyze his social map, find out his favorite friends to share articles.

The services that perform these functions are also highly parallel. Social profiling is very complex, and the key is how these services can achieve such low latency and reliable throughput: for active users of Twitter,facebook and Google Reader, It only takes 15 seconds, or less, for prismatic to pinpoint the topics hundreds of users like. After user authorization, prismatic can analyze the user's interest very quickly, the user often registers the time (fills in the password and clicks confirms the process) this work has already completed. Take a look at what's going on in 15 seconds: Prismatic the user's latest tweets and Facebook content, as well as what Google Reader has labeled as like, which is typically about 10 seconds. And then collect the 1000 URLs shared by the user and his contacts, and then feed the data into the aforementioned document controller, using the same machine learning stack for the collection/analysis channel; These results are aggregated, post-processing, and eventually stored in Dynamodb and S3.

This process is very strict with latency, so these processes cannot be serial and must be executed concurrently, so they have to use as much channel and parallel technology as possible. This process requires high throughput because each process is not small, and once many users register at the same time, you have to do more to reduce latency. This is also unique to prismatic, whose streaming and aggregating classes have reduced the weight level and achieved the ultimate performance of each user with a low latency map-reduce task. For multiple users, the concurrent processing they use in the system almost performs the full performance of the machine.

API Client-Facing

Data acquisition and new user management are deployed on API machines. The main design objectives and challenges of the system are: Recent articles must be indexed and meet low latency requirements, the index is not small (often many GB), and must be updated in real time, only in order to pass the most recently updated article to the user, which needs to design a load balance across the machine, It also requires the ability to quickly and easily divide the task into new machines (and, of course, close the merger); If you want to find the real needs of users, just using the index is not enough, but also requires the user's "fingerprint" (their hobbies, social relationships, recently read articles, etc.), these fingerprint data is very large and constantly changing.

The document controller is required for the first few questions (mentioned above). The document Controller organizes the current set of documents, preprocessing them, and S3 the just-done index set storage every few minutes. When a new API machine starts, the document controller reads the files first, puts them in memory, and passes the latest copy of the index to it. The document controller is also responsible for transferring all index changes (new document/New comment/deletion, etc.) to all running API machines in real time. Other API machines commonly used functions are also read by the document controller from the S3 into memory, periodically updated to S3, if the data exceeds the S3 storage size limit, it is deposited dynamodb.

The remaining question is how to extract and update the user "fingerprint" of each request, subject to the requirement of delay. Prismatic the method used here is to use Sticky sessions (session) to bind users to an API machine. When a user logs on for the first time, it stores his information in a back-write cache with a valid expiration date. During the life cycle of the user session, the data is put into memory for interest analysis. All of the user's actions in this process are processed by the same API machine, and for that very small part of the user fingerprint, which is not critical, is updated in batches every few minutes, when the session expires or shuts down. More key fingerprints are synchronized directly or at least one direct write cache is used to update.

Other services-client-oriented

Public interest functions, mainly for users who are not logged in to the service, through the regular API to collect their needs, according to the age division, the different needs on different pages. User function, admin create account, login and other functions. Typically deployed on a SQL database to store the user's primary data, these often require snapshots to be periodically taken and backed up. URL simplifies service batching and other services there are other services for machine language training, data archiving and event tracking and analysis.

Gallery

This is a declarative description of the graphical calculation of a good way to give full play to the advantages of Clojure. A diagram is a clojure diagram, in which the key value pairs, keys stored in the key words, the value of the use of other methods in the diagram to calculate the keyword after the function, so as to achieve the external parameters of the effect.

This approach has been applied almost everywhere. For example, a document analysis channel is a graph in which the core content of each file is dependent on it (for example, if you want to define the subject of a document, if you want to extract its text first, the subscription generation process is a graph of queries and two steps; Each subscription generation service itself is a graph, Each of these resources (such as data storage, memory, HTTP controller, etc.) is a node of this graph, and they depend on each other. This facilitates the following:

Modularization describes the service configuration succinctly (for example, using a child graph); Draw graphs using the interdependencies between the and services, and measure the computational time and errors that each node takes to perform complex operations such as document analysis, or activities performed on resources of each complex service (e.g., API) Intelligent for each calculation of the different threads planning nodes (at the theoretical level and even the machine level), by simulating the way to replace some of the nodes in the diagram, so that all production services can be simple test.

Prismatic machine learning technology for the two areas of documentation and users.

Machine Learning for documents

For processing HTML documents: Extract the core text (not its sidebar, footer, comments, etc.), title, author, useful pictures, etc., and identify some of the characteristics of the document (such as what the article is about, the subject, etc.). The patterns of these tasks are typical. The model is trained in large batches of other machines, first reading the data from the S3 and then storing the parameter files S3, then reading the model in the S3 in the content extraction channel (and periodically refreshing). All the data output from the system will also be fed back to the channel, which will help to understand the user's interests and learn from the mistakes at any time.

Prismatic research and development of the framework of "flop" library is the most interested in software engineers, it implements the most advanced machine learning training and reasoning code, it and ordinary beautiful Clojure code very similar, but do compile (using magic Macros) into low-level array execution loop, This makes it very similar to the metal in Java, which can be used without the help of JNI.

This code framework is simpler and easier to read than a heavyweight Java, and is essentially flat with Java execution. Prismatic spends most of his time creating a fast-moving story machine component.

Machine Learning for users

Use social network data to speculate on what users are interested in and use the obvious signs in the application (add some or delete some articles) to optimize the speculation.

The question of using these obvious flags is interesting because user input must be very quick to feed back in their subscriptions. If a user deletes 5 articles from a recommended publication, then immediately stop showing the other articles in this publication to the user, not the next day. This means that there is no time to perform another batch task for all users, and the solution is to learn online: When the user submits those obvious flags, the user's model is updated immediately.

The raw data stream that is generated when the user interacts with the application is saved. This allows you to rerun the raw stream data needed to machine-learn the user's interest at a later time, while also avoiding the loss of the data as a result of a weak cache error during uploading. The introduction of online learning can correct and more accurately compute the model.

Learn Something

Find your location. Carefully consider the entire channel and all the data that flows through it. Do more work on many challenges and propose specific solutions for each problem. Each service has its own size, and communication between services in a very easy to extend way does not bring much pressure on other services. Prismatic does not use Hadoop to build the system, and all data exists in the form of raw data in distributed databases and file systems. Discover and make full use of highly parallel opportunities. Parallel operations are performed while the user is waiting. A new user registration process, such as a new user upload channel, is performed while the user is waiting, so that registration can be completed in seconds. Use procedural language to develop rich granular, free abstract programs. These levels of abstraction can be used to form the logical layer of special problems. Avoid heavyweight, large-scale frameworks such as Hadoop. This results in a smaller code base, which is less prone to error, easier to understand, and easier to extend, under the same conditions. The rationale for building your own code base is simple, since most open source functionality is locked into a heavyweight framework, making it difficult to reuse code, and is difficult to debug when it comes to scaling and problem-making. Finding the right person to develop is important to the system. Currently, the prismatic background services development team is comprised of 3 computer science PhDs who are responsible for all the development work, including all the code from machine learning algorithm research to low-level web and iphone client systems engineering. Putting all the code into production early, although early in the investment often builds and debugging tools, it makes it easy and fun to create and debug product services. Keep it simple. Don't spend too much time on complex code libraries or frameworks, use simple things, and don't fantasize about a simpler solution when it's good enough. For example, use the simple opportunity of HTTP communication protocol, and not to consider what the popular framework. If you can work, be happy to buy some out-of-the-box management solutions, such as S3 or Dynamo. More on building powerful development tools and class libraries. For example, Prismatic's "flop" library, which allows them to write a digital arithmetic machine learning algorithm that is as fast as Java, with only one-tenth of the code; "Store" abstracts a number of unimportant key-value storage details that allow for a high level of caching in a variety of environments, Batch processing and transitive flow data abstraction, "graph" makes writing data, testing and managing distributed flow process services becomes easy and feasible. Be careful about each data type, and don't expect to find a common solution for I/O and storage.

Finally share the feeling of using:

When I logged on to the Prismatic home page, the simple, generous style felt a touch of technology before I used it (the system would recommend public interest). When I clicked on the registration, the interface hint could be logged in with a Facebook account, typing in the username and password (when he had analyzed the data in my facebook,twitter and Google Reader), And then it was a piece of cloud computing and big data, and, of course, some mobile news (250ms wasn't bragging).

This reminds me of the domestic application, the first is the way to read, used friends should know, it actually only provides prismatic not registered user effect, that is, recommended a lot of popular interest, and then ask users to check inside. The second I thought of is NetEase mailbox, when you subscribe to a more accurate keyword, he recommended the article is mostly more in line with the user's taste. However, unfortunately, prismatic currently only support Facebook,twitter and Google Reader, such as foreign social tools, the scope of the content is only well-known foreign publications, so I am looking forward to the domestic can be comparable to the application, will be Micro Bo, QQ, Everyone and other domestic social tools integrated in, will read the recommendation to do everyone is different, each article is consistent with the user's taste.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More