#研发解决方案介绍 #recsys-evaluate (recommended reviews)

Source: Internet
Author: User
Tags piwik

Zheng based on Liu document last updated on 2014/12/1 keywords:Recsys , recommended reviews, Evaluation of Recommender System, Piwik, Flume, Kafka, Storm, Redis, MySQLThis document applies to people: Research and development recommendation systems are more than just circles around recommended algorithmsLet's make it clear that we belong to the industrial sector. Many of the new and odd algorithms that work well in academic papers are not feasible in industry. When we did semantic aggregation, the segmentation, clustering, similarity calculation, entity word recognition, sentiment analysis and so on, finally adopted the mature algorithm that the industry has been popular more than 10 years ago and even decades ago. If the algorithm does not determine fate, then what is the key? algorithm + rule base + manual intervention (collation corpus, logo, tuning parameters, etc.), mostly dirty work tired work. or call, feature + algorithm + manual interventionTo narrow the data range or dimensionality with features. I wrote in 2009:
in the world of semantics, it is possible to say roughly that everything is a feature extraction. You just have to find the features, and things are good. ...... ...... Do you expect Bijigongyuyiyi? The real application of natural language processing is that it is difficult to find a all-in-one feature in a scene.  is a layer of features superimposed. a layer of features removes part of the junk data. So again and again, the end becomes a fruition. Pay attention to methodology.
Liang bin in 2012 Weibo said:
statistics Coarse and rough, is a sledgehammer. The rules are fine and precise, but they are small hammers. First big game after fine chess.
How did the rule base come from? it is necessary to construct some peripheral systems to observe the features, establish the rules, adjust the parameters and observe the effects.。 So similar to this, do the referral service, you need to recommend the effect of evaluation. Recommended Application ScenariosThe e-commerce recommendation scenario has very specific indicators:
    1. recommended number of impressions, the number of items ;
    2. recommended places to display click- through rate, product launch click -through;
    3. The most important is the next single conversion rate and the conversion rate (or call payment conversion rate) of the two tough.
Then the recommended evaluation system should have the following functions:
  1. statistics of several display indicators in real time (at least near real time)
      • Differentiate between site-side and mobile-client recommended Impressions
      • Further differentiate between different clients, such as IOS and Android
      • Figure
  2. Data Overview
      • Figure
  3. summarize various indicators by recommended bit type or recommended algorithm
      • Look and look.
      • Users who viewed this product purchased
      • You may be interested in the following items (guess what you Like)
      • Goods around goods (note: Only local life service products)
      • Registration pop-up window recommended
      • Products near Stores
      • Food around the store
      • Nearby Beer and Skittles
      • Figure
      • ......
  4. Two experimental methods of common evaluation recommendation effect
      1. Offline Test:
        • Practice: Obtain the user's behavior data from the log system, then divide the data set into training data and test data, such as 80% training data and 20% of test data (also cross-validation), and then train the user's interest model on the training data set, test on the test set
        • Pros: It doesn't require actual user interaction
        • Cons: Offline experimentation can only measure a narrow set of data slices, mainly about the accuracy of algorithmic predictions or evaluations
        • Objective: To filter out the poor performance algorithm in advance.
      2. AB Test:
        • Practice: By a certain number of rules of the user randomly divided into groups, and different groups of users to adopt different recommended algorithms, so that can be more fair to obtain different algorithms in the actual online performance indicators
        • Figure
  5. Recommended Service Interface Test interface
      • Exposed, let's hand it to be submitted, see the effect
Recommended Evaluation Technology selectionWhen it comes to real-time log aggregation and processing, it has to be flume+kafka+storm, so technology selection is: piwik+flume+kafka+storm+redis+mysql recommended evaluation Data flow process
  1. data escalation:--piwik
      1. The main station itself deploys the open source traffic statistics System Piwik, so in the various recommendations on the Web site according to the rules buried points can be
        • Example: "The user who has browsed this product also purchased" a element of the first item in the recommendation bar added WWE properties:wwe= "t:goods,w:rec,id:ae45c145d1045c9d51c270c066018685,rec:101_01_ 103 "
      2. When the browser is fully loaded, Piwik JavaScript sends buried data to the server side.
      3. After the Piwik server is received, write the disk log file
  2. Data acquisition:--flume
      1. Flume Agent is deployed on each server in the Piwik cluster
      2. The agent collects Flume cluster push logs from the recommended data, such as the one that is configured to push each additional line of logs, or is pushed once every 5 minutes
      3. The embedded log of the mobile client is stored in MySQL on the wireless server side, so we use the script to read the data every minute and put it in the Flume monitoring directory.
  3. data access:--kafka
      1. Because data acquisition speed and data processing speed do not necessarily match, so add a message middleware Linkedin Kafka as a buffer
      2. Data flow mode for Flume source-->flume channel-->flume Sink, then we write a Kafka Sink as the message producer, will Sink receive from the Channel The log data sent to the message consumer
      3. Figure
  4. Flow-based calculation:--storm
      1. Storm is responsible for real-time computation of collected data
      2. Storm Spout responsible for uninterrupted reading of data from external systems and assembled into a tuple to be emitted, the tuple is transmitted in the topology after being launched
      3. So we're going to write a Kafka Spout as message consumer pull log data
      4. Write some Storm bolts and process the data.
        • , the structure of a topology
  5. data output:--redis
      1. After Storm Bolt analyzes the data in real time, writes the statistics to Redis
  6. Data statistics:--mysql
      1. Evaluation system real-time data directly from the Redis read, and query the master database tracking into a single case, synchronized to MySQL, as a report presentation data source
In short, the data flows in the following ways:
    1. Piwik JavaScript
    2. Piwik Servers
    3. Flume Agent
    4. Custom Kafka Sink
    5. Custom Kafka Spout
    6. Custom Storm Bolts
    7. Redis
    8. Evaluation System Calculation
    9. MySQL
    10. Evaluation System Report display
Flume+kafka+storm Frequently Asked questionsAlthough our real-time traffic statistics and referral evaluation System have adopted the Flume+kafka+storm scheme, it is important to note that this program also has some pits. Some third-party conclusions are summarized below:
    • If it is configured to collect every new log, the real-time data from Flume to Kafka can be too fast, causing storm spout to consume Kafka message rates. The delay can be caused by the calculation of hbase after data is emitted into stream (note: HBase performance is really worrying, not suitable for this kind of real-time data processing, especially after adding more indexes);
      • A reference to the data: Storm single pipeline processing capacity of about 20000 tupe/s (each tuple size is 1000 bytes);
      • Too many tuple, will be reported to the GC exception because the Kafka message needs a new String () to obtain;
      • A large number of tuples accumulate in the stream, resulting in a time-out function that automatically callbacks fail ();
      • Multi-tuple structure can be optimized to package multiple logs into a tuple
      • As a general rule, a single launch can carry
Kafka Sink message producer code fragment
Kafkasink.java
import kafka.javaapi.producer. Producer; ... Public class kafkasink extends abstractsink implements Configurable { ... Private producer<string, byte[]> Producer; ... @Override Public Status process () throws eventdeliveryexception { Channel channel = Getchannel (); Transaction tx = Channel.gettransaction (); Try { Tx.begin (); Event e = Channel.take (); if (e = = null) { Tx.rollback (); return status.backoff;    } producer.send (new keyedmessage<String, byte[]> (topic , E.getbody ())); tx.commit (); return status.ready; } catch (Exception e) {
Kafka Spout Message Consumer code snippetSpout There are many, we pick Kafka spout look down.
Kafkaspout.java
Public abstract class kafkaspout implements irichspout { ... @Override Public void activate () { ... for (final kafkastream<byte[], byte[]> stream:streamlist) { executor.submit (New Runnable() { @Override Public void run () { consumeriterator<byte[], byte[]> iterator = Stream.iterator (); While (Iterator.hasnext ()) { if (spoutpending.get () <= 0) { sleep (+); continue;        } messageandmetadata<byte[], byte[]> next = Iterator.next (); byte[] message = Next.message (); List<Object> tuple = null; Try { tuple = generatetuple (message); } catch (Exception e) { e.printstacktrace ();        } if (tuple = = null | | tuple.size ()! = outputfieldslength) { continue;        } collector.emit (tuple); spoutpending.decrementandget ();       }      }
Storm Bolt Code SnippetThere are multiple custom bolts, pick one to look down.
Evaluatebolt.java
Public class evaluatebolt extends Basebasicbolt { ... @Override Public void Execute (Tuple input, basicoutputcollector collector) { ... if (LogWebsiteSpout.PAGE_EVENT_BROWSE.equals (EVENT)) { if (LogWebsiteSpout.PAGE_TYPE_GOODS.equals (PageType)) { incrbasestatistics (Basekeymap, Browse_all, 1); } else if (LogWebsiteSpout.PAGE_TYPE_PAY1.equals (PageType)) { incrbasestatistics (Basekeymap, Order_all, 1);    } String recdisplay = Input.getstringbyfield (logwebsitespout.field_rec_display); Recdisplaystatistics (Recdisplay, Time, PageType, basekeymap); } else if (LogWebsiteSpout.PAGE_EVENT_CLICK.equals (EVENT)) { String recType = Input.getstringbyfield (logwebsitespout.field_rec_type);
Evaluation Indicator Definition:
    • Click-through rate: Recommended number of views /recommended product delivery
    • Show click-through rate: Recommended number of views /recommendations
    • Recommended display rate :: Recommended number of Impressions /Total views
    • Recommended views: The number of views generated through referrals
    • Recommended Product Delivery: Recommended number of products (such as: Users browse a product, that in the browse or purchase of the recommended bit generated by the recommended product is 5, the recommended product volume +5)
    • Recommended number of impressions: if the recommended product and show, Count +1
-over-

#研发解决方案介绍 #recsys-evaluate (recommended reviews)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.