How to Generate Online Machine Learning Samples Based on Flink?

Last Update:2020-09-27 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Compared with offline machine learning, online machine learning has better performance in terms of model update timeliness, model iteration cycle, and business experiment results. Therefore, migrating machine learning from offline to online has become an effective means to improve business indicators.
Simple Application Server
USD1.00 New User Coupon
* Only 3,000 coupons available.
* Each new user can only get one coupon(except users from distributors).
* The coupon is valid for 30 days from the date of receipt.

In online machine learning, samples are a key part. This article will give you a detailed introduction to how Weibo uses Flink to generate online samples.

Why choose Flink for online sample generation?
Online sample generation has extremely high requirements on the timeliness and accuracy of samples. There are also strict index requirements for the stability of the operation and disaster tolerance.
Therefore, we decided to use Flink as a real-time stream computing framework for online sample generation.

How to achieve?
Online sample generation, briefly describe a business scenario: real-time correlation of user exposure data and click data, and output the data to Kafka after the correlation, for downstream online training tasks.

First, we must determine the time window associated with the two data streams. In this step, it is generally recommended to associate the logs of the two data streams offline, and join the two data in different time ranges in an offline manner to determine the time window required for online. For example, the minimum correlation ratio accepted by the business is 85%, and the offline test confirms that the two data streams can correlate 85% of the data within 20 minutes, then 20 minutes can be used as the time window. The correlation ratio and window time here are actually a trade-off between accuracy and real-time.

After determining the time window, we did not use Flink's time window to implement the join of multiple data streams, but chose to implement the union + timer method. There are two main considerations here: First, the join operation that comes with Flink does not support multiple data streams. Second, use timer+state to achieve, which is more customized, less restrictive, and more convenient.

Next, we subdivide the sample generation process into:

① Input data stream

Generally, our data sources include Kafka, Trigger, MQ, etc. Flink needs to read logs from the data source in real time.

② Formatting and filtering of input data stream

After reading the log, format the data and filter out unnecessary fields and data.

Specify the key of the sample join. For example: user id and content id as key.

The output data format is generally tuple2 (K, V), K: the key participating in the join. V: Fields used in the sample.

③ Union of input data stream

Use Flink's union operation to superimpose multiple input streams together to form a DataStream.

Specify a distinguishable alias or add a distinguishable field for each input stream.

④ Aggregation of input data stream: keyby operation

Do the keyby operation on the join key. Following the above example, it means to join multiple data streams based on user id and content id.

If the key has data skew, it is recommended to add a random number to the key and then aggregate first, and remove the random number and aggregate again.

⑤ Data storage state + timer

Define a Value State.

In the process method after keyby, we will rewrite the processElement method and judge in the processElement method. If the value state is empty, then new a new state, write data to the value state, and register a timer for this data (The timer will be automatically deduplicated by Flink according to key+timestamp), and here we use ProcessingTime (meaning that onTimer() is triggered when the system timestamp reaches the timestamp set by the Timer). If it is not empty, follow the splicing strategy to update the existing results. For example, if there is no click behavior for the first log data of user id1 and content id1 in the time window, this field is 0. After the second click data is entered, this field is updated to 1. Of course, in addition to the update operation, there are also various operations such as counting, accumulation, and averaging. How to distinguish whether the data comes from exposure or click in the process, use the alias defined in step ③ above.

Override the onTimer method. The onTimer method mainly defines the logic executed when the timer is triggered: get the stored data from the value state and output the data. Then execute state.clear.

There are two conditions for the sample output from the window: First, the timer expires. Second, the samples needed for the business are all spliced together.

Reference pseudo code here:

public class StateSampleFunction extends KeyedProcessFunction<String, Tuple2, ReturnSample> {
    /**
     * This state is maintained through process functions, using ValueState
     */
    private ValueState state;
 
 
    private Long timer = null;
 
 
    public StateSampleFunction (String time){
        timer = Long.valueOf(time);
    }
 
 
    @Override
    public void open(Configuration parameters) throws Exception {
        // Get state
        state = getRuntimeContext().getState(new ValueStateDescriptor<>("state", TypeInformation.of(new TypeHint< ReturnSample >() {})));
    }
 
 
    @Override
    public void processElement(Tuple2value, Context context, Collector< ReturnSample> collector) throws Exception {
        if (value.f0 == null){
            return;
        }
 
 
        Object sampleValue = value.f1;
        Long time = context.timerService().currentProcessingTime();
        ReturnSample returnSample = state.value();
        if (returnSample == null) {
            returnSample = new ReturnSample();
            returnSample.setKey(value.f0);
            returnSample.setTime(time);
            context.timerService().registerProcessingTimeTimer(time +timer);
        }
 
 
        // Update click data to state
        if (sampleValue instanceof ClickLog){
            ClickLog clickLog = (ClickLog)values;
            returnSample =(ReturnSample) clickLog.setSample(returnSample);
        }
        state.update(returnSample);
    }
 
 
    /**
     * @param timestamp
     * @param ctx
     * @param out
     * @throws Exception
     */
    @Override
    public void onTimer(long timestamp, OnTimerContext ctx, Collector< ReturnSample> out) throws Exception {
        ReturnSample value = state.value();
        state.clear();
        out.collect(value);
    }
}

⑥ Formatting and filtering of logs after splicing

The spliced data needs to be formatted according to the requirements of the online training job, such as json, CSV and other formats.

Filtering: Determine what kind of data is a qualified sample. For example: only what is actually read is a usable sample.

⑦ Output

The samples are finally output to the real-time data queue. The following is the actual job topology and runtime status:

Selection of StateBackend

Advantages and suggestions of using RocksDB/Gemini as the state backend:

We used big data to compare memory with RocksDB and Gemini. The results show that RocksDB and Gemin are more reasonable than memory in terms of data processing, job stability and resource usage. Among them, Gemini has the most obvious advantage.

In addition, if it is a state with a large amount of data, it is recommended to use Gemini + SSD solid state drives.

Monitoring of samples

1. Abnormal monitoring of Flink jobs

Job failure monitoring

Failover monitoring

Checkpoint failed monitoring

RocksDB usage monitoring

Job consumption Monitoring of Kafka's Comsumer Lag

Monitoring of job back pressure

2. The consumption delay monitoring of Kafka at the sample input

3. Monitoring the write volume of Kafka at the sample output

4. Sample monitoring

Splicing rate monitoring

Positive sample monitoring

Monitoring of output sample format

Whether the value corresponding to the output label is in the normal range

Whether the value corresponding to the input label is null

Whether the value corresponding to the output label is empty

Sample verification

How to verify the accuracy of the data after the sample is generated

Online and offline mutual verification

Connect online samples from the exported Kafka to HDFS for offline storage. And partition according to the time window of online join.

Compare offline samples and online samples generated under the same conditions

Full process verification of whitelisted users

Store the logs and sample results of the whitelisted users in a real-time data warehouse such as ES for verification.

Troubleshooting

Sample abnormalities have a great impact on online model training. When an abnormal alarm is found, the first thing to do is to send a sample abnormal alarm to the online model training job. After receiving the alarm message, the model stops updating. So as to avoid affecting the online effect of the model.

After the ordinary business fault is resolved, the original data is discarded, and all input log streams are consumed from the latest point in time and new samples are generated. Important services need to reset the Kafka offset of the input log stream to regenerate sample data from the point of failure.

Platformization

It is very important to make strict specifications for the sample generation process through platformization. In the process of platformization, it is necessary to provide simple and universal development templates to improve the efficiency of job development; provide platform-based job monitoring and sample indicator monitoring frameworks to avoid repeated vehicle manufacturing; provide general sample output landing strategies, and online/offline calibration It can provide more convenient service to the business side.

Based on platform-based development, users only need to care about the business logic. Need user development:

Data cleaning logic corresponding to input data

Data cleaning logic before sample output

The rest can be achieved by configuring on the UI, specifically:

Enter Kafka configuration information and the UDF class corresponding to data cleaning

Time window for sample stitching

Aggregation of fields in the window

Kafka configuration information for sample output and UDF class for data cleaning and formatting before output

The resource situation is reviewed and configured by the platform. After completion, the job is automatically generated and submitted.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

How to Generate Online Machine Learning Samples Based on Flink?

Contact Us

Recommend Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support