Compared with offline machine learning, online machine learning has better performance in terms of model update timeliness, model iteration cycle, and business experiment results. Therefore, migrating machine learning from offline to online has become an effective means to improve business indicators.
Simple Application Server
USD1.00 New User Coupon
* Only 3,000 coupons available.
* Each new user can only get one coupon(except users from distributors).
* The coupon is valid for 30 days from the date of receipt.
In online machine learning, samples are a key part. This article will give you a detailed introduction to how Weibo uses Flink to generate online samples.
Why choose Flink for online sample generation?
Online sample generation has extremely high requirements on the timeliness and accuracy of samples. There are also strict index requirements for the stability of the operation and disaster tolerance.
Therefore, we decided to use Flink as a real-time stream computing framework for online sample generation.
How to achieve?
Online sample generation, briefly describe a business scenario: real-time correlation of user exposure data and click data, and output the data to Kafka after the correlation, for downstream online training tasks.
First, we must determine the time window associated with the two data streams. In this step, it is generally recommended to associate the logs of the two data streams offline, and join the two data in different time ranges in an offline manner to determine the time window required for online. For example, the minimum correlation ratio accepted by the business is 85%, and the offline test confirms that the two data streams can correlate 85% of the data within 20 minutes, then 20 minutes can be used as the time window. The correlation ratio and window time here are actually a trade-off between accuracy and real-time.
After determining the time window, we did not use Flink's time window to implement the join of multiple data streams, but chose to implement the union + timer method. There are two main considerations here: First, the join operation that comes with Flink does not support multiple data streams. Second, use timer+state to achieve, which is more customized, less restrictive, and more convenient.
Next, we subdivide the sample generation process into:
① Input data stream
Generally, our data sources include Kafka, Trigger, MQ, etc. Flink needs to read logs from the data source in real time.
② Formatting and filtering of input data stream
After reading the log, format the data and filter out unnecessary fields and data.
Specify the key of the sample join. For example: user id and content id as key.
The output data format is generally tuple2 (K, V), K: the key participating in the join. V: Fields used in the sample.
③ Union of input data stream
Use Flink's union operation to superimpose multiple input streams together to form a DataStream.
Specify a distinguishable alias or add a distinguishable field for each input stream.
④ Aggregation of input data stream: keyby operation
Do the keyby operation on the join key. Following the above example, it means to join multiple data streams based on user id and content id.
If the key has data skew, it is recommended to add a random number to the key and then aggregate first, and remove the random number and aggregate again.
⑤ Data storage state + timer
Define a Value State.
In the process method after keyby, we will rewrite the processElement method and judge in the processElement method. If the value state is empty, then new a new state, write data to the value state, and register a timer for this data (The timer will be automatically deduplicated by Flink according to key+timestamp), and here we use ProcessingTime (meaning that onTimer() is triggered when the system timestamp reaches the timestamp set by the Timer). If it is not empty, follow the splicing strategy to update the existing results. For example, if there is no click behavior for the first log data of user id1 and content id1 in the time window, this field is 0. After the second click data is entered, this field is updated to 1. Of course, in addition to the update operation, there are also various operations such as counting, accumulation, and averaging. How to distinguish whether the data comes from exposure or click in the process, use the alias defined in step ③ above.
Override the onTimer method. The onTimer method mainly defines the logic executed when the timer is triggered: get the stored data from the value state and output the data. Then execute state.clear.
There are two conditions for the sample output from the window: First, the timer expires. Second, the samples needed for the business are all spliced together.
Reference pseudo code here:
public class StateSampleFunction extends KeyedProcessFunction<String, Tuple2, ReturnSample> {
/**
* This state is maintained through process functions, using ValueState
*/
private ValueState state;
private Long timer = null;
public StateSampleFunction (String time){
timer = Long.valueOf(time);
}
@Override
public void open(Configuration parameters) throws Exception {
// Get state
state = getRuntimeContext().getState(new ValueStateDescriptor<>("state", TypeInformation.of(new TypeHint< ReturnSample >() {})));
}
@Override
public void processElement(Tuple2value, Context context, Collector< ReturnSample> collector) throws Exception {
if (value.f0 == null){
return;
}
Object sampleValue = value.f1;
Long time = context.timerService().currentProcessingTime();
ReturnSample returnSample = state.value();
if (returnSample == null) {
returnSample = new ReturnSample();
returnSample.setKey(value.f0);
returnSample.setTime(time);
context.timerService().registerProcessingTimeTimer(time +timer);
}
// Update click data to state
if (sampleValue instanceof ClickLog){
ClickLog clickLog = (ClickLog)values;
returnSample =(ReturnSample) clickLog.setSample(returnSample);
}
state.update(returnSample);
}
/**
* @param timestamp
* @param ctx
* @param out
* @throws Exception
*/
@Override
public void onTimer(long timestamp, OnTimerContext ctx, Collector< ReturnSample> out) throws Exception {
ReturnSample value = state.value();
state.clear();
out.collect(value);
}
}
⑥ Formatting and filtering of logs after splicing
The spliced data needs to be formatted according to the requirements of the online training job, such as json, CSV and other formats.
Filtering: Determine what kind of data is a qualified sample. For example: only what is actually read is a usable sample.
⑦ Output
The samples are finally output to the real-time data queue. The following is the actual job topology and runtime status:
Selection of StateBackend
Advantages and suggestions of using RocksDB/Gemini as the state backend:
We used big data to compare memory with RocksDB and Gemini. The results show that RocksDB and Gemin are more reasonable than memory in terms of data processing, job stability and resource usage. Among them, Gemini has the most obvious advantage.
In addition, if it is a state with a large amount of data, it is recommended to use Gemini + SSD solid state drives.
Monitoring of samples
1. Abnormal monitoring of Flink jobs
Job failure monitoring
Failover monitoring
Checkpoint failed monitoring
RocksDB usage monitoring
Job consumption Monitoring of Kafka's Comsumer Lag
Monitoring of job back pressure
2. The consumption delay monitoring of Kafka at the sample input
3. Monitoring the write volume of Kafka at the sample output
4. Sample monitoring
Splicing rate monitoring
Positive sample monitoring
Monitoring of output sample format
Whether the value corresponding to the output label is in the normal range
Whether the value corresponding to the input label is null
Whether the value corresponding to the output label is empty
Sample verification
How to verify the accuracy of the data after the sample is generated
Online and offline mutual verification
Connect online samples from the exported Kafka to HDFS for offline storage. And partition according to the time window of online join.
Compare offline samples and online samples generated under the same conditions
Full process verification of whitelisted users
Store the logs and sample results of the whitelisted users in a real-time data warehouse such as ES for verification.
Troubleshooting
Sample abnormalities have a great impact on online model training. When an abnormal alarm is found, the first thing to do is to send a sample abnormal alarm to the online model training job. After receiving the alarm message, the model stops updating. So as to avoid affecting the online effect of the model.
After the ordinary business fault is resolved, the original data is discarded, and all input log streams are consumed from the latest point in time and new samples are generated. Important services need to reset the Kafka offset of the input log stream to regenerate sample data from the point of failure.
Platformization
It is very important to make strict specifications for the sample generation process through platformization. In the process of platformization, it is necessary to provide simple and universal development templates to improve the efficiency of job development; provide platform-based job monitoring and sample indicator monitoring frameworks to avoid repeated vehicle manufacturing; provide general sample output landing strategies, and online/offline calibration It can provide more convenient service to the business side.
Based on platform-based development, users only need to care about the business logic. Need user development:
Data cleaning logic corresponding to input data
Data cleaning logic before sample output
The rest can be achieved by configuring on the UI, specifically:
Enter Kafka configuration information and the UDF class corresponding to data cleaning
Time window for sample stitching
Aggregation of fields in the window
Kafka configuration information for sample output and UDF class for data cleaning and formatting before output
The resource situation is reviewed and configured by the platform. After completion, the job is automatically generated and submitted.