Kafka + flink: quasi-real-time exception detection system

Source: Internet
Author: User

1. Background
Exception detection can be defined as "making decisions based on the actions of actors (persons or machines)". This technology can be applied to many industries, for example, transaction detection and loan detection are performed in Financial Scenarios, production line warning is performed in industrial scenarios, and *** detection is performed in security scenarios.

Streamcompute plays different roles in different business requirements: It can detect online fraud, you can also perform near-real-time result analysis, global warning, and rule adjustment after decision-making.

This article first introduces a quasi-real-time exception detection system.

The so-called quasi-real-time, that is, the latency must be within Ms. For example, a bank performs a real-time transaction check to determine whether each transaction is a normal transaction: if the user name and password are stolen, the system can detect risks when a hacker initiates a transaction to determine whether to freeze the transaction.

This type of scenario has a high requirement on real-time performance. Otherwise, normal transactions may be blocked, so it is called a quasi-real-time system.

Because actors may adjust the rules according to the results of the system, the rules will also be updated. streamcompute and offline processing are used to study whether the rules need to be updated and how to update the rules.

2. system architecture and module Overview
To solve this problem, we design the following system architecture:

The online system can complete the Online Detection Function in the form of web services:
Detect a single event
Checks based on the global context, such as the global blacklist.
Based on user profiles or recent information, such as the time and location of the last 20 transactions
Kafka sends the event and detection results and their causes to the downstream
Flink near real-time processing
Updates user attributes in near real time, such as the latest transaction time and location;

Summarize and compare the global detection status, for example, the interception rate of a rule suddenly changes greatly, and the global pass rate suddenly increases or decreases;

Maxcompute/hadoop storage and offline analysis are used to keep historical records, and business personnel will explore whether there is a new mode of hbase and save user profiles.

3. Key modules
3.1 Online Detection System

Transaction exception detection is implemented in the system. It can be a web server or a system embedded into the client. In this article, we assume that it is a Web server, and its main task is to review the incoming events and feedback the consent or rejection.

Three levels of detection can be performed for each incoming event:

Event-Level Detection
Only this event can be used to complete detection, such as format judgment or basic rule verification (attribute a must be greater than 10 and less than 30, attribute B cannot be blank, and so on)
Global context Detection
In the context of global information, for example, there is a global blacklist to determine whether the user is in the blacklist. Or an attribute that is equal to or greater than the global level of light rain
Average.

Profile content detection

Cross-Record Analysis of the operator itself. For example, the first 100 transactions of the user occurred in Hangzhou, and the transaction occurred in Beijing only 10 minutes after the previous transaction, then there is a reason to send an exception signal.

Therefore, this system must store at least three aspects: the entire detection process, the judgment rules, and the global data required. In addition, decide whether to cache user profiles locally as needed.

3.2 Kafka

Kafka is mainly used to send data such as detected events, detection results, rejection or passing reasons to downstream for stream computing and offline computing to process.

3.3 flink near real-time processing

The above system has completed exception detection and sent the decision to Kafka. Next we need to use the data to perform a new round of defensive detection for the current policy.

Even if the known cheating behavior has been entered into the model and rule repository for marking, there are always "smart people" Trying fraud. They will learn the current system, guess the rules, and make adjustments. These new behaviors may be beyond our current understanding. Therefore, we need a system to detect exceptions of the entire system and discover new rules.

That is to say, our goal is not to check whether a single event is faulty, but to check whether the logic used to detect the event is faulty,

Therefore, we must look at the problem at a higher level than the event. If the problem changes at a higher level, we have reason to consider adjusting the rules/logic.

Specifically, the system should focus on some macro indicators, such as the total amount, average value, and behaviors of a certain group. When these indicators change, some rules are often invalid.

For example:

The interception rate before a rule was 20%, which suddenly dropped to 5%;

After the rule was launched one day, a large number of normal users were intercepted;

A person's spending on electronic products has suddenly increased by 100 times, but there are also many similar behaviors, which may have some reasonable explanation (such as the iPhone market );

A person's behavior is normal for a single time, but should not be so many times. For example, he bought 100 times of the same product in a day. [Window Analysis ];

Identify a combination of multiple normal behaviors. This combination is abnormal. For example, a user can buy a food knife, a ticket can be bought, and a rope can be bought, it is also normal to go to the gas station to refuel, but it is not normal to do these tasks at the same time within a short period of time. This behavior pattern can be found through global analysis.

Based on the near-real-time results produced by streamcompute, the business personnel can promptly discover whether the rules are correct and then adjust the rules.

In addition, streamcompute can update user portraits in real time, such as counting users' behaviors in the past 10 minutes and the last 10 login locations.

3.4 maxcompute/hadoop offline storage for exploratory analysis

In this step, you can use scripts, SQL, or machine learning algorithms to perform exploratory analysis and discover new models, such as clustering calculation.

Clustering, model training after marking behaviors, or periodically re-computing user profiles. There is a large relationship with the business here, and there is not much to describe.

3.5 hbase User Profiles

Hbase stores user profiles generated by streamcompute and offline Computation for use by the detection system. Hbase is selected to meet real-time query requirements.

4. Summary
The above provides a conceptual design of a quasi-real-time exception detection system. Although the business logic is simple, the entire system is complete and well scalable. Therefore, it can be further improved on this basis.
Welcome to Java engineers who have worked for one or five years to join Java architecture development: 855835163
Free Java architecture learning materials (including high availability, high concurrency, high performance and distributed, JVM performance tuning, spring source code, mybatis, netty, redis, Kafka, MySQL, zookeeper, tomcat, docker, Dubbo, nginx, and other architectural materials with multiple knowledge points. You can learn and improve yourself by taking advantage of every minute and every second, do not use "no time" to hide your laziness! When you are young, work hard to give yourself an explanation of the future!

Kafka + flink: quasi-real-time exception detection system

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.