Preface
When talking about log analysis, most people feel that this is an afterthought behavior. When hackers succeed, the website will be hacked. When an operator finds out, the security personnel will intervene in the analysis of the intrusion causes. By analyzing hacker attacks, they will often trace back the logs of the past few days or even longer.
Processing process.
I personally think that the log analysis process is divided into three stages:
• Past:
In the past, there were not many operation logs for many websites. Only a few GB may contain dozens or even hundreds of GB. When an attack occurs, grep, perl, or python scripts can be used to complete the operation, however, this is basically in the post-event phase. In the original phase, exceptions are discovered through the grep keyword, which cannot achieve real-time analysis results. It is often necessary to intervene after an accident. Later, we deployed a perl script on the server to use real-time tail logs to discover attackers and perform a good analysis. The problem here is that the load on the server is high, and the O & M personnel may not be able to assist you in the deployment, which is hard to handle. So can we get involved in advance? The answer is yes. We will gradually implement it through the methods described below.
• Now:
The most fundamental embodiment of data in the big data era is big. With the rise of e-commerce. Every day, hundreds of millions of logs or hundreds of millions of logs have become the mainstream. If you still rely on the previous script or grep, you cannot complete the analysis, let alone real-time analysis.
Big data has brought us many solutions for massive data processing, such as hive (offline analysis), storm (real-time analysis framework), impala (real-time computing engine), and haddop (Distributed Computing) and hbase and spark technologies.
So before we have data, what should we do to support our security data analysis platform?
I think it can be divided into several stages:
Collect data.
Data processing.
Real-time data computing.
Data storage is divided into two parts: offline and real-time.
First, if there is no data at the first point, do not read it down.
The foundation of security analysis is data. All data sources are from web logs. From a business perspective, these are all business logs, but in my eyes, these data is a "honeypot ".
There are good and bad people in the logs. Our goal is to find out the bad guys.
With so many big data-based technologies, the data types selected through architecture and technology selection are as follows:
Data collection is implemented through flume, and data subscription is implemented through kafka. The real-time data computing framework uses strorm for real-time processing. Data storage is implemented through two aspects: real-time storage and offline storage.
Flume: Flume provides simple data processing and writing capabilities to various data receivers (customizable). Flume provides the console, RPC (Thrift-RPC), text (file), tail (UNIX tail), syslog (syslog log system, supports TCP and UDP modes), exec (command execution) and other data sources.
Kafka: Kafka1 is a distributed message queue used by linkedin for log processing.
Storm: A real-time computing framework that uses stream processing to process data in real time. storm features high real-time performance, high throughput, low latency, and real-time. It is applicable to a continuous stream of data sources. The following figure shows the storm ui:
• Basic log processing:
With these methods, we need to observe the log format to understand the meaning of each field and format the log for easy extraction. Here we use regular expressions to complete matching. For example, an nginx log rule:
Log_format combined '$ remote_addr-$ remote_user [$ time_local] ''" $ request "$ status $ body_bytes_sent'' "$ http_referer" "$ http_user_agent "';
For malicious attack logs, what are the keywords used here? $ Request, $ status, $ body_bytes_sent, and http_user_agent.
After formatting and sorting a large amount of data, what we need to do is to try to remove the obstacles at hand.
These obstacles include various scans, various crawlers, and various intentional and unintentional intrusions.
For basic filtering, we mainly focus on two items: suspected successful and unsuccessful. These logs can be used for basic identification.
HTTP code = 403,404,502,301, which can be regarded as an unsuccessful attack.
The htttp code is equivalent to 200 and 500, which can be regarded as a suspected "successful attack ". With these basic filters, a large amount of useless data can be removed.
Our goal: remember the type of attacks we want to capture that are hidden under a large amount of obstacle data. We cannot rely solely on these attacks for analysis. This is not professional and irresponsible.
Rule customization:
Through rule customization, you can combine the attack and defense experience with the problems found in the previous analysis process into rules and add them to the storm real-time analysis job to discover attack behavior and store the attack behavior into the database. The number of detected rules depends on the number and accuracy of rules, including regular expression writing and rule customization.
Storm rule capture:
The implementation method in storm is to match keywords through regular expressions, such as phpinfo.
The data flow direction in Storm is that storm accesses Kafka topics. We can pre-process the data received by tupple.
This part of storm uses prepare for preprocessing. Here we can write the regular expression to prepare.
Storm job is written in java. The code matching phpinfo here is:
After data preprocessing, you need to perform a search. The logic of the regular expression is that if the regular expression is not black or white, it will be skipped if it does not match. The case is ignored here.
Storm uses execute to determine the logic of the execution layer. It matches whether the tupple contains Phpinfo. If yes, it displays the phpinfo that has been found. If not, the result is not displayed.
After uploading a storm job to Nimbus, the following information can be found in the execution result, and the phpinfo keyword can be found in real time. mvn is used for storm job compilation.
Finally, the database outputs the matching results to the database. The matching results are as follows:
Storm real-time computing supports local debugging and remote debugging. Local access to http: // hostname/phpinfo. php, information captured by storm:
Information written into the database:
After the last data is written to the database, we can see the test at 14:23:41 seconds and the database is inserted at 14:23:49 seconds.
The matching information through the Phpinfo keyword is as follows:
• Data visualization:
Through basic data analysis, the results can be drawn into graphs. In this way, the monitoring period of attacks can be extended without being stuck in a single database query, of course, the purpose of visualization is not to achieve visualization, but to achieve operational needs.
If the table is not shown in the figure, the user experience should be considered if the table is good enough. But is this visualization useful?
The answer is not necessarily. The goal of visualization is to let others clearly see the true meaning of your data analysis.
• Data storage:
After data analysis, targeted storage is required for subsequent joint analysis. Data storage mainly uses offline and real-time data, and displays attack trends within one day in real time.
• Data Analysis: (important)
By writing the inspection results of these rules to the database, you can use the database query method to filter logs, extract the attack time, attack ip address, number of attacks, ip address source locations, and the time periods during which attacks are the most popular, this allows hackers to draw an activity track chart.
Judge the technical capabilities of hackers, whether they are regular customers, and what are the motives for committing crimes. However, we are pessimistic that, even if we analyze these features, we still need to take certain actions against attacks, for example, extract the top 20 and seal it out.
Second, can attacks be further analyzed? If this analysis only happens to everyone, you need to combine the data with the vulnerability to analyze the shellshock vulnerability. Can the php cgi remote code execution vulnerability be found? After a period of analysis, we can summarize the trend.
All of these focus on features and keywords. Recognition by keyword potential is like identifying you as fat, thin, tall, and short. First, you will be distinguished by category, then perform analysis.
The premise for analysis is to create a table first, and you need to set the database table structure, for example:
Here, we are concerned with the following information: attack logs, attack payload, attack methods, attack return status, attack ip addresses, and attacker browser fingerprints.
• Determine the scope of analysis:
Are you sure you want to find the problem? SQL injection, xss, file inclusion, directory traversal, brute force cracking, and various scanner scans. After this information is collected, it is written into the rule. Through storm real-time computing, we will obtain a variety of data over a period of time, with the basic sample of analysis.
Analysis is actually a summary process, which can be completed using mysql.
All our analysis is conducted from the security perspective. So let's take a look at what user_agent is interested in? Here is the awvs scanner fingerprint
Various types of scanning data:
From the perspective of data analysis, attackers seem to have a special liking for discuz's. bak file. Or, let's see which phpmyadmin attackers are most keen on?
And a variety of xss:
And the familiar struts2?
And so on, these are all features that can be found. Finding them is actually not a goal. Is our goal smart?
Because a lot of attack data is meaningless, how to filter out the real danger from these data is the scope of automated testing.
Through the analysis of the data, we can easily know what attackers are interested in and whether there is a large scale of exploitation on the market.
Future:
The future of log analysis must be based on data. Machine learning and data mining algorithms are used to predict logs and attack trends.
Finally, log analysis is a process of continuous evolution and continuous cultivation.
The pressure on the database layer is relatively high, and Mysql is not suitable for queries of tens of millions of databases. hbase will be considered for processing in the future.
And considering the use of Bayesian algorithms for scoring and policy adjustment of historical data.