Webshell detection-Log Analysis

Last Update:2015-11-23 Source: Internet

Author: User

Tags blank page

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Webshell detection-Log Analysis

It is always believed that the final meaning of log analysis is evidence collection and prediction-a complete story of ongoing, ongoing, and future attacks (when. where. who. what happened. why ).

This article explains how to identify webshell, that is, to trace the attack events that have occurred from the identified attack events. The implanted webshell is undoubtedly a definite attack event, as long as it has been passed in, there is a high probability that it has been detected by the black webshell. There are two mainstream Methods: anti-virus software and intrusion detection system, in addition to technical defects (as described later), these two methods are more often unable to obtain the complete file sample (the deployment cost of webshell kill scanning tool is relatively high, system compatibility problems, performance problems), or failure to obtain complete traffic information (the deployment cost of traffic images is also very high), and use a low-cost log analysis method.

This article focuses on the log analysis methods of webshell detection, including how to establish and implement the model. At last, we will briefly discuss the comparison between traditional detection methods.

I. Analysis

General idea: first locate the exception log and then find the attack log. The whole process is divided into two steps: webshell extraction + webshell validation.
(P.s is like

The Length Anomaly Model of web Log exception detection practices and HMM and security applications in engineering slag eye

Introduction, first discover the unknown, and then confirm the known from the unknown)

1. webshell Extraction

Based on Security experience, we can provide the following assumptions:

(1) webshell access features (main features)

1) A small number of IP addresses initiate access to it

2) fewer visits

3) This page is an isolated page.

Note that the words marked with red are abstract adjectives. We need to quantify these features, such as a small amount. How much is a small amount? What is an isolated page?

Next, we will use the common descriptive statistical method.

1) total daily access distribution of a single URL

2) distribution of independent IP addresses for a single URL

3) inbound and outbound distribution of a single URL (we can regard the Website access path as a directed graph)

Next, let's take a look at the basic concepts of Directed Graphs.

Vertices (node): 1, 2, 3, 4, 5, 6, 7, 8 is equivalent to the url in the access log.

Edge: 1-> 2 1-> 3 4-> 1 5-> 1 6-> 5 7-> 7 is equivalent to jump from A url to B url

Inbound in-degree

Outbound degree out-degree

The inbound and outbound degrees of Node 1 are 2.

Node 2 and node 3 have 1 inbound and 0 outbound.

Node 4 and node 6 have an inbound value of 0 and an outbound value of 1. They are suspended nodes (pendant vertex), which are special. For example, if 404 is redirected to the homepage, such nodes will be generated.

Node 5 has an inbound value of 1 and an outbound value of 1.

Node 7 has an inbound value of 1 and an outbound value of 1, but points to itself and belongs to the self-loop. Most verified webshells belong to this type.

Node 8 has an inbound value of 0 and an outbound value of 0. It is an isolated node (isolated vertex)

Node 7 and node 8 belong to the webshell access feature. (3) this page is an isolated page.

(P.s. Graph-based Anomaly Detection, a Graph-based exception Detection method, occupies a very, very important position in the security Detection method, such as detecting machines infected with worms)

Supplement 20151103: For webshells with an inbound degree greater than 1, what is isolated and the degree of interaction with other pages is relatively small.

When webshell uses tags to list files in the current directory, it will interact with other pages. Webshells that require the collaboration of multiple scripts can also interact with each other.

Of course, not all isolated pages are webshells. The following situations may also cause isolated pages

(1) Hide access to normally isolated pages such as the management background (mostly out-of-date, or without access control permissions)

(Supplement 20151103: Some people have questioned why the backend is isolated. After logging in, I will not jump to other pages? What if I only open the page and do not log on ?)

(2) scanner behavior, common vulnerability scans, PoC scans, and Webshell scans (Common webshell paths are often seen in logs and added with one sentence payload scans)-this is the most important interference data and needs to be removed.

For case (1) Use a whitelist, for case (2) scanner Identification

(P.s. crawlers, fingerprint recognition, and scanner identification (which can be derived from man-machine identification in a broad sense) can be called the trigger of web security technology)

Supplement 20151103: after the model has been running for more than a month, it turns out that isolated pages are quite varied. Some sites have a "security" detection tool, the "Reset upload and compression password" tool does not include deleted files.

(2) path features of webshell (Auxiliary features)

In addition to the unique access features of weshell, we can also use path features to help extract

Let's look at a batch of real webshells.

The webshell paths implanted by different means have their own characteristics, such as the uploaded webshell. If the uploaded component has protective measures (
File Upload Vulnerability defense-removal of image writing and horse writing

The file name will be overwritten (the 32-bit hexadecimal name in the example), and there will be date features in the path, this type of webshell is also very easy to appear in the static Resource Directory (image, style, configuration.

Supplement 20151103: When writing a trojan in batches, especially when using the vulnerability to write a trojan in batches, the script automatically generates a file name and stores it in a specific directory, the similarity analysis of path will find this rule.

(Text Similarity is also a basic skill in data analysis)

(3) time features of webshell (Auxiliary features)

The newly added page is regarded as an abnormal page, but this solution has obvious defects.

(1) the existing page write horse will be missed

(2) normal site updates will be misjudged

Therefore, this feature is used as an auxiliary feature to restore the webshell implant process. If a protection product such as WAF is connected, you can also find out whether the defense is bypassed.

Supplement 20151103: The time attribute of the file can also be modified.

(4) webshell Payload features (Auxiliary features)

Traffic-based security detection and defense tools such as WAF and IDS take payload features (especially attack features) in network communication as the main detection means.
For more information, see closing the door on webshell-by Anuj Soni.

The following lists some actually discovered payloads (after desensitization)

SUPort = 43958 & SUUser = LocalAdministrator & SUPass = xxx & SUCommand = net + user + spider + % 2 Fadd + % 26 + net + localgroup + administrators + spider + % 2 Fadd & user = spider & password = spider & part = C % 3A % 5C % 5C

whirlwind=%40eval%01%28base64_decode%28%24_POST%5Bz0%5D%29%29%3B&z0=QGluaV9zZXQoImRpc3BsYXlfZXJyb3JzIiwiMCIpO0BzZXRfdGltZV9saW1pdCgwKTtAc2V0X21hZ2ljX3F1b3Rlc19ydW50aW1lKDApO2VjaG8oIi0%2BfCIpOzskRD1kaXJuYW1lKCRfU0VSVkVSWyJTQ1JJUFRfRklMRU5BTUUiXSk7aWYoJEQ9PSIiKSREPWRpcm5hbWUoJF9TRVJWRVJbIlBBVEhfVFJBTlNMQVRFRCJdKTskUj0ieyREfVx0IjtpZihzdWJzdHIoJEQsMCwxKSE9Ii8iKXtmb3JlYWNoKHJhbmdlKCJBIiwiWiIpIGFzICRMKWlmKGlzX2RpcigieyRMfToiKSkkUi49InskTH06Ijt9JFIuPSJcdCI7JHU9KGZ1bmN0aW9uX2V4aXN0cygncG9zaXhfZ2V0ZWdpZCcpKT9AcG9zaXhfZ2V0cHd1aWQoQHBvc2l4X2dldGV1aWQoKSk6Jyc7JHVzcj0oJHUpPyR1WyduYW1lJ106QGdldF9jdXJyZW50X3VzZXIoKTskUi49cGhwX3VuYW1lKCk7JFIuPSIoeyR1c3J9KSI7cHJpbnQgJFI7O2VjaG8oInw8LSIpO2RpZSgpOw%3D%3

N3b31d1 = cGhwaW5mbygpOw =

Getpwd = admin & go = edit & godir = % 2 Fhtdocs % 2 Fbbs % 2 Fconfig % 2F & govar = config_global.php

senv=eval(\"Ex\"%26cHr(101)%26\"cute(\"\"Server.ScriptTimeout%3D3600:On+Error+Resume+Next:Function+bd%28byVal+s%29%3AFor+i%3D1+To+Len%28s%29+Step+2%3Ac%3DMid%28s%2Ci%2C2%29%3AIf+IsNumeric%28Mid%28s%2Ci%2C1%29%29+Then%3AExecute%28%22%22%22%22bd%3Dbd%26chr%28%26H%22%22%22%22%26c%26%22%22%22%22%29%22%22%22%22%29%3AElse%3AExecute%28%22%22%22%22bd%3Dbd%26chr%28%26H%22%22%22%22%26c%26Mid%28s%2Ci%2B2%2C2%29%26%22%22%22%22%29%22%22%22%22%29%3Ai%3Di%2B2%3AEnd+If%22%22%26chr%2810%29%26%22%22Next%3AEnd+Function:Response.Write(\"\"\"\"->|\"\"\"\"):Ex\"%26cHr(101)%26\"cute(\"\"\"\"On+Error+Resume+Next:\"\"\"\"%26bd(\"\"\"\"526573706F6E73652E5772697465282268616F72656E2229\"\"\"\")):Response.Write(\"\"\"\"|<-\"\"\"\"):Response.End\"\")\")"

However, log analysis can only be used as a secondary feature for two reasons:

A. due to incomplete log fields, payload cannot be used.

B. In most cases, attackers only perform webshell memory activity detection without generating attack features, or attack payload for encryption to bypass feature detection.
In the following example, the payload of wso 2.5.1 is a = RC.

But do not underestimate this intuitive feature. The conventional webshell accounts for a large proportion, and in webshell, especially when the Echo information is not GUI, validation can also play a role that cannot be ignored.

2. webshell confirmation

Think about how brick-and-mortar companies determine whether a page is webshell. Open ta and see what the page looks like, that is, request playback. There are two types of problems to consider during request playback.

(1) Check whether replay can cause destructive operations, such as site pressure. (There are also some examples of malicious operations, such as website Speed monitoring applications, cc defense applications will put websites with low bandwidth and low performance), as well as file deletion, account resetting, and system reinstallation (e.g. /setup-config.php), which is one of the reasons for not directly play back each access log (of course, the size of the overall log volume is also one of the reasons)

(2) Whether playback infringes user privacy. Strict log storage regulations cannot store cookies, post, and other fields that may involve user sensitive data, or must be desensitized and then stored, after authorization, users can view the data. Of course, the more important reason for not storing data is the huge consumption of storage resources.

(P.s. sometimes I want to prevent security personnel from defending themselves against piracy. I am working as a scanner (vulnerability identification), especially for the whole network. I have my own social engineering library and my own weakness map,)

For situations (1) use speed limit, add playback content filtering, eliminate cookie authentication information, and filter threat operations. (2) A bit subtle
The playback problem is solved. The next step is to determine whether it is a webshell based on the response page of the playback. Let's first look at the response page type
(1) The response page is not blank (a page with content after a GET request is initiated to the URL)

Instance 1 webshell login Port

Align = center> method = post> Password: type = password name = pass> type = submit value = '>

The logon box is very informative (login box highlights)

Instance 2 upload a file-type webshell

<Form action = "? Cmd = up "method =" post "enctype =" multipart/form-data "name =" form1 "> <input type =" file "name =" file "size =" 17 "class =" Input "> <input type =" submit "name =" Submit "value =" submit "class =" Input "> </form>

Example 3: An unauthenticated mustang

Instance 4 wso webshell

A: 4: {s: 5: "uname"; s: 81: "Linux li676-178 3.19.1-x86_64-linode53 #1 SMP Tue Mar 10 15:30:28 EDT 2015 x86_64"; s: 11: "php_version"; s: 5: "5.6.9"; s: 11: "wso_version"; s: 5: "2.5.1"; s: 8: "safemode"; B: 0 ;}

(2) The response page is blank.

After a GET request is initiated to a URL, the response is a blank page with payload playback (desensitized)

Shows the detection solution. Two features are used.

(5) webshell Behavior Characteristics

Abstract webshell to form a critical path to attack. It is abstracted as a mathematical description Policy Library to extract exceptions.

(6) webshell webpage features: Content/structure/visual signature

(For more information, see the web page similarity)

My webshell Sample Library: https://github.com/tanjiti/webshellSample

Review, we detect webshell Based on features. The mentioned features include:

(1) webshell access features (main features)-webshell extraction stage

(2) webshell path feature (secondary feature)-webshell extraction stage

(3) webshell time features (Auxiliary features)-webshell extraction stage

(4) webshell Payload feature (Auxiliary feature)-webshell extraction stage

(5) webshell behavior features-webshell extraction stage

(6) webshell Reponse web page features (Content features/structure features/visual features) -- webshell Validation

Finally, there is a feature-(7) webshell attack Association feature

"If a website is implanted with webshell, more than one site is implanted"

"If a website is implanted with a webshell, more than one webshell is implanted"

The advantage of searching can be used at this time. The webshell visitor features (IP/UA/Cookie), Payload features, and time features can be used for associated search, like this xcode event, 360 after the basic data is built (here I refer to the chunan article I admire very much, "I will always remember that the data infrastructure is not enough to collect a pile of garbage to store it, this is backed by a comprehensive data lifecycle solution. Collection, etl, data quality, and quick data interaction are the most important tasks .) Using search to associate data and restore data by timeline tells an interesting story.

Supplement 20151103: There are two difficulties in searching:

(1) how to associate and display results by timeline and Behavior

(2) how to build basic data facilities, such as using elasticsearch, how long the data volume is retained, how to create indexes, and cluster Load Balancing
(When talking about the construction of basic data facilities, there is a super core plug-in. First, there is a data transmission pitfall caused by hadoop fragmentation, and then a pitfall for changing log fields, there is also an uncertain failure of the cluster solved by restarting. Fortunately, many friends around me have provided help and thank you very much for hadoop)

II. Implementation

1. Data Acquisition

Data source: web access logs

Obtaining method: if the data is stored in hdfs, copy the data to the model computing cluster using distcp.

P.s. the acquisition of optical data has encountered many pitfalls. data transmission between different versions of hadoop (the hadoop fragmentation problem is also one of the products of the engineer's cultural orientation, I love to use open-source things to assemble a separate and complete system. Of course, many full-stack engineers have been developed)

2. feature Extraction
Http_host
Root_domain
Url
Path
Query string
Referer
Ip
Timestamp
Http_response_code
Http_method
Request_body Request body non-essential fields
Non-essential cookie Fields
User_agent

3. Preprocessing

Before Statistics, We need to pre-process the data.

Preprocessing 1: cut logs by hour (cut is mainly to avoid excessive computing time in case of a large number of logs)

Preprocessing 2: extract logs with a response code of 2xx and 3xx

Preprocessing 3: feature normalization is very important. If preprocessing is not performed, a super large directed graph will be formed, and mapreduce batch processing will not be able to process it.

Host Normalization: Convert * .xxx.com or .xxx.com to www.xxx.com

Path Normalization: Merge multiple/, replace \/

Referer Normalization:

(1) restore the relative address to an absolute address, e.g./a. php => www.xxx.com/a.php

(2) The host part is not in the local domain (not in the root domain name), empty fields, and referer fields that do not comply with referer specifications are set to null.

(3) Remove the query part

4. Model Creation

1) webshell extraction (fully automated)

Step 1: Create a (path, referer) Directed Graph to extract the isolated page (inbound to 0, outbound to 0) and Self-loop page (inbound to 1, outbound to 1, self-directed) webshell access features

Step 2: remove the non-compliant path (does it comply (? : Https? ://)? [-./\ W] +)

Step 3: Upload, zip, swf, mp3, ico, pidf, torrent)

Step 4: remove the whitelist path (for example, the main page index.php, index.asp, index.aspx, index.ashx, index.html)

Step 5: Go to the path (such as asp, aspx, php, jsp, py, cgi, pl, java, sh, war, cfm, phtml) with the suffix of webshell)

Step 6: remove the scanner's path (by scanner IP reputation Library (cloud scanner IP reputation library and timeliness scanner IP reputation Library) and scanner behavior (you can simply cluster by ip + host, remove requests that exceed M in a unit of time and whose independent paths exceed N)

Step 7: Remove paths with non-200 Response codes

Step 8: Define the credibility of webshell Based on the path feature. The path feature of webshell is marked as 1 (for example, a common file directory to be uploaded and a random file name ).

Step 9: Define the webshell credibility according to the webshell payload feature. The compliance feature is marked as 1, which is equivalent to the webshell detection rules in WAF (but it should be looser because it is not afraid of false positives ), if the log contains a WAF detection result flag field, you can use this field to mark the webshell reliability (such as envlpass =) webshell Payload feature.

Step 10: remove the path in which the number of independent IP addresses and the total number of path access requests exceed the threshold

2) webshell confirmation

Step 1: perform GET playback (speed limit) on the path extracted by webshell. If a parameter exists, it carries a parameter.

(P.s. The page does not exist in some small and tricky webshells without parameter playback)

Step 2: Remove paths with non-200 Response codes

Supplement: Modify to retain 401 of requests to avoid missing webshells authenticated by http basic.

Step 3: Remove 404 rewrite path

Method 1: generate two random file names for playback. Check whether the response body size is the same. If the size is the same, rewrite the file.

Method 2: The Magic fuzz hashing has to play a role again. You can calculate the fuzz hashing value for the rewritten response content and set the similarity threshold to 404 within the threshold range, as shown in the example, remove the dongle rewrite page

Step 4: Play back the blank response page with payload (speed limit and desensitization)

Step 5: Compute the fuzz hashing value on the response page and perform clustering.

Step 6: Read the weshell fuzz hashing feature library and determine the path with the similarity within the threshold value as webshell Reponse web page similarity feature

Step 5: webpage information extraction, including static extraction, dynamic extraction, title extraction, form extraction, Link and Image Information Extraction (fully automated)

Step 6: Abstract The key paths of webshell behaviors, formulate policies, and perform webshell exception Extraction Based on the Policy Library

Step 7: automated attack Confirmation based on webshell sample signature (web page content/structure, visual), confirmation of loopholes in attacks that are abnormal but do not comply with the sample signature with Manual Interference

Step 8: extract and confirm webshell visitor features (IP/UA/Cookie), Payload features, time features, associated searches, sort search results by time, and restore the webshell attack Association features

5. Model Evaluation

Generally, the recall rate and accuracy are evaluated. But how can I confirm that all webshells have been detected? I implant webshells on my site and check whether all webshells can be detected, but there is a major problem with this method-the site access traffic is abnormal, to be resolved.

Iii. Model Defects

Where the model is to be improved

Problem 1: referer forgery

Question 2: webshell for images (due to the release of static files)

Problem 3: webshells are implanted into existing files (because they are not isolated)

IV. Other Detection Methods

The previous section describes how to use web log analysis to find webshells. Now let's review the traditional webshell detection products.

(P.s. There is always a lot of inspiration from commercialized detection technologies. Their methods are effective even though they are good)

WAF/IDS/IPS: checks whether HTTP requests comply with webshell communication features (passive detection)

Vulnerability scanner: scan for known backdoor implant vulnerabilities, such as common webshell paths and kitchen knife connections (active detection)

Backdoor detection and removal tool: checks whether a webshell malicious file exists in the file system.

Directory monitoring tools: file integrity monitoring, file modification time, owner, and permissions (the added webshell file and the existing file time implanted with webshell will change)

SIEM log analysis (forensics) tool: checks whether there are webshell access events (the existing is generally based on features and simple association, and rarely uses machine learning methods)

The technologies used by these products are divided into static and dynamic detection methods, which are actually used in the anti-virus field.

1. Static Detection

(1) file content detection: checks whether webshell features are included, such as common webshell functions.

Disadvantage: webshell fancy obfuscation Bypass

For details about fancy obfuscation, see:
Http://www.bkjia.com/Article/201509/443305.html

For more information about the detection method, see:
PHP Shell Detector

(2) file content detection and detection of encryption (obfuscation)

Based on the above experience, we have added the ability to determine whether to encrypt webshell)

1. Index of Coincidence: in essence, it is probability. In short, a meaningful word Coincidence Index is high, and a string that is encrypted or obfuscated has a low Coincidence Index.

2. Information Entropy (Entropy): The essence is probability. In short, the meaningful word Entropy is small, and the Entropy of the encrypted or obfuscated string is large.

3. LongestWord: A Rough assumption that the longer the string, the more likely it is to be encrypted or confused

4. Compression
5. Signature: a traditional method for Feature Matching

For more information about the detection method, see:

NeoPi Method

Disadvantages:

Some problems can be explained in the first article. The second article just proves this problem.

Data analysis methods, especially machine learning, focus more on the probability of pointing to a large amount of data. The coverage rate for special cases is low, but the advantage is obvious, but it cannot be used independently.

(3) file Hash Detection: Create a webshell sample hashing library, compare the files to be detected, and identify the files to be detected as suspicious files within the threshold range.
Ssdeep webshell (fuzzy hashing detection)

(4) file Integrity Detection

File Creation Time (new file (New webshell), modification time (original file injection webshell), File Permission, owner
Disadvantage: O & M is not available for sites with frequent updates

2. Dynamic Detection

Sandbox technology, which is determined based on the behavioral characteristics of Dynamic Language sandbox Runtime

Disadvantages:

The encrypted file cannot be executed, and the file that is written very frustrated (with a syntax error) cannot be executed.

Test product participation:

Baidu's webshell Detection Service webdir

V. Conclusion

This article has been written for almost half a month. I am a collection madman, like to collect data, also like to collect methods, the collection needs to be verified, so it took a lot of time, however, this process is quite interesting. It is really nice to pass through the World (bringing together the expertise of different fields. At the same time, you are welcome to talk about it. It doesn't matter if you scold me as a dog.
Postscript:

The results were really scolded, but the reason was not the dregs of the expected article. There was a problem with the technical solution, but a personal attack involving plagiarism. Besides writing a technical blog, I wrote my reading notes and pushed the good materials I found. They were not included in the security circle. They didn't participate in meetings, even if they didn't attend meetings. They were closed, and they couldn't authenticate themselves, so they had to endure.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More