Analyze risk data using the Python tool

Source: Internet
Author: User

With the large-scale growth of network security information data, the application of data analysis technology to network security analysis has become the industry research hotspot, small Ann in this small lecture hall with the Python tools to make a simple analysis of risk data, mainly to analyze the Honeypot log data, to see the general use of proxy IP to do something.

You may ask Ann what is a honeypot, online some hackers or technical personnel often do some "things", need to hide their identity, so they will use proxy IP to work. The Honeypot (Honeypot) is a new active defense security technology, it is a deception system specially set up for attack or intrusion-can be used to protect the product system and to collect hacker information, it is a flexible and diversified network security technology.

To be more popular is to provide a large number of proxy IPs, luring some criminals to use proxies for these proxy IPs to collect their information.

Introduction to Data analysis tools

工欲善其事, its prerequisite, in this small Ann to introduce some Python data analysis of "divine weapon".

The famous data Analysis library in Python panda

The Pandas Library is a numpy-based tool that is created to solve data analysis tasks and is also built around the two core data structures of series and DataFrame, where series and DataFrame correspond to one-dimensional sequences and two-dimensional table structures.

Pandas provides a number of functions and methods that enable us to process data quickly and easily. This library has a lot of advantages, easy to use, interface abstraction is very good, and document support is really moving. You will soon discover that it is one of the important factors that make Python a powerful and efficient data analysis environment.

Visualization of data

Python's most commonly used Matplotlib library, Matplotlib, is a Python graphics framework, and Python's most famous drawing library, which provides a complete set of command APIs similar to those of MATLAB, making it ideal for interactive mapping.

We have these "weapons of God" in hand, under the small Ann will take you with Python these tools to the Honeypot proxy data for a brief analysis of the introduction.

1. Introduction Tool – Load Data analysis Package

Start Ipython notebook, load the running environment:

%matplotlib Inlineimport Pandas as pdfrom datetime import Timedelta, Datetimeimport Matplotlib.pyplot as Pltimport numpy a S NP
2. Data preparation

As the saying goes: Paddle. The data of the small Ann analysis is mainly the user uses the proxy IP access log record information, the raw data to be analyzed is stored in CSV form. This is the first introduction to Pandas.read_csv, a common method that reads data into the Dataframe.

Analysis_data = Pd.read_csv ('./honeypot_data.csv ')

Right, a line of code can read all the data into a two-dimensional table structure Dataframe variable, it feels very simple there is wood!!! Of course, with the IO tool provided by pandas you can also be large file block read, and then this small Ann test performance, the full load about 215.3 billion pieces of data is about 90 seconds or so, performance is quite good.

3. Data

Generally speaking, before we analyze the data, we first need to have a general understanding of the data, such as the amount of data, what variables, data variables distribution, data duplication, data loss, data anomalies in the initial observation and so on. The following small safety belt small partners together to a glimpse of these data.

Use the Shape method to view the number of rows and columns of data

Out: (21524530, 22) #这是有22个维度,共计21524530条数据记的DataFrame 

Use the head () method to view the first 5 rows of data by default, plus the tail () method is to view the next 5 rows by default, and of course you can enter parameters to view the number of custom rows

Analysis_data.head (10)

Here you can learn that our data records have users using proxy IP date, proxy header information, proxy access domain name, proxy method, source IP and honeypot node information and so on. In this small Ann must be sure to tell you, small Ann every time do the data analysis must use method –describe method. Pandas's describe () function enables quick statistical summarization of the data:

For numeric type data, it calculates each variable:

Total number, average value, maximum, minimum, standard deviation, 50%-digit number, etc.;

Non-numeric type data, the method gives the variable:

The number of non-null values, the unique number (equivalent to the distinct method in the database), the maximum frequency variable, and the maximum frequency.

By the head () method we can find that the data contains numeric variables, non-numeric variables, we can first use the Dtypes method to view the data types of columns in Dataframe, and use the Select_dtypes method to classify data by data type. Then, using the statistics returned by the describe method, there is a preliminary understanding of the data:

Df.select_dtypes (include=[' O ']). Describe ()

Df.select_dtypes (include=[' float64 '). Describe ()
Proxy_retlength SCAN_OS_FP SCAN_OS_SUB_FP Scan_scan_mode dtype_details
Count 6.417354e+06 0.0 0.0 0.0
Mean 1.671744e+03 NaN NaN NaN
Std 3.104775e+04 NaN NaN NaN
Min 0.000000e+00 NaN NaN NaN
25% NaN NaN NaN NaN
50% NaN NaN NaN NaN
75% NaN NaN NaN NaN
Max 2.829355e+07 NaN NaN NaN

By simply observing the statistical results of each dimension of the above variables, we can see that the average length of the data obtained by everyone is about 1670 bytes. At the same time, we can find the fields such as scan_os_sub_fp,scan_scan_mode, such as empty values and so on. So that we can have a general understanding of the data as a whole.

4. Data Cleansing

Because the source data usually contains some empty values or even empty columns, it can affect the time and efficiency of data analysis, and after previewing the data digest, these invalid data needs to be processed.

In general, remove some null data can use the Dropna method, when you use the method, after the inspection found that Dropna () almost removed all rows of data, a pandas user manual, the original without parameters, Dropna () will remove all the rows containing null values.

If you only want to remove all columns with null values, you need to add axis and how two parameters:

Analysis_data.dropna (Axis=1, how= ' all ')

Alternatively, you can dropna the parameter subset to remove the specified column null data, and set the Thresh value to remove every non-none data less than the number of rows Thresh.

Analysis_data.dropna (subset=[' proxy_host ', ' Srcip ']) #移除proxy_host字段或srcip字段没有值的行analysis_data. Dropna (thresh=10) #移除所有行字段中有值属性小于10的行
5. Statistical analysis

After a preliminary understanding of some of the information in the data, the original data has 22 variables. From the analysis purpose, I will select the local variables from the original data for analysis. Here we introduce pandas data slicing method Loc.

LOC ([start_row_index:end_row_index,[' Timestampe ', ' proxy_host ', ' Srcip ']) is an important slicing method of pandas, preceded by a slice of a row, followed by a column slice after a comma, That is, select the variables to analyze.

As below, I choose the date here, host and source IP fields--

Analysis_data = Analysis_data.loc ([:, [' Timestampe ', ' proxy_host ', ' Srcip ']])

First of all, let's look at the amount of data daily used by the honeypot agent, we will calculate the data by day, understand the daily data amount of PV, and draw the results of the trend chart.

Daily_proxy_data = analysis_data[analysis_data.module== ' proxy ']daily_proxy_visited_count = Daily_proxy_ Data.timestamp.value_counts (). Sort_index () Daily_proxy_visited_count.plot ()

For data column discard, in addition to invalid values and requirements, some tables themselves redundant columns also need to be cleaned up in this link, such as the index number in Dataframe, type description, and so on, through the discard of these data, thus generating new data, can make the data capacity effectively reduced, and thus improve the computational efficiency.

The analysis shows that the usage of honeypot agent has exploded in the days of June 5, 19-22 and 25th. Then these days the data has a situation, not normal, the specific is God horse situation, not urgent, behind the small safety belt everyone to slowly find out in the end is those people (source IP) did what "bad".

Further analysis, after the data has an exception, let us look at the amount of daily to go to heavy IP data and its growth. Can be calculated by the day groupby after the Nunique () method to calculate the daily amount of IP data.

Daily_proxy_data = analysis_data[analysis_data.module== ' proxy ']daily_proxy_visited_count = Daily_proxy_ Data.groupby ([' Proxy_host ']). Srcip.nunique () Daily_proxy_visited_count.plot ()

What is the majority of people (source IP) in the dry God horse? The horse of the dry God? The horse of the dry God? Let's take a look at which of the most visited hosts, that is, the number of IPs associated with the same host, so that we can see only the top 10 top host.

First select the host and IP fields, can over GroupBy method to group each domain name (host), and then the IP access to each domain name unique statistics.

HOST_ASSOCIATE_IP = proxy_data.loc[:, [' proxy_host ', ' srcip ']]grouped_host_ip = Host_associate_ip.groupby ([' Proxy_ Host ']). Srcip.nunique () print (Grouped_host_ip.sort_values (ascending=false). Head (10))
delegate access to host Source IP
Www.gan**.com 1113
Wap.gan**.com 913
Webim.gan**.com 710
Cgi.** 32T 615
Loc.* 543 515 455 428 405

Look at what everyone has done--look at the log data and discover the original collection of information such as the price of used cars, worker recruitment and so on. From the popular host, we should always use the agent mainly to get Baidu, qq,google,bing this kind of including women website Information.

Let's see who is using proxy IP "officer" the most, that is to see who has the most IP access to different host.

 host_associate_ip = proxy_data.loc[:, [' proxy_host ',  ' Srcip ']]grouped_host_ip =  host_associate_ip.groupby ([' Srcip ' _host ']). Proxy_host.nunique () print (Grouped_host_ip.sort_values ( Ascending=false). Head () 
" tr>
source IP Access different host
64...122 2191
195.*. 1 27...202 452
212..**.13 110...39 430

Oh, we found the target IP is 123. *.155 's young man has a large number of access records, and then view the log, he was in a large collection of hotel information. Well, then we'll probably know who's doing it, and let's see how long they've been using proxy. The code is as follows--

This is not to give you the code, just give the following pseudo-code.

DATE_IP = analysis_data.loc[:,[' timestamp ', ' srcip ']]grouped_date_ip = date_ip.groupby ([' Timestamp ', ' Srcip ']) # Calculate the access date for each source IP (SRCIP) all_srcip_duration_times = ... #算出最长连续日期天数duration_date_cnt = count_date (all_srcip_duration_ Times
Source IP duration Date (days)
80...38 32
213.*. 128 31
125..*.161 22
120..*.161 22
50.*. 67 19
114...97 19
162.*. 113 19
192.*. 226 17
182...205 17
112.*. 108 16
123..*.130 16
61...156 15
61...152 15
58...130 15
216.*. 106 14
101...117 14
124...126 14
79.*. 254 13
115..*.130 13
61...79 13

Well, here I also know what those people do, who use the agent for the longest and so on. Remove IP = 80...38 users use proxy IP access to the data log, found that the lad in a long time to obtain Sohu images.

Honeypot across the country to deploy multiple nodes, let us look at each source IP scan honeypot node total number, understanding IP scan node coverage. The results are shown below:

# Total number of IP scan nodes per IP scan node = df[df.module== ' Scan ']node = node.loc[:,[' Srcip ', ' origin_details ']]grouped_node_count = Node.groupby ([' Srcip ']). Count () print grouped_node_count.sort_values ([' origin_details '], Ascending=false). Head (10 )
Source IP Total number of IP scan nodes
106.*. 161 9
45...214 9
94.*. 174 8
119...216 7
61...222 7
182...205 6
182..*.75 6
42..*.89 6
123...64 6
42..*.128 6
42..*.106 6
42..*.82 6
114...157 6
80...38 6
42..*.149 6
115...163 6

From the above two tables initially know, some conclusions: if the source IP is 182 ... 205 of users scan the Honeypot node for a long time, mark dangerous users, and so on.


Small Ann here to give you a brief introduction of the use of Python tools, mainly pandas library to analyze the data, of course, the function of this library is very powerful, small Ann also just take us to a glimpse of the experience, more or we have to use and explore their own.

Original chain:

Analyze risk data using the Python tool

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.