Analyze risk data using the Python tool

Last Update:2016-08-02 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

With the large-scale growth of network security information data, the application of data analysis technology to network security analysis has become the industry research hotspot, small Ann in this small lecture hall with the Python tools to make a simple analysis of risk data, mainly to analyze the Honeypot log data, to see the general use of proxy IP to do something.

You may ask Ann what is a honeypot, online some hackers or technical personnel often do some "things", need to hide their identity, so they will use proxy IP to work. The Honeypot (Honeypot) is a new active defense security technology, it is a deception system specially set up for attack or intrusion-can be used to protect the product system and to collect hacker information, it is a flexible and diversified network security technology.

To be more popular is to provide a large number of proxy IPs, luring some criminals to use proxies for these proxy IPs to collect their information.

Introduction to Data analysis tools

工欲善其事, its prerequisite, in this small Ann to introduce some Python data analysis of "divine weapon".

The famous data Analysis library in Python panda

The Pandas Library is a numpy-based tool that is created to solve data analysis tasks and is also built around the two core data structures of series and DataFrame, where series and DataFrame correspond to one-dimensional sequences and two-dimensional table structures.

Pandas provides a number of functions and methods that enable us to process data quickly and easily. This library has a lot of advantages, easy to use, interface abstraction is very good, and document support is really moving. You will soon discover that it is one of the important factors that make Python a powerful and efficient data analysis environment.

Visualization of data

Python's most commonly used Matplotlib library, Matplotlib, is a Python graphics framework, and Python's most famous drawing library, which provides a complete set of command APIs similar to those of MATLAB, making it ideal for interactive mapping.

We have these "weapons of God" in hand, under the small Ann will take you with Python these tools to the Honeypot proxy data for a brief analysis of the introduction.

1. Introduction Tool – Load Data analysis Package

Start Ipython notebook, load the running environment:

%matplotlib Inlineimport Pandas as pdfrom datetime import Timedelta, Datetimeimport Matplotlib.pyplot as Pltimport numpy a S NP

2. Data preparation

As the saying goes: Paddle. The data of the small Ann analysis is mainly the user uses the proxy IP access log record information, the raw data to be analyzed is stored in CSV form. This is the first introduction to Pandas.read_csv, a common method that reads data into the Dataframe.

Analysis_data = Pd.read_csv ('./honeypot_data.csv ')

Right, a line of code can read all the data into a two-dimensional table structure Dataframe variable, it feels very simple there is wood!!! Of course, with the IO tool provided by pandas you can also be large file block read, and then this small Ann test performance, the full load about 215.3 billion pieces of data is about 90 seconds or so, performance is quite good.

3. Data

Generally speaking, before we analyze the data, we first need to have a general understanding of the data, such as the amount of data, what variables, data variables distribution, data duplication, data loss, data anomalies in the initial observation and so on. The following small safety belt small partners together to a glimpse of these data.

Use the Shape method to view the number of rows and columns of data

analysis_data.shape
Out: (21524530, 22) #这是有22个维度，共计21524530条数据记的DataFrame

Use the head () method to view the first 5 rows of data by default, plus the tail () method is to view the next 5 rows by default, and of course you can enter parameters to view the number of custom rows

Analysis_data.head (10)

Here you can learn that our data records have users using proxy IP date, proxy header information, proxy access domain name, proxy method, source IP and honeypot node information and so on. In this small Ann must be sure to tell you, small Ann every time do the data analysis must use method –describe method. Pandas's describe () function enables quick statistical summarization of the data:

For numeric type data, it calculates each variable:

Total number, average value, maximum, minimum, standard deviation, 50%-digit number, etc.;

Non-numeric type data, the method gives the variable:

The number of non-null values, the unique number (equivalent to the distinct method in the database), the maximum frequency variable, and the maximum frequency.

By the head () method we can find that the data contains numeric variables, non-numeric variables, we can first use the Dtypes method to view the data types of columns in Dataframe, and use the Select_dtypes method to classify data by data type. Then, using the statistics returned by the describe method, there is a preliminary understanding of the data:

Df.select_dtypes (include=[' O ']). Describe ()

Df.select_dtypes (include=[' float64 '). Describe ()

Proxy_retlength	SCAN_OS_FP	SCAN_OS_SUB_FP	Scan_scan_mode	dtype_details
Count	6.417354e+06	0.0	0.0	0.0
Mean	1.671744e+03	NaN	NaN	NaN
Std	3.104775e+04	NaN	NaN	NaN
Min	0.000000e+00	NaN	NaN	NaN
25%	NaN	NaN	NaN	NaN
50%	NaN	NaN	NaN	NaN
75%	NaN	NaN	NaN	NaN
Max	2.829355e+07	NaN	NaN	NaN

By simply observing the statistical results of each dimension of the above variables, we can see that the average length of the data obtained by everyone is about 1670 bytes. At the same time, we can find the fields such as scan_os_sub_fp,scan_scan_mode, such as empty values and so on. So that we can have a general understanding of the data as a whole.

4. Data Cleansing

Because the source data usually contains some empty values or even empty columns, it can affect the time and efficiency of data analysis, and after previewing the data digest, these invalid data needs to be processed.

In general, remove some null data can use the Dropna method, when you use the method, after the inspection found that Dropna () almost removed all rows of data, a pandas user manual, the original without parameters, Dropna () will remove all the rows containing null values.

If you only want to remove all columns with null values, you need to add axis and how two parameters:

Analysis_data.dropna (Axis=1, how= ' all ')

Alternatively, you can dropna the parameter subset to remove the specified column null data, and set the Thresh value to remove every non-none data less than the number of rows Thresh.

Analysis_data.dropna (subset=[' proxy_host ', ' Srcip ']) #移除proxy_host字段或srcip字段没有值的行analysis_data. Dropna (thresh=10) #移除所有行字段中有值属性小于10的行

5. Statistical analysis

After a preliminary understanding of some of the information in the data, the original data has 22 variables. From the analysis purpose, I will select the local variables from the original data for analysis. Here we introduce pandas data slicing method Loc.

LOC ([start_row_index:end_row_index,[' Timestampe ', ' proxy_host ', ' Srcip ']) is an important slicing method of pandas, preceded by a slice of a row, followed by a column slice after a comma, That is, select the variables to analyze.

As below, I choose the date here, host and source IP fields--

Analysis_data = Analysis_data.loc ([:, [' Timestampe ', ' proxy_host ', ' Srcip ']])

First of all, let's look at the amount of data daily used by the honeypot agent, we will calculate the data by day, understand the daily data amount of PV, and draw the results of the trend chart.

Daily_proxy_data = analysis_data[analysis_data.module== ' proxy ']daily_proxy_visited_count = Daily_proxy_ Data.timestamp.value_counts (). Sort_index () Daily_proxy_visited_count.plot ()

For data column discard, in addition to invalid values and requirements, some tables themselves redundant columns also need to be cleaned up in this link, such as the index number in Dataframe, type description, and so on, through the discard of these data, thus generating new data, can make the data capacity effectively reduced, and thus improve the computational efficiency.

The analysis shows that the usage of honeypot agent has exploded in the days of June 5, 19-22 and 25th. Then these days the data has a situation, not normal, the specific is God horse situation, not urgent, behind the small safety belt everyone to slowly find out in the end is those people (source IP) did what "bad".

Further analysis, after the data has an exception, let us look at the amount of daily to go to heavy IP data and its growth. Can be calculated by the day groupby after the Nunique () method to calculate the daily amount of IP data.

Daily_proxy_data = analysis_data[analysis_data.module== ' proxy ']daily_proxy_visited_count = Daily_proxy_ Data.groupby ([' Proxy_host ']). Srcip.nunique () Daily_proxy_visited_count.plot ()

What is the majority of people (source IP) in the dry God horse? The horse of the dry God? The horse of the dry God? Let's take a look at which of the most visited hosts, that is, the number of IPs associated with the same host, so that we can see only the top 10 top host.

First select the host and IP fields, can over GroupBy method to group each domain name (host), and then the IP access to each domain name unique statistics.

HOST_ASSOCIATE_IP = proxy_data.loc[:, [' proxy_host ', ' srcip ']]grouped_host_ip = Host_associate_ip.groupby ([' Proxy_ Host ']). Srcip.nunique () print (Grouped_host_ip.sort_values (ascending=false). Head (10))

delegate access to host	Source IP
Www.gan**.com	1113
Wap.gan**.com	913
Webim.gan**.com	710
Cgi.**.qq.com	32T
Www.baidu.com	615
Loc.*.baidu.com	543
Baidu.com	515
www.google.com	455
Www.bing.com	428
12.ip138.com	405

Look at what everyone has done--look at the log data and discover the original collection of information such as the price of used cars, worker recruitment and so on. From the popular host, we should always use the agent mainly to get Baidu, qq,google,bing this kind of including women website Information.

Let's see who is using proxy IP "officer" the most, that is to see who has the most IP access to different host.

 host_associate_ip = proxy_data.loc[:, [' proxy_host ',  ' Srcip ']]grouped_host_ip =  host_associate_ip.groupby ([' Srcip ' _host ']). Proxy_host.nunique () print (Grouped_host_ip.sort_values ( Ascending=false). Head ()

" tr>

source IP	Access different host
2850"
64...122	2191
710
212..**.14
518
195.*. 1	27...202	452
451
212..**.13	110...39	430

Oh, we found the target IP is 123. *.155 's young man has a large number of access records, and then view the log, he was in a large collection of hotel information. Well, then we'll probably know who's doing it, and let's see how long they've been using proxy. The code is as follows--

This is not to give you the code, just give the following pseudo-code.

DATE_IP = analysis_data.loc[:,[' timestamp ', ' srcip ']]grouped_date_ip = date_ip.groupby ([' Timestamp ', ' Srcip ']) # Calculate the access date for each source IP (SRCIP) all_srcip_duration_times = ... #算出最长连续日期天数duration_date_cnt = count_date (all_srcip_duration_ Times

Source IP	duration Date (days)
80...38	32
213.*. 128	31
125..*.161	22
120..*.161	22
50.*. 67	19
114...97	19
162.*. 113	19
192.*. 226	17
182...205	17
112.*. 108	16
123..*.130	16
61...156	15
61...152	15
58...130	15
216.*. 106	14
101...117	14
124...126	14
79.*. 254	13
115..*.130	13
61...79	13

Well, here I also know what those people do, who use the agent for the longest and so on. Remove IP = 80...38 users use proxy IP access to the data log, found that the lad in a long time to obtain Sohu images.

Honeypot across the country to deploy multiple nodes, let us look at each source IP scan honeypot node total number, understanding IP scan node coverage. The results are shown below:

# Total number of IP scan nodes per IP scan node = df[df.module== ' Scan ']node = node.loc[:,[' Srcip ', ' origin_details ']]grouped_node_count = Node.groupby ([' Srcip ']). Count () print grouped_node_count.sort_values ([' origin_details '], Ascending=false). Head (10 )

Source IP	Total number of IP scan nodes
106.*. 161	9
45...214	9
94.*. 174	8
119...216	7
61...222	7
182...205	6
182..*.75	6
42..*.89	6
123...64	6
42..*.128	6
42..*.106	6
42..*.82	6
114...157	6
80...38	6
42..*.149	6
115...163	6

From the above two tables initially know, some conclusions: if the source IP is 182 ... 205 of users scan the Honeypot node for a long time, mark dangerous users, and so on.

Conclusion

Small Ann here to give you a brief introduction of the use of Python tools, mainly pandas library to analyze the data, of course, the function of this library is very powerful, small Ann also just take us to a glimpse of the experience, more or we have to use and explore their own.

Original chain: http://www.freebuf.com/sectool/109479.html

Analyze risk data using the Python tool

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Analyze risk data using the Python tool

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Analyze risk data using the Python tool

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support