Data-Hack SQL Injection Detection
0x00 Preface
I was planning to translate this series of tutorials. I came to the article later and found that the teaching process was not very friendly, so I followed his ideas, but I made a lot of changes, another thing is that I assume that the reader is familiar with basic python and SQL injection. Another thing to note is that I wrote it in ipython notebook, so the code in this article may need some changes for use.
I think the brief topic of this article is to provide a way of thinking for "how to identify SQL injection". The idea itself is to solve the problem in the form of data science, it is actually called machine learning.
To achieve our goal, we need a process:
Collect data thinking data Feature Engineering Machine Learning 0x01 preparation 1. tools
This series mainly uses python, so the following is the required python library. I won't teach you how to install these things.
Sqlparse (a library used to parse the SQL syntax tree) Scikit-Learn (python Machine Learning Library) Pandas (for fast processing of a certain amount of data) numpy (for scientific computing) matplotlib (for data visualization)
2. What is machine learning?
Because supervised learning is used in this article, we will inject the knowledge required for supervised learning. machine learning, as its name suggests, gives machines the ability to learn, if we already have an algorithm that can be learned, how can we teach it? For example, if a child needs to know how to recognize fruit, we will put two piles of different fruits and tell him that there are apples on the left and bananas on the right. Then, when he has learned this pile of shit, we can take him to a pile of new fruit and let him identify it. In other words, this time we will prepare a pile of data, telling the algorithm that the left side is a normal SQL request, and the right side is a SQL Injection request. Let him learn it later, finally, let's test a bunch of unknown data.
3. SQL syntax tree
What do you think of the processing of the SQL language from the input database to the stored content? The SQL language is a DSL (domain-specific language), such as ruby, c, java, these can do anything, but some languages can only do something in a certain field. SQL is such a language, which can only describe data operations. However, it is classified into programming languages in the case of big classification. It requires lexical analysis and syntax analysis. For those who do not know this process, you can see it.
0x02 prepare data
Because the data has been prepared this time, all we need is to write a small script to read it out, and I will package what we need.
: Download
#-*-Coding: UTF-8-*-import osimport pandas as pd basedir = '/Users/slay/project/python/datahack/data_hacking/SQL _injection/data' filelist = OS. listdir (basedir) df_list = [] # cyclically read the content under basedir. The file named 'legit 'is legal, and malicious is a malicious SQL statement for file in filelist: df = pd. read_csv (OS. path. join (basedir, file), sep = '|', names = ['raw _ SQL '], header = None) df ['type'] = 'git 'if file. split ('. ') [0] = 'git 'else' malicious 'df_list.append (df) # Put the content into the dataframe object dataframe = pd. concat (df_list, ignore_index = True) dataframe. dropna (inplace = True) # print dataframe ['type']. value_counts () # view the first five dataframes. head ()
Now we can clearly understand what kind of data we are facing.
0x03 Feature Engineering 1. Concepts
So, what then? Can we just drop the data into the algorithm and get a tall SQL firewall? Now let's look at a problem. We have two SQL statements to view the * content from the admin table.
select user from admin;select hello from admin;
The final input obtained by the algorithm is [, 1] and [, 1]. It doesn't matter if you don't understand it, that is, you get something like this.
{Select: 1, user: 1, hello: 0, from: 1, admin: 1} {select: 1, user: 0, hello: 1, from: 1, admin: 1}
Is there anything wrong, that is, in the machine's view, user and hello are actually different types of stuff, but you know that they are the same thing for those who know the SQL language, therefore, we need to tag the same thing so that the machine can know it.
Do you have a vague understanding of what is Feature Engineering? To do a good job of Feature Engineering, you need to have a deep understanding of the problems you are facing, that is, "domain knowledge". Bringing this question into your understanding of the SQL language, based on this understanding, we can process features so that algorithms can better classify them. The problem with fruit classification is that you have to tell the child that bananas are long, yellow, and Apple is red and round. Of course, if you directly drop the above stuff into the algorithm, classifier can also work, with an accuracy of about 70%. Maybe you look okay. I can only tell you that this is a disaster. This reminds me of a Data Mining competition. The difference between the first place and the 1,000th place is 0.01, which is abnormal.
2. convert data
So what we need now is to convert the raw data into features, which is why we need to process SQL statements when we talk about the syntax tree just now, we now use the sqlparse module to create a function to process SQL statements.
import sqlparseimport string def parse_sql(raw_sql): parsed_sql = [] sql = sqlparse.parse(unicode(raw_sql,'utf-8')) for parse in sql: for token in parse.tokens: if token._get_repr_name() != 'Whitespace': parsed_sql.append(token._get_repr_name()) return parsed_sql sql_one = parse_sql("select 2 from admin")sql_two = parse_sql("INSERT INTO Persons VALUES ('Gates', 'Bill', 'Xuanwumen 10', 'Beijing')") print "sql one :%s"%(sql_one)print "sql two :%s"%(sql_two)
Output SQL one: ['dml', 'integer', 'keyword', 'keyword'] SQL two: ['dml', 'keyword', 'identifier ', 'keyword', 'parenthesis ']
We can see that both select and insert are identified as dml, so now we need to observe the data and check whether the feature has the ability to classify the data, now we need to convert SQL statements first.
dataframe['parsed_sql'] = dataframe['raw_sql'].map(lambda x:parse_sql(x))dataframe.head()
3. Other
In theory, we can directly put these things into the algorithm now. However, to make it easier for me to say something else, the performance of the classifier depends largely on the features, if the data cannot be well classified, we need to consider some other processing features. For example, if you think SQL injection makes SQL statements seem long, then it can be converted into features.
dataframe['len'] = dataframe['parsed_sql'].map(lambda x:len(x))dataframe.head()
Now we need to observe the data to see if the length can classify the data.
%matplotlib inlineimport matplotlib.pyplot as pltdataframe.boxplot('len','type')plt.ylabel('SQL Statement Length')
0x04 Machine Learning 1. Train & Test
Here I will call the python library directly, because it is very troublesome to explain, and I will understand the level of Random Forest (Random Forest) to be used this time, I think it is better not to mention it. If you are interested in its mathematical principles, refer to the paper below, which is the clearest explanation of the random forest.
Gilles Louppe random forest: from theory to practice http://arxiv.org/pdf/1407.7502v1.pdf
Next, we will process the feature again and convert it to 0 and 1 vector form. x is our feature data and y represents the result.
import numpy as npfrom sklearn.preprocessing import LabelEncoderfrom sklearn.feature_extraction.text import CountVectorizerimport string vectorizer = CountVectorizer(min_df=1)le = LabelEncoder() X = vectorizer.fit_transform(dataframe['parsed_sql'].map(lambda x:string.join(x,' '))) x_len = dataframe.as_matrix(['len']).reshape(X.shape[0],1) x = X.toarray() y = le.fit_transform(dataframe['type'].tolist()) print x[:100]print y[:100]
Output
[[0 0 0 ..., 2 0 0] [0 0 0 ..., 1 0 0] [0 0 0 ..., 0 0 0] ..., [0 0 0 ..., 0 0 0] [0 0 0 ..., 0 0 0] [0 0 0 ..., 0 0 0]][1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
Input
clf = sklearn.ensemble.RandomForestClassifier(n_estimators=30) scores = sklearn.cross_validation.cross_val_score(clf, x, y, cv=10, n_jobs=4) print scores
Output
[ 0.97699497 0.99928109 0.99928058 1. 1. 0.97192225 0.99928006 0.99856012 1. 1. ]
The above cross_validation is a method for testing classifier. The principle is to test the training classifier on some split datasets, we can better evaluate the performance from the multiple scores we get. We get a seemingly good result. Next, let's train the classifier.
From sklearn. cross_validation import train_test_split # splits the data into training data and test data. The training data is used to train the model, and the test data is used to test the performance of the classifier. X_train, X_test, y_train, y_test, index_train, index_test = train_test_split (x, y, dataframe. index, test_size = 0.2) # Start Training clf. fit (X_train, y_train) # predict X_pred = clf. predict (X_test)
If the values just now cannot intuitively see what you have trained, you need a confusion matrix.
%matplotlib inlineimport matplotlib.pyplot as pltfrom sklearn.metrics import confusion_matrix cm = confusion_matrix(X_pred,y_test)print cm # Show confusion matrix in a separate windowplt.matshow(cm)plt.title('Confusion matrix')plt.colorbar()plt.ylabel('True label')plt.xlabel('Predicted label')plt.show()
The confusion matrix allows us to observe data more intuitively. Our data atmosphere is 0, 1. For example, [0, 0] = 196 is the sample that legit is correctly classified, [0, 1] = 3 is the sample of the error classification, and the second row is the case of the error classification.
Now we seem to be doing a good job in classification, and we have achieved a 99% accuracy rate. However, as you may imagine, three of every 199 correct samples are incorrectly classified, generally, the SQL statements to be processed by a medium-sized website may exceed 1000 times, which means that you may block 3000 harmless statements. So what we need below is to reduce the probability that legit is incorrectly classified.
2. Adjustment
Most sklearn models have a function called predict_proba, that is, the probability of prediction. predict actually calls predict_proba internally, and then press 50%. You can call predict_proba to change the classification probability.
Loss = np. zeros (2) y_probs = clf. predict_proba (X_test) [:, 1] thres = 0.7 # use a 0.7 probability to classify y_pro = np. zeros (y_probs.shape) y_pro [y_probs> thres] = 1.cm = confusion_matrix (y_test, y_pro) print cm
Output
[[ 197 0] [ 5 2577]]
The probability of legit being classified by errors is reduced, but 0.7 is just a random parameter. Can we simply find a way to optimize it? Let's simply define a function f (x) that outputs the probability of mis-classification with the input parameters.
Def f (s_x): loss = np. zeros (2) y_probs = clf. predict_proba (X_test) [:, 1] thres = s_x # This can be set to whatever you 'd like y_pro = np. zeros (y_probs.shape) y_pro [y_probs> thres] = 1. cm = confusion_matrix (y_test, y_pro) counts = sum (cm) count = sum (counts) if counts [0]> 0: loss [0] = float (cm [0, 1]) /count else: loss [0] = 0.01 if counts [1]> 0: loss [1] = float (cm [1, 0])/count else: loss [1] = 0.01 return loss #0.1 to 0.9 x = np. linspace (0.1, 0.9, 100) # x Input f (x). The result is y = np. array ([f (I) for I in x]) # visualize plt. plot (x, y) plt. show ()
Amount. Continue to use 0.7.
0x05 conclusion
This is a Series. Maybe I don't have anyone to believe this. It's a bit messy at the beginning.
I don't know who said the last old saying, but everyone is talking about it every day.
80% of the performance of data mining projects depends on Feature Engineering, and the remaining 20% depends on other parts such as models. Also, the performance ceiling of data mining projects is determined by Feature Engineering, the degree of closeness to the upper limit is determined by the model.