College students ' acceptance prediction--Logistic regression

Last Update:2016-04-29 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Dataset

Every year, high school and college students apply for entry into various universities and institutions. Each student has a unique set of test scores, scores, and backgrounds. The Admissions committee accepts or rejects these applicants in accordance with this decision. In this case, a binary classification algorithm can be used to accept or reject the request. Logistic regression is a suitable method, and we will solve this problem in this task.

The dataset Admissions.csv contains information about 1000 applicants, with the following characteristics:

GRE -graduate Record Exam (Postgraduate entrance exam), a generalized test for prospective graduate students (a general testing prospective graduate student), Continu ous between and 800.
GPA -Cumulative grade point average (cumulative GPA), continuous between 0.0 and 4.0.
admit -Binary variable, 0 or 1, where 1 means the applicant is admitted to the program.

Use Linear Regression to Predict admission

This is the original data, the value of admit is 0 or 1

import pandasimport matplotlib.pyplot as pltadmissions = pandas.read_csv("admissions.csv")plt.scatter(admissions["gpa"], admissions["admit"])plt.show()

This is the value of the admit predicted by the linear regression model, it is found that the range of values is large, even negative, not what we want.

in memory# Import linear regression classfrom sklearn.linear_model import LinearRegression# Initialize a linear regression modelmodel = LinearRegression()# Fit modelmodel.fit(admissions[[‘gre‘, ‘gpa‘]], admissions["admit"])# Prediction of admissionadmit_prediction = model.predict(admissions[[‘gre‘, ‘gpa‘]])# Plot Estimated Functionplt.scatter(admissions["gpa"], admit_prediction)

So we expect to construct a model that gives us a probability of accepting (admission), and this probability is valued at [0~1], and then we classify it according to the method of the bank credit card approval-model evaluation ROC&AUC This article to select the appropriate threshold value.

The Logit Function

Logistic regression is a popular classification method that restricts output to between 0 and 1. This output can be considered as a given set of probabilities for entering an event, just like any other classification method.

The Logit function is the basis for logistic regression, which has the following form:
Look at the look of the Logit function:

# Logistic Function def logit(x):    # NP.EXP (x) raises x to the exponential power, ie e^x. e ~= 2.71828    returnNP.EXP (x)/(1+ np.exp (x))# linspace is as NumPy function to produced evenly spaced numbers over a specified interval.# Create An array with the values between-6 and 6 as Tt = Np.linspace (-6,6, -, dtype=float)# Get Logistic fitsYlogit = logit (t)# Plot the logistic functionPlt.plot (t, Ylogit, label="Logistic") Plt.ylabel ("Probability") Plt.xlabel ("T") Plt.title ("Logistic Function") Plt.show () A = logit (-Ten) B = logit (Ten)"' a:4.5397868702434395e-05b:0.99995460213129761 '

The Logistic Regression

Logistic regression is to use the output of the linear regression as input to the logit function and produce an output as the final probability. Where the β0 is the Intercept, the other βi is the slope and also the coefficient of the feature.
As with linear models, we want to find the optimal value of βi so that the error between the predicted value and the real value is minimized. The most common method to minimize errors is the maximum likelihood method and gradient descent method.

Model Data

Following a logistic regression experiment, the sample data needs to be shuffled before each training test set is divided so that the sampling is random. Seeing the relationship between the last GRE and the predicted values found that the larger the GRE, the greater the probability of being accepted, which is true.

 fromSklearn.linear_modelImportLogisticregression# Randomly shuffle our data for the training and test setAdmissions = Admissions.loc[np.random.permutation (Admissions.index)]# Train with + and test with the following, split datasetNum_train = theData_train = Admissions[:num_train]data_test = Admissions[num_train:]# Fit Logistic regression to admit with GPA and GRE as features using the training setLogistic_model = Logisticregression () logistic_model.fit (data_train[[' GPA ',' GRE ']], data_train[' admit '])# Print the Models coefficientsPrint (LOGISTIC_MODEL.COEF_)" [[0.38004023 0.00791207]] "# Predict The chance of admission from those in the training setFitted_vals = Logistic_model.predict_proba (data_train[[' GPA ',' GRE ']])[:,1]fitted_test = Logistic_model.predict_proba (data_test[[' GPA ',' GRE ']])[:,1]plt.scatter (data_test["GRE"], Fitted_test) plt.show ()

Predictive Power

Here's a usage to mention, Accuracy_train = (predicted = = data_train[' admit '). Mean () predicted = = Data_train[' admit '] get is a Boolean array , when calculating mean (), true is recorded as 1,false 0, and then the mean value is obtained. But in the list is not possible, the list object's Boolean data does not have mean () this function.

#. Predict () using a threshold of 0.50 by defaultpredicted = Logistic_model.predict (data_train[[' GPA ',' GRE ']])# The average of the binary array would give us the accuracyAccuracy_train = (predicted = = data_train[' admit ']). Mean ()# Print The accuracyPrint"Accuracy in Training Set = {s}". Format (S=accuracy_train))"# This output is also good accuracy in Training Set = 0.7785714285714286"# Percentage of those admittedpercent_admitted = data_test["Admit"].mean () * -# predicted to be admittedpredicted = Logistic_model.predict (data_test[[' GPA ',' GRE ']])# What proportion's our predictions were trueAccuracy_test = (predicted = = data_test[' admit ']). Mean ()

The threshold value for logistic regression in Sklearn is set to 0.5 by default

Admissions ROC Curve

Predict_proba in logistic regression this function returns not the class label, but the probability of acceptance, which allows us to modify the threshold ourselves. First we need to make its ROC curve to observe the appropriate threshold:

From sklearn.metrics import Roc_curve, roc_auc_score# Compute The probabilities predicted by the training andTest set# Predict_proba returns probabilies forEach class. We want the second columntrain_probs = Logistic_model.predict_proba (data_train[[' GPA ', ' GRE '])[:,1]test_probs = Logistic_model.predict_proba (data_test[[' GPA ', ' GRE '])[:,1]# Compute AUC forTraining Setauc_train = Roc_auc_score (data_train["Admit"], Train_probs) # Compute AUC forTest setauc_test = Roc_auc_score (data_test["Admit"], test_probs) # DifferenceinchAUC Valuesauc_diff = auc_train-auc_test# Compute ROC Curves roc_train = Roc_curve (data_train["Admit"], train_probs) roc_test = Roc_curve (data_test["Admit"], test_probs) # PlotfalsePositives bytruePositivesplt.plot (roc_train[0], roc_train[1]) Plt.plot (roc_test[0], roc_test[1])

You can see that the ROC curve starts very steep and slowly becomes smooth. The AUC value of the test set is 0.79 less than the AUC value of the training set of 0.82. These indications suggest that our models can predict whether or not to be accepted based on GRE and GPA.

College students ' acceptance prediction--Logistic regression

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

College students ' acceptance prediction--Logistic regression

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

College students ' acceptance prediction--Logistic regression

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support