College students ' acceptance prediction--Logistic regression

Source: Internet
Author: User

Dataset

Every year, high school and college students apply for entry into various universities and institutions. Each student has a unique set of test scores, scores, and backgrounds. The Admissions committee accepts or rejects these applicants in accordance with this decision. In this case, a binary classification algorithm can be used to accept or reject the request. Logistic regression is a suitable method, and we will solve this problem in this task.

    • The dataset Admissions.csv contains information about 1000 applicants, with the following characteristics:

GRE -graduate Record Exam (Postgraduate entrance exam), a generalized test for prospective graduate students (a general testing prospective graduate student), Continu ous between and 800.
GPA -Cumulative grade point average (cumulative GPA), continuous between 0.0 and 4.0.
admit -Binary variable, 0 or 1, where 1 means the applicant is admitted to the program.

Use Linear Regression to Predict admission
    • This is the original data, the value of admit is 0 or 1
import pandasimport matplotlib.pyplot as pltadmissions = pandas.read_csv("admissions.csv")plt.scatter(admissions["gpa"], admissions["admit"])plt.show()

    • This is the value of the admit predicted by the linear regression model, it is found that the range of values is large, even negative, not what we want.
in memory# Import linear regression classfrom sklearn.linear_model import LinearRegression# Initialize a linear regression modelmodel = LinearRegression()# Fit modelmodel.fit(admissions[[‘gre‘, ‘gpa‘]], admissions["admit"])# Prediction of admissionadmit_prediction = model.predict(admissions[[‘gre‘, ‘gpa‘]])# Plot Estimated Functionplt.scatter(admissions["gpa"], admit_prediction)

    • So we expect to construct a model that gives us a probability of accepting (admission), and this probability is valued at [0~1], and then we classify it according to the method of the bank credit card approval-model evaluation ROC&AUC This article to select the appropriate threshold value.
The Logit Function

Logistic regression is a popular classification method that restricts output to between 0 and 1. This output can be considered as a given set of probabilities for entering an event, just like any other classification method.

    • The Logit function is the basis for logistic regression, which has the following form:

    • Look at the look of the Logit function:

# Logistic Function def logit(x):    # NP.EXP (x) raises x to the exponential power, ie e^x. e ~= 2.71828    returnNP.EXP (x)/(1+ np.exp (x))# linspace is as NumPy function to produced evenly spaced numbers over a specified interval.# Create An array with the values between-6 and 6 as Tt = Np.linspace (-6,6, -, dtype=float)# Get Logistic fitsYlogit = logit (t)# Plot the logistic functionPlt.plot (t, Ylogit, label="Logistic") Plt.ylabel ("Probability") Plt.xlabel ("T") Plt.title ("Logistic Function") Plt.show () A = logit (-Ten) B = logit (Ten)"' a:4.5397868702434395e-05b:0.99995460213129761 '

The Logistic Regression
    • Logistic regression is to use the output of the linear regression as input to the logit function and produce an output as the final probability. Where the β0 is the Intercept, the other βi is the slope and also the coefficient of the feature.
    • As with linear models, we want to find the optimal value of βi so that the error between the predicted value and the real value is minimized. The most common method to minimize errors is the maximum likelihood method and gradient descent method.
Model Data
    • Following a logistic regression experiment, the sample data needs to be shuffled before each training test set is divided so that the sampling is random. Seeing the relationship between the last GRE and the predicted values found that the larger the GRE, the greater the probability of being accepted, which is true.
 fromSklearn.linear_modelImportLogisticregression# Randomly shuffle our data for the training and test setAdmissions = Admissions.loc[np.random.permutation (Admissions.index)]# Train with + and test with the following, split datasetNum_train = theData_train = Admissions[:num_train]data_test = Admissions[num_train:]# Fit Logistic regression to admit with GPA and GRE as features using the training setLogistic_model = Logisticregression () logistic_model.fit (data_train[[' GPA ',' GRE ']], data_train[' admit '])# Print the Models coefficientsPrint (LOGISTIC_MODEL.COEF_)" [[0.38004023 0.00791207]] "# Predict The chance of admission from those in the training setFitted_vals = Logistic_model.predict_proba (data_train[[' GPA ',' GRE ']])[:,1]fitted_test = Logistic_model.predict_proba (data_test[[' GPA ',' GRE ']])[:,1]plt.scatter (data_test["GRE"], Fitted_test) plt.show ()

Predictive Power
    • Here's a usage to mention, Accuracy_train = (predicted = = data_train[' admit '). Mean () predicted = = Data_train[' admit '] get is a Boolean array , when calculating mean (), true is recorded as 1,false 0, and then the mean value is obtained. But in the list is not possible, the list object's Boolean data does not have mean () this function.
#. Predict () using a threshold of 0.50 by defaultpredicted = Logistic_model.predict (data_train[[' GPA ',' GRE ']])# The average of the binary array would give us the accuracyAccuracy_train = (predicted = = data_train[' admit ']). Mean ()# Print The accuracyPrint"Accuracy in Training Set = {s}". Format (S=accuracy_train))"# This output is also good accuracy in Training Set = 0.7785714285714286"# Percentage of those admittedpercent_admitted = data_test["Admit"].mean () * -# predicted to be admittedpredicted = Logistic_model.predict (data_test[[' GPA ',' GRE ']])# What proportion's our predictions were trueAccuracy_test = (predicted = = data_test[' admit ']). Mean ()
    • The threshold value for logistic regression in Sklearn is set to 0.5 by default
Admissions ROC Curve
    • Predict_proba in logistic regression this function returns not the class label, but the probability of acceptance, which allows us to modify the threshold ourselves. First we need to make its ROC curve to observe the appropriate threshold:
From sklearn.metrics import Roc_curve, roc_auc_score# Compute The probabilities predicted by the training andTest set# Predict_proba returns probabilies forEach class. We want the second columntrain_probs = Logistic_model.predict_proba (data_train[[' GPA ', ' GRE '])[:,1]test_probs = Logistic_model.predict_proba (data_test[[' GPA ', ' GRE '])[:,1]# Compute AUC forTraining Setauc_train = Roc_auc_score (data_train["Admit"], Train_probs) # Compute AUC forTest setauc_test = Roc_auc_score (data_test["Admit"], test_probs) # DifferenceinchAUC Valuesauc_diff = auc_train-auc_test# Compute ROC Curves roc_train = Roc_curve (data_train["Admit"], train_probs) roc_test = Roc_curve (data_test["Admit"], test_probs) # PlotfalsePositives bytruePositivesplt.plot (roc_train[0], roc_train[1]) Plt.plot (roc_test[0], roc_test[1])

You can see that the ROC curve starts very steep and slowly becomes smooth. The AUC value of the test set is 0.79 less than the AUC value of the training set of 0.82. These indications suggest that our models can predict whether or not to be accepted based on GRE and GPA.

College students ' acceptance prediction--Logistic regression

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.