Introduction

This is a summary of the examination and implementation of machine learning applied to student retention at a higher-education, degree granting university. This analysis includes methods using decision tree/random forest modeling, as well as boosting and logistic regression in order to classify student dropouts.

The objective for this project was to be able to predict student dropout within the first two weeks of the following semester after the current enrollment using data extracted from the learning management system in use by the university. The data from the learning management system is updated in real time, allowing use to provide accurate results for the following semester four weeks into the current one.

For our purposes, we were not concerned with the accuracy of the predictive model at the .5 base threshold for classification problems. Rather, we wanted to identify the average risk and pinpoint those who exceeded the average risk. We then implemented these results using a rank ordering to determine which students were most at risk compared to their peers.

Data and Feature Extraction

The LMS (learning management system) being used by the institution contains many measurements of student activity in the LMS. In previous analyses and development of reports, we were able to extract important metrics measuring course and instructor quality, as well as student participation and performance. They include, but are not limited to:

Using many of these features and existing reports, we made use of the R package dplyr to transform and extract more features from the data.

By using the features mentioned above in addition to many more, theoretically we could identify a student’s risk of dropout based on their behavior captured by the LMS. Students who are not active on discussion posts, not sending messages and not logging in are very likely to drop out the following semester. The most important aspect of this project was to ensure we extracted all possible features from the data used.

An approach that was extremely beneficial in later analysis was to determine the fraction of the total interactions, submission comments, discussion posts and conversation messages in a course that each student was responsible for. For example, a student may have an average of 25% of the total discussion board posts in their courses, showing they are extremely active in all of their courses.

After finding this proportion, mean, min and max functions were applied to all categories. This same approach was applied to determine the proportion of total interactions and posts for the instructors as well.

Another valuable insight was to examine the quality of courses the student had been enrolled in during their time at the university. This became tricky, initially the problem was approached by creating dummy variables for each individual course (over 1400). Assessing this method, it was decided that an approach with dimensionality reduction in mind was best. We tackled this by breaking the courses into levels by department, i.e. Accounting 1000 level courses, Biology 2000 level courses, and so forth. This reduced the dimensionality of the dataset considerably and still allowed us to identify the effect of number of quality courses taken by a student.

Using the course and course level subsets, we averaged the above features for all levels in all departments. For example, we obtained the average discussion posts for Accounting 1000 level courses, Biology 2000 level courses etc. Using bootstrapping with 1000 iterations, we developed a standard error for these averages, allowing for any anomalies in the courses and/or departments. After which, we assigned binary variables denoting whether a particular course had met or exceeded the average within the standard error in that category. For example, if the average for discussion board posts in Accounting 1000 level courses was 32.2 with a standard error of 1, and accounting 1010 received 31.2 in a semester, they were labeled as a 1.

Using these new features, we used dplyr once again to total the binary variables by student, showing how many courses over the duration of their career at the institution were at or above average. Students with low counts in these areas would most likely be recognized by the algorithm as having low-quality experiences, thereby being more likely to drop out/transfer.

The final, and perhaps most important method we used to maintain model accuracy while still ensuring time allotted to intervene with the student before dropout was to perform the analysis at early time periods in the semester. The two time periods examined for this analysis were at one month and six weeks into the respective semester of enrollment. These two time periods were chosen to determine whether there was a large difference in predictive power at week 4 vs. week 6 in addition to allowing the appropriate parties to reach out to the student and have time to intervene and retain the student.

Model Selection and Accuracy Examination

Preliminarily, 3 models were chosen to examine and use for analysis. They were boosting, using XGBoost, decision tree forests using Random Forests, and logistic regression.

To determine which models were most accurate, we not only examined accuracy rates amongst the three, but more importantly the ROC, more commonly known as AUC or Area-Under-Curve analysis. This would show us which models performed better across all thresholds rather than the base .5 used for classification algorithms. This is paramount to our rank order analysis, as the default parameter of .5 may exclude some students who are in the top percentile of at-risk individuals. Selecting a model with the greatest AUC will remove this possibility and provide valuable insight to predictive power across different thresholds.

Prediction Results

Let’s examine some of the results from our different predictive models. Keep in mind, that the accuracy of these models isn’t measured in predictive power, but in evaluating the AUC. The model with the greatest AUC is the one we should select, in order to have the best accuracy over differing cutoffs.

Examining the accuracy for cutoff graph, it seems like the XGB model has a higher accuracy at a lower cutoff rating than random forest. So should we identify the optimal cutoff rate? As we mentioned before, this cutoff rate is ancillary and only needed if we really want to impose one. A rank-order analysis is optimal for our purposes here.

Now we can examine the most important aspect, the area-under-curve analysis. Examining the below graph, the logistic model hardly does better than making individual guesses as to whether the student will drop. But looking at the XGBoost model (blue line) and the Random Forest model (green line), its difficult to tell which does better overall, even between the four and six week subsets. Our logistic regression in both subsets is overtaken by both random forest and boosting models.

We can examine the actual area under curve using the wonderful tools included in the ROCR and pROC packages. Below are the respective AUC values for each individual model, separated by four week and six week subsets:

Area Under Curve Four Week
XGB AUC RF AUC LM AUC
0.8022574 0.7968702 0.7399977
Area Under Curve Six Week
XGB AUC RF AUC LM AUC
0.8182148 0.816157 0.7528038

The difference between the XGB AUC and the Random Forest AUC is minimal in both 4 and 6 week subsets. This shows that between boosting and random forests, neither one is significantly better at predicting student dropouts. For purposes of interpretability and implementation, we chose random forest to use for this problem. The one drawback for using random forest is computational time and power, but for this project that aspect isn’t an issue. For the purpose of ample intervention time and predictability, the four week random forest prediction model was chosen.

Final Predictive Example and Variable Importance

Now lets look a bit deeper into our random forest. The first thing we should examine is our mean predicted value. The mean of our prediction is 0.0368474. This means that each student has approximately a 3.7% chance of dropping out. This 3.7% can be described as the potential risk for large life events occuring, or any other unseen instances where a dropout would happen. So we were correct to assume the .5 base threshold for classification would be of little use to us.

Let’s take a look at our most important random forest variables:

Wow! Login count seems to be incredibly important when predicting droupout. This makes sense as a student who isn’t logging in much would be prone to dropout as they are not turning in assignments or checking messages.

Some of the other important variables are:

It seems to be a trend that interactions as a count are not as predictive as the fraction of interactions that a student or teacher are responsible for in a specific course. A student who has a higher average fraction of total interactions in more courses is much less likely to be at risk for dropping because they are involved and active in their classes. This is tells us that even if a student has high total interactions, those who have a low fraction of the total interactions may potentially be more at risk than average.

The correct implementation of this project is the most important aspect. Most institutions don’t have the resources to contact all those who are above average risk for dropout, so providing a sorted list of those most at risk is incredibly useful. Some may prefer a rank-order, starting from number one (most at risk) and decreasing from that point forward.

To visualize the output of the algorithm, we’ll sort the predictions from greatest to smallest. The top students are most at risk for dropout within the first two weeks of the following semester.

Prediction User ID
9933 0.8246667 2.146448e+17
15402 0.8080556 5.630213e+17
10414 0.8061111 2.437168e+17
6292 0.7619444 -4.992332e+17
9011 0.7458889 1.560826e+17
1477 0.7074444 -1.950927e+17
7060 0.6674444 -5.490151e+17
7696 0.6458333 -7.262097e+16
12525 0.6196667 3.843185e+17
5294 0.6182222 -4.358169e+17

As we can see, the predictions are decreasing quickly and likely don’t predict many students over .5. This means that if we were to impose the threshold of .5 we would miss many of the at risk students who are well above our average of 3.7% risk of dropout while also likely being below the .5 threshold.

Further Analysis and Improvements

This project is in preliminary implementation, meaning we have no results to show how the algorithm performs real time. These results will likely be available at the conclusion of this enrollment term (May 2019). The question of how to intervene is a much more administration-oriented problem. In this aspect, we’ve provided the best currently available analysis for retention prevention. After the end of the current enrollment term, the model and its performance will be re-assessed and tuned as needed.

As far as improvments to the model, additional explanatory data would be very useful. At the time of implementation, data access was limited. Things that would be incredibly informational are as follows:

It seems as if these would be obvious variables to include (which they are), but due to administrative blocks and delays, this data was not made available for this project. Currently we are working to obtain this, and will update the model at the appropriate times.

Another important factor to consider when determining importance of intervention with students is class standing. For instance, the economic and financial benefit of a freshman at risk for dropuout after their first semester is considerably less than that of a senior student about to complete their final semester at the institution. The potential for retention of a student so close to graduation not only provides additional revenue in the current year, but the diploma stating the institutions name is worth much more, in referrals, alumni donations amongst many other things.

Thank you for reading, any and all questions can be directed to [email protected].