Introduction
This is a revisiting of the previous paper written on Student Retention utilizing data from online learning management systems at higher education institutions. The paper was centered around the idea, construction, evaluation and implementation of a machine learning algorithm tasked with predicting each students risk of dropping out prematurely from their educational institution. The algorithm previous utilized current Learning Management System (LMS) data, from student conversation messages, participation metrics and more to instructor evaluation metrics like response time to messages, grade time, and interactions with students.
The revisiting and adding of data is the result of the accuracy of the first algorithm. Since running the algorithm and getting the predicted scores, examining the top 500 most at risk students and the enrollments for the following semester yielded some incredible findings: Almost 1 out of every 2 students predicted to be in the top 500 most at-risk did not enroll for the following semester. This was immensely exciting, as we had identified a valuable avenue of exploration for predictive analytics, and it had been proven to be successful. So as is natural, we decided to make it even more accurate and comprehensive.
Since the previous paper and its revisions, new data was extracted from the previous higher educational institutions database, specifically relating to pre-college behavior and academic insights. These included but were not limited to:
- Student Age
- Student Marital Status
- Student ACT/SAT Scores
- Student Enrollment Status
Much more was included in addition to the metrics mentioned above. The inclusion of these new metrics is vital to identifying at risk behaviors due to their potential correlation with current college study and academic performance habits. Identifying students who that may be struggling in college who previously struggled in high school or on the ACT with specific demographic characteristics allows for a more refined risk prediction and a better view of which groups to provide more aid to.
In addition, this information provides better trend analysis data, allowing the algorithm to identify students who may have improved their academic performance since before college. Compared to their peers, these students may exhibit less risk of dropping out due to their improved performance and confidence in their ability to perform well in college. Pre-college information paired with current academic performance data is key for this perspective.
This data was collected for all semesters at this institution dating back four years. This amount of data was optimal for the purpose of constructing this retention algorithm. This paper walks through the addition of the features from the new data, the methods utilized to improve the algorithm and the ideals behind the methods used.
Data Preparation
The data contained in the LMS for this institution was extensive, and the main information that was extracted from the LMS encompassed student-instructor interaction, student participation, and quality of instruction in the online space. Previous research conducted shows that increased online interaction and participation from instructors to students has enourmous benefits for KPI’s such as decreasing drop rates for courses, increasing student perception of course quality, and increased likelihood of retention.
Through this data, we have effectively captured a students current academic characteristics and behavior. This is what makes this algorithm so unique and beneficial, drawing from the fact that the dataset is dynamic up to the minute, capturing extensive patterns and changes over time. The addition of the pre-college data only boosts the ability of the algorithm to identify those patterns and changes.
The data relating to pre-college information was vast, as the admission process is quite extensive for higher education institutions. The key for wrangling this data was to ensure that only the relevant features were used and were constructed in the dataset properly so that it could be utilized across multiple types of models. The main concern arose with the XGBoost package as it only supports one-hot encoding for factor variables, so dimensionality reduction in this instance was key.
Developing the new training set involved changing all factor variables into binary columns, known as “one-hot” encoding. There were numerous columns that resulted from this, and it was important to identify which factors would end up being important and to remove those that would likely not be. This helped with dimensionality reduction, however the dataset was still very large, resulting in 521 variables. Due to the structure of the algorithms used, this may seem like an issue but XGBoost and RandomForest both handle one-hot encoded variables extremely well, and leaving factor structures out of the equation speeds up the algorithms performance.
Algorithm Testing and Ensembling
As was the case in the previous algorithmic training, 3 algorithms were tested to determine which was the best fit for the task at hand. As retention is a tricky issue to begin with, it then warrants the assumption that it would also be difficult for an ML algorithm to predict with certainty that a drop out will occur. Preliminary testing proved this assumption, with only about 10-15 students out of 18,000 being predicted above a .50 with regression testing. Due to this fact, Area-Under-Curve was chosen as a method for analyzing the proper algorithm to use. This way, the algorithm with the largest AUC would suit our needs.
The three algorithms chosen to test were as follows:
- A basic linear regression model, using matrix algebra to get a quick and simple baseline model
- A Random Forest model, utilized for its extensive mathematical prowess and its depth of statistical analysis. Random Forests also work very well with factor variables and one-hot encoded variables.
- A boosting algorithm using the popular package XGBoost, used for its quick handling of large datasets and proven accuracy in numerous competitive machine learning competitions.
Utilizing and examining these three algorithms provides a wide coverage for different methodologies, ensuring we get the best results possible. Each algorithm would be tasked with a regression output, as we only care about the students with the highest probability of dropping out, not an accurate prediction of whether or not they certainly will.
After running the algorithms, due to the performance of each being so closely related, an ensemble method was utilized to expand on areas of increased performance over other algorithms. By ensembling all three algorithms together, we took the best of all 3 and utilized those areas to increase the AUC of the final ensemble method.
Shown below are the evaluation plots for each algorithm.
As is evident, the ensemble algorithm (orange) has the best aggregate performance over all the other algorithms. This is a prime example of a perfect situation in which to use ensembling to increase performance.
Below the AUC of each algorithm is shown.
This plot shows a more visual example of the ensemble method outperforming the other algorithms. It far surpasses the other AUC values for the algorithms, in addition to achieving a high level of accuracy at a low cutoff rating. The model quickly generalizes to a high accuracy point which is visible in the evaluation plotting shown previous.
Our assumptions about the AUC are confirmed by examining the actual AUC values below:
XGB AUC | RF AUC | LM AUC | ENS AUC |
---|---|---|---|
0.720697 | 0.6476469 | 0.6734606 | 0.9137814 |
The AUC for the ensemble is exceedingly more than the other algorithms. Again, a model case for using ensembling methods. The previous algorithm tested with only LMS data had a max AUC of 86%. Ensembling with the new data yields a 5% increase in accuracy, which may not seem like much however with this amount of data at these accuracy levels, 5% is a substantial increase. For the purposes of our algorithm, this is also good due to the fact that the higher AUC the more identifiable the most at risk students will be.
Key Features and Implementation
There were a few key features which made this algorithm so successful, and it would be worth it to explore just a few of the key factors that boosted the accuracy of the algorithm:
- Bagging and Average Interactions by Department
- Each department has its own average for online interactions with students, so for instance Economics & Finance may have an average of 4.32 interactions per student, while Arts & Humanities may only have 2.34 interactions per student. Using this information, we compared each course in each department and their interactinos against the department average. If they were above the average, they received a 1, if they were below, they received a 0. This resulted in a list of binaries comprising the ‘high quality’ courses that each student had been in. These were then aggregated by student to show a total count of ‘high quality’ courses that the student had enrolled in. The higher the count, the lower the likelihood of dropping out. With more analysis in the future, this may help us identify key bottlenecks in the structure of specific majors, allowing remedies to be put in place and thereby resulting in long term success for those students.
- High School GPA
- The data provided included each students previous high school GPA. This information is invaluable, because it gives a clear perspective on the students habits and behavior prior to when they attended the institution. This information coupled with things such as ACT/SAT Scores can also provide a proxy for potential scholarship groups (This institutions scholarship index is generated using those two metrics). In addition, when paired with current academic behavior, a high school GPA can help the algorithm identify students who seem to be improving in their college career compared to previously.
- International and Out-of-State
- Another very helpful metric involves where each student originated from. In this particular institutions state, any students who reside within the state receive heavily discounted tuition compared to students who come from out of state. The difference is excessive enough to provide more of a commitment from out of state students to remain at the institution due to financial involvement. The same principle applies to international students as well, and by providing this information to the algorithm, it allows it to make the unique distinction between students who have in-state tuition and those who have out-of-state tuition.
There were many more features that could be discussed as being important or why they were included, but the most encompassing explanation behind the ideals for constructing the database is this: the more comprehensive the picture of each student can be, the better the algorithm. This will include further data as the algorithm is improved, involving everything from attendance at sporting events, swipes with student identification cards throughout campus and tutoring center data. With this integration of information, potentially the algorithm can successfully identify the bottlenecks and characteristics of students who are most certainly going to dorp out.
Through using this algorithm and recording the results from the outreach programs, different models can be tested and trained in the future. These algorithms may involve multi-level models in order to further segment the prediction groups, potentially allowing outreach to groups at risk of dropout for financial, academic or social reasons.
Any questions or comments can be directed to [email protected]