Thanks for waiting! Here you will find an explanation for all the content you would find on the pages of this website! I'll outline what I'm doing now as well as what I plan to for the future.
Pre-Game Prediction Model 1.0: Oliver
Oliver was built on data from games from 2014/15 to 2018/19. As each season concludes, I will add the data to our training set to improve Oliver for the next season. The data is scraped and gathered from College Hockey News (www.collegehockeynews.com). I owe them a huge amount of thanks, as the data they put on their website would not be available to the public without them purchasing it from the NCAA. Their articles are top-notch and the coverage they provide for the sport is unparalleled. I also based Oliver on the amazing model created by Peter Tanner at www.moneypuck.com. Please give his website a look for NHL content.
Oliver is designed to predict the winning percentage of teams based on several factors. The first factor is the season Close Corsi For % of a team. This stat, boiled down, is how much a team out-shoots their opponents over the course of a season when the game is either tied or within a goal of each other. This was the most predictive metric for shot volume to use in Oliver, more so than overall Corsi For %.
The second factor I use is my calculated adjRating for each team. At its core, this is a goal efficiency metric that I created to replace Expected Goals. Since there really is no standardized shot tracking in the NCAA, besides quantity, there's no way to create an expected goals model, as many have done in the NHL. What I wanted to do with my goal efficiency metric was capture a component of the usefulness of expected goals, which is how good a team is at scoring based on its shot volume. To do this, I borrowed from Ken Pomeroy, who does college basketball stats at www.kenpom.com. I calculate a raw offensive rating (OR) and defensive rating (DR) for each team for each game. This is calculated as Goals For or Against / Corsi For or Against * 100. This is then averaged throughout the season to get a team's season offensive and defensive rating. Finally, I used ridge regression to adjust the ratings to account for strength of schedule to get what you see on my page. To interpret this, a team's adjusted OR is how many goals we would expect this team to score for every 100 shot attempts. A team's adjusted DR is how many goals we would expect this team to give up for every 100 shot attempts against. The final adjRating is done by subtracting a team's adjDR from its adjOR.
The final two factors I use in Oliver are simple. I use the factors of PDO, team shot percentage, and team save percentage. While these metrics are poor when being used to evaluate the strength of a team properly, as these stats are primarily luck-driven, they are very predictive when trying to calculate win probabilities, lending to the old adage that you have to be lucky to win.
With all these factors, I ran a multivariate regression with the target variable being a team's win percentage to calculate my coefficients. The specific coefficients will be published here over the summer as I complete the rewrite of my code. I then use the coefficients with the current season's factors for each team to calculate a predicted winning percentage of each team. This "Expected Winning Percentage" has been calculated to be more predictive of future wins than a team's actual current winning percentage. I rank teams by their "Expected Winning Percentage" which can also be used to see which teams are over-preforming or under-preforming their actual underlying play.
Due to my novice coding abilities, I have had to manually track the results of my model, which I started in the middle of December. As of when I write this, the model has gone 261-128 (67%) with 62 ties. The way I calculate the win probabilities is percent chance of a team to win, so I assume a tie can't occur (which it can). To calculate the probability of a tie, I would have to rework the whole model also given the fact that ties have to have a 65 minute game rather than the normal 60. Below is a look at the expected win percentage of each team vs. its actual win percentage as of 2/25/20:
On my website you can find my predictions for the next day's worth of games. I use each team's expected winning percentage to calculate the probability of each team winning the game. The graphic is fairly simple to decipher, as teams with the higher probability are highlighted in green, with teams with the lower probability are highlighted in red. I want to store my data in a database, so that probabilities can be updated and shown in real time as well as you guys having the ability of looking at more games than just the next day, such as what is found at MoneyPuck. Unfortunately, I am nowhere near competent enough coding something like that, so that will have to be a future project for me.
These charts were made from the data I scraped from CHN by using Tableau. Due to there being no TOI data available on a team level, all team charts are using data from the whole game and all game states. All player charts, on the other hand, are made using only data from even strength play, as this shows a better indication of a player's overall ability. I am always looking for ideas for more charts to feel free to suggest them! Also, if you have any questions about interpreting charts, please send me those inquiries.
Future Work and Ideas
The first thing I have planned is to rewrite my code base this offseason so I can better update the website more efficiently. I kind of cobbled together the current version so there's many things I can make better to improve the time spent on the front end. Something that I also want to do is add in women's rankings. As you can see, currently my model is only used for the men which shows on my rankings and predictions pages. The only place the women's teams and players show up are in the charts. When I set out to do this, I wanted to provide equal coverage to both the men and the women, something traditional media outlets do not. Unfortunately, with the whole transition of NEWHA from DII to DI, certain teams don't show up on CHN that would be important for scraping and calculating the full rankings. When I reached out about this, I was told that while the conference was DI, they still thought that the 4 teams missing were DII, which led them to exclude them from certain stats pages. They reached out to the NCAA about this for clarification, and hopefully we can get this fixed soon. As soon as they add those teams, I can implement my full model for the women's game, which will be fantastic!
I also want to improve my model in the future by adding in a home ice advantage component. I want to first see the effect of home ice and whether it's important enough to add to my model. I also want to adjust my process to weigh recent results more. We know that a hockey season is long and teams change the way they play over the course of the season. I want to better capture that as I think it will be more predictive. Finally, I want to get better at coding so I can switch to a more database style of storing my data rather than a bunch of CSV files lying around. This will help me look at past results for the model, as well as run queries to see fun stuff, such as looking at the biggest upsets of the season, or the most one-sided games.
Thanks for reading! Again, if you have any questions or suggestions, feel free to go to the Contact page to get in touch! Also, take a look at my #CBJHAC slides below for an abridged version of this post!