Exploring and Cleaning our Data

Loading the Data

PFFScoutingData <- read_csv("/Users/nickkondo/OneDrive - Chapman University/R/NFL/Datasets/PFFScoutingData.csv")
players <- read_csv("/Users/nickkondo/OneDrive - Chapman University/R/NFL/Datasets/players.csv")
games <- read_csv("/Users/nickkondo/OneDrive - Chapman University/R/NFL/Datasets/games.csv")
plays <- read_csv("/Users/nickkondo/OneDrive - Chapman University/R/NFL/Datasets/plays.csv")
tracking2018 <- read_csv("/Users/nickkondo/OneDrive - Chapman University/R/NFL/Datasets/tracking2018.csv")
tracking2019 <- read_csv("/Users/nickkondo/OneDrive - Chapman University/R/NFL/Datasets/tracking2019.csv")
tracking2020 <- read_csv("/Users/nickkondo/OneDrive - Chapman University/R/NFL/Datasets/tracking2020.csv")

Checking NA Values

Let’s take a look into our data and see how many NA values there are in our data set across variables.

sort(colSums(is.na(tracking2018)))
##          time             x             y             s             a 
##             0             0             0             0             0 
##           dis         event   displayName          team       frameId 
##             0             0             0             0             0 
##        gameId        playId playDirection             o           dir 
##             0             0             0        555537        555537 
##         nflId  jerseyNumber      position 
##        555537        555537        555537

Finding the Display Name for with NA Values

We have 555,537 NA values under a few variables including nflId, jersey number, and position. nflID is a unique ID for every player in the NFL. This means that we have a handful of observations where there is no player. What is accounting for these NA values?

table(tracking2018[is.na(tracking2018$nflId), "displayName"])
## 
## football 
##   555537

Filtering Our Data Set

There are 555,537 observations where the football is being tracked rather than a player. To clean up our data set so we only have observations of the players being tracked, let’s remove the observations with ‘football’ as the display name. We will also use filter so that we are returning only the frames of the ‘kickoff’ and ‘kick_received’. Now let’s see how many NA values we have.

2018

kick18 <- tracking2018 %>%
    # Filtering so the football is removed
filter(displayName != "football")

# Filtering so we only see the frames kick received
kick18 <- kick18 %>%
    filter(event == "kick_received")

2019

kick19 <- tracking2019 %>%
    # Filtering so the football is removed
filter(displayName != "football")

# Filtering so we only see the frames of kick received
kick19 <- kick19 %>%
    filter(event == "kick_received")

2020

kick20 <- tracking2020 %>%
    # Filtering so the football is removed
filter(displayName != "football")

# Filtering so we only see the frames of kick received
kick20 <- kick20 %>%
    filter(event == "kick_received")
kick <- rbind(kick18, kick19, kick20)

sort(colSums(is.na(kick)))
##          time             x             y             s             a 
##             0             0             0             0             0 
##           dis             o           dir         event         nflId 
##             0             0             0             0             0 
##   displayName  jerseyNumber      position          team       frameId 
##             0             0             0             0             0 
##        gameId        playId playDirection 
##             0             0             0

Joining Tables

Joining the Tracking, Plays, and PFF Scouting Tables

The plays table and PFFScouting table are joined and the filter function is used to return kickoff plays only. Onside, squib kicks, and punts are removed and stored in an object kickoffs.

# Combining our plays data with our tracking(kick) data
kick_plays <- inner_join(x = kick, y = plays, by = c("gameId", "playId"))

# Combining our tracking and plays data with our PFF Scouting Data
kick_plays_PFF <- inner_join(x = kick_plays, y = PFFScoutingData, by = c("gameId",
    "playId"))

# Filtering out only the kickoff play type
kickoffs <- kick_plays_PFF %>%
    filter(specialTeamsPlayType == "Kickoff") %>%
    # We are going to keep deep, flat, and pooch kicks
filter(kickType == "D" | kickType == "F" | kickType == "P")

# Changing the character variables to factors
kickoffs <- as.data.frame(unclass(kickoffs), stringsAsFactors = TRUE)

head(kickoffs)

Summary Statistics

Number of Unique Plays and Games

How many unique plays and games do we have in our data set now?

length(unique(kickoffs$playId))
## [1] 2115
length(unique(kickoffs$gameId))
## [1] 724

X and Y Dimensions

What are the dimensions of our X and Y coordinates?

min(kickoffs$x)
## [1] 0.76
max(kickoffs$x)
## [1] 119.52
min(kickoffs$y)
## [1] -2
max(kickoffs$y)
## [1] 61.99

Feature Engineering

New Features

Here we are creating new features and explanation behind why these features were created:

yard_result: The yard line the play results in, or the yard line the returner carried the ball to before being tackled. The yard result is going to be the value that we are aiming to predict.

YardResult_Over25: A binary indicator if the yard resulted in the 25 yard line or more. The 25 yard line is where the ball is placed if the play result in a touchback.

We want to make this more actionable and into a simple yes or no decision. On kickoffs, the returner has the decision to catch the ball and return the kick, or the decision to let the ball land in the end zone and result in a touchback.

A touchback results in the ball being placed at the 25 yard line. If the returner expects to run the ball beyond the 25 yard line, he should make the decision to return it. If the returner does not think he can run the ball past the 25 yard line, he should let the ball land in the end zone to receive a touchback (ball placed at 25 yard line). Some things that influence what yard the returner gets to are how far the kick is (or where the returner catches the ball), the hang time of the kick, how fast the returner is, how far the defenders are, and much more. To start, let’s create a new feature, a binary variable ’did the returner reach beyond the 25 yard line or not.

When a returner sets up to catch a ball, the returner has the decision to take the ball out of the end zone and gain yards, or if the ball lands in the end zone it’s ruled as a touchback and the ball is placed at the 25 yard line. For this reason we will consider returns further than the 25 yard line as a success and returns less than the 25 yard line as a failure.

kickto_yardline - The yard line the ball was kicked to, or the yard line the player caught(or dropped) the ball.

Y_position - The position the field a player is. We categorize this into 4 locations: left = left of the hash right = right of the hash center = in between the hashes out of bounds = not in the boundaries of the field

kickto_5yardBin - The 5-yard bin that the ball was kicked to. The yard_result variable is grouped into bin of 5 yards.

The exact yard line a returner catches the ball may not be a great predictor of yard line result because if we separate the yard line by each yard, there are too few observations at each level. A new feature that we will create and test is 5-yard bins. This is more actionable because when a player is looking at the sky catching a ball, he may not know what exact yard line he’s standing on, but he will have a good idea of what 5 yard bin he’s standing in.

kickoffs <- kickoffs %>% 
         # Creating a variable for what yard line the play resulted at 
  mutate(yard_result = (100 - yardlineNumber) - playResult,
         
         # Creating a binary variable for if the ball reached the 25 yard line or not
         YardResult_over25 = ifelse(yard_result > 25 | yard_result == 25,'1','0'),
         
         # Creating a variable for which yard line the ball was kicked to
         kickedto_yardline = 100 - (yardlineNumber + kickLength),
         
         # Creating a variable for Y position on the field
         Y_position = ifelse(y < 23.36667 & y > 0, 'left', 
                      ifelse(y > 29.96667 & y < 53.3333, 'right',
                      ifelse(y > 23.36667 & y < 53.3333, 'center', 
                      ifelse(y < 0 | y > 53.3333, "out of bounds", 'NA')))),
         
         # Creating a variable for 5-yard bins
         kickedto_5yardBin =
           ifelse(kickedto_yardline >= -10 & -5 > kickedto_yardline, "back-half endzone",
           ifelse(kickedto_yardline >= -5 & 0 > kickedto_yardline, "front-half endzone", 
           ifelse(kickedto_yardline >= 0 & 5 > kickedto_yardline, "0-5",
           ifelse(kickedto_yardline >= 5 & 10 > kickedto_yardline, "5-10",
           ifelse(kickedto_yardline >= 10 & 15 > kickedto_yardline, "10-15", 
           ifelse(kickedto_yardline >= 15 & 20 > kickedto_yardline, "15-20",
           ifelse(kickedto_yardline >= 20 & 25 > kickedto_yardline, "20-25",
           ifelse(kickedto_yardline >= 25 & 30 > kickedto_yardline, "25-30",
           ifelse(kickedto_yardline >= 30 & 35 > kickedto_yardline, "30-35",
           ifelse(kickedto_yardline >= 35 & 40 > kickedto_yardline, "35-40",
           ifelse(kickedto_yardline >= 40 & 45 > kickedto_yardline, "40-45",
           ifelse(kickedto_yardline >= 45 & 50 > kickedto_yardline, "45-50",
           ifelse(kickedto_yardline >= 50 & 55 > kickedto_yardline, "50-55",
           ifelse(kickedto_yardline >= 55 & 60 > kickedto_yardline, "55-60",
           ifelse(kickedto_yardline >= 60 & 65 > kickedto_yardline, "60-65","NA")
           )))))))))))))))

# Changing variables to factors
kickoffs <- as.data.frame(unclass(kickoffs),                     
                                  stringsAsFactors = TRUE)

head(kickoffs)

Releveling Factor Variables

Re leveling the kickedto_5yardBin so they are in proper order.


# Releveling the 5-Yard bins
kickoffs <- kickoffs %>%
    mutate(kickedto_5yardBin = fct_relevel(kickedto_5yardBin, "back-half endzone",
        "front-half endzone", "0-5", "5-10", "10-15", "15-20", "20-25",
        "25-30", "30-35", "35-40", "40-45", "45-50", "50-55", "55-60"))

levels(kickoffs$kickedto_5yardBin)
##  [1] "back-half endzone"  "front-half endzone" "0-5"               
##  [4] "5-10"               "10-15"              "15-20"             
##  [7] "20-25"              "25-30"              "30-35"             
## [10] "35-40"              "40-45"

Creating a Table for Returns Only

We only want to use the plays where the returner made the decision to return it. Downed, Fair Catch, Out of Bounds, and Touchback don’t have any outcome to evaluate. Muffed means the player did not catch it cleanly but the player still returned the ball so we will leave that variable in.

table(kickoffs$specialTeamsResult)
## 
##        Muffed Out of Bounds        Return     Touchback 
##          1056            44         57574         14234

returns <- kickoffs %>%
    filter(specialTeamsResult == "Return" | specialTeamsResult == "Muffed")

table(returns$specialTeamsResult)
## 
##        Muffed Out of Bounds        Return     Touchback 
##          1056             0         57574             0

Summary Statistics

For every single kickoff, the ball is usually kicked to the -1.189 yard line. Which is approximately one yard into the end zone.

For every single kickoff return (excludes touchbacks), the average yard result was 25.84, slightly above where the ball would be placed if the play resulted in a touchback.


summary(kickoffs$kickedto_yardline)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -10.0000  -3.0000   0.0000   0.8582   4.0000  40.0000

summary(returns$yard_result)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00   19.00   24.00   24.89   29.00  100.00

Here is the variable 5 yard bin kicked to compared to the yard result. We can see that where the ball is kicked is correlated in some way with the yard result. This aligns with our hypothesis that the further the ball is kicked, the smaller the yard line the player will likely end up. One thing to notice is when the ball is kicked in the back of the end zone, the highest yard line a returner ran to was the 50 yard line. (100 = score)


ggplot(returns, aes(x = kickedto_5yardBin, y = yard_result)) + geom_boxplot() +
    coord_flip() + xlab("5-Yard Bin Kicked To") + ylab("Yard Result") +
    theme(axis.text = element_text(size = 14), axis.title = element_text(size = 14))

Creating Defenders (Kickoff Team) and Defender Variables

The data set is expanded so that for every player or displayName, there are 11 observations, where 11 defenders and their location, distance, speed, acceleration, and more variables are included. We slice our data to return the 10 closest defenders because these are the players running down the field while the kicker usually lags behind.

# Creating a new variable Opposing team because we eventually want to
# find the distance from the opposing players, not team players

# If a player is on the away team, lets return the home team, else,
# let's return away
returns$opp_team <- ifelse(returns$team == "away", "home", "away")

# Sanity check
table(returns$opp_team == returns$team)
## 
## FALSE 
## 58630

# Creating a data set for the moment of the ball being caught
received <- returns %>%
    filter(event == "kick_received")

# Joining returns2 onto itself where team = opp_team and creating 11
# rows for each player with a defender added

receivedJoin <- left_join(received, received[, c("gameId", "playId", "nflId",
    "displayName", "opp_team", "x", "y", "s", "a", "dis")], by = c(gameId = "gameId",
    playId = "playId", team = "opp_team"))

# Creating a variable for distance from returner or 'ball carrier'
receivedJoin$distance <- (((receivedJoin$x.x - receivedJoin$x.y)^2) + (receivedJoin$y.x -
    receivedJoin$y.y)^2)^0.5

# Using filter to return the only the returner
receivedJoin <- receivedJoin %>%
    filter(receivedJoin$nflId.x == receivedJoin$returnerId)


# We are grabbing the 10 closest defenders because the kicker should
# be removed from the play when evaluating the kickoff team

# Average Yardline Average
received10 <- receivedJoin %>%
    group_by(displayName.x, gameId, playId) %>%
    slice_min(distance, n = 10) %>%
    # Creating average distance, speed, and acceleration variables
mutate(KoTeamAvgX = mean(x.x), KoTeamAvgDist = mean(distance), KoTeamAvgSpeed = mean(s.y),
    KoTeamAvgA = mean(a.y))

head(received10)

Selecting Relevant Variables

Let’s roll this data back up and selecting only the variables that we want.

clean <- received10 %>%
    select(time, x.x, y.x, event, displayName.x, nflId.x, returnerId, position,
        gameId, playId, frameId, playDescription, kickReturnYardage, yard_result,
        YardResult_over25, kickedto_yardline, kickedto_5yardBin, Y_position,
        KoTeamAvgX, KoTeamAvgDist, KoTeamAvgSpeed, KoTeamAvgA, specialTeamsPlayType,
        specialTeamsResult, kickType, hangTime, kickDirectionIntended,
        kickDirectionActual, returnDirectionIntended, returnDirectionActual,
        kickoffReturnFormation)

df <- unique(clean)

Visualizations

Histograms

Frequency of 5-Yard Bins the Ball is Kicked To

Here we are creating plot to show the frequency of where balls were kicked to. We can see that the most balls were kicked the the back half of the end zone, and the frequency drops as we move out of the end zone. The reason for this is because the kicker’s objective is to kick the ball as far down field as possible.

ggplot(kickoffs, aes(x = kickedto_5yardBin)) + geom_histogram(stat = "count",
    fill = "coral") + coord_flip() + xlab("5-Yard Bin Kicked To") + ylab("frequency") +
    theme(axis.text = element_text(size = 16), axis.title = element_text(size = 16)) +
    ggtitle("5-Yard Bin Kicked To (All Kicks)")

Frequency of 5-Yard Bins the Ball is Returned From Here is the frequency of balls returned. Take a look at the back-half endzone, despite the most balls kicked here, players rarely return this ball. This is because the players would rather take a touchback (ball placed at 25 yard line) rather than returning the ball from the back half of the endzone.

ggplot(returns, aes(x = kickedto_5yardBin)) + geom_histogram(stat = "count",
    fill = "coral") + coord_flip() + xlab("5-Yard Bin Kicked To") + ylab("frequency") +
    theme(axis.text = element_text(size = 16), axis.title = element_text(size = 16)) +
    ggtitle("5-Yard Bins (Returns Only)")

Scatter Plots

Let’s take a look at some scatter plots of our variables.

ggplot(data = df, aes(x = kickedto_yardline, y = kickReturnYardage, color = kickedto_yardline,
    alpha = 1/density)) + geom_point(shape = 16, size = 5, show.legend = FALSE,
    alpha = 0.15) + theme_excel(base_size = ) + labs(x = "Yard Line Kicked To",
    y = "Kick Return Yardage") + scale_color_gradient(low = "#0091ff",
    high = "#f0650e") + scale_alpha(range = c(0.05, 0.25)) + theme(axis.text = element_text(size = 14),
    axis.title = element_text(size = 16, face = "bold"))


ggplot(data = df, aes(x = KoTeamAvgDist, y = kickReturnYardage, color = KoTeamAvgDist)) +
    geom_point(shape = 16, size = 5, show.legend = FALSE) + theme_excel(base_size = 16) +
    labs(x = "Avg Distance of Defenders", y = "Kick Return Yardage") +
    theme(axis.text = element_text(size = 14), axis.title = element_text(size = 16,
        face = "bold"))


ggplot(data = df, aes(x = KoTeamAvgSpeed, y = kickReturnYardage, color = KoTeamAvgSpeed)) +
    geom_point(shape = 16, size = 5, show.legend = FALSE) + theme_excel(base_size = 16) +
    labs(x = "Avg Speed of Defenders", y = "Kick Return Yardage") + scale_color_gradient(low = "#32aeff",
    high = "#f2aeff") + theme(axis.text = element_text(size = 14), axis.title = element_text(size = 16,
    face = "bold"))


ggplot(data = df, aes(x = KoTeamAvgA, y = kickReturnYardage, color = KoTeamAvgA)) +
    geom_point(shape = 16, size = 5, show.legend = FALSE) + theme_excel(base_size = 16) +
    labs(x = "Avg Speed^2 of Defenders", y = "Kick Return Yardage") + scale_color_gradient(low = "#32aeff",
    high = "#f2aeff") + theme(axis.text = element_text(size = 14), axis.title = element_text(size = 16,
    face = "bold"))

Removing Outliers

After seeing this I think I’m going to take out plays that were returned more than 50 yards because they are outliers and may be skewing our data.

# Removing outliers

df_NoOutliers <- df %>%
    filter(yard_result < 50)

Building a Linear Model

mod1 <- lm(kickReturnYardage ~ kickedto_yardline + KoTeamAvgDist + KoTeamAvgA +
    Y_position + kickType + specialTeamsResult, data = df)

tab_model(mod1)
  kick Return Yardage
Predictors Estimates CI p
(Intercept) -3.44 -10.53 – 3.65 0.341
kickedto yardline -0.27 -0.40 – -0.14 <0.001
KoTeamAvgDist 0.50 0.36 – 0.63 <0.001
KoTeamAvgA 0.98 -0.30 – 2.27 0.133
Y position [left] -1.66 -2.78 – -0.54 0.004
Y position [right] -1.33 -2.46 – -0.21 0.020
kickType [F] -0.33 -2.36 – 1.71 0.752
kickType [P] -2.80 -4.61 – -0.99 0.002
specialTeamsResult
[Return]
9.01 5.93 – 12.09 <0.001
Observations 2648
R2 / R2 adjusted 0.156 / 0.154

Generating Predictions

# Creating a split with 75% of the data in the training set
df_split <- initial_split(df, prop = 0.75)
df_train <- training(df_split)
df_test <- testing(df_split)

mod2 <- lm(kickReturnYardage ~ kickedto_yardline + KoTeamAvgDist + Y_position +
    kickType + specialTeamsResult, data = df)

# Generating in-sample (training) predictions
preds_test <- predict(mod2, newdata = df_test)

# Generating out-of-sample (testing) predictions
preds_train <- predict(mod2, newdata = df_train)

# Mean Squared Error
get_rmse <- function(true, predictions) {
    sqrt(mean((true - predictions)^2))
}

get_rmse(df_train$yard_result, preds_train)
## [1] 12.01806
get_rmse(df_test$yard_result, preds_test)
## [1] 12.84061

# Test Error > Training Error means model is overfit Test Error <
# Training Error means model is underfit We want test error to ~
# training

Logistic Regression

logit <- glm(YardResult_over25 ~ kickedto_yardline + hangTime + specialTeamsResult +
    KoTeamAvgDist + KoTeamAvgSpeed, family = binomial, data = df)

summary(logit)
## 
## Call:
## glm(formula = YardResult_over25 ~ kickedto_yardline + hangTime + 
##     specialTeamsResult + KoTeamAvgDist + KoTeamAvgSpeed, family = binomial, 
##     data = df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2691  -1.0424  -0.8238   1.2121   2.6227  
## 
## Coefficients:
##                          Estimate Std. Error z value
## (Intercept)              -3.51416    1.34253  -2.618
## kickedto_yardline         0.17254    0.01607  10.739
## hangTime                 -0.03257    0.19401  -0.168
## specialTeamsResultReturn  1.66074    0.45694   3.635
## KoTeamAvgDist             0.10409    0.01512   6.885
## KoTeamAvgSpeed           -0.28369    0.10260  -2.765
##                                      Pr(>|z|)    
## (Intercept)                          0.008856 ** 
## kickedto_yardline        < 0.0000000000000002 ***
## hangTime                             0.866673    
## specialTeamsResultReturn             0.000278 ***
## KoTeamAvgDist                0.00000000000577 ***
## KoTeamAvgSpeed                       0.005691 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3489.4  on 2539  degrees of freedom
## Residual deviance: 3299.2  on 2534  degrees of freedom
##   (115 observations deleted due to missingness)
## AIC: 3311.2
## 
## Number of Fisher Scoring iterations: 4
exp(logit$coefficients)
##              (Intercept)        kickedto_yardline 
##               0.02977291               1.18831836 
##                 hangTime specialTeamsResultReturn 
##               0.96795265               5.26320293 
##            KoTeamAvgDist           KoTeamAvgSpeed 
##               1.10969902               0.75300042

Generating Logistic Predictions Generating predicted probabilities for the test and training sets

preds_trainLogit <- predict(logit, newdata = df_train)

preds_testLogit <- predict(logit, newdata = df_test)

head(preds_testLogit)
##          1          2          3          4          5          6 
##  1.9214100 -1.1234775 -0.5036481 -1.1479345 -0.7216348         NA
head(preds_trainLogit)
##           1           2           3           4           5 
## -0.48579288  0.08489535 -0.36867629 -0.43023333 -0.40863336 
##           6 
## -0.06525577

Generating Results Data Frames

results_train <- data.frame(truth = df_train %>%
    select(YardResult_over25) %>%
    mutate(YardResult_over25 = as.numeric(YardResult_over25)), Class1 = preds_trainLogit,
    type = rep("train", length(preds_trainLogit)))

results_test <- data.frame(truth = df_test %>%
    select(YardResult_over25) %>%
    mutate(YardResult_over25 = as.numeric(YardResult_over25)), Class1 = preds_testLogit,
    type = rep("test", length(preds_testLogit)))

results <- bind_rows(results_train, results_test)

slice(results)

Generating ROC Plots

One each for the test and training sets. Be sure to label the cutoff probabilities along the ROC lines using the cutoffs at

p_train <- ggplot(results_train, aes(m = Class1, d = truth.YardResult_over25)) +
    geom_roc(labelsize = 3.5, cutoffs.at = c(0.99, 0.9, 0.7, 0.5, 0.3,
        0.1, 0)) + theme_minimal(base_size = 16)

p_test <- ggplot(results_test, aes(m = Class1, d = truth.YardResult_over25)) +
    geom_roc(labelsize = 3.5, cutoffs.at = c(0.99, 0.9, 0.7, 0.5, 0.3,
        0.1, 0)) + theme_minimal(base_size = 16)

print(p_train)

print(p_test)

Calculating AUC

Calculating the AUC for the test and training sets using the functions calc_auc

calc_auc(p_train)
calc_auc(p_test)

Elastic Net Model

enet_mod <- cva.glmnet(kickReturnYardage ~ kickedto_yardline + KoTeamAvgDist +
    KoTeamAvgA + Y_position + kickType + specialTeamsResult, data = df,
    alpha = seq(0, 1, by = 0.05))
plot(enet_mod)

enet_mod <- cva.glmnet(kickReturnYardage ~ kickedto_yardline + KoTeamAvgDist +
    KoTeamAvgA + Y_position + kickType + specialTeamsResult, data = df_train,
    alpha = seq(0, 1, by = 0.05))

plot(enet_mod)

# now enet_mod holds a list with all of the sub models, each with
# alpha = whatever sequence the model was estimated with
minlossplot(enet_mod, cv.type = "min")

# Use this function to find the best alpha
get_alpha <- function(fit) {
    alpha <- fit$alpha
    error <- sapply(fit$modlist, function(mod) {
        min(mod$cvm)
    })
    alpha[which.min(error)]
}
# Get all parameters.
get_model_params <- function(fit) {
    alpha <- fit$alpha
    lambdaMin <- sapply(fit$modlist, `[[`, "lambda.min")
    lambdaSE <- sapply(fit$modlist, `[[`, "lambda.1se")
    error <- sapply(fit$modlist, function(mod) {
        min(mod$cvm)
    })
    best <- which.min(error)
    data.frame(alpha = alpha[best], lambdaMin = lambdaMin[best], lambdaSE = lambdaSE[best],
        eror = error[best])
}
# extract the best alpha value and model parameters
best_alpha <- get_alpha(enet_mod)
print(best_alpha)
## [1] 1
get_model_params(enet_mod)
# extract the best model object
best_mod <- enet_mod$modlist[[which(enet_mod$alpha == best_alpha)]]
print(best_mod)
## 
## Call:  glmnet::cv.glmnet(x = x, y = y, weights = ..1, offset = ..2,      nfolds = nfolds, foldid = foldid, alpha = a) 
## 
## Measure: Mean-Squared Error 
## 
##     Lambda Index Measure    SE Nonzero
## min 0.0746    44   102.6 13.18       7
## 1se 3.0816     4   115.0 12.60       1

summary(enet_mod)
##                 Length Class  Mode     
## alpha           21     -none- numeric  
## nfolds           1     -none- numeric  
## modlist         21     -none- list     
## call             4     -none- call     
## terms            2     -none- call     
## xlev             6     -none- list     
## sparse           1     -none- logical  
## use.model.frame  1     -none- logical  
## na.action        1     -none- character
# Also heres the code for the plot I put in the slides
g <- ggplot(returns, aes(x = kickedto_5yardBin, y = yard_result, color = kickedto_5yardBin)) +
    labs(x = "5-Yard Bin Kicked To", y = "Yard Result")

# Since Boxplots are boring, I made a violin plot instead
g + geom_violin(aes(fill = kickedto_5yardBin), size = 1, alpha = 0.5) +
    geom_boxplot(outlier.alpha = 0, coef = 0, color = "gray40", width = 0.2) +
    scale_fill_brewer(palette = "Dark2", guide = "none") + coord_flip()

Random Forest

NFL_rf_fit2 <- randomForest(YardResult_over25 ~ kickedto_yardline + hangTime +
    kickoffReturnFormation + Y_position + KoTeamAvgDist + KoTeamAvgSpeed,
    data = df, type = classification, mtry = 4, na.action = na.roughfix,
    ntree = 1500, localImp = TRUE)


print(NFL_rf_fit2)
## 
## Call:
##  randomForest(formula = YardResult_over25 ~ kickedto_yardline +      hangTime + kickoffReturnFormation + Y_position + KoTeamAvgDist +      KoTeamAvgSpeed, data = df, type = classification, mtry = 4,      ntree = 1500, localImp = TRUE, na.action = na.roughfix) 
##                Type of random forest: classification
##                      Number of trees: 1500
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 40.98%
## Confusion matrix:
##      0   1 class.error
## 0 1000 451   0.3108201
## 1  637 567   0.5290698
plot(NFL_rf_fit2)

Generating Test Predictions

rf_mods <- list()
oob_err <- NULL
test_err <- NULL
for (mtry in 1:9) {
    NFL_rf_fit2 <- randomForest(YardResult_over25 ~ kickedto_yardline +
        hangTime + kickoffReturnFormation + Y_position + KoTeamAvgDist +
        KoTeamAvgSpeed, data = df, mtry = mtry, na.action = na.roughfix,
        ntree = 500)
    oob_err[mtry] <- NFL_rf_fit2$err.rate[500]

    cat(mtry, " ")
}
## 1  2  3  4  5  6  7  8  9
## 1 2 3 4 5 6 7 8 9

results_DF <- data.frame(mtry = 1:9, oob_err)
ggplot(results_DF, aes(x = mtry, y = oob_err)) + geom_point() + theme_minimal() +
    xlim(1, 9)

varImpPlot(NFL_rf_fit2)

plot_min_depth_distribution(NFL_rf_fit2)

plot_multi_way_importance(NFL_rf_fit2)
## [1] "Warning: your forest does not contain information on local importance so 'accuracy_decrease' measure cannot be extracted. To add it regrow the forest with the option localImp = TRUE and run this function again."


Next Next Steps:

Blockers and Shapley values

Nueral Networks using s, a, dir, x, and y? - Last years expected rushing yards model used just those variables for all 22 players on the field