A professional development soccer team’s goal is to find the best players around the world for the lowest dollars. Our goal is to identify under-valued players, or players who perform extremely high with a low market price. We will do this by building a model to see what about performance is correlated with market price? Is it goals? Assists? Both? Is there a way for us to create an efficiency metric?
Season statistics and current market values are taken from FBref.com from the years 2018 to 2020.
They include seasonal statistics from the Big 5 leagues. The type of tables included are standard, shooting, passing, passing types, goal creation, defense, possession, playing time, and miscellaneous.
standard <- fb_big5_advanced_season_stats(season_end_year = c(2018:2020),
stat_type = "standard", team_or_player = "player")
shooting <- fb_big5_advanced_season_stats(season_end_year = c(2018:2020),
stat_type = "shooting", team_or_player = "player")
passing <- fb_big5_advanced_season_stats(season_end_year = c(2018:2020),
stat_type = "passing", team_or_player = "player")
passingtypes <- fb_big5_advanced_season_stats(season_end_year = c(2018:2020),
stat_type = "passing_types", team_or_player = "player")
gca <- fb_big5_advanced_season_stats(season_end_year = c(2018:2020), stat_type = "gca",
team_or_player = "player")
defense <- fb_big5_advanced_season_stats(season_end_year = c(2018:2020),
stat_type = "defense", team_or_player = "player")
possession <- fb_big5_advanced_season_stats(season_end_year = c(2018:2020),
stat_type = "possession", team_or_player = "player")
misc <- fb_big5_advanced_season_stats(season_end_year = c(2018:2020), stat_type = "misc",
team_or_player = "player")
playingtime <- fb_big5_advanced_season_stats(season_end_year = c(2018:2020),
stat_type = "playing_time", team_or_player = "player")
keepers <- fb_big5_advanced_season_stats(season_end_year = c(2018:2020),
stat_type = "keepers", team_or_player = "player")
keepers_adv <- fb_big5_advanced_season_stats(season_end_year = c(2018:2020),
stat_type = "keepers_adv", team_or_player = "player")
Retrieving the market values
market_values17 <- get_player_market_values(country_name = c("England",
"Spain", "France", "Italy", "Germany"), start_year = 2017)
market_values18 <- get_player_market_values(country_name = c("England",
"Spain", "France", "Italy", "Germany"), start_year = 2018)
market_values19 <- get_player_market_values(country_name = c("England",
"Spain", "France", "Italy", "Germany"), start_year = 2019)
marketvalues <- bind_rows(market_values17, market_values18, market_values19)
Creating a unique identifier Here we are creating a unique identifier to each table so when we merge the tables, they will be merged by a unique. We combine Name, Year, and League Competition to
Here we are creating a unique identifier to each table and then combining all of the tables into one huge data set with all of our variables. For now we are excluding goal keepers.
marketvalues <- marketvalues %>%
mutate(Season_End_Year = season_start_year + 1, PlayerYearComp_id = paste(player_name,
Season_End_Year, comp_name))
standard <- standard %>%
mutate(PlayerYearComp_id = paste(Player, Season_End_Year, Comp))
shooting <- shooting %>%
mutate(PlayerYearComp_id = paste(Player, Season_End_Year, Comp))
passing <- passing %>%
mutate(PlayerYearComp_id = paste(Player, Season_End_Year, Comp))
passingtypes <- passingtypes %>%
mutate(PlayerYearComp_id = paste(Player, Season_End_Year, Comp))
gca <- gca %>%
mutate(PlayerYearComp_id = paste(Player, Season_End_Year, Comp))
defense <- defense %>%
mutate(PlayerYearComp_id = paste(Player, Season_End_Year, Comp))
possession <- possession %>%
mutate(PlayerYearComp_id = paste(Player, Season_End_Year, Comp))
playingtime <- playingtime %>%
mutate(PlayerYearComp_id = paste(Player, Season_End_Year, Comp))
misc <- misc %>%
mutate(PlayerYearComp_id = paste(Player, Season_End_Year, Comp))
Joining Data Sets Now that we have a unique identifier, let’s join our statistics tables with our market values tables. We are going to keep each table separate for now. We are also adding a suffix to deal with our variables that are duplicate. For one duplicate variable, no suffix ("") will be added to the end. In the second duplicate variable, REMOVEDUPLICATE will be added to the end. Then we are going to remove our unused variables.
# Offense
## Standard, Shooting, and GCA
standardMarket <- inner_join(x = standard, y = marketvalues, by = "PlayerYearComp_id",
suffix = c("", ".REMOVEDUPLICATE"))
shootingMarket <- inner_join(x = shooting, y = marketvalues, by = "PlayerYearComp_id",
suffix = c("", ".REMOVEDUPLICATE"))
gcaMarket <- inner_join(x = gca, y = marketvalues, by = "PlayerYearComp_id",
suffix = c("", ".REMOVEDUPLICATE"))
StdShootMkt <- left_join(x = standardMarket, y = shootingMarket, by = "PlayerYearComp_id",
suffix = c("", ".REMOVEDUPLICATE"))
StdShootGCAMkt <- inner_join(x = StdShootMkt, y = gcaMarket, by = "PlayerYearComp_id",
suffix = c("", ".REMOVEDUPLICATE"))
StdShootGCAMkt <- StdShootGCAMkt %>%
distinct(PlayerYearComp_id, .keep_all = TRUE)
## Passing and Market Values
passingMarket <- inner_join(x = passing, y = marketvalues, by = "PlayerYearComp_id",
suffix = c("", ".REMOVEDUPLICATE"))
# Joining Passing Types and Market Values
passingtypesMarket <- inner_join(x = passingtypes, y = marketvalues, by = "PlayerYearComp_id",
suffix = c("", ".REMOVEDUPLICATE"))
# Joining Passing, Passing Types, and Market
PassMkt <- inner_join(x = passingMarket, y = passingtypesMarket, by = "PlayerYearComp_id",
suffix = c("", ".REMOVEDUPLICATE"))
# Joining Standard, Shooting, GCA, Passing, Market
StdShootGCAPassMkt <- inner_join(x = StdShootGCAMkt, y = PassMkt, by = "PlayerYearComp_id",
suffix = c("", ".REMOVEDUPLICATE"))
# Joining Playing time and Market
playingtimeMarket <- inner_join(x = playingtime, y = marketvalues, by = "PlayerYearComp_id",
suffix = c("", ".REMOVEDUPLICATE"))
# Joining Standard, Shooting, GCA, Passing, Playing Time, and Market
# Storing this as offense
offense_stats <- inner_join(x = StdShootGCAPassMkt, y = playingtimeMarket,
by = "PlayerYearComp_id", suffix = c("", ".REMOVEDUPLICATE"))
# Removing Duplicate rows
offense_stats <- offense_stats %>%
distinct(PlayerYearComp_id, .keep_all = TRUE)
# Removing unneccessary DFs
rm(StdShootGCAPassMkt, StdShootGCAMkt, playingtimeMarket, passingtypesMarket,
passingMarket, PassMkt, standardMarket, shootingMarket, gcaMarket,
StdShootMkt)
# defense
defenseMarket <- inner_join(x = passing, y = marketvalues, by = "PlayerYearComp_id",
suffix = c("", ".REMOVEDUPLICATE"))
miscMarket <- inner_join(x = passing, y = marketvalues, by = "PlayerYearComp_id",
suffix = c("", ".REMOVEDUPLICATE"))
Removing Repeated Variables Removing duplicate variables (columns)
offense_stats <- offense_stats %>%
select(-contains("REMOVEDUPLICATE"))
Here we are separating the position variable into primary_position
and secondary_position
. If a secondary position isn’t listed, NA
will be returned.
offense_stats <- offense_stats %>%
separate(Pos, c("primary_position", "secondary_position"), ",", remove = FALSE)
Creating a separate data set for each position
Now let’s create separate data sets for each position (excluding goal keepers)
# Filtering for only Forwards
forwards <- offense_stats %>%
filter(primary_position == "FW")
# Changing all variables to factr variables
forwards <- as.data.frame(unclass(forwards), stringsAsFactors = TRUE)
midfielders <- offense_stats %>%
filter(primary_position == "MF")
midfielders <- as.data.frame(unclass(midfielders), stringsAsFactors = TRUE)
Do we want to do any more filtering here? Do we want to perhaps filter minutes played? Let’s focus on the forwards table, let’s take a look at our data. We see that we the maximum minutes played is 3,420 minutes and the minimum minutes played is 1 minute. We have a median of about 1196 minutes and a 1st Quartile of 391 minutes. Should players with very little minutes be filtered out? If so, what is the threshold for minutes played in an entire season to be included in this data set.
After talking it over with the group, we made the decision to create a minimum minutes played in a season as 20 minutes per match played
300-350
summary(forwards$Min_Playing)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 266.2 984.5 1142.1 1846.5 3420.0
offense_clean <- offense_stats %>%
glimpse(offense_stats)
## Rows: 5,466
## Columns: 156
## $ Season_End_Year <int> 2018, 2018, 2018, 2018…
## $ Squad <chr> "Amiens", "Amiens", "A…
## $ Comp <chr> "Ligue 1", "Ligue 1", …
## $ Player <chr> "Khaled Adénon", "Dani…
## $ Nation <chr> "BEN", "BRA", "FRA", "…
## $ Pos <chr> "DF", "DF,FW", "MF", "…
## $ primary_position <chr> "DF", "DF", "MF", "DF"…
## $ secondary_position <chr> NA, "FW", NA, "MF", NA…
## $ Age <chr> "32", "28", "33", "34"…
## $ Born <dbl> 1985, 1989, 1984, 1982…
## $ MP_Playing <dbl> 34, 21, 1, 13, 1, 17, …
## $ Starts_Playing <dbl> 32, 20, 0, 3, 1, 4, 1,…
## $ Min_Playing <dbl> 2921, 1727, 7, 397, 90…
## $ Mins_Per_90_Playing <dbl> 32.5, 19.2, 0.1, 4.4, …
## $ Gls <dbl> 0, 1, 0, 1, 0, 0, 0, 0…
## $ Ast <dbl> 1, 0, 0, 0, 0, 1, 0, 0…
## $ G_minus_PK <dbl> 0, 1, 0, 1, 0, 0, 0, 0…
## $ PK <dbl> 0, 0, 0, 0, 0, 0, 0, 0…
## $ PKatt <dbl> 0, 0, 0, 0, 0, 0, 0, 0…
## $ CrdY <dbl> 9, 2, 0, 2, 0, 0, 0, 3…
## $ CrdR <dbl> 0, 0, 0, 0, 0, 0, 0, 0…
## $ Gls_Per <dbl> 0.00, 0.05, 0.00, 0.23…
## $ Ast_Per <dbl> 0.03, 0.00, 0.00, 0.00…
## $ `G+A_Per` <dbl> 0.03, 0.05, 0.00, 0.23…
## $ G_minus_PK_Per <dbl> 0.00, 0.05, 0.00, 0.23…
## $ `G+A_minus_PK_Per` <dbl> 0.03, 0.05, 0.00, 0.23…
## $ xG_Expected <dbl> 0.5, 0.4, 0.0, 0.9, 0.…
## $ npxG_Expected <dbl> 0.5, 0.4, 0.0, 0.9, 0.…
## $ xA_Expected <dbl> 0.3, 1.9, 0.0, 0.4, 0.…
## $ `npxG+xA_Expected` <dbl> 0.8, 2.4, 0.0, 1.4, 0.…
## $ xG_Per <dbl> 0.02, 0.02, 0.00, 0.21…
## $ xA_Per <dbl> 0.01, 0.10, 0.00, 0.10…
## $ `xG+xA_Per` <dbl> 0.03, 0.12, 0.00, 0.31…
## $ npxG_Per <dbl> 0.02, 0.02, 0.00, 0.21…
## $ `npxG+xA_Per` <dbl> 0.03, 0.12, 0.00, 0.31…
## $ Url <chr> "https://fbref.com/en/…
## $ PlayerYearComp_id <chr> "Khaled Adénon 2018 Li…
## $ comp_name <chr> "Ligue 1", "Ligue 1", …
## $ region <chr> "Europe", "Europe", "E…
## $ country <chr> "France", "France", "F…
## $ season_start_year <int> 2017, 2017, 2017, 2017…
## $ squad <chr> "Amiens SC", "Amiens S…
## $ player_num <chr> "3", "-", "20", "24", …
## $ player_name <chr> "Khaled Adénon", "Dani…
## $ player_position <chr> "Centre-Back", "Left-B…
## $ player_dob <date> 1985-07-29, 1989-06-0…
## $ player_age <dbl> 31, 28, 33, 34, 33, 29…
## $ player_nationality <chr> "Benin", "Brazil", "Fr…
## $ current_club <chr> "Doxa Katokopias", "Wi…
## $ player_height_mtrs <chr> "1.8", "1.85", "1.83",…
## $ player_foot <chr> "right", "left", "left…
## $ date_joined <chr> "2015-07-01", "2017-08…
## $ joined_from <chr> "Vendée Luçon Football…
## $ contract_expiry <chr> NA, NA, NA, NA, NA, NA…
## $ player_market_value_euro <dbl> 500000, 1500000, 20000…
## $ player_url <chr> "https://www.transferm…
## $ Mins_Per_90 <dbl> 32.5, 19.2, 0.1, 4.4, …
## $ Gls_Standard <dbl> 0, 1, 0, 1, 0, 0, 0, 0…
## $ Sh_Standard <dbl> 5, 8, 0, 8, 0, 11, 1, …
## $ SoT_Standard <dbl> 1, 1, 0, 1, 0, 1, 0, 1…
## $ SoT_percent_Standard <dbl> 20.0, 12.5, NA, 12.5, …
## $ Sh_per_90_Standard <dbl> 0.15, 0.42, 0.00, 1.81…
## $ SoT_per_90_Standard <dbl> 0.03, 0.05, 0.00, 0.23…
## $ G_per_Sh_Standard <dbl> 0.00, 0.13, NA, 0.13, …
## $ G_per_SoT_Standard <dbl> 0.00, 1.00, NA, 1.00, …
## $ Dist_Standard <dbl> 6.3, 17.7, NA, 14.6, N…
## $ FK_Standard <dbl> 0, 0, 0, 0, 0, 2, 0, 0…
## $ PK_Standard <dbl> 0, 0, 0, 0, 0, 0, 0, 0…
## $ PKatt_Standard <dbl> 0, 0, 0, 0, 0, 0, 0, 0…
## $ npxG_per_Sh_Expected <dbl> 0.10, 0.06, NA, 0.11, …
## $ G_minus_xG_Expected <dbl> -0.5, 0.6, 0.0, 0.1, 0…
## $ `np:G_minus_xG_Expected` <dbl> -0.5, 0.6, 0.0, 0.1, 0…
## $ SCA_SCA <dbl> 7, 28, 0, 8, 0, 28, 2,…
## $ SCA90_SCA <dbl> 0.22, 1.46, 0.00, 1.82…
## $ PassLive_SCA <dbl> 4, 23, 0, 8, 0, 13, 2,…
## $ PassDead_SCA <dbl> 1, 3, 0, 0, 0, 10, 0, …
## $ Drib_SCA <dbl> 0, 0, 0, 0, 0, 0, 0, 0…
## $ Sh_SCA <dbl> 0, 0, 0, 0, 0, 1, 0, 0…
## $ Fld_SCA <dbl> 1, 2, 0, 0, 0, 3, 0, 1…
## $ Def_SCA <dbl> 1, 0, 0, 0, 0, 1, 0, 0…
## $ GCA_GCA <dbl> 1, 4, 0, 0, 0, 2, 0, 2…
## $ GCA90_GCA <dbl> 0.03, 0.21, 0.00, 0.00…
## $ PassLive_GCA <dbl> 1, 4, 0, 0, 0, 0, 0, 2…
## $ PassDead_GCA <dbl> 0, 0, 0, 0, 0, 1, 0, 0…
## $ Drib_GCA <dbl> 0, 0, 0, 0, 0, 0, 0, 0…
## $ Sh_GCA <dbl> 0, 0, 0, 0, 0, 0, 0, 0…
## $ Fld_GCA <dbl> 0, 0, 0, 0, 0, 1, 0, 0…
## $ Def_GCA <dbl> 0, 0, 0, 0, 0, 0, 0, 0…
## $ Cmp_Total <dbl> 958, 535, 2, 216, 7, 1…
## $ Att_Total <dbl> 1168, 840, 5, 275, 20,…
## $ Cmp_percent_Total <dbl> 82.0, 63.7, 40.0, 78.5…
## $ TotDist_Total <dbl> 21066, 10542, 24, 4038…
## $ PrgDist_Total <dbl> 7904, 6072, 0, 1548, 1…
## $ Cmp_Short <dbl> 254, 207, 1, 93, 0, 81…
## $ Att_Short <dbl> 295, 245, 3, 108, 0, 1…
## $ Cmp_percent_Short <dbl> 86.1, 84.5, 33.3, 86.1…
## $ Cmp_Medium <dbl> 503, 228, 1, 95, 4, 73…
## $ Att_Medium <dbl> 565, 337, 1, 112, 4, 1…
## $ Cmp_percent_Medium <dbl> 89.0, 67.7, 100.0, 84.…
## $ Cmp_Long <dbl> 189, 89, 0, 26, 3, 34,…
## $ Att_Long <dbl> 278, 225, 0, 46, 16, 7…
## $ Cmp_percent_Long <dbl> 68.0, 39.6, NA, 56.5, …
## $ xA <dbl> 0.3, 1.9, 0.0, 0.4, 0.…
## $ A_minus_xA <dbl> 0.7, -1.9, 0.0, -0.4, …
## $ KP <dbl> 3, 14, 0, 5, 0, 18, 1,…
## $ Final_Third <dbl> 39, 56, 0, 24, 1, 27, …
## $ PPA <dbl> 1, 18, 0, 4, 0, 6, 0, …
## $ CrsPA <dbl> 0, 9, 0, 0, 0, 2, 0, 9…
## $ Prog <dbl> 44, 88, 0, 26, 0, 32, …
## $ Att <dbl> 1168, 840, 5, 275, 20,…
## $ Live_Pass <dbl> 1124, 696, 5, 265, 10,…
## $ Dead_Pass <dbl> 44, 144, 0, 10, 10, 35…
## $ FK_Pass <dbl> 43, 15, 0, 10, 2, 9, 1…
## $ TB_Pass <dbl> 1, 0, 0, 0, 0, 2, 0, 0…
## $ Press_Pass <dbl> 141, 123, 1, 25, 0, 65…
## $ Sw_Pass <dbl> 28, 17, 0, 9, 0, 16, 2…
## $ Crs_Pass <dbl> 1, 58, 0, 1, 0, 12, 2,…
## $ CK_Pass <dbl> 0, 0, 0, 0, 0, 21, 0, …
## $ In_Corner <dbl> 0, 0, 0, 0, 0, 0, 0, 0…
## $ Out_Corner <dbl> 0, 0, 0, 0, 0, 0, 0, 0…
## $ Str_Corner <dbl> 0, 0, 0, 0, 0, 0, 0, 0…
## $ Ground_Height <dbl> 817, 394, 3, 193, 3, 1…
## $ Low_Height <dbl> 94, 68, 1, 20, 1, 28, …
## $ High_Height <dbl> 257, 378, 1, 62, 16, 1…
## $ Left_Body <dbl> 220, 545, 4, 40, 0, 19…
## $ Right_Body <dbl> 792, 50, 0, 211, 16, 2…
## $ Head_Body <dbl> 137, 90, 1, 22, 0, 19,…
## $ TI_Body <dbl> 1, 129, 0, 0, 0, 4, 0,…
## $ Other_Body <dbl> 3, 3, 0, 0, 2, 4, 0, 3…
## $ Cmp_Outcomes <dbl> 958, 535, 2, 216, 7, 1…
## $ Off_Outcomes <dbl> 3, 4, 0, 2, 0, 1, 0, 3…
## $ Out_Outcomes <dbl> 35, 20, 0, 9, 1, 9, 1,…
## $ Int_Outcomes <dbl> 14, 11, 1, 4, 0, 8, 0,…
## $ Blocks_Outcomes <dbl> 16, 22, 1, 8, 0, 12, 1…
## $ MP_Playing.Time <dbl> 34, 21, 1, 13, 1, 17, …
## $ Min_Playing.Time <dbl> 2921, 1727, 7, 397, 90…
## $ Mn_per_MP_Playing.Time <dbl> 86, 82, 7, 31, 90, 32,…
## $ Min_percent_Playing.Time <dbl> 85.4, 50.5, 0.2, 11.6,…
## $ Mins_Per_90_Playing.Time <dbl> 32.5, 19.2, 0.1, 4.4, …
## $ Starts_Starts <dbl> 32, 20, 0, 3, 1, 4, 1,…
## $ Mn_per_Start_Starts <dbl> NA, NA, NA, NA, NA, NA…
## $ Compl_Starts <dbl> 32, 18, 0, 2, 1, 0, 0,…
## $ Subs_Subs <dbl> 2, 1, 1, 10, 0, 13, 2,…
## $ Mn_per_Sub_Subs <dbl> NA, NA, NA, NA, NA, NA…
## $ unSub_Subs <dbl> 2, 4, 6, 0, 36, 15, 2,…
## $ PPM_Team.Success <dbl> 1.26, 1.24, 1.00, 0.69…
## $ onG_Team.Success <dbl> 32, 23, 0, 3, 0, 4, 0,…
## $ onGA_Team.Success <dbl> 37, 21, 0, 8, 1, 10, 3…
## $ plus_per__minus__Team.Success <dbl> -5, 2, 0, -5, -1, -6, …
## $ plus_per__minus_90_Team.Success <dbl> -0.15, 0.10, 0.00, -1.…
## $ On_minus_Off_Team.Success <dbl> -0.15, 0.48, 0.13, -1.…
## $ onxG_Team.Success..xG. <dbl> 29.0, 20.1, 0.1, 3.3, …
## $ onxGA_Team.Success..xG <dbl> 50.2, 25.8, 0.1, 10.5,…
## $ xGplus_per__minus__Team.Success..xG <dbl> -21.2, -5.7, 0.0, -7.2…
## $ xGplus_per__minus_90_Team.Success..xG <dbl> -0.65, -0.30, 0.12, -1…
## $ On_minus_Off_Team.Success..xG <dbl> -0.85, 0.48, 0.65, -1.…
# Removing the players who do not reach 350 minutes
offense_clean <- offense_stats %>%
filter(Mn_per_MP_Playing.Time > 20)
# We go from 5,466 players to 4,993
Creating a new clean data set to store only the variables we care about and are interested in testing
offense_clean2 <- offense_clean %>%
select(PlayerYearComp_id, Player, Squad, Comp, Season_End_Year, primary_position,
player_position, Age, Min_Playing, G_minus_PK, "G+A_minus_PK_Per",
Ast, xG_Expected, npxG_Expected, xA_Expected, "npxG+xA_Expected",
xG_Per, xA, xA_Per, "xG+xA_Per", npxG_Per, "npxG+xA_Per", player_height_mtrs,
joined_from, player_market_value_euro, Gls, Sh_per_90_Standard,
G_per_Sh_Standard, Dist_Standard, npxG_per_Sh_Expected)
We want to know the most highly correlated between G_minus_PK, ‘G+A_minus_PK_Per’, xG_Expected, npxG_Expected, xA_Expected, ‘npxG+xA_Expected’, xG_Per, xA_Per, ‘xG+xA_Per’, npxG_Per, ‘npxG+xA_Per’ since all these variables are correlated. To see which one is most associated with market value, we are going to run separate linear regressions on each variable.
CODE START Need to rename the variables with a + sign From this quick regression, G minus PK actually has the highest correlation between these.
mod2 <- lm(player_market_value_euro ~ G_minus_PK + xA + npxG_per_Sh_Expected, data = offense_clean2) tab_model(mod2) # .321 # Non Penalty Goals per Shot increased our R Squared to .33 but the variable was not significant
mod1 <- lm(player_market_value_euro ~ G_minus_PK, data = offense_clean2) tab_model(mod1) # .275
G_plus_A_minus_PK_Per
mod1 <- lm(player_market_value_euro ~ G_plus_A_minus_PK_Per, data = offense_clean2) tab_model(mod1) # .202
mod1 <- lm(player_market_value_euro ~ Gls, data = offense_clean2) tab_model(mod1) # .265 # Goals minus PK is a better indicator than goals. We can throw goals out
mod1 <- lm(player_market_value_euro ~ npxG_Expected, data = offense_clean2) tab_model(mod1) #.244 # non penalty xG is actually a better indicator than xG by itself
mod1 <- lm(player_market_value_euro ~ xG_Expected, data = offense_clean2) tab_model(mod1) # .233 # Goals minus Pk better than xG
mod1 <- lm(player_market_value_euro ~ npxG+xA_Expected
, data = offense_clean2) tab_model(mod1) # .294
npxG_per_Sh_Expected
mod1 <- lm(player_market_value_euro ~ npxG_per_Sh_Expected, data = offense_clean2) tab_model(mod1)
mod1 <- lm(player_market_value_euro ~ xA, data = offense_clean2) tab_model(mod1) # .224
mod1 <- lm(player_market_value_euro ~ Ast, data = offense_clean2) tab_model(mod1) #.214
mod1 <- lm(player_market_value_euro ~ xG+xA_Per
, data = offense_clean2) tab_model(mod1) # .168
mod1 <- lm(player_market_value_euro ~ npxG+xA_Per
, data = offense_clean2) tab_model(mod1) # .165
mod1 <- lm(player_market_value_euro ~ xG_Per, data = offense_clean2) tab_model(mod1) #.128
mod1 <- lm(player_market_value_euro ~ xA_Per, data = offense_clean2) tab_model(mod1) # .127
mod1 <- lm(player_market_value_euro ~ xG, data = offense_clean2) tab_model(mod1)
mod1 <- lm(player_market_value_euro ~ G_per_Sh_Standard, data = offense_clean2) tab_model(mod1) # .029
mod1 <- lm(player_market_value_euro ~ npxG_per_Sh_Expected, data = offense_clean2) tab_model(mod1) # .024
mod2 <- lm(player_market_value_euro ~ G_minus_PK + xA, data = offense_clean2) tab_model(mod1) # .32
CODE END
Deliverable
The deliverable will be in the form of an HTML file (website link) where we will be able to publish our findings. We can include as much text, code, output, and charts that we want. Here is a draft of how I plan on introducing our topic.
First we are finding out who is being paid the most and why? What do teams pay for? Is it Performance? Age? Both? Why are player market values the way they are?
We are going to create a model for predicting the market value of player based on previous market prices and performance statistics. From this, we can identify players who are performing extremely well and being under-paid and therefore ‘undervalued’. Our goal is to put the best 11 players on the field for the lowest amount of money. We can do this by identifying the most undervalued player at each position.
Predicted market price - actual market price = Value
Next Steps: Interpret the regressions Make one final model with the highest performance Work on visuals Create HTML file to outline data?
OLD NOTES
Standard Table Should we create separate tables for each league and then one of our variables can be squad. Or just include all leagues.
Do we want to use regular stats or non penalty stats?
In my model, I use non-penalty scores because penalty kicks are dependent on the other team. How do we feel about this.
We can either use goals and assists as separate variables, or we can use goals+assists as one predictor variable. To answer this question, I will create two models and see which one is a better predictor of +/-.
Should we use variables on a per 90 minute basis?
I decided to go with per 90 minute variables. To normalize our players, let’s use all variables on a per 90 minute basis. It will be extremely important to trim players that don’t fit the minute threshold. To qualify as a leader, a player needs to play 30 minutes per squad game on FBref. We are going to use Minutes per match played 20
Do we want to use xG and xA or G and A. What is a better predictor of +/-, what about value on transfer market? The break out players that are going above expected
New Feature G - xG = goals over expected
There are other variables such as xGplus_per_minus_90 Expected goals scored minus expected goals allowed
We also have xG expected
Squad -Nation -Player
Age -Born
Mins_Per_90_Playing -Min_Playing