Introduction

A professional development soccer team’s goal is to find the best players around the world for the lowest dollars. Our goal is to identify under-valued players, or players who perform extremely high with a low market price. We will do this by building a model to see what about performance is correlated with market price? Is it goals? Assists? Both? Is there a way for us to create an efficiency metric?

Loading the Datasets

Season statistics and current market values are taken from FBref.com from the years 2018 to 2020.

They include seasonal statistics from the Big 5 leagues. The type of tables included are standard, shooting, passing, passing types, goal creation, defense, possession, playing time, and miscellaneous.


standard <- fb_big5_advanced_season_stats(season_end_year = c(2018:2020),
    stat_type = "standard", team_or_player = "player")

shooting <- fb_big5_advanced_season_stats(season_end_year = c(2018:2020),
    stat_type = "shooting", team_or_player = "player")

passing <- fb_big5_advanced_season_stats(season_end_year = c(2018:2020),
    stat_type = "passing", team_or_player = "player")

passingtypes <- fb_big5_advanced_season_stats(season_end_year = c(2018:2020),
    stat_type = "passing_types", team_or_player = "player")

gca <- fb_big5_advanced_season_stats(season_end_year = c(2018:2020), stat_type = "gca",
    team_or_player = "player")

defense <- fb_big5_advanced_season_stats(season_end_year = c(2018:2020),
    stat_type = "defense", team_or_player = "player")
possession <- fb_big5_advanced_season_stats(season_end_year = c(2018:2020),
    stat_type = "possession", team_or_player = "player")
misc <- fb_big5_advanced_season_stats(season_end_year = c(2018:2020), stat_type = "misc",
    team_or_player = "player")
playingtime <- fb_big5_advanced_season_stats(season_end_year = c(2018:2020),
    stat_type = "playing_time", team_or_player = "player")
keepers <- fb_big5_advanced_season_stats(season_end_year = c(2018:2020),
    stat_type = "keepers", team_or_player = "player")
keepers_adv <- fb_big5_advanced_season_stats(season_end_year = c(2018:2020),
    stat_type = "keepers_adv", team_or_player = "player")

Retrieving the market values

market_values17 <- get_player_market_values(country_name = c("England",
    "Spain", "France", "Italy", "Germany"), start_year = 2017)

market_values18 <- get_player_market_values(country_name = c("England",
    "Spain", "France", "Italy", "Germany"), start_year = 2018)

market_values19 <- get_player_market_values(country_name = c("England",
    "Spain", "France", "Italy", "Germany"), start_year = 2019)

marketvalues <- bind_rows(market_values17, market_values18, market_values19)

Joining the data sets

Creating a unique identifier Here we are creating a unique identifier to each table so when we merge the tables, they will be merged by a unique. We combine Name, Year, and League Competition to

Here we are creating a unique identifier to each table and then combining all of the tables into one huge data set with all of our variables. For now we are excluding goal keepers.


marketvalues <- marketvalues %>%
    mutate(Season_End_Year = season_start_year + 1, PlayerYearComp_id = paste(player_name,
        Season_End_Year, comp_name))

standard <- standard %>%
    mutate(PlayerYearComp_id = paste(Player, Season_End_Year, Comp))

shooting <- shooting %>%
    mutate(PlayerYearComp_id = paste(Player, Season_End_Year, Comp))

passing <- passing %>%
    mutate(PlayerYearComp_id = paste(Player, Season_End_Year, Comp))

passingtypes <- passingtypes %>%
    mutate(PlayerYearComp_id = paste(Player, Season_End_Year, Comp))

gca <- gca %>%
    mutate(PlayerYearComp_id = paste(Player, Season_End_Year, Comp))

defense <- defense %>%
    mutate(PlayerYearComp_id = paste(Player, Season_End_Year, Comp))

possession <- possession %>%
    mutate(PlayerYearComp_id = paste(Player, Season_End_Year, Comp))

playingtime <- playingtime %>%
    mutate(PlayerYearComp_id = paste(Player, Season_End_Year, Comp))

misc <- misc %>%
    mutate(PlayerYearComp_id = paste(Player, Season_End_Year, Comp))

Joining Data Sets Now that we have a unique identifier, let’s join our statistics tables with our market values tables. We are going to keep each table separate for now. We are also adding a suffix to deal with our variables that are duplicate. For one duplicate variable, no suffix ("") will be added to the end. In the second duplicate variable, REMOVEDUPLICATE will be added to the end. Then we are going to remove our unused variables.


# Offense

## Standard, Shooting, and GCA
standardMarket <- inner_join(x = standard, y = marketvalues, by = "PlayerYearComp_id",
    suffix = c("", ".REMOVEDUPLICATE"))

shootingMarket <- inner_join(x = shooting, y = marketvalues, by = "PlayerYearComp_id",
    suffix = c("", ".REMOVEDUPLICATE"))

gcaMarket <- inner_join(x = gca, y = marketvalues, by = "PlayerYearComp_id",
    suffix = c("", ".REMOVEDUPLICATE"))

StdShootMkt <- left_join(x = standardMarket, y = shootingMarket, by = "PlayerYearComp_id",
    suffix = c("", ".REMOVEDUPLICATE"))

StdShootGCAMkt <- inner_join(x = StdShootMkt, y = gcaMarket, by = "PlayerYearComp_id",
    suffix = c("", ".REMOVEDUPLICATE"))

StdShootGCAMkt <- StdShootGCAMkt %>%
    distinct(PlayerYearComp_id, .keep_all = TRUE)

## Passing and Market Values
passingMarket <- inner_join(x = passing, y = marketvalues, by = "PlayerYearComp_id",
    suffix = c("", ".REMOVEDUPLICATE"))

# Joining Passing Types and Market Values
passingtypesMarket <- inner_join(x = passingtypes, y = marketvalues, by = "PlayerYearComp_id",
    suffix = c("", ".REMOVEDUPLICATE"))

# Joining Passing, Passing Types, and Market
PassMkt <- inner_join(x = passingMarket, y = passingtypesMarket, by = "PlayerYearComp_id",
    suffix = c("", ".REMOVEDUPLICATE"))

# Joining Standard, Shooting, GCA, Passing, Market
StdShootGCAPassMkt <- inner_join(x = StdShootGCAMkt, y = PassMkt, by = "PlayerYearComp_id",
    suffix = c("", ".REMOVEDUPLICATE"))

# Joining Playing time and Market
playingtimeMarket <- inner_join(x = playingtime, y = marketvalues, by = "PlayerYearComp_id",
    suffix = c("", ".REMOVEDUPLICATE"))

# Joining Standard, Shooting, GCA, Passing, Playing Time, and Market
# Storing this as offense
offense_stats <- inner_join(x = StdShootGCAPassMkt, y = playingtimeMarket,
    by = "PlayerYearComp_id", suffix = c("", ".REMOVEDUPLICATE"))

# Removing Duplicate rows
offense_stats <- offense_stats %>%
    distinct(PlayerYearComp_id, .keep_all = TRUE)

# Removing unneccessary DFs
rm(StdShootGCAPassMkt, StdShootGCAMkt, playingtimeMarket, passingtypesMarket,
    passingMarket, PassMkt, standardMarket, shootingMarket, gcaMarket,
    StdShootMkt)

# defense
defenseMarket <- inner_join(x = passing, y = marketvalues, by = "PlayerYearComp_id",
    suffix = c("", ".REMOVEDUPLICATE"))

miscMarket <- inner_join(x = passing, y = marketvalues, by = "PlayerYearComp_id",
    suffix = c("", ".REMOVEDUPLICATE"))

Removing Repeated Variables Removing duplicate variables (columns)

offense_stats <- offense_stats %>%
    select(-contains("REMOVEDUPLICATE"))

Here we are separating the position variable into primary_position and secondary_position. If a secondary position isn’t listed, NA will be returned.


offense_stats <- offense_stats %>%
    separate(Pos, c("primary_position", "secondary_position"), ",", remove = FALSE)

Creating a separate data set for each position

Now let’s create separate data sets for each position (excluding goal keepers)


# Filtering for only Forwards
forwards <- offense_stats %>%
    filter(primary_position == "FW")

# Changing all variables to factr variables
forwards <- as.data.frame(unclass(forwards), stringsAsFactors = TRUE)

midfielders <- offense_stats %>%
    filter(primary_position == "MF")
midfielders <- as.data.frame(unclass(midfielders), stringsAsFactors = TRUE)

Do we want to do any more filtering here? Do we want to perhaps filter minutes played? Let’s focus on the forwards table, let’s take a look at our data. We see that we the maximum minutes played is 3,420 minutes and the minimum minutes played is 1 minute. We have a median of about 1196 minutes and a 1st Quartile of 391 minutes. Should players with very little minutes be filtered out? If so, what is the threshold for minutes played in an entire season to be included in this data set.

After talking it over with the group, we made the decision to create a minimum minutes played in a season as 20 minutes per match played

300-350

summary(forwards$Min_Playing)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0   266.2   984.5  1142.1  1846.5  3420.0

offense_clean <- offense_stats %>%

glimpse(offense_stats)
## Rows: 5,466
## Columns: 156
## $ Season_End_Year                       <int> 2018, 2018, 2018, 2018…
## $ Squad                                 <chr> "Amiens", "Amiens", "A…
## $ Comp                                  <chr> "Ligue 1", "Ligue 1", …
## $ Player                                <chr> "Khaled Adénon", "Dani…
## $ Nation                                <chr> "BEN", "BRA", "FRA", "…
## $ Pos                                   <chr> "DF", "DF,FW", "MF", "…
## $ primary_position                      <chr> "DF", "DF", "MF", "DF"…
## $ secondary_position                    <chr> NA, "FW", NA, "MF", NA…
## $ Age                                   <chr> "32", "28", "33", "34"…
## $ Born                                  <dbl> 1985, 1989, 1984, 1982…
## $ MP_Playing                            <dbl> 34, 21, 1, 13, 1, 17, …
## $ Starts_Playing                        <dbl> 32, 20, 0, 3, 1, 4, 1,…
## $ Min_Playing                           <dbl> 2921, 1727, 7, 397, 90…
## $ Mins_Per_90_Playing                   <dbl> 32.5, 19.2, 0.1, 4.4, …
## $ Gls                                   <dbl> 0, 1, 0, 1, 0, 0, 0, 0…
## $ Ast                                   <dbl> 1, 0, 0, 0, 0, 1, 0, 0…
## $ G_minus_PK                            <dbl> 0, 1, 0, 1, 0, 0, 0, 0…
## $ PK                                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0…
## $ PKatt                                 <dbl> 0, 0, 0, 0, 0, 0, 0, 0…
## $ CrdY                                  <dbl> 9, 2, 0, 2, 0, 0, 0, 3…
## $ CrdR                                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0…
## $ Gls_Per                               <dbl> 0.00, 0.05, 0.00, 0.23…
## $ Ast_Per                               <dbl> 0.03, 0.00, 0.00, 0.00…
## $ `G+A_Per`                             <dbl> 0.03, 0.05, 0.00, 0.23…
## $ G_minus_PK_Per                        <dbl> 0.00, 0.05, 0.00, 0.23…
## $ `G+A_minus_PK_Per`                    <dbl> 0.03, 0.05, 0.00, 0.23…
## $ xG_Expected                           <dbl> 0.5, 0.4, 0.0, 0.9, 0.…
## $ npxG_Expected                         <dbl> 0.5, 0.4, 0.0, 0.9, 0.…
## $ xA_Expected                           <dbl> 0.3, 1.9, 0.0, 0.4, 0.…
## $ `npxG+xA_Expected`                    <dbl> 0.8, 2.4, 0.0, 1.4, 0.…
## $ xG_Per                                <dbl> 0.02, 0.02, 0.00, 0.21…
## $ xA_Per                                <dbl> 0.01, 0.10, 0.00, 0.10…
## $ `xG+xA_Per`                           <dbl> 0.03, 0.12, 0.00, 0.31…
## $ npxG_Per                              <dbl> 0.02, 0.02, 0.00, 0.21…
## $ `npxG+xA_Per`                         <dbl> 0.03, 0.12, 0.00, 0.31…
## $ Url                                   <chr> "https://fbref.com/en/…
## $ PlayerYearComp_id                     <chr> "Khaled Adénon 2018 Li…
## $ comp_name                             <chr> "Ligue 1", "Ligue 1", …
## $ region                                <chr> "Europe", "Europe", "E…
## $ country                               <chr> "France", "France", "F…
## $ season_start_year                     <int> 2017, 2017, 2017, 2017…
## $ squad                                 <chr> "Amiens SC", "Amiens S…
## $ player_num                            <chr> "3", "-", "20", "24", …
## $ player_name                           <chr> "Khaled Adénon", "Dani…
## $ player_position                       <chr> "Centre-Back", "Left-B…
## $ player_dob                            <date> 1985-07-29, 1989-06-0…
## $ player_age                            <dbl> 31, 28, 33, 34, 33, 29…
## $ player_nationality                    <chr> "Benin", "Brazil", "Fr…
## $ current_club                          <chr> "Doxa Katokopias", "Wi…
## $ player_height_mtrs                    <chr> "1.8", "1.85", "1.83",…
## $ player_foot                           <chr> "right", "left", "left…
## $ date_joined                           <chr> "2015-07-01", "2017-08…
## $ joined_from                           <chr> "Vendée Luçon Football…
## $ contract_expiry                       <chr> NA, NA, NA, NA, NA, NA…
## $ player_market_value_euro              <dbl> 500000, 1500000, 20000…
## $ player_url                            <chr> "https://www.transferm…
## $ Mins_Per_90                           <dbl> 32.5, 19.2, 0.1, 4.4, …
## $ Gls_Standard                          <dbl> 0, 1, 0, 1, 0, 0, 0, 0…
## $ Sh_Standard                           <dbl> 5, 8, 0, 8, 0, 11, 1, …
## $ SoT_Standard                          <dbl> 1, 1, 0, 1, 0, 1, 0, 1…
## $ SoT_percent_Standard                  <dbl> 20.0, 12.5, NA, 12.5, …
## $ Sh_per_90_Standard                    <dbl> 0.15, 0.42, 0.00, 1.81…
## $ SoT_per_90_Standard                   <dbl> 0.03, 0.05, 0.00, 0.23…
## $ G_per_Sh_Standard                     <dbl> 0.00, 0.13, NA, 0.13, …
## $ G_per_SoT_Standard                    <dbl> 0.00, 1.00, NA, 1.00, …
## $ Dist_Standard                         <dbl> 6.3, 17.7, NA, 14.6, N…
## $ FK_Standard                           <dbl> 0, 0, 0, 0, 0, 2, 0, 0…
## $ PK_Standard                           <dbl> 0, 0, 0, 0, 0, 0, 0, 0…
## $ PKatt_Standard                        <dbl> 0, 0, 0, 0, 0, 0, 0, 0…
## $ npxG_per_Sh_Expected                  <dbl> 0.10, 0.06, NA, 0.11, …
## $ G_minus_xG_Expected                   <dbl> -0.5, 0.6, 0.0, 0.1, 0…
## $ `np:G_minus_xG_Expected`              <dbl> -0.5, 0.6, 0.0, 0.1, 0…
## $ SCA_SCA                               <dbl> 7, 28, 0, 8, 0, 28, 2,…
## $ SCA90_SCA                             <dbl> 0.22, 1.46, 0.00, 1.82…
## $ PassLive_SCA                          <dbl> 4, 23, 0, 8, 0, 13, 2,…
## $ PassDead_SCA                          <dbl> 1, 3, 0, 0, 0, 10, 0, …
## $ Drib_SCA                              <dbl> 0, 0, 0, 0, 0, 0, 0, 0…
## $ Sh_SCA                                <dbl> 0, 0, 0, 0, 0, 1, 0, 0…
## $ Fld_SCA                               <dbl> 1, 2, 0, 0, 0, 3, 0, 1…
## $ Def_SCA                               <dbl> 1, 0, 0, 0, 0, 1, 0, 0…
## $ GCA_GCA                               <dbl> 1, 4, 0, 0, 0, 2, 0, 2…
## $ GCA90_GCA                             <dbl> 0.03, 0.21, 0.00, 0.00…
## $ PassLive_GCA                          <dbl> 1, 4, 0, 0, 0, 0, 0, 2…
## $ PassDead_GCA                          <dbl> 0, 0, 0, 0, 0, 1, 0, 0…
## $ Drib_GCA                              <dbl> 0, 0, 0, 0, 0, 0, 0, 0…
## $ Sh_GCA                                <dbl> 0, 0, 0, 0, 0, 0, 0, 0…
## $ Fld_GCA                               <dbl> 0, 0, 0, 0, 0, 1, 0, 0…
## $ Def_GCA                               <dbl> 0, 0, 0, 0, 0, 0, 0, 0…
## $ Cmp_Total                             <dbl> 958, 535, 2, 216, 7, 1…
## $ Att_Total                             <dbl> 1168, 840, 5, 275, 20,…
## $ Cmp_percent_Total                     <dbl> 82.0, 63.7, 40.0, 78.5…
## $ TotDist_Total                         <dbl> 21066, 10542, 24, 4038…
## $ PrgDist_Total                         <dbl> 7904, 6072, 0, 1548, 1…
## $ Cmp_Short                             <dbl> 254, 207, 1, 93, 0, 81…
## $ Att_Short                             <dbl> 295, 245, 3, 108, 0, 1…
## $ Cmp_percent_Short                     <dbl> 86.1, 84.5, 33.3, 86.1…
## $ Cmp_Medium                            <dbl> 503, 228, 1, 95, 4, 73…
## $ Att_Medium                            <dbl> 565, 337, 1, 112, 4, 1…
## $ Cmp_percent_Medium                    <dbl> 89.0, 67.7, 100.0, 84.…
## $ Cmp_Long                              <dbl> 189, 89, 0, 26, 3, 34,…
## $ Att_Long                              <dbl> 278, 225, 0, 46, 16, 7…
## $ Cmp_percent_Long                      <dbl> 68.0, 39.6, NA, 56.5, …
## $ xA                                    <dbl> 0.3, 1.9, 0.0, 0.4, 0.…
## $ A_minus_xA                            <dbl> 0.7, -1.9, 0.0, -0.4, …
## $ KP                                    <dbl> 3, 14, 0, 5, 0, 18, 1,…
## $ Final_Third                           <dbl> 39, 56, 0, 24, 1, 27, …
## $ PPA                                   <dbl> 1, 18, 0, 4, 0, 6, 0, …
## $ CrsPA                                 <dbl> 0, 9, 0, 0, 0, 2, 0, 9…
## $ Prog                                  <dbl> 44, 88, 0, 26, 0, 32, …
## $ Att                                   <dbl> 1168, 840, 5, 275, 20,…
## $ Live_Pass                             <dbl> 1124, 696, 5, 265, 10,…
## $ Dead_Pass                             <dbl> 44, 144, 0, 10, 10, 35…
## $ FK_Pass                               <dbl> 43, 15, 0, 10, 2, 9, 1…
## $ TB_Pass                               <dbl> 1, 0, 0, 0, 0, 2, 0, 0…
## $ Press_Pass                            <dbl> 141, 123, 1, 25, 0, 65…
## $ Sw_Pass                               <dbl> 28, 17, 0, 9, 0, 16, 2…
## $ Crs_Pass                              <dbl> 1, 58, 0, 1, 0, 12, 2,…
## $ CK_Pass                               <dbl> 0, 0, 0, 0, 0, 21, 0, …
## $ In_Corner                             <dbl> 0, 0, 0, 0, 0, 0, 0, 0…
## $ Out_Corner                            <dbl> 0, 0, 0, 0, 0, 0, 0, 0…
## $ Str_Corner                            <dbl> 0, 0, 0, 0, 0, 0, 0, 0…
## $ Ground_Height                         <dbl> 817, 394, 3, 193, 3, 1…
## $ Low_Height                            <dbl> 94, 68, 1, 20, 1, 28, …
## $ High_Height                           <dbl> 257, 378, 1, 62, 16, 1…
## $ Left_Body                             <dbl> 220, 545, 4, 40, 0, 19…
## $ Right_Body                            <dbl> 792, 50, 0, 211, 16, 2…
## $ Head_Body                             <dbl> 137, 90, 1, 22, 0, 19,…
## $ TI_Body                               <dbl> 1, 129, 0, 0, 0, 4, 0,…
## $ Other_Body                            <dbl> 3, 3, 0, 0, 2, 4, 0, 3…
## $ Cmp_Outcomes                          <dbl> 958, 535, 2, 216, 7, 1…
## $ Off_Outcomes                          <dbl> 3, 4, 0, 2, 0, 1, 0, 3…
## $ Out_Outcomes                          <dbl> 35, 20, 0, 9, 1, 9, 1,…
## $ Int_Outcomes                          <dbl> 14, 11, 1, 4, 0, 8, 0,…
## $ Blocks_Outcomes                       <dbl> 16, 22, 1, 8, 0, 12, 1…
## $ MP_Playing.Time                       <dbl> 34, 21, 1, 13, 1, 17, …
## $ Min_Playing.Time                      <dbl> 2921, 1727, 7, 397, 90…
## $ Mn_per_MP_Playing.Time                <dbl> 86, 82, 7, 31, 90, 32,…
## $ Min_percent_Playing.Time              <dbl> 85.4, 50.5, 0.2, 11.6,…
## $ Mins_Per_90_Playing.Time              <dbl> 32.5, 19.2, 0.1, 4.4, …
## $ Starts_Starts                         <dbl> 32, 20, 0, 3, 1, 4, 1,…
## $ Mn_per_Start_Starts                   <dbl> NA, NA, NA, NA, NA, NA…
## $ Compl_Starts                          <dbl> 32, 18, 0, 2, 1, 0, 0,…
## $ Subs_Subs                             <dbl> 2, 1, 1, 10, 0, 13, 2,…
## $ Mn_per_Sub_Subs                       <dbl> NA, NA, NA, NA, NA, NA…
## $ unSub_Subs                            <dbl> 2, 4, 6, 0, 36, 15, 2,…
## $ PPM_Team.Success                      <dbl> 1.26, 1.24, 1.00, 0.69…
## $ onG_Team.Success                      <dbl> 32, 23, 0, 3, 0, 4, 0,…
## $ onGA_Team.Success                     <dbl> 37, 21, 0, 8, 1, 10, 3…
## $ plus_per__minus__Team.Success         <dbl> -5, 2, 0, -5, -1, -6, …
## $ plus_per__minus_90_Team.Success       <dbl> -0.15, 0.10, 0.00, -1.…
## $ On_minus_Off_Team.Success             <dbl> -0.15, 0.48, 0.13, -1.…
## $ onxG_Team.Success..xG.                <dbl> 29.0, 20.1, 0.1, 3.3, …
## $ onxGA_Team.Success..xG                <dbl> 50.2, 25.8, 0.1, 10.5,…
## $ xGplus_per__minus__Team.Success..xG   <dbl> -21.2, -5.7, 0.0, -7.2…
## $ xGplus_per__minus_90_Team.Success..xG <dbl> -0.65, -0.30, 0.12, -1…
## $ On_minus_Off_Team.Success..xG         <dbl> -0.85, 0.48, 0.65, -1.…

# Removing the players who do not reach 350 minutes
offense_clean <- offense_stats %>%
    filter(Mn_per_MP_Playing.Time > 20)

# We go from 5,466 players to 4,993

Creating a new clean data set to store only the variables we care about and are interested in testing

offense_clean2 <- offense_clean %>%
    select(PlayerYearComp_id, Player, Squad, Comp, Season_End_Year, primary_position,
        player_position, Age, Min_Playing, G_minus_PK, "G+A_minus_PK_Per",
        Ast, xG_Expected, npxG_Expected, xA_Expected, "npxG+xA_Expected",
        xG_Per, xA, xA_Per, "xG+xA_Per", npxG_Per, "npxG+xA_Per", player_height_mtrs,
        joined_from, player_market_value_euro, Gls, Sh_per_90_Standard,
        G_per_Sh_Standard, Dist_Standard, npxG_per_Sh_Expected)

We want to know the most highly correlated between G_minus_PK, ‘G+A_minus_PK_Per’, xG_Expected, npxG_Expected, xA_Expected, ‘npxG+xA_Expected’, xG_Per, xA_Per, ‘xG+xA_Per’, npxG_Per, ‘npxG+xA_Per’ since all these variables are correlated. To see which one is most associated with market value, we are going to run separate linear regressions on each variable.

CODE START Need to rename the variables with a + sign From this quick regression, G minus PK actually has the highest correlation between these.

Goals minus PK + xA

mod2 <- lm(player_market_value_euro ~ G_minus_PK + xA + npxG_per_Sh_Expected, data = offense_clean2) tab_model(mod2) # .321 # Non Penalty Goals per Shot increased our R Squared to .33 but the variable was not significant

Goals minus PK

mod1 <- lm(player_market_value_euro ~ G_minus_PK, data = offense_clean2) tab_model(mod1) # .275

G_plus_A_minus_PK_Per

mod1 <- lm(player_market_value_euro ~ G_plus_A_minus_PK_Per, data = offense_clean2) tab_model(mod1) # .202

Goals

mod1 <- lm(player_market_value_euro ~ Gls, data = offense_clean2) tab_model(mod1) # .265 # Goals minus PK is a better indicator than goals. We can throw goals out

Non Penalty xG

mod1 <- lm(player_market_value_euro ~ npxG_Expected, data = offense_clean2) tab_model(mod1) #.244 # non penalty xG is actually a better indicator than xG by itself

xG

mod1 <- lm(player_market_value_euro ~ xG_Expected, data = offense_clean2) tab_model(mod1) # .233 # Goals minus Pk better than xG

Non Penalty xG + xA

mod1 <- lm(player_market_value_euro ~ npxG+xA_Expected, data = offense_clean2) tab_model(mod1) # .294

npxG_per_Sh_Expected

mod1 <- lm(player_market_value_euro ~ npxG_per_Sh_Expected, data = offense_clean2) tab_model(mod1)

xA

mod1 <- lm(player_market_value_euro ~ xA, data = offense_clean2) tab_model(mod1) # .224

Assists

mod1 <- lm(player_market_value_euro ~ Ast, data = offense_clean2) tab_model(mod1) #.214

xG + xA per 90 minutes

mod1 <- lm(player_market_value_euro ~ xG+xA_Per, data = offense_clean2) tab_model(mod1) # .168

Non-Penalty xG + xA per 90 minutes

mod1 <- lm(player_market_value_euro ~ npxG+xA_Per, data = offense_clean2) tab_model(mod1) # .165

xG per 90 minutes

mod1 <- lm(player_market_value_euro ~ xG_Per, data = offense_clean2) tab_model(mod1) #.128

mod1 <- lm(player_market_value_euro ~ xA_Per, data = offense_clean2) tab_model(mod1) # .127

mod1 <- lm(player_market_value_euro ~ xG, data = offense_clean2) tab_model(mod1)

Total is more highly correlated than per 90

mod1 <- lm(player_market_value_euro ~ xA_Per, data = offense_clean2) tab_model(mod1) # .127

mod1 <- lm(player_market_value_euro ~ npxG_Per, data = offense_clean2) tab_model(mod1) # .122

Goals per Shot

mod1 <- lm(player_market_value_euro ~ G_per_Sh_Standard, data = offense_clean2) tab_model(mod1) # .029

mod1 <- lm(player_market_value_euro ~ npxG_per_Sh_Expected, data = offense_clean2) tab_model(mod1) # .024

mod2 <- lm(player_market_value_euro ~ G_minus_PK + xA, data = offense_clean2) tab_model(mod1) # .32

CODE END

Deliverable

The deliverable will be in the form of an HTML file (website link) where we will be able to publish our findings. We can include as much text, code, output, and charts that we want. Here is a draft of how I plan on introducing our topic.

First we are finding out who is being paid the most and why? What do teams pay for? Is it Performance? Age? Both? Why are player market values the way they are?

We are going to create a model for predicting the market value of player based on previous market prices and performance statistics. From this, we can identify players who are performing extremely well and being under-paid and therefore ‘undervalued’. Our goal is to put the best 11 players on the field for the lowest amount of money. We can do this by identifying the most undervalued player at each position.

Predicted market price - actual market price = Value

Next Steps: Interpret the regressions Make one final model with the highest performance Work on visuals Create HTML file to outline data?

OLD NOTES

Standard Table Should we create separate tables for each league and then one of our variables can be squad. Or just include all leagues.

Do we want to use regular stats or non penalty stats?
In my model, I use non-penalty scores because penalty kicks are dependent on the other team. How do we feel about this.

We can either use goals and assists as separate variables, or we can use goals+assists as one predictor variable. To answer this question, I will create two models and see which one is a better predictor of +/-.

Should we use variables on a per 90 minute basis?
I decided to go with per 90 minute variables. To normalize our players, let’s use all variables on a per 90 minute basis. It will be extremely important to trim players that don’t fit the minute threshold. To qualify as a leader, a player needs to play 30 minutes per squad game on FBref. We are going to use Minutes per match played 20

Do we want to use xG and xA or G and A. What is a better predictor of +/-, what about value on transfer market? The break out players that are going above expected

New Feature G - xG = goals over expected

There are other variables such as xGplus_per_minus_90 Expected goals scored minus expected goals allowed

We also have xG expected

Squad -Nation -Player

Age -Born

Mins_Per_90_Playing -Min_Playing

What makes a player valuable? How does a player contribute to a win?

Nicholas Kondo