Model Planning and Building

Introduction

In this part of the project, I’d like to see what are the possible predictors of a games sales. This is important information since it’ll give pointers towards developers on what they could do to sell to a larger audience, and it’ll provide insight on what made a game sell well. As with the prior part of this project, I will also be analysing each important sales region individually as well.

To supplement my data, I’ve included a web scrape from the site vgchartz. This site provides information about console sales, and will act as an addition predictor. I chose to include console sales since the popularity of the platform that the game is on will likely attribute to a game’s success.

Libraries and Resources

First, I’ll have to load in the libraries I’ll be using in addition to what I’ve done in the previous part of this project.

suppressMessages(library("tidyverse"))
suppressMessages(library("knitr"))
suppressMessages(library("caret"))
suppressMessages(library("rvest"))
suppressMessages(library("stringr"))
purl("Discovery_And_DataPrep.Rmd", output = "part1.r")

## 
## 
## processing file: Discovery_And_DataPrep.Rmd

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |..                                                                    |   3%
  |                                                                            
  |.....                                                                 |   7%
  |                                                                            
  |.......                                                               |  10%
  |                                                                            
  |..........                                                            |  14%
  |                                                                            
  |............                                                          |  17%
  |                                                                            
  |..............                                                        |  21%
  |                                                                            
  |.................                                                     |  24%
  |                                                                            
  |...................                                                   |  28%
  |                                                                            
  |......................                                                |  31%
  |                                                                            
  |........................                                              |  34%
  |                                                                            
  |...........................                                           |  38%
  |                                                                            
  |.............................                                         |  41%
  |                                                                            
  |...............................                                       |  45%
  |                                                                            
  |..................................                                    |  48%
  |                                                                            
  |....................................                                  |  52%
  |                                                                            
  |.......................................                               |  55%
  |                                                                            
  |.........................................                             |  59%
  |                                                                            
  |...........................................                           |  62%
  |                                                                            
  |..............................................                        |  66%
  |                                                                            
  |................................................                      |  69%
  |                                                                            
  |...................................................                   |  72%
  |                                                                            
  |.....................................................                 |  76%
  |                                                                            
  |........................................................              |  79%
  |                                                                            
  |..........................................................            |  83%
  |                                                                            
  |............................................................          |  86%
  |                                                                            
  |...............................................................       |  90%
  |                                                                            
  |.................................................................     |  93%
  |                                                                            
  |....................................................................  |  97%
  |                                                                            
  |......................................................................| 100%

## output file: part1.r

## [1] "part1.r"

source("part1.r")

I’ll also need to load in my supplementary data.

#Yoink the URL
URL_ConsoleSales <- "http://www.vgchartz.com/analysis/platform_totals/Hardware/Global/"
html <- read_html(URL_ConsoleSales)

#Pinning down the rows that I'll be scrapping from
rows <- html %>% html_nodes(xpath = '//*[@id="myTable"]/tbody/tr')

#Grabbing all observations I need
Platform <- rows %>%
  html_nodes(xpath = '//*[@id="myTable"]/tbody/tr/td[2]') %>%
  html_text()
NA_Sales <- rows %>%
  html_nodes(xpath = '//*[@id="myTable"]/tbody/tr/td[3]') %>%
  html_text()
EU_Sales <- rows %>%
  html_nodes(xpath = '//*[@id="myTable"]/tbody/tr/td[4]') %>%
  html_text()
JP_Sales <- rows %>%
  html_nodes(xpath = '//*[@id="myTable"]/tbody/tr/td[5]') %>%
  html_text()
Other_Sales <- rows %>%
  html_nodes(xpath = '//*[@id="myTable"]/tbody/tr/td[6]') %>%
  html_text()
Global_Sales <- rows %>%
  html_nodes(xpath = '//*[@id="myTable"]/tbody/tr/td[7]') %>%
  html_text()

#Create a tibble of my newly aquired data
ConsoleSalesData <- tibble(Platform=Platform,
                           'NA Console Sales'=NA_Sales,
                           'EU Console Sales'=EU_Sales,
                           'JP Console Sales'=JP_Sales,
                           'Other Console Sales'=Other_Sales,
                           'Global Console Sales'=Global_Sales)

Data Organization

I’ll need to reorganize my supplementary data so that it’s tidy and clean

#Here, I convert each column into its appropriate type
ConsoleSalesData[2] <- ConsoleSalesData$`NA Console Sales` %>% as.numeric()
ConsoleSalesData[3] <- ConsoleSalesData$`EU Console Sales` %>% as.numeric()
ConsoleSalesData[4] <- ConsoleSalesData$`JP Console Sales` %>% as.numeric()
ConsoleSalesData[5] <- ConsoleSalesData$`Other Console Sales` %>% as.numeric()
ConsoleSalesData[6] <- ConsoleSalesData$`Global Console Sales` %>% as.numeric()

#This absolute monster of a code chunk is just me converting the names of each console into something that matches the naming convention used in part 1 of this project.
ConsoleSalesData$Platform[ConsoleSalesData$Platform=='3DO Interactive Multiplayer (3DO)'] <- '3DO'
ConsoleSalesData$Platform[ConsoleSalesData$Platform=='Atari 2600 (2600)'] <- '2600'
ConsoleSalesData$Platform[ConsoleSalesData$Platform=='Atari 7800 (7800)'] <- '7800'
ConsoleSalesData$Platform[ConsoleSalesData$Platform=='Dreamcast (DC)'] <- 'DC'
ConsoleSalesData$Platform[ConsoleSalesData$Platform=='Game Boy (GB)'] <- 'GB'
ConsoleSalesData$Platform[ConsoleSalesData$Platform=='Game Boy Advance (GBA)'] <- 'GBA'
ConsoleSalesData$Platform[ConsoleSalesData$Platform=='GameCube (GC)'] <- 'GC'
ConsoleSalesData$Platform[ConsoleSalesData$Platform=='GameGear (GG)'] <- 'GG'
ConsoleSalesData$Platform[ConsoleSalesData$Platform=='iOS (iOS)'] <- 'iOS'
ConsoleSalesData$Platform[ConsoleSalesData$Platform=='Microsoft Windows (PC)'] <- 'PC'
ConsoleSalesData$Platform[ConsoleSalesData$Platform=='Neo Geo (NG)'] <- 'NG'
ConsoleSalesData$Platform[ConsoleSalesData$Platform=='Nintendo 3DS (3DS)'] <- '3DS'
ConsoleSalesData$Platform[ConsoleSalesData$Platform=='Nintendo 64 (N64)'] <- 'N64'
ConsoleSalesData$Platform[ConsoleSalesData$Platform=='Nintendo DS (DS)'] <- 'DS'
ConsoleSalesData$Platform[ConsoleSalesData$Platform=='Nintendo Entertainment System (NES)'] <- 'NES'
ConsoleSalesData$Platform[ConsoleSalesData$Platform=='Nintendo Switch (NS)'] <- 'NS'
ConsoleSalesData$Platform[ConsoleSalesData$Platform=='PC-FX (PCFX)'] <- 'PCFX'
ConsoleSalesData$Platform[ConsoleSalesData$Platform=='PlayStation (PS)'] <- 'PS'
ConsoleSalesData$Platform[ConsoleSalesData$Platform=='PlayStation 2 (PS2)'] <- 'PS2'
ConsoleSalesData$Platform[ConsoleSalesData$Platform=='PlayStation 3 (PS3)'] <- 'PS3'
ConsoleSalesData$Platform[ConsoleSalesData$Platform=='PlayStation 4 (PS4)'] <- 'PS4'
ConsoleSalesData$Platform[ConsoleSalesData$Platform=='PlayStation Portable (PSP)'] <- 'PSP'
ConsoleSalesData$Platform[ConsoleSalesData$Platform=='PlayStation Vita (PSV)'] <- 'PSV'
ConsoleSalesData$Platform[ConsoleSalesData$Platform=='Sega CD (SCD)'] <- 'SCD'
ConsoleSalesData$Platform[ConsoleSalesData$Platform=='Sega Genesis (GEN)'] <- 'GEN'
ConsoleSalesData$Platform[ConsoleSalesData$Platform=='Sega Saturn (SAT)'] <- 'SAT'
ConsoleSalesData$Platform[ConsoleSalesData$Platform=='Super Nintendo Entertainment System (SNES)'] <- 'SNES'
ConsoleSalesData$Platform[ConsoleSalesData$Platform=='TurboGrafx-16 (TG16)'] <- 'TG16'
ConsoleSalesData$Platform[ConsoleSalesData$Platform=='Wii (Wii)'] <- 'Wii'
ConsoleSalesData$Platform[ConsoleSalesData$Platform=='Wii U (WiiU)'] <- 'WiiU'
ConsoleSalesData$Platform[ConsoleSalesData$Platform=='WonderSwan (WS)'] <- 'WS'
ConsoleSalesData$Platform[ConsoleSalesData$Platform=='Xbox (XB)'] <- 'XB'
ConsoleSalesData$Platform[ConsoleSalesData$Platform=='Xbox 360 (X360)'] <- 'X360'
ConsoleSalesData$Platform[ConsoleSalesData$Platform=='Xbox One (XOne)'] <- 'XOne'

#I convert the Platform column into a tibble
ConsoleSalesData$Platform <- as.factor(ConsoleSalesData$Platform)

#I removed these rows from my data since the data they provided was NULL.
ConsoleSalesData <- ConsoleSalesData[-c(28:34),]

#With all that done, I can show off the final table
ConsoleSalesData

Appending Data and Modelling Phase Prep

Now that I have my supplementary data in a tidy form, I’ll just append all that I’ve got onto my preexisting tables from the previous part. This way I can analyse data by significant regions/globally. In addition, I make some changes so that I can easily form my models.

#Seperating console sales into NA, EU, JP, and Global just like in part 1
ConsoleDataNA <- tibble(Platform=ConsoleSalesData$Platform,
                        `NA Console Sales`=ConsoleSalesData$`NA Console Sales`)
ConsoleDataEU <- tibble(Platform=ConsoleSalesData$Platform,
                        `EU Console Sales`=ConsoleSalesData$`EU Console Sales`)
ConsoleDataJP <- tibble(Platform=ConsoleSalesData$Platform,
                        `JP Console Sales`=ConsoleSalesData$`JP Console Sales`)
ConsoleDataGlobal <- tibble(Platform=ConsoleSalesData$Platform,
                        `Global Console Sales`=ConsoleSalesData$`Global Console Sales`)

#Adding console sales to the each of the tables I had made back in part 1
UserBasedDataNA <- left_join(UserBasedDataNA, ConsoleDataNA, by = "Platform")

## Warning: Column `Platform` joining factors with different levels, coercing to
## character vector

UserBasedDataEU <- left_join(UserBasedDataEU, ConsoleDataEU, by = "Platform")

## Warning: Column `Platform` joining factors with different levels, coercing to
## character vector

UserBasedDataJP <- left_join(UserBasedDataJP, ConsoleDataJP, by = "Platform")

## Warning: Column `Platform` joining factors with different levels, coercing to
## character vector

UserBasedDataGlobal <- left_join(UserBasedDataGlobal, ConsoleDataGlobal, by = "Platform")

## Warning: Column `Platform` joining factors with different levels, coercing to
## character vector

CriticBasedDataNA <- left_join(CriticBasedDataNA, ConsoleDataNA, by = "Platform")

## Warning: Column `Platform` joining factors with different levels, coercing to
## character vector

CriticBasedDataEU <- left_join(CriticBasedDataEU, ConsoleDataEU, by = "Platform")

## Warning: Column `Platform` joining factors with different levels, coercing to
## character vector

CriticBasedDataJP <- left_join(CriticBasedDataJP, ConsoleDataJP, by = "Platform")

## Warning: Column `Platform` joining factors with different levels, coercing to
## character vector

CriticBasedDataGlobal <- left_join(CriticBasedDataGlobal, ConsoleDataGlobal, by = "Platform")

## Warning: Column `Platform` joining factors with different levels, coercing to
## character vector

#Replacing all spaces with an underscore, since it'll make things more convinient during the modeling phase
names(UserBasedDataNA) <- gsub(x = names(UserBasedDataNA), pattern = " ", replacement = "\\_")
names(UserBasedDataEU) <- gsub(x = names(UserBasedDataEU), pattern = " ", replacement = "\\_")
names(UserBasedDataJP) <- gsub(x = names(UserBasedDataJP), pattern = " ", replacement = "\\_")
names(UserBasedDataGlobal) <- gsub(x = names(UserBasedDataGlobal), pattern = " ", replacement = "\\_")
names(CriticBasedDataNA) <- gsub(x = names(CriticBasedDataNA), pattern = " ", replacement = "\\_")
names(CriticBasedDataEU) <- gsub(x = names(CriticBasedDataEU), pattern = " ", replacement = "\\_")
names(CriticBasedDataJP) <- gsub(x = names(CriticBasedDataJP), pattern = " ", replacement = "\\_")
names(CriticBasedDataGlobal) <- gsub(x = names(CriticBasedDataGlobal), pattern = " ", replacement = "\\_")

#Convert the scores to doubles so that I can properly analyse them in the model
UserBasedDataNA$User_Score <- as.double(UserBasedDataNA$User_Score)
UserBasedDataEU$User_Score <- as.double(UserBasedDataEU$User_Score)
UserBasedDataJP$User_Score <- as.double(UserBasedDataJP$User_Score)
UserBasedDataGlobal$User_Score <- as.double(UserBasedDataGlobal$User_Score)
CriticBasedDataNA$Critic_Score <- as.double(CriticBasedDataNA$Critic_Score)
CriticBasedDataEU$Critic_Score <- as.double(CriticBasedDataEU$Critic_Score)
CriticBasedDataJP$Critic_Score <- as.double(CriticBasedDataJP$Critic_Score)
CriticBasedDataGlobal$Critic_Score <- as.double(CriticBasedDataGlobal$Critic_Score)

#I'll also be creating tables where it has both User and Critic scores. These tables will be used for my visual models.
cols <- c(1,9:10)
DataNA <- left_join(UserBasedDataNA, CriticBasedDataNA[,cols], by = "Title")
DataEU <- left_join(UserBasedDataEU, CriticBasedDataEU[,cols], by = "Title")
DataJP <- left_join(UserBasedDataJP, CriticBasedDataJP[,cols], by = "Title")
DataGlobal <- left_join(UserBasedDataGlobal, CriticBasedDataGlobal[,cols], by = "Title")

Modeling Data

Now I’ll begin to visualize models of game sales based on the console sales within that region and the review scores. I’ve created multiple models to view each significant region in addition to a model dedicated to global sales.

I’ve also done additional models which look to see how ESRB ratings affect a game’s sales. These Models are not visualized both for the sake of brevity, and because I believe the results shown are a fine enough indicator. This is more for my own curiousity.

North America

#Randomly pick 75% of the data to split.
index <- DataNA$NA_Sales %>% createDataPartition(p = 0.75, list = FALSE)

train <-DataNA[index, ]
test <-DataNA[-index,]

train_model <- lm(train, formula=NA_Sales~NA_Console_Sales+User_Score+Critic_Score)
summary(train_model)

## 
## Call:
## lm(formula = NA_Sales ~ NA_Console_Sales + User_Score + Critic_Score, 
##     data = train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.077 -0.357 -0.148  0.110 40.751 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -0.6412741  0.0516434 -12.417  < 2e-16 ***
## NA_Console_Sales  0.0086863  0.0005492  15.816  < 2e-16 ***
## User_Score       -0.0049355  0.0006966  -7.085 1.48e-12 ***
## Critic_Score      0.0207456  0.0007286  28.473  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8686 on 10425 degrees of freedom
##   (1226 observations deleted due to missingness)
## Multiple R-squared:  0.09176,    Adjusted R-squared:  0.0915 
## F-statistic: 351.1 on 3 and 10425 DF,  p-value: < 2.2e-16

predictions <- train_model %>% predict(test)

ggplot(data = test, aes(x=predictions, y = NA_Sales)) + geom_point(alpha=1/5) + scale_x_continuous(name = "Console Sales and Review Scores") + scale_y_continuous(name = "North America Sales") + geom_smooth(method = "lm")

## Warning: Removed 409 rows containing non-finite values (stat_smooth).

## Warning: Removed 409 rows containing missing values (geom_point).

Model with ESRB ratings included.

UserModelNA <- lm(UserBasedDataNA, formula=NA_Sales~Rating)
summary(UserModelNA)

## 
## Call:
## lm(formula = NA_Sales ~ Rating, data = UserBasedDataNA)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -0.500 -0.320 -0.210  0.009 40.888 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.0370     0.1138   0.325 0.745111    
## RatingAO      1.2230     0.9590   1.275 0.202238    
## RatingE       0.4347     0.1157   3.758 0.000172 ***
## RatingE10+    0.2726     0.1179   2.311 0.020864 *  
## RatingK-A     0.2230     0.9590   0.233 0.816125    
## RatingM       0.4635     0.1165   3.978 7.02e-05 ***
## RatingRP     -0.0370     0.6829  -0.054 0.956790    
## RatingT       0.2545     0.1154   2.204 0.027536 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9522 on 7009 degrees of freedom
## Multiple R-squared:  0.01111,    Adjusted R-squared:  0.01012 
## F-statistic: 11.24 on 7 and 7009 DF,  p-value: 3.056e-14

Europe

#Randomly pick 75% of the data to split.
index <- DataEU$EU_Sales %>% createDataPartition(p = 0.75, list = FALSE)

train <-DataEU[index, ]
test <-DataEU[-index,]

train_model <- lm(train, formula=EU_Sales~EU_Console_Sales+User_Score+Critic_Score)
summary(train_model)

## 
## Call:
## lm(formula = EU_Sales ~ EU_Console_Sales + User_Score + Critic_Score, 
##     data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.7445 -0.2540 -0.1119  0.0647 10.5523 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -0.2864339  0.0339779  -8.430   <2e-16 ***
## EU_Console_Sales  0.0044268  0.0003497  12.661   <2e-16 ***
## User_Score       -0.0045881  0.0004753  -9.654   <2e-16 ***
## Critic_Score      0.0136966  0.0004922  27.827   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5907 on 10408 degrees of freedom
##   (1242 observations deleted due to missingness)
## Multiple R-squared:  0.07985,    Adjusted R-squared:  0.07958 
## F-statistic: 301.1 on 3 and 10408 DF,  p-value: < 2.2e-16

predictions <- train_model %>% predict(test)

ggplot(data = test, aes(x=predictions, y = EU_Sales)) + geom_point(alpha=1/5) + scale_x_continuous(name = "Console Sales and Review Scores") + scale_y_continuous(name = "Europe Sales") + geom_smooth(method = "lm")

## Warning: Removed 393 rows containing non-finite values (stat_smooth).

## Warning: Removed 393 rows containing missing values (geom_point).

Model with ESRB ratings included.

UserModelEU <- lm(UserBasedDataEU, formula=EU_Sales~Rating)
summary(UserModelEU)

## 
## Call:
## lm(formula = EU_Sales ~ Rating, data = UserBasedDataEU)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.3229 -0.2312 -0.1461 -0.0129 28.6788 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  0.07900    0.08080   0.978  0.32827   
## RatingAO     0.53100    0.68087   0.780  0.43548   
## RatingE      0.20224    0.08213   2.463  0.01382 * 
## RatingE10+   0.09083    0.08374   1.085  0.27810   
## RatingK-A    0.08100    0.68087   0.119  0.90531   
## RatingM      0.24385    0.08272   2.948  0.00321 **
## RatingRP    -0.04400    0.48482  -0.091  0.92769   
## RatingT      0.08707    0.08196   1.062  0.28814   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6761 on 7009 degrees of freedom
## Multiple R-squared:  0.01026,    Adjusted R-squared:  0.009272 
## F-statistic: 10.38 on 7 and 7009 DF,  p-value: 5.031e-13

Japan

#Randomly pick 75% of the data to split.
index <- DataJP$JP_Sales %>% createDataPartition(p = 0.75, list = FALSE)

train <-DataJP[index, ]
test <-DataJP[-index,]

train_model <- lm(train, formula=JP_Sales~JP_Console_Sales+User_Score+Critic_Score)
summary(train_model)

## 
## Call:
## lm(formula = JP_Sales ~ JP_Console_Sales + User_Score + Critic_Score, 
##     data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.1366 -0.0559 -0.0291  0.0030  6.3627 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -0.1241650  0.0110225 -11.265  < 2e-16 ***
## JP_Console_Sales  0.0026405  0.0002018  13.088  < 2e-16 ***
## User_Score        0.0006122  0.0001635   3.744 0.000182 ***
## Critic_Score      0.0016836  0.0001699   9.911  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2025 on 10430 degrees of freedom
##   (1219 observations deleted due to missingness)
## Multiple R-squared:  0.03305,    Adjusted R-squared:  0.03277 
## F-statistic: 118.8 on 3 and 10430 DF,  p-value: < 2.2e-16

predictions <- train_model %>% predict(test)

ggplot(data = test, aes(x=predictions, y = JP_Sales)) + geom_point(alpha=1/5) + scale_x_continuous(name = "Console Sales and Review Scores") + scale_y_continuous(name = "Japan Sales") + geom_smooth(method = "lm")

## Warning: Removed 416 rows containing non-finite values (stat_smooth).

## Warning: Removed 416 rows containing missing values (geom_point).

Model with ESRB ratings included.

UserModelJP <- lm(UserBasedDataJP, formula=JP_Sales~Rating)
summary(UserModelJP)

## 
## Call:
## lm(formula = JP_Sales ~ Rating, data = UserBasedDataJP)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.0901 -0.0601 -0.0501 -0.0401  6.4099 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.034714   0.033842   1.026    0.305    
## RatingAO    -0.034714   0.285156  -0.122    0.903    
## RatingE      0.055338   0.034397   1.609    0.108    
## RatingE10+   0.006311   0.035072   0.180    0.857    
## RatingK-A    1.425286   0.285156   4.998 5.93e-07 ***
## RatingM      0.008425   0.034644   0.243    0.808    
## RatingRP    -0.034714   0.203051  -0.171    0.864    
## RatingT      0.025352   0.034328   0.739    0.460    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2831 on 7009 degrees of freedom
## Multiple R-squared:  0.00816,    Adjusted R-squared:  0.007169 
## F-statistic: 8.237 on 7 and 7009 DF,  p-value: 4.874e-10

Global

#Randomly pick 75% of the data to split.
index <- DataGlobal$Global_Sales %>% createDataPartition(p = 0.75, list = FALSE)

train <-DataGlobal[index, ]
test <-DataGlobal[-index,]

train_model <- lm(train, formula=Global_Sales~Global_Console_Sales+User_Score+Critic_Score)
summary(train_model)

## 
## Call:
## lm(formula = Global_Sales ~ Global_Console_Sales + User_Score + 
##     Critic_Score, data = train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -2.210 -0.702 -0.304  0.202 81.460 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -1.0371190  0.1014138  -10.23  < 2e-16 ***
## Global_Console_Sales  0.0049820  0.0003888   12.81  < 2e-16 ***
## User_Score           -0.0107979  0.0014189   -7.61 2.97e-14 ***
## Critic_Score          0.0408941  0.0014753   27.72  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.755 on 10419 degrees of freedom
##   (1231 observations deleted due to missingness)
## Multiple R-squared:  0.08124,    Adjusted R-squared:  0.08097 
## F-statistic: 307.1 on 3 and 10419 DF,  p-value: < 2.2e-16

predictions <- train_model %>% predict(test)

ggplot(data = test, aes(x=predictions, y = Global_Sales)) + geom_point(alpha=1/5) + scale_x_continuous(name = "Console Sales and Review Scores") + scale_y_continuous(name = "Global Sales") + geom_smooth(method = "lm")

## Warning: Removed 404 rows containing non-finite values (stat_smooth).

## Warning: Removed 404 rows containing missing values (geom_point).

Model with ESRB ratings included.

UserModelGlobal <- lm(UserBasedDataGlobal, formula=Global_Sales~Rating)
summary(UserModelGlobal)

## 
## Call:
## lm(formula = Global_Sales ~ Rating, data = UserBasedDataGlobal)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -0.975 -0.635 -0.415  0.003 81.595 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.1667     0.2308   0.722 0.470134    
## RatingAO      1.7833     1.9448   0.917 0.359211    
## RatingE       0.7680     0.2346   3.274 0.001067 ** 
## RatingE10+    0.4185     0.2392   1.749 0.080264 .  
## RatingK-A     1.7533     1.9448   0.902 0.367351    
## RatingM       0.8182     0.2363   3.463 0.000538 ***
## RatingRP     -0.1317     1.3849  -0.095 0.924230    
## RatingT       0.4103     0.2341   1.753 0.079702 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.931 on 7009 degrees of freedom
## Multiple R-squared:  0.01047,    Adjusted R-squared:  0.009479 
## F-statistic: 10.59 on 7 and 7009 DF,  p-value: 2.542e-13

Model Results

As I can see from the results of my models, it appears that, individually, a game’s user and critic reviews along with its platform’s sales numbers are all significant, but the correlation between the three and a game’s sales is somewhat average. I could also see from the models that the sales of the console and both the critic and user review scores have a high correlation with a game’s sales individually.

This could possibly be because there’s generally a larger quantity of poorly sold games than there are average to top selling games. What could likely increase correlation would be to limit my selection to only the highest/average grossing games. Other limitations that these models could have were the fact that a number of observations within my data in Part 1 didn’t receive a console sales column from the supplementary data. That was because the site I scrapped from didn’t provide a sales figure for some of the consoles that were observed, and, as a result, they were cut from the model.

In addition to those findings however, I could see from my additional models looking at the ESRB ratings that Rated M and Rated E games have a stronger significance toward game sales. More specifically, it appears that everywhere other than Japan has Rated M games as the higher selling products. In contrast, Japan seems to have a stronger market that’s geared towards a younger audience.

Model Planning and Building

Aaron Nicanor’s Data Science Portfolio Project

Introduction

Libraries and Resources

Data Organization

Appending Data and Modelling Phase Prep

Modeling Data

North America

Europe

Japan

Global

Model Results