Introduction

Finally, in this part of the project I’ll document any previous revisions I’ve made to my project alongside the possible applications this data can be used for.

First, as always, I’ll have to load in my libraries and any previous work that I’ve done before.

suppressMessages(library("tidyverse"))
suppressMessages(library("knitr"))
suppressMessages(library("caret"))
suppressMessages(library("rvest"))
suppressMessages(library("stringr"))
#purl("Discovery_And_DataPrep.Rmd", output = "part1.r")
source("part1.r")
#purl("Model_Planning_and_Building.Rmd", output = "part2.r")
source("part2.r")
## 
## 
## processing file: Discovery_And_DataPrep.Rmd
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |..                                                                    |   3%
  |                                                                            
  |.....                                                                 |   7%
  |                                                                            
  |.......                                                               |  10%
  |                                                                            
  |..........                                                            |  14%
  |                                                                            
  |............                                                          |  17%
  |                                                                            
  |..............                                                        |  21%
  |                                                                            
  |.................                                                     |  24%
  |                                                                            
  |...................                                                   |  28%
  |                                                                            
  |......................                                                |  31%
  |                                                                            
  |........................                                              |  34%
  |                                                                            
  |...........................                                           |  38%
  |                                                                            
  |.............................                                         |  41%
  |                                                                            
  |...............................                                       |  45%
  |                                                                            
  |..................................                                    |  48%
  |                                                                            
  |....................................                                  |  52%
  |                                                                            
  |.......................................                               |  55%
  |                                                                            
  |.........................................                             |  59%
  |                                                                            
  |...........................................                           |  62%
  |                                                                            
  |..............................................                        |  66%
  |                                                                            
  |................................................                      |  69%
  |                                                                            
  |...................................................                   |  72%
  |                                                                            
  |.....................................................                 |  76%
  |                                                                            
  |........................................................              |  79%
  |                                                                            
  |..........................................................            |  83%
  |                                                                            
  |............................................................          |  86%
  |                                                                            
  |...............................................................       |  90%
  |                                                                            
  |.................................................................     |  93%
  |                                                                            
  |....................................................................  |  97%
  |                                                                            
  |......................................................................| 100%
## output file: part1.r
## Warning: Column `Platform` joining factors with different levels, coercing to
## character vector

## Warning: Column `Platform` joining factors with different levels, coercing to
## character vector

## Warning: Column `Platform` joining factors with different levels, coercing to
## character vector

## Warning: Column `Platform` joining factors with different levels, coercing to
## character vector

## Warning: Column `Platform` joining factors with different levels, coercing to
## character vector

## Warning: Column `Platform` joining factors with different levels, coercing to
## character vector

## Warning: Column `Platform` joining factors with different levels, coercing to
## character vector

## Warning: Column `Platform` joining factors with different levels, coercing to
## character vector

Revisions

After the completion of the previous two parts, I’ve received a few critiques on how I can improve my work. Here I’ll document all the changes I thought worth changing (The original versions of Part 1 and Part 2 can still be found in this github repo. The revised versions can be found in the github page for this project).

Part 1

I had adjusted my code from lines 191-218 so that I won’t be creating redundant tibbles. This was fixed thanks to a suggestion from a peer. They also mentioned that I could’ve included all region sales into one tibble instead of splitting them into multiple tibbles, and they’ve also suggested that I supplement any missing review scores with web scrapes from another site. I chose not to change either for the following reasons. The former suggestion I chose not to do because I believed separate tibbles allowed me to easily use certain display functions on any region of my choosing, and, in addition, the separate tibbles came in handy in the next part where I chose to create models on each region. The former suggestion was a fine idea, but after searching though some of the variables with missing review scores, a lot of them came from very outdated games which lack a web documented review score. While I could likely track any critic reviews given to the older games on old game magazines, there was still the issue of missing a user review score. In the end, I chose not to include this suggestion.

One other change I made on my own was to better explain my intentions for this project back in the introduction section.

Part 2

The peer that had critiqued my project had mentioned that I could’ve explained my models better after displaying them. To fix this, I thought the cross-validation that I’d do later in this part would help aid in explaining those previously made models. That said, I’ve also included additional explanations on both my ESRB models and the initial models. They also mentioned that I should’ve talked about what I could do with the information shown in that part, but I believed that my introduction matched pretty well with most things you could do with this data.

Cross-validation

In order to better illustrate what my models are telling us, I’ll be using cross-validation on each regions’ Game Sales v Console Sales and Review Scores model.

As a guide to help read the data that I’ll extract from this cross-validation… - R2: Represents the strength of the correlation. 1 is a stronger correlation. - Root Mean Squared Error (RMSE): Represents the average difference between the observed known outcome values and the values predicted by the model. The lower the RMSE, the better the model. - Mean absolute error (MAE): Represents how big an error we can expect. Smaller numbers means higher accuracy.

North America

NA Sales v. NA Console Sales and Review Scores

#I can't do cross-validation with any missing data within the tibble,
#so I had to remove any empty spaces left in by console sales.
CV_DataNA <- DataNA[rowSums(is.na(DataNA[ , ])) == 0, ]
index <- CV_DataNA$NA_Sales %>% createDataPartition(p = 0.75, list = FALSE)

train <- CV_DataNA[index, ]
test <- CV_DataNA[-index,]

train_model <- lm(train, formula=NA_Sales~NA_Console_Sales+User_Score+Critic_Score)
summary(train_model)
## 
## Call:
## lm(formula = NA_Sales ~ NA_Console_Sales + User_Score + Critic_Score, 
##     data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.1223 -0.3615 -0.1517  0.1181 14.9514 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -0.6836018  0.0480217 -14.235  < 2e-16 ***
## NA_Console_Sales  0.0086484  0.0005098  16.966  < 2e-16 ***
## User_Score       -0.0045880  0.0006450  -7.113 1.21e-12 ***
## Critic_Score      0.0211340  0.0006759  31.266  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8097 on 10425 degrees of freedom
## Multiple R-squared:  0.1084, Adjusted R-squared:  0.1081 
## F-statistic: 422.5 on 3 and 10425 DF,  p-value: < 2.2e-16
predictions <- train_model %>% predict(test) 

CV <- data.frame( R2 = R2(predictions, test$NA_Sales),
            RMSE = RMSE(predictions, test$NA_Sales),
            MAE = MAE(predictions, test$NA_Sales))
CV

Europe

EU Sales v. EU Console Sales and Review Scores

#I can't do cross-validation with any missing data within the tibble,
#so I had to remove any empty spaces left in by console sales.
CV_DataEU <- DataEU[rowSums(is.na(DataEU[ , ])) == 0, ]
index <- CV_DataEU$EU_Sales %>% createDataPartition(p = 0.75, list = FALSE)

train <- CV_DataEU[index, ]
test <- CV_DataEU[-index,]

train_model <- lm(train, formula=EU_Sales~EU_Console_Sales+User_Score+Critic_Score)
summary(train_model)
## 
## Call:
## lm(formula = EU_Sales ~ EU_Console_Sales + User_Score + Critic_Score, 
##     data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.7552 -0.2575 -0.1143  0.0661 28.6347 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -0.2932275  0.0374520  -7.829 5.38e-15 ***
## EU_Console_Sales  0.0046269  0.0003890  11.895  < 2e-16 ***
## User_Score       -0.0046488  0.0005333  -8.716  < 2e-16 ***
## Critic_Score      0.0138178  0.0005542  24.933  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6593 on 10425 degrees of freedom
## Multiple R-squared:  0.06651,    Adjusted R-squared:  0.06624 
## F-statistic: 247.6 on 3 and 10425 DF,  p-value: < 2.2e-16
predictions <- train_model %>% predict(test) 

CV <- data.frame( R2 = R2(predictions, test$EU_Sales),
            RMSE = RMSE(predictions, test$EU_Sales),
            MAE = MAE(predictions, test$EU_Sales))
CV

Japan

JP Sales v. JP Console Sales and Review Scores

#I can't do cross-validation with any missing data within the tibble,
#so I had to remove any empty spaces left in by console sales.
CV_DataJP <- DataJP[rowSums(is.na(DataJP[ , ])) == 0, ]
index <- CV_DataJP$JP_Sales %>% createDataPartition(p = 0.75, list = FALSE)

train <- CV_DataJP[index, ]
test <- CV_DataJP[-index,]

train_model <- lm(train, formula=JP_Sales~JP_Console_Sales+User_Score+Critic_Score)
summary(train_model)
## 
## Call:
## lm(formula = JP_Sales ~ JP_Console_Sales + User_Score + Critic_Score, 
##     data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.1327 -0.0543 -0.0287  0.0024  5.2006 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -0.1216852  0.0104203  -11.68  < 2e-16 ***
## JP_Console_Sales  0.0025109  0.0001898   13.23  < 2e-16 ***
## User_Score        0.0005941  0.0001535    3.87  0.00011 ***
## Critic_Score      0.0016728  0.0001593   10.50  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1901 on 10423 degrees of freedom
## Multiple R-squared:  0.03495,    Adjusted R-squared:  0.03468 
## F-statistic: 125.8 on 3 and 10423 DF,  p-value: < 2.2e-16
predictions <- train_model %>% predict(test) 

CV <- data.frame( R2 = R2(predictions, test$JP_Sales),
            RMSE = RMSE(predictions, test$JP_Sales),
            MAE = MAE(predictions, test$JP_Sales))
CV

Global

Global Sales v. Global Console Sales and Review Scores

#I can't do cross-validation with any missing data within the tibble,
#so I had to remove any empty spaces left in by console sales.
CV_DataGlobal <- DataGlobal[rowSums(is.na(DataGlobal[ , ])) == 0, ]
index <- CV_DataGlobal$Global_Sales %>% createDataPartition(p = 0.75, list = FALSE)

train <- CV_DataGlobal[index, ]
test <- CV_DataGlobal[-index,]

train_model <- lm(train, formula=Global_Sales~Global_Console_Sales+User_Score+Critic_Score)
summary(train_model)
## 
## Call:
## lm(formula = Global_Sales ~ Global_Console_Sales + User_Score + 
##     Critic_Score, data = train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -2.093 -0.693 -0.300  0.206 34.226 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -1.128210   0.092162  -12.24  < 2e-16 ***
## Global_Console_Sales  0.004941   0.000355   13.92  < 2e-16 ***
## User_Score           -0.009230   0.001291   -7.15 9.28e-13 ***
## Critic_Score          0.040553   0.001339   30.28  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.604 on 10425 degrees of freedom
## Multiple R-squared:  0.09791,    Adjusted R-squared:  0.09765 
## F-statistic: 377.2 on 3 and 10425 DF,  p-value: < 2.2e-16
predictions <- train_model %>% predict(test) 

CV <- data.frame( R2 = R2(predictions, test$Global_Sales),
            RMSE = RMSE(predictions, test$Global_Sales),
            MAE = MAE(predictions, test$Global_Sales))
CV

Results

Through cross-validation, I can see that, while individually console sales and review scores hold a high correlation with game sales, when factored together the correlation seems to drop. The strongest prediction I had was found within North America’s model. In addition, based off of the RMSE and MAE, each of the models seems to be decent and are highly accurate. As discussed in the part prior, this drop is likely caused by the fact that there are more poorly sold games than top selling games.

However, with regards to what I’ve learned throughout this project, it appears that mostly everywhere but Japan has a thriving market for Rated M games. Japan generally has sold more games when they were aimed towards younger audiences. In addition, console sales and review scores of both the critics and users contribute greatly towards a game’s sales.

Operationalization

As I said in my first introduction to this project, I aimed to work with this data so that I could see what helps a game sell. Based on the work I’ve done, I can generally see what types of games do well in what region. With this data, it can aid developers to push towards a certain market to maximize their potential to have a game sell well. This sentiment applies to both indie and big studio devs.

Unintended Consequences

With regards to what I think are the ethical issues this could bring, I personally always believed game development to be a strong form of self expression. This highly logical approach to game development I feel could easily take away from that aspect of game development. While, at the end of the day, a game is a product, I believe it’s equally important to maintain games as a medium of developer self expression.

Everything below was based on suggestions given by my peer during the final In addition, another issue that could arise is providing a stronger incentive for a region’s developer to focus on aiming at a certain audience to maximize their sales. A more specific example can be if North American developers focus too much on the Rated M market and there’s a drop in content aimed towards kids. The opposite could also be said about the Japanese market where games are further aimed towards children but not adults.

Going Forward

While my results can easily help a big game development team, I believe these kinds of observations would be better suited for indie developers. Analysing what could boost the sales of indie games I think would be more valuable information than what I’ve covered here. As a result, observing indie games as opposed to AAA games might be something I could aim for in the future.