Finally, in this part of the project I’ll document any previous revisions I’ve made to my project alongside the possible applications this data can be used for.
First, as always, I’ll have to load in my libraries and any previous work that I’ve done before.
suppressMessages(library("tidyverse"))
suppressMessages(library("knitr"))
suppressMessages(library("caret"))
suppressMessages(library("rvest"))
suppressMessages(library("stringr"))
#purl("Discovery_And_DataPrep.Rmd", output = "part1.r")
source("part1.r")
#purl("Model_Planning_and_Building.Rmd", output = "part2.r")
source("part2.r")
##
##
## processing file: Discovery_And_DataPrep.Rmd
##
|
| | 0%
|
|.. | 3%
|
|..... | 7%
|
|....... | 10%
|
|.......... | 14%
|
|............ | 17%
|
|.............. | 21%
|
|................. | 24%
|
|................... | 28%
|
|...................... | 31%
|
|........................ | 34%
|
|........................... | 38%
|
|............................. | 41%
|
|............................... | 45%
|
|.................................. | 48%
|
|.................................... | 52%
|
|....................................... | 55%
|
|......................................... | 59%
|
|........................................... | 62%
|
|.............................................. | 66%
|
|................................................ | 69%
|
|................................................... | 72%
|
|..................................................... | 76%
|
|........................................................ | 79%
|
|.......................................................... | 83%
|
|............................................................ | 86%
|
|............................................................... | 90%
|
|................................................................. | 93%
|
|.................................................................... | 97%
|
|......................................................................| 100%
## output file: part1.r
## Warning: Column `Platform` joining factors with different levels, coercing to
## character vector
## Warning: Column `Platform` joining factors with different levels, coercing to
## character vector
## Warning: Column `Platform` joining factors with different levels, coercing to
## character vector
## Warning: Column `Platform` joining factors with different levels, coercing to
## character vector
## Warning: Column `Platform` joining factors with different levels, coercing to
## character vector
## Warning: Column `Platform` joining factors with different levels, coercing to
## character vector
## Warning: Column `Platform` joining factors with different levels, coercing to
## character vector
## Warning: Column `Platform` joining factors with different levels, coercing to
## character vector
After the completion of the previous two parts, I’ve received a few critiques on how I can improve my work. Here I’ll document all the changes I thought worth changing (The original versions of Part 1 and Part 2 can still be found in this github repo. The revised versions can be found in the github page for this project).
I had adjusted my code from lines 191-218 so that I won’t be creating redundant tibbles. This was fixed thanks to a suggestion from a peer. They also mentioned that I could’ve included all region sales into one tibble instead of splitting them into multiple tibbles, and they’ve also suggested that I supplement any missing review scores with web scrapes from another site. I chose not to change either for the following reasons. The former suggestion I chose not to do because I believed separate tibbles allowed me to easily use certain display functions on any region of my choosing, and, in addition, the separate tibbles came in handy in the next part where I chose to create models on each region. The former suggestion was a fine idea, but after searching though some of the variables with missing review scores, a lot of them came from very outdated games which lack a web documented review score. While I could likely track any critic reviews given to the older games on old game magazines, there was still the issue of missing a user review score. In the end, I chose not to include this suggestion.
One other change I made on my own was to better explain my intentions for this project back in the introduction section.
The peer that had critiqued my project had mentioned that I could’ve explained my models better after displaying them. To fix this, I thought the cross-validation that I’d do later in this part would help aid in explaining those previously made models. That said, I’ve also included additional explanations on both my ESRB models and the initial models. They also mentioned that I should’ve talked about what I could do with the information shown in that part, but I believed that my introduction matched pretty well with most things you could do with this data.
In order to better illustrate what my models are telling us, I’ll be using cross-validation on each regions’ Game Sales v Console Sales and Review Scores model.
As a guide to help read the data that I’ll extract from this cross-validation… - R2: Represents the strength of the correlation. 1 is a stronger correlation. - Root Mean Squared Error (RMSE): Represents the average difference between the observed known outcome values and the values predicted by the model. The lower the RMSE, the better the model. - Mean absolute error (MAE): Represents how big an error we can expect. Smaller numbers means higher accuracy.
NA Sales v. NA Console Sales and Review Scores
#I can't do cross-validation with any missing data within the tibble,
#so I had to remove any empty spaces left in by console sales.
CV_DataNA <- DataNA[rowSums(is.na(DataNA[ , ])) == 0, ]
index <- CV_DataNA$NA_Sales %>% createDataPartition(p = 0.75, list = FALSE)
train <- CV_DataNA[index, ]
test <- CV_DataNA[-index,]
train_model <- lm(train, formula=NA_Sales~NA_Console_Sales+User_Score+Critic_Score)
summary(train_model)
##
## Call:
## lm(formula = NA_Sales ~ NA_Console_Sales + User_Score + Critic_Score,
## data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.1223 -0.3615 -0.1517 0.1181 14.9514
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.6836018 0.0480217 -14.235 < 2e-16 ***
## NA_Console_Sales 0.0086484 0.0005098 16.966 < 2e-16 ***
## User_Score -0.0045880 0.0006450 -7.113 1.21e-12 ***
## Critic_Score 0.0211340 0.0006759 31.266 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8097 on 10425 degrees of freedom
## Multiple R-squared: 0.1084, Adjusted R-squared: 0.1081
## F-statistic: 422.5 on 3 and 10425 DF, p-value: < 2.2e-16
predictions <- train_model %>% predict(test)
CV <- data.frame( R2 = R2(predictions, test$NA_Sales),
RMSE = RMSE(predictions, test$NA_Sales),
MAE = MAE(predictions, test$NA_Sales))
CV
EU Sales v. EU Console Sales and Review Scores
#I can't do cross-validation with any missing data within the tibble,
#so I had to remove any empty spaces left in by console sales.
CV_DataEU <- DataEU[rowSums(is.na(DataEU[ , ])) == 0, ]
index <- CV_DataEU$EU_Sales %>% createDataPartition(p = 0.75, list = FALSE)
train <- CV_DataEU[index, ]
test <- CV_DataEU[-index,]
train_model <- lm(train, formula=EU_Sales~EU_Console_Sales+User_Score+Critic_Score)
summary(train_model)
##
## Call:
## lm(formula = EU_Sales ~ EU_Console_Sales + User_Score + Critic_Score,
## data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.7552 -0.2575 -0.1143 0.0661 28.6347
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.2932275 0.0374520 -7.829 5.38e-15 ***
## EU_Console_Sales 0.0046269 0.0003890 11.895 < 2e-16 ***
## User_Score -0.0046488 0.0005333 -8.716 < 2e-16 ***
## Critic_Score 0.0138178 0.0005542 24.933 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6593 on 10425 degrees of freedom
## Multiple R-squared: 0.06651, Adjusted R-squared: 0.06624
## F-statistic: 247.6 on 3 and 10425 DF, p-value: < 2.2e-16
predictions <- train_model %>% predict(test)
CV <- data.frame( R2 = R2(predictions, test$EU_Sales),
RMSE = RMSE(predictions, test$EU_Sales),
MAE = MAE(predictions, test$EU_Sales))
CV
JP Sales v. JP Console Sales and Review Scores
#I can't do cross-validation with any missing data within the tibble,
#so I had to remove any empty spaces left in by console sales.
CV_DataJP <- DataJP[rowSums(is.na(DataJP[ , ])) == 0, ]
index <- CV_DataJP$JP_Sales %>% createDataPartition(p = 0.75, list = FALSE)
train <- CV_DataJP[index, ]
test <- CV_DataJP[-index,]
train_model <- lm(train, formula=JP_Sales~JP_Console_Sales+User_Score+Critic_Score)
summary(train_model)
##
## Call:
## lm(formula = JP_Sales ~ JP_Console_Sales + User_Score + Critic_Score,
## data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.1327 -0.0543 -0.0287 0.0024 5.2006
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.1216852 0.0104203 -11.68 < 2e-16 ***
## JP_Console_Sales 0.0025109 0.0001898 13.23 < 2e-16 ***
## User_Score 0.0005941 0.0001535 3.87 0.00011 ***
## Critic_Score 0.0016728 0.0001593 10.50 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1901 on 10423 degrees of freedom
## Multiple R-squared: 0.03495, Adjusted R-squared: 0.03468
## F-statistic: 125.8 on 3 and 10423 DF, p-value: < 2.2e-16
predictions <- train_model %>% predict(test)
CV <- data.frame( R2 = R2(predictions, test$JP_Sales),
RMSE = RMSE(predictions, test$JP_Sales),
MAE = MAE(predictions, test$JP_Sales))
CV
Global Sales v. Global Console Sales and Review Scores
#I can't do cross-validation with any missing data within the tibble,
#so I had to remove any empty spaces left in by console sales.
CV_DataGlobal <- DataGlobal[rowSums(is.na(DataGlobal[ , ])) == 0, ]
index <- CV_DataGlobal$Global_Sales %>% createDataPartition(p = 0.75, list = FALSE)
train <- CV_DataGlobal[index, ]
test <- CV_DataGlobal[-index,]
train_model <- lm(train, formula=Global_Sales~Global_Console_Sales+User_Score+Critic_Score)
summary(train_model)
##
## Call:
## lm(formula = Global_Sales ~ Global_Console_Sales + User_Score +
## Critic_Score, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.093 -0.693 -0.300 0.206 34.226
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.128210 0.092162 -12.24 < 2e-16 ***
## Global_Console_Sales 0.004941 0.000355 13.92 < 2e-16 ***
## User_Score -0.009230 0.001291 -7.15 9.28e-13 ***
## Critic_Score 0.040553 0.001339 30.28 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.604 on 10425 degrees of freedom
## Multiple R-squared: 0.09791, Adjusted R-squared: 0.09765
## F-statistic: 377.2 on 3 and 10425 DF, p-value: < 2.2e-16
predictions <- train_model %>% predict(test)
CV <- data.frame( R2 = R2(predictions, test$Global_Sales),
RMSE = RMSE(predictions, test$Global_Sales),
MAE = MAE(predictions, test$Global_Sales))
CV
Through cross-validation, I can see that, while individually console sales and review scores hold a high correlation with game sales, when factored together the correlation seems to drop. The strongest prediction I had was found within North America’s model. In addition, based off of the RMSE and MAE, each of the models seems to be decent and are highly accurate. As discussed in the part prior, this drop is likely caused by the fact that there are more poorly sold games than top selling games.
However, with regards to what I’ve learned throughout this project, it appears that mostly everywhere but Japan has a thriving market for Rated M games. Japan generally has sold more games when they were aimed towards younger audiences. In addition, console sales and review scores of both the critics and users contribute greatly towards a game’s sales.
As I said in my first introduction to this project, I aimed to work with this data so that I could see what helps a game sell. Based on the work I’ve done, I can generally see what types of games do well in what region. With this data, it can aid developers to push towards a certain market to maximize their potential to have a game sell well. This sentiment applies to both indie and big studio devs.
With regards to what I think are the ethical issues this could bring, I personally always believed game development to be a strong form of self expression. This highly logical approach to game development I feel could easily take away from that aspect of game development. While, at the end of the day, a game is a product, I believe it’s equally important to maintain games as a medium of developer self expression.
Everything below was based on suggestions given by my peer during the final In addition, another issue that could arise is providing a stronger incentive for a region’s developer to focus on aiming at a certain audience to maximize their sales. A more specific example can be if North American developers focus too much on the Rated M market and there’s a drop in content aimed towards kids. The opposite could also be said about the Japanese market where games are further aimed towards children but not adults.
While my results can easily help a big game development team, I believe these kinds of observations would be better suited for indie developers. Analysing what could boost the sales of indie games I think would be more valuable information than what I’ve covered here. As a result, observing indie games as opposed to AAA games might be something I could aim for in the future.