Final Project

Purpose

The purpose of this individual/group final project is to put to work the tools and

knowledge that you gain throughout this course. This provides you with multiple

benefits.

1. It will provide you with more experience using data cleaning tools on real life

data sets.

2. It helps you become a self-directed learner. As a data scientist, a large part of

your job is to self-direct your learning and interests to find unique and creative

ways to find insights in data.

3. It starts to build your data science portfolio. Establishing a data science

portfolio is a great way to show potential employers your ability to work with

data.

The course is structured in a way that allows you to work on your project as you

progress through the weeks. Thus, you should not have to cram during the last two

weeks of the term to complete your project. Rather, I plan to have you work on the

project and use some of the in-class time to do peer evaluation of your code.

Project Goal

The principal goal of this project is to import a real life data set, clean and tidy the data,

and perform basic exploratory data analysis; all while using R Markdown to produce

an HTML or PDF report that is fully reproducible.

Project Data

You will need to select one data set from the four that I have supplied below. All four

data sets contain key attributes that will demonstrate the data science capabilities that

you have learned throughout this course. You may even need to learn new skills not

taught to accomplish your mission. These include working with:

• multiple data types (numerics, characters, dates, etc)

• non-normalized characteristics (may contain punctuations, upper and

lowercase letters, etc)

• data sets that need to be merged

• unclean data (missing values, values that do not align to the data dictionary)

• variables that need to be created (i.e. the data may contain income and expense

variables but you want to analyze savings such that you need to create a

savings variable out of the income and expense variables)

• data that needs to be filtered out

• and much more!

Available data sets include:

You can choose from one of the following four data sets posted on Canvas. Each

dataset has its own challenges and strengths.

• Dog Data

• Lodge Data

• NFL Data

• Global Music Data

Note: Your homework group members may or may not all select the same dataset. If

members in your peer group select the same dataset, your work should reflect an

individual/pair effort.

Project Report

You will write an R Markdown HTML or PDF report that provides the sections in the

grading rubric below. You will need to import, assess, clean & tidy the data, and then

come up with your own research questions that you would like to answer from the data

by performing exploratory data analysis (if you’d like to perform a predictive model to

answer your hypothesis that is fine but it is not required). Some thoughts to help you:

• Make a storyboard. Your project should be a logical, cohesive story–not simply

a bunch of graphs created for the sake of making them. The story may change

as you dive deeper into the data and find insights, but a storyboard gives you

direction and purpose for developing insights. Clear writing means a clear mind,

and a storyboard is vital to producing a good story.

• Speaking of insights, keep in mind that your project should follow the chain of

data -> insights -> actions. As a future data analyst (or data scientist), you work

to create insights that lead to actions, not to waste 40 hours on a awe-inspiring

visualization that is ignored directly after a presentation and never used again.

• Simple descriptive statistics can (and usually) yield more of an immediate

impact than a complicated model.

• Do subgroups matter in your data?

• Why are data missing?

• Are trends over time important?

Although each data set’s data dictionary contains some additional questions worth

pursuing, try to be creative in your analysis and investigate the data in a way that your

classmates most likely will not. Creativity is an essential ingredient for a good data

scientist!

Section Standard Possible Points

Introduction 1.1 Provide an introduction that explains the problem statement you are

addressing. Why should I be interested in this?

1.2 Provide a short explanation of how you plan to address this problem

statement (the data used and the methodology employed)

1.3 Discuss your current proposed approach/analytic technique you think will

address (fully or partially) this problem.

1.4 Explain how your analysis will help the consumer of your analysis.

5

Packages

Required

2.1 All packages used are loaded upfront so the reader knows which are

required to replicate the analysis.

2.2 Messages and warnings resulting from loading the package are

suppressed.

2.3 Explanation is provided regarding the purpose of each package (there are

over 10,000 packages, don’t assume that I know why you loaded each

package).

5

Data

Preparation

3.1 Original source where the data was obtained is cited and, if possible,

hyperlinked.

10

Section Standard Possible Points

3.2 Source data is thoroughly explained (i.e. what was the original purpose of

the data, when was it collected, how many variables did the original have,

explain any peculiarities of the source data such as how missing values are

recorded, or how data was imputed, etc.).

3.3 Data importing and cleaning steps are explained in the text (tell me why

you are doing the data cleaning activities that you perform) and follow a

logical process.

3.4 Once your data is clean, show what the final data set looks like. However,

do not print off a data frame with 200+ rows; show me the data in the most

condensed form possible.

3.5 Provide summary information about the variables of concern in your

cleaned data set. Do not just print off a bunch of code chunks

with str(), summary(), etc. Rather, provide me with a consolidated

explanation, either with a table that provides summary info for each variable

or a nicely written summary paragraph with inline code.

Exploratory

Data Analysis

4.1 Uncover new information in the data that is not self-evident (i.e. do not

just plot the data as it is; rather, slice and dice the data in different ways,

create new variables, or join separate data frames to create new summary

information).

4.2 Provide findings in the form of plots and tables. Show me you can display

findings in different ways.

4.3 Graph(s) are carefully tuned for desired purpose. One graph illustrates one

primary point and is appropriately formatted (plot and axis titles, legend if

necessary, scales are appropriate, appropriate geoms used, etc.).

4.4 Table(s) carefully constructed to make it easy to perform important

comparisons. Careful styling highlights important features. Size of table is

appropriate.

4.5 Insights obtained from the analysis are thoroughly, yet succinctly,

explained. Easy to see and understand the interesting findinsg that you

uncovered.

10

Summary 6.1 Summarize the problem statement you addressed.

6.2 Summarize how you addressed this problem statement (the data used and

5

Section Standard Possible Points

the methodology employed).

6.3 Summarize the interesting insights that your analysis provided.

6.4 Summarize the implications to the consumer of your analysis.

6.5 Discuss the limitations of your analysis and how you, or someone else,

could improve or build on it.

Formatting &

Other

Requirements

7.1 Proper coding style is followed and code is well commented (see section

regarding style).

7.2 Coding is systematic – complicated problem broken down into sub-

problems that are individually much simpler. Code is efficient, correct, and

minimal. Code uses appropriate data structure (list, data frame,

vector/matrix/array). Code checks for common errors.

7.3 Achievement, mastery, cleverness, creativity: Tools and techniques from

the course are applied very competently and, perhaps,somewhat creatively.

Perhaps student has gone beyond what was expected and required, e.g.,

extraordinary effort, additional tools not addressed by this course, unusually

sophisticated application of tools from course.

7.4 .Rmd fully executes without any errors and HTML produced matches the

HTML report submitted by student.

15

Total possible points: 50

Due no later than: Thursday, March 11, 2022, 5:59PM PT

I expect your report to tell a story with the data. I do not want you to just report some

statistics that you find but, rather, to provide a coherent narrative of your findings. Here

is an example of the type of report that I am looking for:

• AirBnB user pathways

You need to submit the HTML or PDF file and the .Rmd file that produced the HTML

or PDF report, your data, and any other files your .Rmd file leverages (images, .bib file,

etc.). Your submitted files should be named with year, course number, lastname, first

& middle initial, and then “finalproject.” For example my file name would be:

2022_DSCI353_paparasa_finalproject.Rmd. I expect to be able to fully reproduce your

report by knitting your .Rmd file.

http://uc-r.github.io/basics#style

http://rpubs.com/angiechen/234334

Any additional details regarding the final project will be provided in class.


title: “MidTermProject”
author: “GlobalMusicData”
date: “5/27/2022”
output:
html_document: default
pdf_document: default

“`{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
options(knitr.duplicate.label = “allow”)
“`

## **Introduction**

Music is the language of expression in today’s world most musicians make Music to send specific message for politicians and public figures.

The purpose of this project is to perform data analysis and visualization for the GlobalMusicData data set. The data set provides information in detail about the artists with their tracks genre and playlist. The GlobalMusicData data set contains data on track names, albums, playlist, genre and many more for different artists since the year 1993.It is interesting to listen music that everyone listens based on your playlist and the Spotify API we downloaded from canvas it has very useful data so we can figure out the popularity of the music. The goal is to provide analytically statistics for music lover to know just how useful music is?

We want to see how music become publicly famous, and what type of music gain more popularity like pop, rap, country and many more music types around the world.

We used R to perform data analysis and visualization to explore and identify trends in the artists tracks, and uncover insights to understand through the following steps:

* Load Required Packages
* Clean Up and Prepare Data for Analysis
* Perform Exploratory Data Analysis
* Data Visualization

![](GP.png){Width=20%}

## **More Data**

For more information about http Spotify click here:[API](https://developer.spotify.com/documentation/web-api/reference/#/)

## **Packages Required**

You can also embed plots, for example:
“`{r}

“`

“`{r, echo=TRUE, warning=FALSE, message=FALSE}

library(readr) #used to read csv file
library(plotly) #used to make interactive, publication-quality graphs.
library(tidyr) # used to tidy up data
library(GGally) #extension of ggplot2 with functions
library(prettydoc) # document themes for R Markdown
library(DT) # used for displaying R data objects (matrices or data frames) as tables on HTML pages
library(lubridate) # used for date/time functions
library(magrittr) # used for piping
library(ggplot2) # used for data visualization
library(dplyr) # used for data manipulation

“`

## **Data Preparation**

The following is the code used to evaluate the variables in the source data. We noted that there is a total of 32,833 observations in the data set, and 33 variables, which are listed below.
“`{r, echo=TRUE, warning=TRUE, message=FALSE}
# Importing the data
data <- read.csv("Global Music Data.csv") ``` ```{r, echo=FALSE, warning=TRUE, message=FALSE} # Find total number of observations # nrow(data) #get the variable name and its description values_table1 <- rbind(c("track_id","track_name","track_artist","track_popularity","track_album_id","track_album_name","track_album_release_date","playlist_name","playlist_id","playlist_genre","playlist_subgenre","danceability","energy","key","loudness","mode","speechiness","acousticness","instrumentalness","liveness","valence","tempo","duration_ms"), c("Unique ID for track", "Name of the track", "The artsist for specific for every track ", "Song Popularity (0-100) where higher is better", "Album unique ID", "Song album name", "Date when album released", "Name of playlist", "Playlist ID", "Playlist genre", "Playlist subgenre", "Describes how suitable a track is for dancing based on a combination of musical elements", "Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity.", "The estimated overall key of the track.", "The overall loudness of a track in decibels (dB).", "modality (major or minor) of a track, Major is represented by 1 and minor is 0.", "Detects the presence of spoken words in a track.", "A confidence measure from 0.0 to 1.0 of whether the track is acoustic.", "Predicts whether a track contains no vocals.", "Detects the presence of an audience in the recording.", "A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track.", "The overall estimated tempo of a track in beats per minute (BPM).", "Duration of song in milliseconds" )) fig_table1 <- plot_ly( type = 'table', columnorder = c(1,2), columnwidth = c(12,12), header = list( values = c('VARIABLES
‘, ‘DESCRIPTION‘),
line = list(color = ‘#506784’),
fill = list(color = ‘#119DFF’),
align = c(‘left’,’center’),
font = list(color = ‘white’, size = 12),
height = 40
),
cells = list(
values = values_table1,
line = list(color = ‘#506784’),
fill = list(color = c(‘#25FEFD’, ‘white’)),
align = c(‘left’, ‘left’),
font = list(color = c(‘#506784’), size = 12),
height = 30
))
fig_table1
“`

## **Data Cleaning**

“`{r}
#Computing summary statistics for the variables
datatable(
summary(data)
)
#Identifying the data types of each variable
datatable(
str(data)
)
“`
“`{r}
#Identifying missing data

#number of missing values in this data frame.
sum(is.na(data))
“`
“`{r}
#Count the number of missing values per column
colSums(is.na(data))
“`
“`{r}
#Return the column names without missing values
names((colSums(is.na(data))>0))
“`

“`{r}

# Read first 10 rows of the cleaned data set

datatable(head(data, 10),options = list(scrollX=TRUE, pageLength=5))
“`

“`{r}

# Read last 10 rows of the cleaned data set

datatable(tail(data, 10),options = list(scrollX=TRUE, pageLength=5))
“`

We used the following code to tidy up our data:
“`{r , echo=FALSE}

# Convert the start and end times from string to date/time format

# data$track_album_release_date <- ymd_hms(data$track_album_release_date) # # #Check for duplicate rows # data[duplicated(data$track_id),] # # #Check for duplicate columns # data[!duplicated(lapply(data, summary))] # # #Check for duplicate rows # data[duplicated(data$track_id),] # # #Check for duplicate columns # data[!duplicated(lapply(data, summary))] # # # n_occur <- data.frame(table(data$track_id)) # # #gives you a data frame with a list of track_ids and the number of times they occurred. # n_occur[n_occur$Freq > 1,]
#
# #tells you which track_ids occurred more than once.
# data[data$track_id %in% n_occur$Var1[n_occur$Freq > 1],]
#
# #Identifying missing data
#
# #number of missing values in this data frame.
# sum(is.na(data))
#
# #Count the number of missing values per column
# colSums(is.na(data))
#
# #Identify the position of the columns with at least one missing value
# which(colSums(is.na(data))>0)
#
# #Return the column names with missing values
# names(which(colSums(is.na(data))>0))
“`

## **Proposed Exploratory Data Analysis and Data Visualization**

You can also embed plots, for example:

“`{r}

pairs(~danceability+energy+key+loudness,data = data,
main = “Scatterplot Matrix For GlobalMusicData”)

“`

“`{r}

ggplot(data, aes(x = playlist_genre,y=track_popularity)) +
#customize bars
geom_bar(color=”black”,
fill = “pink”,
width= 0.5,
stat=’identity’) +
#adding values numbers
geom_text(aes(label = track_popularity),
vjust = -0.25) +
#customize x,y axes and title
ggtitle(“Graph showing popularity Playlist genre”) +
xlab(“Playlist genre”) +
ylab(“Popularity of the Track”) +
#change font
theme(plot.title = element_text(color=”black”, size=14, face=”bold”, hjust = 0.5 ),
axis.title.x = element_text(color=”black”, size=11, face=”bold”),
axis.title.y = element_text(color=”black”, size=11, face=”bold”))

“`

“`{r}

##Histogram

ggplot(data, aes(x=playlist_genre)) +geom_bar()

“`

“`{r}

# Box plots

bp <- ggplot(data, aes(x=duration_ms, y=playlist_genre, fill=playlist_genre)) + geom_boxplot()+ labs(title="Plot of Duration against playlist genre",x="Duration in (ms)", y = "Playlist genre") bp + theme_classic() ```




Why Choose Us

  • 100% non-plagiarized Papers
  • 24/7 /365 Service Available
  • Affordable Prices
  • Any Paper, Urgency, and Subject
  • Will complete your papers in 6 hours
  • On-time Delivery
  • Money-back and Privacy guarantees
  • Unlimited Amendments upon request
  • Satisfaction guarantee

How it Works

  • Click on the “Place Order” tab at the top menu or “Order Now” icon at the bottom and a new page will appear with an order form to be filled.
  • Fill in your paper’s requirements in the "PAPER DETAILS" section.
  • Fill in your paper’s academic level, deadline, and the required number of pages from the drop-down menus.
  • Click “CREATE ACCOUNT & SIGN IN” to enter your registration details and get an account with us for record-keeping and then, click on “PROCEED TO CHECKOUT” at the bottom of the page.
  • From there, the payment sections will show, follow the guided payment process and your order will be available for our writing team to work on it.