Whether „Driving Home for Christmas“, „Winter Wonderland“, „Let it snow!“ or „Last Christmas“ – every year christmas songs are taking over the charts again. While average Joe is joyfully putting on the next christmas song, the data scientist starts his journey of discovery through the snowy music history.

The data set comes from 55000+ Song Lyrics, which contains over 55,000+ songs. It is a data frame with 55,000+ rows and four columns:

artist
song
link
text

<br />
library(igraph)<br />
library(ggraph)<br />
library(ggplot2)<br />
library(wordcloud2)<br />
library(dplyr)<br />
library(widyr)<br />
library(tidytext)<br />
library(tm)<br />
library(stringr)<br />
library(topicmodels)<br />
library(reshape2)<br />
library(quanteda)<br />
library(Rtsne)<br />
library(DT)<br />
library(knitr)<br />
library(animation)<br />
library(ldatuning)</p>
<p>set.seed(201712)</p>
<p># Preperation ————————————————————-<br />
songs &lt;- read.csv(“songdata.csv”)<br />
songs$song &lt;- songs$song %&gt;% as.character()<br />
songs$artist &lt;- songs$artist %&gt;% as.character()<br />
songs$link &lt;- songs$link %&gt;% as.character()<br />
songs$text &lt;- songs$text %&gt;% as.character()

library(igraph)

library(ggraph)

library(ggplot2)

library(wordcloud2)

library(dplyr)

library(widyr)

library(tidytext)

library(tm)

library(stringr)

library(topicmodels)

library(reshape2)

library(quanteda)

library(Rtsne)

library(DT)

library(knitr)

library(animation)

library(ldatuning)

set.seed(201712)

# Preperation ————————————————————-

songs <– read.csv(“songdata.csv”)

songs$song <– songs$song %>% as.character()

songs$artist <– songs$artist %>% as.character()

songs$link <– songs$link %>% as.character()

songs$text <– songs$text %>% as.character()

Our goal is to perform a comprehensive analysis of the song texts to identify the Christmas songs. In order to do so, first we add an additional column to the data frame to give each song a label of either Christmas or Not Christmas, where every song which contains the words Christmas, Xmas or X-mas will be labeled as Christmas and otherwise as Not Christmas.

<br />
# Initialization of the Labels<br />
label &lt;- character(dim(songs)[1])</p>
<p>for(i in 1:dim(songs)[1]){<br />
  if(str_detect(songs$song[i], “Christmas”) |<br />
     str_detect(songs$song[i], “X-mas”) |<br />
     str_detect(songs$song[i], “Xmas”)){<br />
    label[i] &lt;- “Christmas”<br />
  } else{<br />
    label[i] &lt;- “Not Christmas”<br />
  }<br />
}</p>
<p>songs &lt;- songs %&gt;%<br />
  mutate(Label = label)</p>
<p>

# Initialization of the Labels

label <– character(dim(songs)[1])

for(i in 1:dim(songs)[1]){

if(str_detect(songs$song[i], “Christmas”) |

str_detect(songs$song[i], “X-mas”) |

str_detect(songs$song[i], “Xmas”)){

label[i] <– “Christmas”

} else{

label[i] <– “Not Christmas”

}

songs <– songs %>%

mutate(Label = label)

This is just the initialization of the labels, later we will apply Naive Bayes to a training set to identify the other Christmas songs. First of all, we will start by exploring the data set by means of some intuitive descriptive approaches.

<br />
D3Vis &lt;- function(edgeList, directed){<br />
  colnames(edgeList) &lt;- c(“SourceName”, “TargetName”, “Weight”)</p>
<p>  # Min-Max &amp; Inverse scaling, because the weights should represent distance/similarity<br />
  edgeList$Weight &lt;- 1 – edgeList$Weight<br />
  weight.min &lt;- edgeList$Weight %&gt;% min<br />
  weight.max &lt;- edgeList$Weight %&gt;% max</p>
<p>  edgeList$Weight &lt;- (edgeList$Weight – weight.min)/(weight.max – weight.min)</p>
<p>  # Create a graph. Use simplyfy to ensure that there are no duplicated edges or self loops<br />
  gD &lt;- igraph::simplify(igraph::graph.data.frame(edgeList, directed=directed))</p>
<p>  # Create a node list object (actually a data frame object) that will contain information about nodes<br />
  nodeList &lt;- data.frame(ID = c(0:(igraph::vcount(gD) – 1)), # because networkD3 library requires IDs to start at 0<br />
                         nName = igraph::V(gD)$name)</p>
<p>  # Map node names from the edge list to node IDs<br />
  getNodeID &lt;- function(x){<br />
    which(x == igraph::V(gD)$name) – 1 # to ensure that IDs start at 0<br />
  }<br />
  # And add them to the edge list<br />
  edgeList &lt;- plyr::ddply(edgeList, .variables = c(“SourceName”, “TargetName”, “Weight”),<br />
                          function (x) data.frame(SourceID = getNodeID(x$SourceName),<br />
                                                  TargetID = getNodeID(x$TargetName)))</p>
<p>  #+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++#<br />
  # Calculate some node properties and node similarities that will be used to illustrate<br />
  # different plotting abilities and add them to the edge and node lists</p>
<p>  # Calculate degree for all nodes<br />
  nodeList &lt;- cbind(nodeList, nodeDegree=igraph::degree(gD, v = igraph::V(gD), mode = “all”))</p>
<p>  # Calculate betweenness for all nodes<br />
  betAll &lt;- igraph::betweenness(gD, v = igraph::V(gD), directed = directed) / (((igraph::vcount(gD) – 1) * (igraph::vcount(gD)-2)) / 2)<br />
  betAll.norm &lt;- (betAll – min(betAll))/(max(betAll) – min(betAll))<br />
  nodeList &lt;- cbind(nodeList, nodeBetweenness=100*betAll.norm) # We are scaling the value by multiplying it by 100 for visualization purposes only (to create larger nodes)<br />
  rm(betAll, betAll.norm)</p>
<p>  #Calculate Dice similarities between all pairs of nodes<br />
  dsAll &lt;- igraph::similarity.dice(gD, vids = igraph::V(gD), mode = “all”)</p>
<p>  F1 &lt;- function(x) {data.frame(diceSim = dsAll[x$SourceID +1, x$TargetID + 1])}<br />
  edgeList &lt;- plyr::ddply(edgeList, .variables=c(“SourceName”, “TargetName”, “Weight”, “SourceID”, “TargetID”),<br />
                          function(x) data.frame(F1(x)))</p>
<p>  rm(dsAll, F1, getNodeID, gD)</p>
<p>  #++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++#<br />
  # We will also create a set of colors for each edge, based on their dice similarity values<br />
  # We’ll interpolate edge colors based on the using the “colorRampPalette” function, that<br />
  # returns a function corresponding to a collor palete of “bias” number of elements (in our case, that<br />
  # will be a total number of edges, i.e., number of rows in the edgeList data frame)<br />
  F2 &lt;- colorRampPalette(c(“#FFFF00”, “#FF0000”), bias = nrow(edgeList), space = “rgb”, interpolate = “linear”)<br />
  colCodes &lt;- F2(length(unique(edgeList$diceSim)))<br />
  edges_col &lt;- sapply(edgeList$diceSim, function(x) colCodes[which(sort(unique(edgeList$diceSim)) == x)])</p>
<p>  rm(colCodes, F2)</p>
<p>  #++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++#<br />
  # revise transformation of the weights<br />
  edgeList$Weight &lt;- -(edgeList$Weight*(weight.max – weight.min) + weight.min – 1)</p>
<p>  # Let’s create a network<br />
  D3_network_LM &lt;- networkD3::forceNetwork(Links = edgeList, # data frame that contains info about edges<br />
                                           Nodes = nodeList, # data frame that contains info about nodes<br />
                                           Source = “SourceID”, # ID of source node<br />
                                           Target = “TargetID”, # ID of target node<br />
                                           Value = “Weight”, # value from the edge list (data frame) that will be used to value/weight relationship amongst nodes<br />
                                           NodeID = “nName”, # value from the node list (data frame) that contains node description we want to use (e.g., node name)<br />
                                           Nodesize = “nodeBetweenness”,  # value from the node list (data frame) that contains value we want to use for a node size<br />
                                           Group = “nodeDegree”,  # value from the node list (data frame) that contains value we want to use for node color<br />
                                           #                                         height = 500, # Size of the plot (vertical)<br />
                                           #                                         width = 1000,  # Size of the plot (horizontal)<br />
                                           fontSize = 20, # Font size<br />
                                           linkDistance = networkD3::JS(“function(d) { return 10*d.value; }”), # Function to determine distance between any two nodes, uses variables already defined in forceNetwork function (not variables from a data frame)<br />
                                           #                                         linkWidth = networkD3::JS(“function(d) { return d.value/5; }”),# Function to determine link/edge thickness, uses variables already defined in forceNetwork function (not variables from a data frame)<br />
                                           opacity = 0.85, # opacity<br />
                                           arrows = directed,<br />
                                           zoom = TRUE, # ability to zoom when click on the node<br />
                                           #                                         opacityNoHover = 0.1, # opacity of labels when static<br />
                                           legend = F,<br />
                                           linkColour = edges_col) # edge colors</p>
<p>  # Plot network<br />
  D3_network_LM<br />
}<br />
</code><a id=”user-content-exploration-the-initial-christmas-songs” class=”anchor” href=”https://github.com/eodaGmbH/christmas_songs/blob/master/notebook.Rmd#exploration-the-initial-christmas-songs” aria-hidden=”true”></a><a id=”user-content-cleaning–tokenization” class=”anchor” href=”https://github.com/eodaGmbH/christmas_songs/blob/master/notebook.Rmd#cleaning–tokenization” aria-hidden=”true”></a>

D3Vis <– function(edgeList, directed){

colnames(edgeList) <– c(“SourceName”, “TargetName”, “Weight”)

# Min-Max & Inverse scaling, because the weights should represent distance/similarity

edgeList$Weight <– 1 – edgeList$Weight

weight.min <– edgeList$Weight %>% min

weight.max <– edgeList$Weight %>% max

edgeList$Weight <– (edgeList$Weight – weight.min)/(weight.max – weight.min)

# Create a graph. Use simplyfy to ensure that there are no duplicated edges or self loops

gD <– igraph::simplify(igraph::graph.data.frame(edgeList, directed=directed))

# Create a node list object (actually a data frame object) that will contain information about nodes

nodeList <– data.frame(ID = c(0:(igraph::vcount(gD) – 1)), # because networkD3 library requires IDs to start at 0

nName = igraph::V(gD)$name)

# Map node names from the edge list to node IDs

getNodeID <– function(x){

which(x == igraph::V(gD)$name) – 1 # to ensure that IDs start at 0

}

# And add them to the edge list

edgeList <– plyr::ddply(edgeList, .variables = c(“SourceName”, “TargetName”, “Weight”),

function (x) data.frame(SourceID = getNodeID(x$SourceName),

TargetID = getNodeID(x$TargetName)))

#+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++#

# Calculate some node properties and node similarities that will be used to illustrate

# different plotting abilities and add them to the edge and node lists

# Calculate degree for all nodes

nodeList <– cbind(nodeList, nodeDegree=igraph::degree(gD, v = igraph::V(gD), mode = “all”))

# Calculate betweenness for all nodes

betAll <– igraph::betweenness(gD, v = igraph::V(gD), directed = directed) / (((igraph::vcount(gD) – 1) * (igraph::vcount(gD)–2)) / 2)

betAll.norm <– (betAll – min(betAll))/(max(betAll) – min(betAll))

nodeList <– cbind(nodeList, nodeBetweenness=100*betAll.norm) # We are scaling the value by multiplying it by 100 for visualization purposes only (to create larger nodes)

rm(betAll, betAll.norm)

#Calculate Dice similarities between all pairs of nodes

dsAll <– igraph::similarity.dice(gD, vids = igraph::V(gD), mode = “all”)

F1 <– function(x) {data.frame(diceSim = dsAll[x$SourceID +1, x$TargetID + 1])}

edgeList <– plyr::ddply(edgeList, .variables=c(“SourceName”, “TargetName”, “Weight”, “SourceID”, “TargetID”),

function(x) data.frame(F1(x)))

rm(dsAll, F1, getNodeID, gD)

#++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++#

# We will also create a set of colors for each edge, based on their dice similarity values

# We’ll interpolate edge colors based on the using the “colorRampPalette” function, that

# returns a function corresponding to a collor palete of “bias” number of elements (in our case, that

# will be a total number of edges, i.e., number of rows in the edgeList data frame)

F2 <– colorRampPalette(c(“#FFFF00”, “#FF0000”), bias = nrow(edgeList), space = “rgb”, interpolate = “linear”)

colCodes <– F2(length(unique(edgeList$diceSim)))

edges_col <– sapply(edgeList$diceSim, function(x) colCodes[which(sort(unique(edgeList$diceSim)) == x)])

rm(colCodes, F2)

#++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++#

# revise transformation of the weights

edgeList$Weight <– –(edgeList$Weight*(weight.max – weight.min) + weight.min – 1)

# Let’s create a network

D3_network_LM <– networkD3::forceNetwork(Links = edgeList, # data frame that contains info about edges

Nodes = nodeList, # data frame that contains info about nodes

Source = “SourceID”, # ID of source node

Target = “TargetID”, # ID of target node

Value = “Weight”, # value from the edge list (data frame) that will be used to value/weight relationship amongst nodes

NodeID = “nName”, # value from the node list (data frame) that contains node description we want to use (e.g., node name)

Nodesize = “nodeBetweenness”, # value from the node list (data frame) that contains value we want to use for a node size

Group = “nodeDegree”, # value from the node list (data frame) that contains value we want to use for node color

# height = 500, # Size of the plot (vertical)

# width = 1000, # Size of the plot (horizontal)

fontSize = 20, # Font size

linkDistance = networkD3::JS(“function(d) { return 10*d.value; }”), # Function to determine distance between any two nodes, uses variables already defined in forceNetwork function (not variables from a data frame)

# linkWidth = networkD3::JS(“function(d) { return d.value/5; }”),# Function to determine link/edge thickness, uses variables already defined in forceNetwork function (not variables from a data frame)

opacity = 0.85, # opacity

arrows = directed,

zoom = TRUE, # ability to zoom when click on the node

# opacityNoHover = 0.1, # opacity of labels when static

legend = F,

linkColour = edges_col) # edge colors

# Plot network

D3_network_LM

}

</code><a id=“user-content-exploration-the-initial-christmas-songs” class=“anchor” href=“https://github.com/eodaGmbH/christmas_songs/blob/master/notebook.Rmd#exploration-the-initial-christmas-songs” aria–hidden=“true”></a><a id=“user-content-cleaning–tokenization” class=“anchor” href=“https://github.com/eodaGmbH/christmas_songs/blob/master/notebook.Rmd#cleaning–tokenization” aria–hidden=“true”></a>

We should start with the data cleaning and tokenization. Afterwards, the Christmas songs will be selected and saved as a variable.

<br />
songs.unnest &lt;- songs %&gt;%<br />
  unnest_tokens(word, text) %&gt;%<br />
  anti_join(tibble(word = stop_words$word)) %&gt;%<br />
  filter(!str_detect(word, “\\d+”))</p>
<p>xmas.unnest &lt;- songs.unnest %&gt;%<br />
  filter(Label == “Christmas”)<br />
</code><a id=”user-content-correlation-analysis” class=”anchor” href=”https://github.com/eodaGmbH/christmas_songs/blob/master/notebook.Rmd#correlation-analysis” aria-hidden=”true”></a>

songs.unnest <– songs %>%

unnest_tokens(word, text) %>%

anti_join(tibble(word = stop_words$word)) %>%

filter(!str_detect(word, “\\d+”))

xmas.unnest <– songs.unnest %>%

filter(Label == “Christmas”)

</code><a id=“user-content-correlation-analysis” class=“anchor” href=“https://github.com/eodaGmbH/christmas_songs/blob/master/notebook.Rmd#correlation-analysis” aria–hidden=“true”></a>

Now we can start analyzing the initial Christmas songs by means of correlations from different perspectives. In the following, we will visualize the correlations with the networkD3 html widget where nodes with the same total number of connections will be given the same color and the color of the edge implies the number of common neighbors shared by two nodes. Moreover, the size of a node indicates the centrality of it, which is defined by the betweenness, i.e. the number of shortest paths going through it. Where the distance between two nodes is the minimum maximum transformation of 1 minus the correlation, which makes sense because intuitively the higher the correlation, the nearer two nodes should be. Moreover, the shorter the distance, the wider the edge.

Note that the correlations are always based on lyrics.

The correlation between words which appeared more than 100 times and are correlated with at least one other word with a correlation greater than 0.55.

<br />
correlation.words &lt;- xmas.unnest %&gt;%<br />
  group_by(word) %&gt;%<br />
  filter(n() &gt; 100) %&gt;%<br />
  ungroup() %&gt;%<br />
  pairwise_cor(word, song, sort = T)</p>
<p># Network visualization<br />
correlation.words %&gt;%<br />
  filter(correlation &gt; 0.55) %&gt;%<br />
  D3Vis(directed = F)<br />
</code><a id=”user-content-correlation-between-songs” class=”anchor” href=”https://github.com/eodaGmbH/christmas_songs/blob/master/notebook.Rmd#correlation-between-songs” aria-hidden=”true”></a>

correlation.words <– xmas.unnest %>%

group_by(word) %>%

filter(n() > 100) %>%

ungroup() %>%

pairwise_cor(word, song, sort = T)

# Network visualization

correlation.words %>%

filter(correlation > 0.55) %>%

D3Vis(directed = F)

</code><a id=“user-content-correlation-between-songs” class=“anchor” href=“https://github.com/eodaGmbH/christmas_songs/blob/master/notebook.Rmd#correlation-between-songs” aria–hidden=“true”></a>

The correlation between songs which are correlated with at least 3 other songs with a correlation greater than 0.75. With this, we may detect similiar or just slightly modified songs.

<br />
correlation.songs &lt;- xmas.unnest %&gt;%<br />
  pairwise_cor(song, word, sort = T)</p>
<p># Network visualization<br />
correlation.songs %&gt;%<br />
  filter(correlation &gt; 0.75) %&gt;%<br />
  group_by(item1) %&gt;%<br />
  filter(n() &gt;= 3) %&gt;%<br />
  ungroup() %&gt;%<br />
  D3Vis(directed = F)<br />
</code><a id=”user-content-correlation-between-certain-words” class=”anchor” href=”https://github.com/eodaGmbH/christmas_songs/blob/master/notebook.Rmd#correlation-between-certain-words” aria-hidden=”true”></a>

correlation.songs <– xmas.unnest %>%

pairwise_cor(song, word, sort = T)

# Network visualization

correlation.songs %>%

filter(correlation > 0.75) %>%

group_by(item1) %>%

filter(n() >= 3) %>%

ungroup() %>%

D3Vis(directed = F)

</code><a id=“user-content-correlation-between-certain-words” class=“anchor” href=“https://github.com/eodaGmbH/christmas_songs/blob/master/notebook.Rmd#correlation-between-certain-words” aria–hidden=“true”></a>

The correlation between certain words

<br />
correlation.words %&gt;%<br />
  filter(item1 == “christus” |<br />
           item1 == “jesus”  |<br />
           item1 == “snow”   |<br />
           item1 == “reindeer” |<br />
           item1 == “home”   |<br />
           item1 == “holy”   |<br />
           item1 == “love”   |<br />
           item1 == “tree”   |<br />
           item1 == “white”  |<br />
           item1 == “christmas”,<br />
         correlation &gt; 0.4) %&gt;%<br />
  D3Vis(directed = F)<br />
</code><a id=”user-content-correlation-between-artists” class=”anchor” href=”https://github.com/eodaGmbH/christmas_songs/blob/master/notebook.Rmd#correlation-between-artists” aria-hidden=”true”></a>

correlation.words %>%

filter(item1 == “christus” |

item1 == “jesus” |

item1 == “snow” |

item1 == “reindeer” |

item1 == “home” |

item1 == “holy” |

item1 == “love” |

item1 == “tree” |

item1 == “white” |

item1 == “christmas”,

correlation > 0.4) %>%

D3Vis(directed = F)

</code><a id=“user-content-correlation-between-artists” class=“anchor” href=“https://github.com/eodaGmbH/christmas_songs/blob/master/notebook.Rmd#correlation-between-artists” aria–hidden=“true”></a>

The correlation between artists

<br />
correlation.artists &lt;- xmas.unnest %&gt;%<br />
  pairwise_cor(artist, word, sort = T)</p>
<p># Network Visualization<br />
correlation.artists %&gt;%<br />
  filter(correlation &gt; 0.8) %&gt;%<br />
  group_by(item1) %&gt;%<br />
  filter(n() &gt;= 3) %&gt;%<br />
  ungroup() %&gt;%<br />
  D3Vis(directed = F)<br />
</code><a id=”user-content-word-cloud” class=”anchor” href=”https://github.com/eodaGmbH/christmas_songs/blob/master/notebook.Rmd#word-cloud” aria-hidden=”true”></a>

correlation.artists <– xmas.unnest %>%

pairwise_cor(artist, word, sort = T)

# Network Visualization

correlation.artists %>%

filter(correlation > 0.8) %>%

group_by(item1) %>%

filter(n() >= 3) %>%

ungroup() %>%

D3Vis(directed = F)

</code><a id=“user-content-word-cloud” class=“anchor” href=“https://github.com/eodaGmbH/christmas_songs/blob/master/notebook.Rmd#word-cloud” aria–hidden=“true”></a>

Word cloud of the initial Christmas songs

<br />
xmas.cloud &lt;- xmas.unnest %&gt;%<br />
  count(word) %&gt;%<br />
  as.data.frame()</p>
<p>xmas.cloud %&gt;%<br />
  wordcloud2(minSize = 3, shape = ‘star’)<br />
</code><a id=”user-content-naive-bayes” class=”anchor” href=”https://github.com/eodaGmbH/christmas_songs/blob/master/notebook.Rmd#naive-bayes” aria-hidden=”true”></a>

xmas.cloud <– xmas.unnest %>%

count(word) %>%

as.data.frame()

xmas.cloud %>%

wordcloud2(minSize = 3, shape = ‘star’)

</code><a id=“user-content-naive-bayes” class=“anchor” href=“https://github.com/eodaGmbH/christmas_songs/blob/master/notebook.Rmd#naive-bayes” aria–hidden=“true”></a>

Naive Bayes is a popular supervised machine learning algorithm to handle classification problems with a huge amount of features. It is „naive“ in the sense that, conditioned on a class, the features are assumed to be independently distributed. In our case, we would like to know, given a bunch of features, i.e. the tf-idf of words in a document, whether a song should be classified as Christmas song or not by Naive Bayes.

The harder part of constructing the maximum likelihood estimator is the choice of the prior distribution, i.e. the probability distribution of the classes. Where it is usually assumed to be uniformly distributed or estimated by the class frequencies. In our case the multinomial distribution for the likelihood and the uniform distribution for the prior are used, which means we have no prejudice regarding the categorization of the songs without given further information.

<br />
# Document Feature Matrix<br />
songs.dfm.tfidf &lt;- corpus(songs, text_field = “text”,<br />
                          docid_field = “song”) %&gt;%<br />
  dfm(tolower = T,<br />
      stem = TRUE,<br />
      remove_punct = TRUE,<br />
      remove = stopwords(“english”)) %&gt;%<br />
  dfm_trim(min_count = 5, min_docfreq = 3) %&gt;%<br />
  dfm_weight(type = “tfidf”)</p>
<p># Determine the Indizes for the training set<br />
christmas.index &lt;- which(label == “Christmas”)<br />
not_christmas.index &lt;- which(label == “Not Christmas”)</p>
<p>christmas.train.index &lt;- christmas.index<br />
not_christmas.train.index &lt;- sample(not_christmas.index, length(christmas.index))</p>
<p>train.index &lt;- c(christmas.train.index, not_christmas.train.index)</p>
<p>label.train &lt;- label[train.index]</p>
<p>trainning.set &lt;- songs.dfm.tfidf[train.index, ]</p>
<p># Train the Model<br />
classifier_NB &lt;- textmodel_NB(trainning.set, label.train)</p>
<p># Prediction<br />
predictions &lt;- classifier_NB %&gt;%<br />
  predict(newdata = songs.dfm.tfidf)</p>
<p># Confusion Matrix<br />
confusion &lt;- table(predictions$nb.predicted, label)</p>
<p>confusion<br />

# Document Feature Matrix

songs.dfm.tfidf <– corpus(songs, text_field = “text”,

docid_field = “song”) %>%

dfm(tolower = T,

stem = TRUE,

remove_punct = TRUE,

remove = stopwords(“english”)) %>%

dfm_trim(min_count = 5, min_docfreq = 3) %>%

dfm_weight(type = “tfidf”)

# Determine the Indizes for the training set

christmas.index <– which(label == “Christmas”)

not_christmas.index <– which(label == “Not Christmas”)

christmas.train.index <– christmas.index

not_christmas.train.index <– sample(not_christmas.index, length(christmas.index))

train.index <– c(christmas.train.index, not_christmas.train.index)

label.train <– label[train.index]

trainning.set <– songs.dfm.tfidf[train.index, ]

# Train the Model

classifier_NB <– textmodel_NB(trainning.set, label.train)

# Prediction

predictions <– classifier_NB %>%

predict(newdata = songs.dfm.tfidf)

# Confusion Matrix

confusion <– table(predictions$nb.predicted, label)

confusion

So we have identified 2965 hidden Christmas songs and there are 2 songs out of the initial 500 Christmas songs that are rejected by Naive Bayes as Christmas songs.

Explore the hidden Christmas songs

<br />
#Determine the Indizes for the hidden (not) Christmas Songs.<br />
hidden.index &lt;- (predictions$nb.predicted == “Christmas”) &amp; (songs$Label == “Not Christmas”)<br />
hidden_not.index &lt;- (predictions$nb.predicted == “Not Christmas”) &amp; (songs$Label == “Christmas”)</p>
<p># Change the labels<br />
label[hidden.index] &lt;- “Hidden Christmas”<br />
label[hidden_not.index] &lt;- “Hidden Not Christmas”</p>
<p>songs$Label &lt;- label<br />
songs.dfm.tfidf@docvars$Label &lt;- label</p>
<p># Wordcloud for the hidden Christmas Songs<br />
hidden.xmas &lt;- songs[hidden.index, ]</p>
<p>hidden.unnest &lt;- hidden.xmas %&gt;%<br />
  unnest_tokens(word, text) %&gt;%<br />
  anti_join(tibble(word = stop_words$word)) %&gt;%<br />
  filter(!str_detect(word, “\\d+”))</p>
<p>hidden.unnest %&gt;%<br />
  count(word) %&gt;%<br />
  filter(n &gt;= 5) %&gt;%<br />
  as.data.frame() %&gt;%<br />
  wordcloud2(shape = “star”, minSize = 5)<br />

#Determine the Indizes for the hidden (not) Christmas Songs.

hidden.index <– (predictions$nb.predicted == “Christmas”) & (songs$Label == “Not Christmas”)

hidden_not.index <– (predictions$nb.predicted == “Not Christmas”) & (songs$Label == “Christmas”)

# Change the labels

label[hidden.index] <– “Hidden Christmas”

label[hidden_not.index] <– “Hidden Not Christmas”

songs$Label <– label

songs.dfm.tfidf@docvars$Label <– label

# Wordcloud for the hidden Christmas Songs

hidden.xmas <– songs[hidden.index, ]

hidden.unnest <– hidden.xmas %>%

unnest_tokens(word, text) %>%

anti_join(tibble(word = stop_words$word)) %>%

filter(!str_detect(word, “\\d+”))

hidden.unnest %>%

count(word) %>%

filter(n >= 5) %>%

as.data.frame() %>%

wordcloud2(shape = “star”, minSize = 5)

</p>
<p># Correlation<br />
hidden.correlation.words &lt;- hidden.unnest %&gt;%<br />
  group_by(word) %&gt;%<br />
  filter(n() &gt; 15) %&gt;%<br />
  ungroup() %&gt;%<br />
  pairwise_cor(word, song, sort = T)</p>
<p># Network visualization<br />
hidden.correlation.words %&gt;%<br />
  filter(correlation &gt; 0.65) %&gt;%<br />
  group_by(item1) %&gt;%<br />
  filter(n() &gt;= 20) %&gt;%<br />
  ungroup() %&gt;%<br />
  D3Vis(directed = F)<br />

# Correlation

hidden.correlation.words <– hidden.unnest %>%

group_by(word) %>%

filter(n() > 15) %>%

ungroup() %>%

pairwise_cor(word, song, sort = T)

# Network visualization

hidden.correlation.words %>%

filter(correlation > 0.65) %>%

group_by(item1) %>%

filter(n() >= 20) %>%

ungroup() %>%

D3Vis(directed = F)

We have therefore successfully identified a bunch of religous christmas songs, whose titles usually do not contain the word „Christmas“ or „X-mas“.

Latent Dirichtlet Allocation & t-Statistics Stochastic Neighbor Embedding

Data Preparation

Only the top 300 features for the Christmas songs including the hidden ones will be used to calculate the Rtsne & LDA, else the memory space will not be sufficient.

<br />
xmas.dfm.tfidf &lt;- songs.dfm.tfidf %&gt;%<br />
  dfm_subset(Label == “Christmas” | Label == “Hidden Christmas”)</p>
<p>songs.dfm.tfidf_300 &lt;- songs.dfm.tfidf %&gt;%<br />
  dfm_select(pattern = xmas.dfm.tfidf %&gt;%<br />
               topfeatures(300) %&gt;%<br />
               names(), selection = “keep”)</p>
<p>xmas.dfm.tfidf_300 &lt;- xmas.dfm.tfidf %&gt;%<br />
  dfm_select(pattern = xmas.dfm.tfidf %&gt;%<br />
               topfeatures(300) %&gt;%<br />
               names(), selection = “keep”)<br />

xmas.dfm.tfidf <– songs.dfm.tfidf %>%

dfm_subset(Label == “Christmas” | Label == “Hidden Christmas”)

songs.dfm.tfidf_300 <– songs.dfm.tfidf %>%

dfm_select(pattern = xmas.dfm.tfidf %>%

topfeatures(300) %>%

names(), selection = “keep”)

xmas.dfm.tfidf_300 <– xmas.dfm.tfidf %>%

dfm_select(pattern = xmas.dfm.tfidf %>%

topfeatures(300) %>%

names(), selection = “keep”)

LDA

LDA stands for Latent Dirichtlet Allocation, which was introduced in Blei, Ng, Jordan (2003). It is a generative probabilistic model of a corpus, where the documents are represented as random mixtures over latent topics and for a single document there are usually only a few topics that are assigned unneglectable probabilities. Moreover, each topic is characterized by a distribution over words, where usually only a small set of words will be assigned significant probabilities for a certain topic. Either the variational expectation maximization algorithm or Gibbs sampling is used for the statistical inference of the parameters.

LDA requires a fixed number of topics, i.e. it assumes that the number of topics should already be known before applying the algorithm. However, there are possibilities to determine the optimal number of topics by different performance metrics, see Nikita, by using the package ldatuning.

<br />
OptimalNumber &lt;- FindTopicsNumber(LDA_xmas &lt;- xmas.dfm.tfidf_300 %&gt;%<br />
                                    convert(to = “topicmodels”),<br />
                                  topics = seq(2, 8, by = 1),<br />
                                  mc.cores = 2,<br />
                                  metrics = c(“CaoJuan2009”, “Arun2010”, “Deveaud2014”),<br />
                                  method = “VEM”,<br />
                                  verbose = T)<br />

OptimalNumber <– FindTopicsNumber(LDA_xmas <– xmas.dfm.tfidf_300 %>%

convert(to = “topicmodels”),

topics = seq(2, 8, by = 1),

mc.cores = 2,

metrics = c(“CaoJuan2009”, “Arun2010”, “Deveaud2014”),

method = “VEM”,

verbose = T)

<br />
FindTopicsNumber_plot(OptimalNumber)<br />

1 2	FindTopicsNumber_plot(OptimalNumber)

Therefore, we will choose 8 as the optimal number of topics.

<br />
LDA_xmas &lt;- xmas.dfm.tfidf_300 %&gt;%<br />
  convert(to = “topicmodels”) %&gt;%<br />
  LDA(k = 8)<br />

LDA_xmas <– xmas.dfm.tfidf_300 %>%

convert(to = “topicmodels”) %>%

LDA(k = 8)

We may use the package tidytext to inspect the topic probability distribution of each document, i.e. for each document the sum of the probabilities that it belongs to a topic from 1 to 8 is equal to 1.

<br />
LDA_xmas %&gt;%<br />
  tidy(matrix = “gamma”) %&gt;%<br />
  datatable(rownames = F)<br />

LDA_xmas %>%

tidy(matrix = “gamma”) %>%

datatable(rownames = F)

Analogously, we can also obtain the probability distribution of words for each topic, i.e. for each topic the sum of probabilities that it generates different words is equal to 1.

<br />
LDA_xmas %&gt;%<br />
  tidy(matrix = “beta”) %&gt;%<br />
  datatable(rownames = F)<br />

LDA_xmas %>%

tidy(matrix = “beta”) %>%

datatable(rownames = F)

The top terms for each topic are:

<br />
# LDA for the Christmas songs<br />
terms(LDA_xmas, 10)<br />

# LDA for the Christmas songs

terms(LDA_xmas, 10)

t-SNE

Developed by van der Maaten and Hinton (2008), t-SNE stands for t-Statistics Stochastic Neighborhood Embedding, which is a dimensionality reduction technique that is formulated to capture the local clustering structure of the original data points. It is non-linear and non-deterministic.

The following computation will take about 30 minutes.

<br />
# t-Statistics Stochastic Neighbor Embedding ——————————–<br />
index.unique.songs &lt;- !songs.dfm.tfidf_300 %&gt;%<br />
  as.matrix() %&gt;%<br />
  duplicated()</p>
<p>songs.unique &lt;- songs.dfm.tfidf_300[index.unique.songs, ] %&gt;% as.matrix()</p>
<p>tsne.all &lt;- Rtsne(songs.unique)</p>
<p>songs_2d &lt;- tsne.all$Y %&gt;%<br />
  as.data.frame() %&gt;%<br />
  mutate(Label = label[index.unique.songs])<br />

# t-Statistics Stochastic Neighbor Embedding ——————————–

index.unique.songs <– !songs.dfm.tfidf_300 %>%

as.matrix() %>%

duplicated()

songs.unique <– songs.dfm.tfidf_300[index.unique.songs, ] %>% as.matrix()

tsne.all <– Rtsne(songs.unique)

songs_2d <– tsne.all$Y %>%

as.data.frame() %>%

mutate(Label = label[index.unique.songs])

<br />
songs_2d %&gt;%<br />
  ggplot(aes(x = V1, y = V2, color = Label)) +<br />
  geom_point(size = 0.25) +<br />
  scale_color_manual(values = c(“Not Christmas” = “#a6a6a6”,<br />
                                “Christmas” = “#88ab33”,<br />
                                “Hidden Christmas” = “#F98948”,<br />
                                “Hidden Not Christmas” = “#437F97”)) +<br />
  guides(color = guide_legend(override.aes = list(size = 5))) +<br />
  theme(panel.grid.major = element_blank(),<br />
        panel.grid.minor = element_blank(),<br />
        panel.background = element_rect(fill = “black”))<br />

songs_2d %>%

ggplot(aes(x = V1, y = V2, color = Label)) +

geom_point(size = 0.25) +

scale_color_manual(values = c(“Not Christmas” = “#a6a6a6”,

“Christmas” = “#88ab33”,

“Hidden Christmas” = “#F98948”,

“Hidden Not Christmas” = “#437F97”)) +

guides(color = guide_legend(override.aes = list(size = 5))) +

theme(panel.grid.major = element_blank(),

panel.grid.minor = element_blank(),

panel.background = element_rect(fill = “black”))

What if we repeat the procedure for more than one iteration?

So far we have only run the Naive Bayes for one iteration. However, we may repeat this procedure for more than one iteration, i.e. train a Naive Bayes classifier and relabel all the false positives as Hidden Christmas/Christmas and all the false negatives as Hidden Not Christmas/Not Christmas over and over again.

First of all, we prepare the data again to avoid bugs.

<br />
songs &lt;- read.csv(“songdata.csv”)<br />
songs$song &lt;- songs$song %&gt;% as.character()<br />
songs$artist &lt;- songs$artist %&gt;% as.character()<br />
songs$link &lt;- songs$link %&gt;% as.character()<br />
songs$text &lt;- songs$text %&gt;% as.character()</p>
<p># Initialization of the Labels<br />
label &lt;- character(dim(songs)[1])</p>
<p>for(i in 1:dim(songs)[1]){<br />
  if(str_detect(songs$song[i], “Christmas”) |<br />
     str_detect(songs$song[i], “X-mas”) |<br />
     str_detect(songs$song[i], “Xmas”)){<br />
    label[i] &lt;- “Christmas”<br />
  } else{<br />
    label[i] &lt;- “Not Christmas”<br />
  }<br />
}</p>
<p>songs &lt;- songs %&gt;%<br />
  mutate(Label = label)</p>
<p>songs.dfm.tfidf &lt;- corpus(songs, text_field = “text”,<br />
                          docid_field = “song”) %&gt;%<br />
  dfm(tolower = T,<br />
      stem = TRUE,<br />
      remove_punct = TRUE,<br />
      remove = stopwords(“english”)) %&gt;%<br />
  dfm_trim(min_count = 5, min_docfreq = 3) %&gt;%<br />
  dfm_weight(type = “tfidf”)</p>
<p>results &lt;- data.frame(precision = numeric(10),<br />
                      recall = numeric(10),<br />
                      f1_score = numeric(10))<br />

songs <– read.csv(“songdata.csv”)

songs$song <– songs$song %>% as.character()

songs$artist <– songs$artist %>% as.character()

songs$link <– songs$link %>% as.character()

songs$text <– songs$text %>% as.character()

# Initialization of the Labels

label <– character(dim(songs)[1])

for(i in 1:dim(songs)[1]){

if(str_detect(songs$song[i], “Christmas”) |

str_detect(songs$song[i], “X-mas”) |

str_detect(songs$song[i], “Xmas”)){

label[i] <– “Christmas”

} else{

label[i] <– “Not Christmas”

}

songs <– songs %>%

mutate(Label = label)

songs.dfm.tfidf <– corpus(songs, text_field = “text”,

docid_field = “song”) %>%

dfm(tolower = T,

stem = TRUE,

remove_punct = TRUE,

remove = stopwords(“english”)) %>%

dfm_trim(min_count = 5, min_docfreq = 3) %>%

dfm_weight(type = “tfidf”)

results <– data.frame(precision = numeric(10),

recall = numeric(10),

f1_score = numeric(10))

Run 10 iterations.

<br />
for(i in 1:10){<br />
  # Determine the Indizes<br />
  christmas.index &lt;- which(label == “Christmas”)<br />
  not_christmas.index &lt;- which(label == “Not Christmas”)</p>
<p>  if(length(christmas.index) &lt; length(not_christmas.index)){<br />
    christmas.train.index &lt;- christmas.index<br />
    not_christmas.train.index &lt;- sample(not_christmas.index, length(christmas.index))<br />
  } else{<br />
    not_christmas.train.index &lt;- not_christmas.index<br />
    christmas.train.index &lt;- sample(christmas.index, length(not_christmas.index))<br />
  }</p>
<p>  train.index &lt;- c(christmas.train.index, not_christmas.train.index)</p>
<p>  label.train &lt;- label[train.index]</p>
<p>  trainning.set &lt;- songs.dfm.tfidf[train.index, ]</p>
<p>  # Train the Model<br />
  classifier_NB &lt;- textmodel_NB(trainning.set, label.train)</p>
<p>  # Prediction<br />
  predictions &lt;- classifier_NB %&gt;%<br />
    predict(newdata = songs.dfm.tfidf)</p>
<p>  # Confusion Matrix<br />
  confusion &lt;- table(predictions$nb.predicted, label)<br />
  precision &lt;- confusion[1, 1]/sum(confusion[1, ])<br />
  recall &lt;- confusion[1, 1]/sum(confusion[, 1])<br />
  f1_score &lt;- 2*precision*recall/(precision + recall)</p>
<p>  # The hidden (not) Christmas Songs ———————————————-<br />
  hidden.index &lt;- (predictions$nb.predicted == “Christmas”) &amp; (songs$Label == “Not Christmas”)<br />
  hidden_not.index &lt;- (predictions$nb.predicted == “Not Christmas”) &amp; (songs$Label == “Christmas”)</p>
<p>  hidden.xmas &lt;- songs[hidden.index, ]<br />
  hidden.not_xmas &lt;- songs[hidden_not.index, ]</p>
<p>  label[hidden.index] &lt;- “Hidden Christmas”<br />
  label[hidden_not.index] &lt;- “Hidden Not Christmas”</p>
<p>  songs_2d &lt;- tsne.all$Y %&gt;%<br />
    as.data.frame() %&gt;%<br />
    mutate(Label = label[index.unique.songs])</p>
<p>  random.forest &lt;- songs_2d %&gt;%<br />
    ggplot(aes(x = V1, y = V2, color = Label)) +<br />
    geom_point(size = 0.25) +<br />
    scale_color_manual(values = c(“Not Christmas” = “#a6a6a6”,<br />
                                  “Christmas” = “#88ab33”,<br />
                                  “Hidden Christmas” = “#F98948”,<br />
                                  “Hidden Not Christmas” = “#437F97”)) +<br />
    guides(color = guide_legend(override.aes = list(size = 5))) +<br />
    theme(panel.grid.major = element_blank(),<br />
          panel.grid.minor = element_blank(),<br />
          panel.background = element_rect(fill = “black”)) +<br />
    ggtitle(paste(“Iteration:”, i))</p>
<p>  # Change the labels</p>
<p>  label[hidden.index] &lt;- “Christmas”<br />
  label[hidden_not.index] &lt;- “Not Christmas”</p>
<p>  songs$Label &lt;- label</p>
<p>  songs.dfm.tfidf@docvars$Label &lt;- label</p>
<p>  results[i, ] &lt;- c(precision, recall, f1_score)<br />
  plot(random.forest)<br />
}<br />

for(i in 1:10){

# Determine the Indizes

christmas.index <– which(label == “Christmas”)

not_christmas.index <– which(label == “Not Christmas”)

if(length(christmas.index) < length(not_christmas.index)){

christmas.train.index <– christmas.index

not_christmas.train.index <– sample(not_christmas.index, length(christmas.index))

} else{

not_christmas.train.index <– not_christmas.index

christmas.train.index <– sample(christmas.index, length(not_christmas.index))

}

train.index <– c(christmas.train.index, not_christmas.train.index)

label.train <– label[train.index]

trainning.set <– songs.dfm.tfidf[train.index, ]

# Train the Model

classifier_NB <– textmodel_NB(trainning.set, label.train)

# Prediction

predictions <– classifier_NB %>%

predict(newdata = songs.dfm.tfidf)

# Confusion Matrix

confusion <– table(predictions$nb.predicted, label)

precision <– confusion[1, 1]/sum(confusion[1, ])

recall <– confusion[1, 1]/sum(confusion[, 1])

f1_score <– 2*precision*recall/(precision + recall)

# The hidden (not) Christmas Songs ———————————————-

hidden.index <– (predictions$nb.predicted == “Christmas”) & (songs$Label == “Not Christmas”)

hidden_not.index <– (predictions$nb.predicted == “Not Christmas”) & (songs$Label == “Christmas”)

hidden.xmas <– songs[hidden.index, ]

hidden.not_xmas <– songs[hidden_not.index, ]

label[hidden.index] <– “Hidden Christmas”

label[hidden_not.index] <– “Hidden Not Christmas”

songs_2d <– tsne.all$Y %>%

as.data.frame() %>%

mutate(Label = label[index.unique.songs])

random.forest <– songs_2d %>%

ggplot(aes(x = V1, y = V2, color = Label)) +

geom_point(size = 0.25) +

scale_color_manual(values = c(“Not Christmas” = “#a6a6a6”,

“Christmas” = “#88ab33”,

“Hidden Christmas” = “#F98948”,

“Hidden Not Christmas” = “#437F97”)) +

guides(color = guide_legend(override.aes = list(size = 5))) +

theme(panel.grid.major = element_blank(),

panel.grid.minor = element_blank(),

panel.background = element_rect(fill = “black”)) +

ggtitle(paste(“Iteration:”, i))

# Change the labels

label[hidden.index] <– “Christmas”

label[hidden_not.index] <– “Not Christmas”

songs$Label <– label

songs.dfm.tfidf@docvars$Label <– label

results[i, ] <– c(precision, recall, f1_score)

plot(random.forest)

}

<br />
results %&gt;%<br />
  mutate(index = 1:10) %&gt;%<br />
  melt(id = “index”) %&gt;%<br />
  ggplot(aes(x = index, y = value, color = variable)) +<br />
  geom_line()<br />

results %>%

mutate(index = 1:10) %>%

melt(id = “index”) %>%

ggplot(aes(x = index, y = value, color = variable)) +

geom_line()

Then the precision as well as the f1 score grow monotonically at first and then converge to a value around 0.95, which means there are not many „Hidden Christmas Songs“ and „Hidden Not Christmas Songs“ left to be detected. However, in this procedure we always believe that the Naive Bayes classifier is 100% accurate, which is hardly possible. Thus, in each iteration there are some songs falsely classified by Naive Bayes as „Christmas“, which will be used in the next iteration in the training set to train the Naive Bayes classifier. With this accumulating error we might have the apprehension that the results are actually worse with more iterations.

<br />
confusion<br />

confusion

At the end we have roughly half of the songs classified as „Christmas“ and the other half as „Not Christmas“, which seems very implausible. It raises the question whether or not there is an optimal number of iterations, however, we simply can not manually control whether all the 57,650 songs are correctly classified or not. This remains an open question to be answered.

For further information and visualizations please visit us on github and download the notebook.