Clustering cities using daily patterns of human activities on the website
1) Basic knowledge of Clusters
2) Basic knowledge of Google Analytics
Holy visibility, Batman! Insights using advanced analysis tools like R is a seriously undervalued aspect of web analytics tracking. Too often, we fall into the trap of thinking Google Analytics Reports are sufficient to give all insights to us. We need to push the boundary of Google Analytics and try some advanced analysis using Data Analysis tools like R.
Our team started to take a step towards advanced level analysis using R. It’s a very impressive to use R and get some advance level insights which still lacking in the Web Analytics.
Clustering can be used for some level of Analysis. Clustering approach belongs to a class of unsupervised learning method. The hierarchical clustering method is one of the very popular in a variety of domains. In marketing, we can use Hierarchical clustering to create market/customer/product segments.
The idea of a clustering algorithm is to group data into the similar group.
Let’s have a look at an example in R using Google Analytics city report. The dataset is a city report of an e-commerce website.
I am using Google Analytics Data from the city report.
Geo ==> Location ==> City Report
# Read in the dataset data 19th Sep 2017 – 18 Oct 2017
GaData = read.csv(“data.csv”)
# Look at summary
As we’ve seen, this data gives the summary of various metrics in each city. For example, in the summary output above we can see that for the variable Revenue minimum value is 0 and a maximum value is 458,487. In other words, one city had only 0 revenue and one city had 458,487 revenue, and a lot of other cities revenues levels in between.
# Compute distances
distances = dist(GaData, method = “euclidean”)
# Hierarchical clustering
clusterCities = hclust(distances, method = “ward.D”)
# Plot the dendrogram
plot(clusterCities, labels = GaData$City)
While clustering one of the key decisions to be made is to decide on the number of clusters to use. In Hierarchical clustering can be represented by a dendrogram. Cutting a dendrogram at a certain level gives a set of clusters. Cutting at another level gives another set of clusters. How would you pick where to cut the dendrogram?
In the plot height of the vertical lines represents the distance between points or clusters and data point listed along the bottom. We want to know how to find the number of clusters by using cluster dendrogram.
It appears, from the above dendrogram, that the data can be represented by 4 clusters.
# Assign points to clusters
clusterGroups = cutree(clusterCities, k = 4)
tapply(GaData$Users, clusterGroups, mean)
tapply(GaData$NewUsers, clusterGroups, mean)
tapply(GaData$Sessions, clusterGroups, mean)
tapply(GaData$BounceRate, clusterGroups, mean)
tapply(GaData$PagesPerSession, clusterGroups, mean)
tapply(GaData$Transactions, clusterGroups, mean)
tapply(GaData$Revenue, clusterGroups, mean)
tapply(GaData$E_commerce_Conversion_Rate, clusterGroups, mean)
We cut our dendrogram at 4 which means we divide our data into 4 cluster groups. After getting cluster group we take mean of each metrics and check what is a major difference between each cluster.
From the results above we can see that there is a relatively well-defined set of groups of a cluster that is relatively distinct when it comes to answering which cities are generating most and least revenues. It is only natural to think the next steps from this sort of output. One could start to devise strategies to understand why certain cities performing the way they do and what to do about it.
Complete code and sample data for this post is available on GitHub