Data Science and Machine Learning – Google Analytics

Data Science and Machine Learning with e-commerce

If you’ve worked with digital marketing tools like Google Analytics, maybe you have noticed that the current outputs of data provided by GA (Google Analytics) are unstructured data or data not ready to be processed directly in Machine Learning and Data Science tools.

In this first investigation of 2 articles, we pretend to apply Data Science and Machine Learning techniques with Google Analytics data from an e-commerce website.

The purpose of the e-commerce website is to find insights related with the incomes and sales

So, let’s get started!

Step 1.- Origin of Data

We’re going to work with a CSV file extracted directly from Google Analytics 360.

To have a good data granularity, we’re going to download the activity of users daily, from the origin of the e-commerce website to YTD

The variables to use are going to be the next ones:

  • Date
  • Sessions
  • Users
  • New Users
  • Bounce Rate
  • Pages / Sessions
  • Avg. Session Duration
  • Conversion Rate
  • Transactions
  • Avg. Order
  • Revenue – Dependent variable or Objective Variable

So, our raw CSV file is going to look like this (Sorry, the CSV file is in Spanish) :

No alt text provided for this image

At this point, I did a little data pre-processing process directly in the CSV file, converting the % variables into integers (removing the % sign), in the bounce rate and conversion rate columns.

No alt text provided for this image

Those simple changes are going to be really helpful when we use this CSV file like input in our Data Science and Machine Learning tools.

We’re ready to move on into the FIRST step of our Data Science process, the data visualization and exploratory analysis

Step 2.- Data visualization & Exploratory data analysis

The data visualization process is fundamental because help us to interpreter in a better way our data and understand the significance of variables.

Navigating in the science of data visualization we can detect patterns, trends and correlations that might be undetectable if we analyze our data directly in a CSV file reader software (Microsoft Excel, generally).

In our investigation we’re going to work with 2 software’s that we recommend you:

GlueViz or simply call Glue is a Python library to explore relationships within and between related data sets

Weka is a suite of machine learning software written in Java, developed at the University of WaikatoNew Zealand.

Here I left you 2 videos related with this 2 amazing tools:

Glue (Linked – View Visualization in Python)

Getting Started with Weka – Machine Learning Recipes #10

Let’s move on!

In this first crossing variables process, I decided to use the variable Sessions with our O.V (Objective variable = Incomes) trying to identify some patterns in the data points.

Why use Sessions and Incomes in this first crossing variables?

In the e-commerce universe sometimes the digital experts express the idea that Sessions are related with the Incomes, so they’ve created the rule: higher Sessions higher Incomes

That’s true?

Here I present you some data visualizations done in Glue:

Only to remember, each data point is an activity done within the e-commerce website by one person.

Analyzing the 4 charts above, 2 of them called my attention:

1.- The Sessions vs Avg.Order has a particular behavior; We can see how the most occurrences of Avg. Order value happen when the lower average of Sessions occur. If we make a zoom in the chart, we can see in this particular case that higher Session are not necessarily generating higher Avg. Orders.

2.- The second interesting chart is Sessions vs conversion rate. In this amazing representation of each data point, we can see how all most occurrences are right-skewed, were the lower average of Sessions occur.

Also in the section of lower average of Session we can see high conversion rates values (circle):

If we put a line in the y-axes (at 1.09% of conversion rate) and we compare the number of data points between low and high Session:

27 occurrences for the purple block (high Sessions) vs 30 occurrences in the blue block(low Session)

In this particular case we have more occurrences of conversion rate higher than 1.09% in the section of low Sessions, giving us as a result that higher Session not necessarily generate high conversion rates.

For those of you who want to know more about conversion rate here I left you a definition from Wikipedia:

Now, it’s time to move further in exploratory data analysis:

During a lot of data correlation exercises done in GlueViz (in assumption, based in data visualization only), we present the most interesting results:

First interesting correlation:

The conversion rate variable (x axes) presents a very similar behavior making a cross analysis with incomes and transactions (y axes). It’s a fact that higher the conversion rate, higher the incomes and transactions since this metric was created (by digital marketing experts) like this:

Conversion rate = (Objectives or transactions / Total visits) * 100.

In the case of our e-commerce site a objective fulfilled is a transaction. However, it is interesting to note that when we analyzed (in the first phase) the cross of variables with sessions over incomes and conversion rate, any of them presented any correlation behavior.

Second interesting correlation:

At this point, Have you noticed how conversion rate is a really interesting variable due all the correlations with other variables?

Another interesting finding was, once again, the conversion rate variable analyzed with page views per session and the average session duration.

If we analyze the graphs logically we would say that the people who visit more pages within the e-commerce site are because they are interested in one or more products at the same time, which makes them stay longer on the site, increasing the conversion rate and generating more income.

From the point of view of digital marketing these graphics are good signals since the content presented within the site is very relevant for the audience: “the more pages the users visit and stay longer within the site, the more transactions and income they do.”

Third interesting correlation:

A very important metric for e-commerce sites is the bounce rate variable, since this metric tells us if our site is relevant to our visitors or not.

The higher the bounce rate the less interaction with customers our e-commerce site is doing, generating fewer transactions and income.

In the graphs above, we can see how the behavior between income and transactions are almost identical when these variables are analyzed with the the bounce rate.

In this particular case, we see that the highest transactions and revenues are generated when the bounce rate is less than 30%.

Well, so far we’ve done the first step of a Data Science process, the Data visualization andExploratory data analysis.

With these techniques we’ve detected some important correlation between all our variables:

Income and transactions with Conversion Rate

Pages/Sessions and Avg. session duration with Bounce rate

At this point we assume that all these variables are correlated and are going to be useful when we design our Machine Learning model, but nevertheless in the Part 2 of this investigation we’re going to run statistical techniques to move further and confirm or deny if all those variables are correlated or not and which of them are going to help us in predict the incomes and/or sales of the e-commerce website.

See you there!