Machine Learning data preparation

Artificial Intelligence theorist Eliezer Yudkowsky once said “By far, the greatest danger of Artificial Intelligence is that people conclude too early that they understand it.”

True, AI is a complex and nuanced world, but there is an increasing array of free tools out there that open up the previously out-of-reach world of Machine Learning to (almost) anyone. Of course, our Innovation team in EQTR\X have been all over this.

In the first of a series of posts on our research and experimentation with AI toolsets, our creative technologist, Lindsey Carr takes you through the critical steps of taking a mountain of data and getting it ready for AI toolsets to interpret and learn from.

In case you’re interested we undertook data visualisation through the Jupyter notebook (running Python) and testing several openly available ML models.

The right data pool

Every business (and individual) nowadays creates more data than they can reasonably comprehend. With so many connected devices and activities in our lives, we each will have a rich and colourful data pool we can tap into for a Machine Learning toolset to exploit. And here at Equator, where better to find rich, plentiful and complex data than in one of our many paid search (PPC) campaigns?

In any PPC campaign, AdWords and Analytics produces mountains of data. Google does a decent enough job of bringing intelligence to this, but ML toolsets can really make decision making and planning that little bit faster and smarter.

The right goal

In any machine learning task, the first thing to ask yourself is – what are we looking to predict? There are several potential targets in PPC – but in terms of Adwords the Conversion metric is the end goal. However, conversion alone is not a good indicator of ROI since what you pay to get a conversion can in some cases make it a losing venture.

We’re only going to talk about Machine Learning data preparation in this post because it’s the first major task and often most of the actual work you should perform.

The right approach to data preparation

When considering the data you’re using it’s important to understand that there may be outside factors you’re not aware of which affect the shape your data takes. The more you can discover about outside events the better. For example, if you sample all your PPC data in the midst of a strong TV campaign your model may not be able to generalise well in future campaigns.

One way to dampen the effect of unique events is to try to get a sample from as long a time series as possible.

Take the graph below as an example – this shows a year’s worth of conversions. The question is - is this shape a genuine trend which takes place year after year or are there outside one-off factors affecting it? As it turns out, this is a yearly trend across the industry we’re analysing. In the case of this data pool, there’s a known incentive for people to buy in Jan – March.

A year’s worth of conversions

A year’s worth of conversions

Cleaning & shaping

Once you have your data and you understand it as well you can, you then need to clean & possibly reshape it for the following (amongst other things):

  • Strong correlations between features
  • Outliers
  • Missing values
  • Categorical Vs Numerical values

Through analysis (in the form of graphs and tables) you generally get a very strong picture about your data. Specifically in this case I could say before I even ran a model that there were certain trends that linked Average Position to Conversion and Cost.

Conversion rate %

Conversion rate %

Specifically Avg. Position 2-3 had a better conversion rate than position 1, and additionally it cost significantly more to get conversions at position 1. It didn’t even increase the rate of conversions (speed attained) significantly.

Once your data is relatively clean and the data types are in a suitable format you can run it through a classifier to see which features have the greatest influence on Conversions.

Running data through classifier

Running data through classifier

This is helpful for ML because throwing more and more data at a problem is not always beneficial. There are times when certain features are not only irrelevant but harm the ability of the model to generalise well and predict future scenarios. It’s worth seeing as well which features have no influence on Conversions.

PPC is also a classic study in imbalanced data – this is where one outcome far outweighs all others – e.g. the number of zero conversions far outweighs conversions greater than zero.

There are 18,000+ records in this data set where there was no conversion  compared to 2000 records where there were conversions recorded

There are 18,000+ records in this data set where there was no conversion compared to 2000 records where there were conversions recorded

The same issue is present in problems like fraudulent transactions and disease detection, where only a small number of cases will be positive. Do you see where the model might have a problem? Imagine you didn’t build a model but instead just wrote a line of code that always predicted ‘No Fraud’, ‘No Disease’ and in the case of PPC ‘No Conversions’. It would have a high accuracy score overall, but completely fail to predict the things we need to predict. This is a problem because Machine Learning algorithms are designed to maximise accuracy. So, we need to address this imbalance – which can be done through under sampling, oversampling, or by weighting the classes.

Model picking

After your data preparation you’ll want to decide which machine learning algorithm is best suited for predicting outcomes. This is the subject of a whole other blog post, suffice it to say choosing your ML approach depends on your data, how clean it is, how much of it you have, what you’re trying to predict, how much computing power you have and more besides.

With that said, it’s fairly straightforward to use a lot of ML algorithms out of the box and just see how they score before moving onto parameter tuning. The XGBoost algorithm is worth a mention here as it became a popular and highly versatile tool which can work through most regression and classification problems. Notably it’s also been used consistently to win Kaggle competitions.

I’ll go through the basics of ML algorithms in the next blog post as well as an overview of cross-validation, parameter tuning and scoring your model for accuracy.