There comes a time in any foray into a new technological area when you realize you may have embarked on a Sisyphean task. Looking at the many options available to do on the project, you research your options, read the book, and start to work—only to find that actually a problem definition can be more work than finding the exact solution.
Reader, this is where I find myself two weeks into this adventure in machine learning. I familiarize myself with the data, the tools, and the methods known to problems with this type of data, and I tried several ways to solve what on the surface seems like a simple machine learning problem: based on past performance, can we do it. predict whether any Ars title will be a winner in an A/B test?
Things have not gone particularly well. In fact, as I finished this article, my recent attempt showed that our algorithm is as accurate as a coin flip.
But at least that’s a start. And in the process of getting there, I learned a lot about data cleaning and pre-processing going into any machine learning course.
Preparing the battlefield
Our data source is a log of results from 5,500-plus headline A/B tests over the past five years—that’s as long as Ars has been doing a headline draw for each story you post. Since we have labels for all of this data (that is, we know whether you won or lost your A/B test), this will appear to be supervised learning problem. All I really need to do to organize the data is to make sure it is properly structured for the model I choose to use to create our algorithm.
I’m not a data scientist, so I won’t be building my own model anytime this decade. Fortunately, AWS provides a number of pre-built templates that are well-suited to the workflow and specifically designed to work within the confines of the Amazon cloud. There are also third-party models, viz Face huggingthat can be used between Global SageMaker. Each model seems to need data fed into it in a specific way.
The choice of model in this case is very much down to the approach we will take to the problem. Initially, I saw two possible approaches to training an algorithm to obtain the probability of success of any given title:
- Binary classification: We simply decide what the probability is that the titles fall into the “win” or “lose” column based on previous winners and losers. We can compare the probability of two titles and choose the strongest candidate.
- Multiple classification categories: We try to rate titles based on their click-through rate into several categories — rank them 1 to 5 stars, for example. Then we can compare the scores of the title candidates.
The second method is more difficult, and there is a big concern with either of these methods that makes the second one even less: 5,500 tests, with 11,000 subjects, is not a lot of data to work with in a large AI / ML system of things.
So I opted for a binary option for my first attempt, because it seemed the most likely to succeed. It also means the only data point I need for each title (Besides the title itself) is whether it won or lost the A/B test. I took my source data and converted it into a comma separated value file with columns. two: titles in one, and “yes” or “no” in the other. I also used the script to remove all HTML tags from the titles (mostly a few HTML tags for italics). With the data reduced almost all the way to the essentials, I moved it into the SageMaker database so I could use Python tools for the rest of the preparation.
Next, I need to choose the type of model and organize the data. Again, much of the data preparation depends on the type of model the data will be fed into. Different types natural language processing models (and problems) require different levels of data preparation.
After that is the “sign.” AWS technology evangelist Julien Simon aptly explains it: “Managing information first means replacing words with symbols, individual symbols.” A token is a machine-readable number that stands for a character string. “So ‘ransomware’ would be one word,” he said, “‘crooks’ would be two words, ‘setup’ would be three words… so a sentence would become a sequence of symbols, and you could dig deep. model and let him learn which ones are good and which ones are bad.”
Depending on the specific problem, you may want to jettison some of the data. For example, if we are trying to do something like sentiment analysis (that is, determining whether a given Ars headline is positive or negative in tone) or grouping headlines by what they’re about, I’d like to cut the data down to the most useful content by removing the “stop words” – common words are important for grammar but don’t tell you what the word is telling you (like most book).
However, in this case, the stop words are important parts of the data—then, we look for parts of the headlines that attract attention. So I chose to keep all the words. And in my first attempt at training, I decided to use it BlazingText, the word processing model presented by AWS has a classification problem similar to the one we are attempting. BlazingText requires “symbol” data—data that identifies a specific part of a text element—to be preceded by “
__label__“. And instead of a comma-delimited file, symbolic data and executable text are put into one line in a text file, like this:
Another aspect of data preprocessing for ML supervised learning is splitting the data into two sets: one for training the algorithm, and one for validating its results. The training data set is usually the largest set. Validation data is generally created from about 10 to 20 percent of the total data.
There it is of a great deal of research into what is actually the right amount of validation data—some of that research suggests that the sweet spot is more related to the number of parameters in the model used to create the algorithm rather than the average size of the data. In this case, given that there is little data to improve through the model, I think my validation data will be 10 percent.
In some cases, you may want to hold a small pool of data to test the algorithm after it is certified. But our plan here is to use live Ars topics to test, so I skipped that step.
To make me order data preparation, I have used it jupyter notebook— an interactive web interface to a Python example — to convert my two-document CSV into a data set and process it. Python has some decent data manipulation and engineering-specific tools that make these tasks fairly straightforward, and I use two in particular here:
pandasA popular data analysis and manipulation module that does wonders for slicing and dicing CSV files and other common data formats.
scikit-learn), a data science module that takes a lot of the heavy lifting out of machine learning data preprocessing.
nltkNatural Language Toolkit — and in particular, the
PunktSentence tokenizer for text processing of our titles.
csvmodule for reading and writing CSV files.
Here is a snippet of the code in the notebook I used to create my training and validation sets from our CSV data:
I started by using
pandas to import a data structure from a CSV created from initially clean and formatted data, calling the resulting “setset”. Using the
dataset.head() The command gave me a view of the headers for each column that was imported from the CSV, along with a peek at some of the data.
The pandas module allows me to multi-add”
__label__” to all the values in the label as required by BlazingText, and I used it lambda function to outline titles and force all words to lowercase. Finally, I used them
sklearn module to split the data into two files that I will feed to BlazingText.