Machine learning Dataset - Improve Matching Accuracy

Machine learning theory says that accurate data in the learning set is crucial for getting a good machine learning model.
Some developers say it bluntly: garbage in, garbage out.
While working on last version of Product Matching AI’s model, we have learned this the harder way.
This story is about how we discovered the problem in our machine learning dataset, and what we did to tackle it.

But, lets start with a little bit of intro

Where do we get our learning sets from?

You would be surprised how many free, good quality data resources there are available on the Web, for free.
When building our current AI model, we have taken several such sources, which resulted in the model we have today.

However, recently we have acquired a significant new client – multinational player with marketplaces in several countries.
When discussing project details with this client, we came to conclusion that client already has historically matched data, for hundreds of thousands of products, across several key competitors.
We have approached our client, asking if we could use this data as learning set for the next version of our Product Matching AI model – and the client accepted – for which we are very grateful.

So, we have received a new data source – and with great thrill we started working on digesting it into our Machine Learning model.
The learning set contained balanced number of good matches (which were confirmed to be correct) and bad ones (which were considered a solid matching candidates but proved to be wrong match). You may ask yourself why we would need wrong matches for the machine learning process? Put it simple – the algorithm needs both good examples, and similar number of bad ones – in order to make the distinguishment between right and wrong.

First results were surprisingly poor

To our surprise, the model built on the above data achieved only 77% of precision (one of the key metrics we use when judging the model performance).
We have plotted the results (both for good matches – the green line, and the wrong ones – the red line) on a chart, based on matching prediction score as calculated by our model.

Lets take a look at the chart in more details

If we take a look at the green line, we will notice that most of good matches have matching prediction score of above 0.8 – this looks promising
Similarly, most of wrong matches (red line) have matching prediction below 0.25. This is also good – as the gap between 0.8 and 0.25 is wide enough, so the model will know whether the match is good or wrong.
However, if you look a bit closer, you will see an unexpected spike in wrong matches, also at prediction scores above 0.8. This didn’t look good, and we thought it may explain the poor results we got.

The investigation

What could be the cause of that right red spike described above? We had no choice but to go down to the very basics – looking at examples of individual wrong matches which scored unexpectedly high matching prediction score.
The task may sound simple enough, but only if you have never done it yourself.
You have to take each matching pair (and we’re talking about tens of thousands of suspicious records), open the first URL, check the product characteristics, and then repeat the same with the 2nd URL. Finally, you have to compare the two, and make a decision – is this a good match or not?
Our client (to whom we are still very thankful for providing the data in the first place) has insisted that the matches supplied have been curated by their own QA team, and that they can be 100% trusted.
However, our experience told us that 100% is hard to achieve, and therefore we have performed our own check.
In order to save costs and time, we decided not to do this check on whole of the learning set – but rather on 2 subsets

Wrong matches that have received unexpectedly high matching prediction score (above 0.7)
Good matches that have received unexpectedly low matching prediction score (below 0.5)

It took approximately one week before we got the results back. Should I mention how impatient we were while waiting?

96.5% is worse than 100%, by far

When results came in, we were jubilant. 3.5% of examples investigated proved to be wrong!
There were both cases of false positives (bad matches which were wrongly proclaimed good) and false negatives (good matches which were wrongly proclaimed as wrong).
3.5% may not sound like a lot – but in the world of machine learning it actually is. Neural network will learn on a skewed example, and it will reach wrong and biased conclusions – as simple as that.
You can also imagine how surprised our client was when we presented this conclusion – at first he refused to believe it – but then we presented concrete examples of these 3.5% wrong matches. I guess his QA team is a having a bit of hard time at the moment.

Back to Product Matching AI – after we corrected the above 3.5% records and fed them back into the machine learning model – we have seen precision rise to 87%. Of course, that number is not perfect, but we do know what the cause may be, and we are (as we speak) working on improving things.

False negative example

Lets take a look at the example of 2 WiFi extenders.
Is the product shown to the left identical to the one shown on the right?

The product title is similar, but the product on the left has full part name there while the right one does not.
Product images are seemingly the same, though taken from different angles
The price does differ (145.01 vs 175) – so the price difference is a bit over 10% (which does sound like a normal difference)
After inspecting product specs on the 2nd URL, we do see that they mention model number – which seems like a key piece of evidence for us.

Client’s QA department said these 2 products were not a match. Our QA said they were wrong, based on the evidence listed above

False positive example

In this example, we will discuss a Hitachi fridge.
Again, we ask ourselves the same question – is the product shown on the left, identical to the one on the right?

Product titles are very, very similar. However, you should notice the difference in model number
In both cases it’s a 710 liters fridge, so volume is the same
The color shown on the images is definitely different (black vs silver)
There is also significant difference in the price: 2570 vs 2399

Client’s QA department said these 2 products were a match. We are positive that they are not, the evidence above is clear.

Lessons learned

Lesson 1

Product matching accuracy is the paramount of both your pricing strategy, and our machine learning process.
Client’s matches had an accuracy of 96.5% which sounded pretty good. However, it cause significant problems both for the client (shall I mention that these 3.5% products were not selling well?) and for us – because our machine learning mechanism requires much higher (near 100%) data accuracy.
Please do not settle for 96.5% – you should aim much, much higher!

Lesson 2

Please take QA very seriously. Technology is good, but has it’s limits. Human QA still cannot be replaced.

Lesson 3

False positives and false negatives are both unwanted. However, false positives do have larger impact – if you do not have time / resources to QA all results, do focus on QA-ing positive matches.

Those were the 3 lessons we have learned 🙂 And how is your learning process going? Do let us know, we’d love to hear from you.

3 Lessons Learned From an Imperfect Product Matching Machine Learning Dataset