Evaluating Recommender Systems: Choosing the best one for your business

Together with the endless expansion of E-commerce and online media in the last years, there are more and more Software-as-a-Service (SaaS) Recommender Systems (RSs) becoming available today. Unlike 5 years ago, when using RSs was a privilege of large companies building their own RS in-house, spending a ginormous budget on a team of data scientists, today’s popularity of SaaS solutions make it affordable to use recommendation even for small- and medium-sized companies. A question that CTOs of such companies are facing when looking for the right SaaS RS is: Which solution is the best? Assuming that you still don’t have a RS, or you are not satisfied with you current RS, which solution should you choose?

In this article, I will cover two approaches:

  • Offline evaluation in academic world (plus the Netflix Prize), searching for low prediction errors (RMSE/MAE) and high Recall/Catalog coverage. TLDR; just know these measures exists and you probably don’t wanna use them. But I still give a brief summary of them in case you are interested.
  • Online evaluation in business world, searching for high Customer Lifetime Values (CLV), going through A/B-testing, CTR, CR, ROI, and QA. You should read this section if you are seriously considering recommendations boosting your business.

The Offline World = How Academics Do It?

RSs have been investigated for decades in academic research. There are lot of research papers introducing different algorithms, and to make the algorithms comparable, they use academic measures. We call these measures the offline measures. You don’t put anything into production, you just play with the algorithms in your sandbox and fine-tune them according to these measures. I personally researched these measures a lot, but from my today’s point of view, they are rather prehistoric. But even in the middle ages of 2006 in the famous Netflix Prize, a purely academic measure called the RMSE (root mean squared error) has been used.

Just to briefly explain how it works, it supposes your users explicitly rate your products with say number of stars (1=strong dislike, 5=strong like), and you have a bunch of such ratings (records saying that user A rated item X with Y stars) from the past. A technique called the split validation is used: you take only a subset of these ratings, say 80% (called the train set), build the RS on them, and then ask the RS to predict the ratings on the 20% you’ve hidden (the test set). And so it may happen that a test user rated some item with 4 stars, but your model predicts 3.5, hence it has an error of 0.5 on that rating, and that’s exactly where RMSE comes from. Then you just compute the average of the errors from the whole test set using a formula and get a final result of 0.71623. BINGO! That’s how good (or, more precisely, bad) your RS is. Or you may also use different formula and get the MAE (mean absolute error), which does not penalize huge errors (true 4 stars, predicted 1 star) that much, so you might only get 0.6134.

One tiny drawback here is that such a data almost doesn’t exist in the real world, or at least there is too few of it.

Users are too lazy and they won’t rate anything. They just open a web page and if they like what they see, they might buy it/consume it; if it sucks, they leave as fast as they came. And so you only have so-called implicit ratings in you web-server log or a database of purchases, and you can’t measure the number-of-stars error on them, simply because there are no stars. You only have +1 = user viewed a detail or purchased a product, and, typically, nothing else. Sometimes these are called the unary ratings, which you know from Facebook’s “Like” button: the rating is either positive, or unknown (the user just mightn’t know the content exists).

You can still use the split-validation on such data, even for your own offline comparison of SaaS recommenders. Say you take, by example, your purchases database, submit history of 80% users to the RS, and then, for each test user, submit only a few purchases and ask the RS to predict the rest. You may have hidden 4 purchased items and ask the RS for 10 items. You may get 0%, 25%, 50%, 75%, or 100% accuracy for that user, depending on how many of the hidden 4 appeared in the recommended 10. And this accuracy is called the Recall. You may average it over your whole test set and TADAAA! You result is 31.4159%, that’s how good your RS is.

Now honestly, even though the Recall is much more sane than RMSE, it still sucks. Say a test user watched 20 episodes of the same TV series, and you measure recall on her. So you hide episodes #18–20 and ask the RS to predict them from #1–17. It is quite easy task as the episodes are strongly connected, so you get recall 100%. Now, did your user discover something new? Do you want to recommend her such a content at all? And what brings the highest business value to you anyway? Say in online store, do you wish to recommend alternatives, or accessories? You should feel you’re getting on a very thin ice with recall.

And one more secret I will tell you: In some cases (not always, depends on your business!), it’s a fair strategy to recommend only the globally most popular items (a.k.a. bestsellers) to achieve reasonable recall. So here comes the Catalog coverage. Do you wish you users to keep discovering new and new content to stay loyal? Then you might want to recommend as many different items as possible. In the simplest case, to compute the Catalog coverage, just take your test users, ask for recommendation for each one of them, and put all the recommended items together. You obtain a large set of different items. Divide the size of this set by the total number of items in your entire catalog, and you get… 42.125%! That’s the portion of items your RS is able to ever recommend.

Now consider a bestseller model. It might have quite good recall, but almost zero coverage (5 constants items?). And take a random recommender. It has almost zero recall and 100% coverage. You might feel you’d like some compromise.

The above image comes from my (now very outdated) original research. You can see about 1000 different RS models drawn in the Recall-Coverage plane. Geeky, ain’t it? :) You might feel dizzy when choosing the best one, but I hope you feel that choosing some from the upper-right (“Pareto-optimal front”) could be a good choice.

To make your offline estimate even more robust, you can use cross-validation (Xval) instead of split-validation. Just divide your users into 10 folds and go in loop: always take 9 folds to build the model, and use the remaining 1 fold to do the validation. Average the results over these 10 runs.

Now you might say: What about my business? Measuring recall and coverage might be fine, but how are they related to my KPIs?

And you are right. To put SaaS RS on X-axis and $$$ on the Y-axis, we have to leave the offline world and go into the production!


The Online World: Follow the examples of smart CTOs

The above section was about measuring the quality of the RS before it goes into production, now it’s time to talk about business KPIs.

While in the offline evaluation we typically use the split-validation, in the online evaluation, the A/B-testing (or multivariate testing) is the today’s most prominent approach. You may integrate few different RSs, divide your users into groups and put the RSs into fight. A bit costly, because it consumes your development resources, so you can use the estimated difficulty of integration and future customizations/adjustments costs as one of your measures, which might a-priori reduce the pool of candidates.

Now lets say you have the integration ready and are able to divide your online users into A/B-test groups. You may either use your own hashing of their UID cookies, or use some tool for that (by example, VWO, Optimizely, or even GAs, though the last option is a little bit painful). To do the experiment, you should determine one good place on your website/application where to test the recommendations, because you sure don’t want to do the full integration of all the candidate RSs early in the pilot stage, right? If you have small traffic, keep in mind the selected place must be visible enough to collect significant results. In the opposite case, if you have huge traffic, you may choose a conservative strategy to, for example, release only 20% of you traffic to the testing, keeping yourself and the rest 80% users safe in case some of the candidate RSs would be completely broken and recommend odd stuff.

Suppose the whole thing is up and running. What to measure? The easiest measures are the Click-Through Rate (CTR) and the Conversion Rate (CR) of the recommendations. Displayed set of N recommendations 20 times, from which 3 times a user clicked on at least one of the recommended items? Then your CTR is 15%. Indeed, clicking is nice, but it probably led the user to a detail-page and you might want to know what happened next. Did the user really find the content interesting? Did she watch the whole video, listen to the whole song, read the whole article, answer the job offer, put the product into the cart and actually order it? This is the conversion rate = number of recommendations that made both you and your user happy.

Example: Recombee KPI console

CTR and CR may give you a good estimate of the recommender performance, but you should stay careful and keep thinking about your product. You may be running a news portal, putting the breaking news on the homepage. This might not bring you the highest possible CTR, but it maintains the quality and the feeling you and you users have about your service. Now you may put a RS there and it might start showing different content, such as yellow journalism articles or funny articles about “very fast dogs running at incredible hihg speeds”. This may increase your immediate CTR by 5 times, but it will damage your image and you may lose you users in the long term.

Here comes the empiric evaluation of the RSs. Just start a new session with empty cookies, simulate the behavior of a user and check whether the recommendations are sane. If you have a QA team, get them to the job! Empiric evaluation is both complicated and easy at once. It’s complicated, because it does not produce any numbers you could present on the product board. But it’s also easy, because, thanks to your human intuition, you will simply recognize which recommendations are good and which are bad. If you choose oddly-working recommender, you’re putting yourself into a lot of future trouble even if the CTR/CR are high at the moment.

But of course, besides quality, you should care about the Return of Investment (ROI). Simply put, you might have determined that the A/B-testing fold #1 lead to increase of X% in conversion rate over baseline fold #0 (your current solution), that your margin was $Y for average successfully recommended item, and that it required Z recommendation requests to achieve that. Do the math, project the expenses/incomes in case you put the given RS on 100% of your traffic, integrating also into other sections of your website/app.

One warning about ROI calculation: It is very fuzzy and depends on large number of unknowns: Will the CR be the same on other places on my website/app? (Simple answer = no, it won’t, different places have different CTR/CR). How will CR change if put the recommendations on more or less attractive position? (Simple answer = a lot). How will the CR evolve in time? Will the users learn to use and trust the recommendation, or will the CR decline?

This leads to the ultimate yet most difficult measure: the Customer Lifetime Value (CLV). You are looking for the win-win situation. You want your users to like your service, to feel comfortable, happy, and willing to return. Hand in hand with that, you want the RS to improve the UX, help the users find interesting content/products what they like. How to reach high CLV using a RS?

Well, no simple advice here. You should search for nice recommendations with high empirical quality and a reasonably positive ROI. To my experience, the niceness of recommendations typically corresponds to business value, will prevent you from being posted by complaints from your QA team/CEO. And if you observe the business case is positive, you’ve found what you were looking for :)


Conclusion

I’ve tried to cover the most important aspects of evaluating RSs. You might have seen it is not an easy task and there is a lot to consider, but I hope at at least gave you some clues to find your way around in the area. You may test RSs offline even before going into production, or do production A/B testing with CTR/CR and ROI estimate. Always include some QA, because CTR/CR/ROI alone may be misleading and do not guarantee compatibility with the vision of your product.

Much has been omitted just to keep the text finitely long. Besides CTR/CR/ROI/quality of the recommendations, you should have a quick look at the overall capabilities of the RS considered. You might want to include recommendations into your emailing campaigns in the future. Will it work? Does it have capabilities to rotate recommendations so that a given user won’t receive the very same set of recommendations in each email? Can you serve all your business requirements, can you affect the recommendations, boost some type of content, filter it based on various criteria? These are topics not covered but you may feel you want also consider them.


The author is a co-founder in Recombee, a sophisticated SaaS Recommendation Engine.