Daniel Lemire's blog

, 11 min read

Implementing a Rating-Based Item-to-Item Recommender System in PHP/SQL

10 thoughts on “Implementing a Rating-Based Item-to-Item Recommender System in PHP/SQL”

  1. Seb says:

    Tu devrais l’annoncer sur le nouveau groupe http://groups.yahoo.com/group/RecommenderSystems/ (qui compte déjà 45 membres)

  2. Seb says:

    Peut-être aussi fournir le code en text file…

  3. Thanks. Great idea.

  4. Chris Harris says:

    Hi Daniel,

    I have implemented your algorithms and created a dev table using the MovieLens 100k database. Is there any way to return similar items to item 1 rather than just items that are rated higher than it?

    If I use say:

    SELECT dev.itemid1 , dev.sum, dev.count, dev.sum / dev.count AS average, movies.movietitle
    FROM dev, movies
    WHERE dev.count > 5 AND dev.itemid2 = 98 AND dev.itemid1 = movies.movieid
    ORDER BY average DESC;

    (item 98 is One Flew Over the Cuckoo’s Nest, query takes 0.0388s) I get 3 Wallace and Grommit movies in the top 5. I presume this is because people who liked Wallace and Grommit did not like One Flew Over the Cuckoo’s Nest.

    Also, you mention in your paper that you can initialize your dev table with 0 default values for all entries and then use the predict_all function. I use the following query for that:

    SELECT dev.itemID1, movies.movietitle, sum(dev.sum + dev.count*rating.ratingvalue) / sum(dev.count) as avgrat
    FROM dev, rating, movies
    WHERE rating.userid = 1
    AND dev.count > 30
    AND dev.itemID1 < > rating.itemid
    AND dev.itemID2 = rating.itemid
    AND dev.itemID1 = movies.movieid
    GROUP BY dev.itemID1, movies.movietitle
    ORDER BY avgrat DESC
    LIMIT 100;

    (that query takes about 5 seconds with a MyISAM table, almost 200 seconds with an InnoDB table! (there are 1,967,832 rows in the dev table)). But I can’t see any reason to pre-populate with 0 values.

    I initially tried this with PostgreSQL but it was way too slow (even with triggers and functions.) I am currently writing stored procedures for MySQL 5 and hopefully they will make things even faster. What do you use on your site?

    Thanks for all the work you’ve put into this!

    Chris

  5. May I ask what you are working on Chris?

    I have implemented your algorithms and created a dev table using the MovieLens 100k database. Is there any way to return similar items to item 1 rather than just items that are rated higher than it?

    Actually, what I suggest people do is to return items that people liked even more than the current item. You could cook up a similarity measure, but most often, isn’t your goal to help people find items they will like rather than items similar to what they’ve found already? Of course, you could argue that if they’ve found a musical and they like musicals, we should suggest other musicals, but slope one is only focused on *rating-based* collaborative filtering.

    Obviously, you could throw in a correlation measure between two items in the database. Myself, I believe you should rather send people see items they are likely to like at least as much as the current item.


    If I use say:

    SELECT dev.itemid1 , dev.sum, dev.count, dev.sum / dev.count AS average, movies.movietitle

    FROM dev, movies

    WHERE dev.count > 5 AND dev.itemid2 = 98 AND dev.itemid1 = movies.movieid

    ORDER BY average DESC;

    (item 98 is One Flew Over the Cuckoo’s Nest, query takes 0.0388s) I get 3 Wallace and Grommit movies in the top 5. I presume this is because people who liked Wallace and Grommit did not like One Flew Over the Cuckoo’s Nest.

    First of all, in collaborative filtering, when someone rates an item, they, most often, like it. Rating something is a positive reaction. If you hate “Flew Over the Cuckoo”, you won’t bother to rate it, most of the time. So, what your result means is that people who rated “Flew Over the Cuckoo’s Nest” preferred “Wallace and Grommit”. In other words, people who cared enough about “Flew Over the Cuckoo’s Nest” to rate it, actually preferred “Wallace and Grommit”.

    If you think “oh, but I don’t want movies of different genres to be recommended this way”, then the best bet is to throw in some content-based techniques. For example, when the user is browsing a drama, only recommend drama. You can achieve this result by having one dev table per genre.

    Also, you mention in your paper that you can initialize your dev table with 0 default values for all entries and then use the predict_all function. I use the following query for that:

    SELECT dev.itemID1, movies.movietitle, sum(dev.sum + dev.count*rating.ratingvalue) / sum(dev.count) as avgrat

    FROM dev, rating, movies

    WHERE rating.userid = 1

    AND dev.count > 30

    AND dev.itemID1 < > rating.itemid

    AND dev.itemID2 = rating.itemid

    AND dev.itemID1 = movies.movieid

    GROUP BY dev.itemID1, movies.movietitle

    ORDER BY avgrat DESC

    LIMIT 100;

    (that query takes about 5 seconds with a MyISAM table, almost 200 seconds with an InnoDB table! (there are 1,967,832 rows in the dev table)). But I can’t see any reason to pre-populate with 0 values.

    Now that you ask, I cannot see why on Earth one would need to populate the table with zeroes. This is very puzzling to me as I always assumed you needed to have a dense matrix for this particular trick to work, but I don’t see why it is needed now. I will add a note in the tech. report to reflect this realization.

    I initially tried this with PostgreSQL but it was way too slow (even with triggers and functions.) I am currently writing stored procedures for MySQL 5 and hopefully they will make things even faster. What do you use on your site?

    inDiscover.net uses MySQL 4.x I believe.

  6. Looking forward to hear from you Chris!

  7. Chris Harris says:

    Thanks for the reply, Daniel. I am currently working on a top-secret project 😉 Details will be revealed soon…

    I think the problem of W&G being recommended when you look at One Flew Over… could also be due to there being such a few number of moview in the ML database. If there were say 100,000 movies, members’ ratings would be better “grouped” into similar items because then people would only rate the movies they liked, as you said, by searching for them rather than being presented a list to rate. But 100,000^2 makes a pretty big dev table!

    I have a few more suggestions for your paper that I will post in a few days. Still trying to get the damn stored procedures working in MySQL! Their cursor implementation seems very strange to me.

  8. Chris Harris says:

    This is just something I though about over the weekend regarding similar items. If there was a “genre” column it would be even better (“and r.genre = r2.genre”):

    select count(r.itemid), r.itemid, m.movietitle
    from rating r, rating r2, movies m
    where r.userid = r2.userid
    and r2.itemid = 168
    and r.itemid 168
    and r2.ratingvalue = 5
    and r.ratingvalue = 5
    and r.itemid = m.movieid
    group by r.itemid
    order by count(r.itemid) desc

  9. Vincent Gagnon says:

    Hi Daniel,
    I just found your article as I am trying to build a user-based recommender system myself. As the article is quite old now, do you think there are some new and/or better solutions to this problem with the technology we have today? Or do you think this solution is still relevant? After all, php and mysql are still widely used. I would be interested to hear your thoughts on this. Thank you

    1. PHP and mysql are not obsolete, certainly.