24th January 2005, 11 min read

Implementing a Rating-Based Item-to-Item Recommender System in PHP/SQL

10 thoughts on “Implementing a Rating-Based Item-to-Item Recommender System in PHP/SQL”

Seb says:

January 24, 2005 at 9:09 pm

Tu devrais l’annoncer sur le nouveau groupe http://groups.yahoo.com/group/RecommenderSystems/ (qui compte déjÃ 45 membres)
Seb says:

January 24, 2005 at 9:09 pm

Peut-Ãªtre aussi fournir le code en text file…
Daniel Lemire says:

January 24, 2005 at 9:14 pm

Thanks. Great idea.
Chris Harris says:

November 4, 2005 at 4:51 am

Hi Daniel,

I have implemented your algorithms and created a dev table using the MovieLens 100k database. Is there any way to return similar items to item 1 rather than just items that are rated higher than it?

If I use say:

SELECT dev.itemid1 , dev.sum, dev.count, dev.sum / dev.count AS average, movies.movietitle
FROM dev, movies
WHERE dev.count > 5 AND dev.itemid2 = 98 AND dev.itemid1 = movies.movieid
ORDER BY average DESC;

(item 98 is One Flew Over the Cuckoo’s Nest, query takes 0.0388s) I get 3 Wallace and Grommit movies in the top 5. I presume this is because people who liked Wallace and Grommit did not like One Flew Over the Cuckoo’s Nest.

Also, you mention in your paper that you can initialize your dev table with 0 default values for all entries and then use the predict_all function. I use the following query for that:

SELECT dev.itemID1, movies.movietitle, sum(dev.sum + dev.count*rating.ratingvalue) / sum(dev.count) as avgrat
FROM dev, rating, movies
WHERE rating.userid = 1
AND dev.count > 30
AND dev.itemID1 < > rating.itemid
AND dev.itemID2 = rating.itemid
AND dev.itemID1 = movies.movieid
GROUP BY dev.itemID1, movies.movietitle
ORDER BY avgrat DESC
LIMIT 100;

(that query takes about 5 seconds with a MyISAM table, almost 200 seconds with an InnoDB table! (there are 1,967,832 rows in the dev table)). But I can’t see any reason to pre-populate with 0 values.

I initially tried this with PostgreSQL but it was way too slow (even with triggers and functions.) I am currently writing stored procedures for MySQL 5 and hopefully they will make things even faster. What do you use on your site?

Thanks for all the work you’ve put into this!

Chris
Daniel Lemire says:

November 4, 2005 at 2:59 pm

May I ask what you are working on Chris?

I have implemented your algorithms and created a dev table using the MovieLens 100k database. Is there any way to return similar items to item 1 rather than just items that are rated higher than it?

Actually, what I suggest people do is to return items that people liked even more than the current item. You could cook up a similarity measure, but most often, isn’t your goal to help people find items they will like rather than items similar to what they’ve found already? Of course, you could argue that if they’ve found a musical and they like musicals, we should suggest other musicals, but slope one is only focused on *rating-based* collaborative filtering.

Obviously, you could throw in a correlation measure between two items in the database. Myself, I believe you should rather send people see items they are likely to like at least as much as the current item.

If I use say:

SELECT dev.itemid1 , dev.sum, dev.count, dev.sum / dev.count AS average, movies.movietitle

FROM dev, movies

WHERE dev.count > 5 AND dev.itemid2 = 98 AND dev.itemid1 = movies.movieid

ORDER BY average DESC;

(item 98 is One Flew Over the Cuckoo’s Nest, query takes 0.0388s) I get 3 Wallace and Grommit movies in the top 5. I presume this is because people who liked Wallace and Grommit did not like One Flew Over the Cuckoo’s Nest.

First of all, in collaborative filtering, when someone rates an item, they, most often, like it. Rating something is a positive reaction. If you hate “Flew Over the Cuckoo”, you won’t bother to rate it, most of the time. So, what your result means is that people who rated “Flew Over the Cuckoo’s Nest” preferred “Wallace and Grommit”. In other words, people who cared enough about “Flew Over the Cuckoo’s Nest” to rate it, actually preferred “Wallace and Grommit”.

If you think “oh, but I don’t want movies of different genres to be recommended this way”, then the best bet is to throw in some content-based techniques. For example, when the user is browsing a drama, only recommend drama. You can achieve this result by having one dev table per genre.

Also, you mention in your paper that you can initialize your dev table with 0 default values for all entries and then use the predict_all function. I use the following query for that:

SELECT dev.itemID1, movies.movietitle, sum(dev.sum + dev.count*rating.ratingvalue) / sum(dev.count) as avgrat

FROM dev, rating, movies

WHERE rating.userid = 1

AND dev.count > 30

AND dev.itemID1 < > rating.itemid

AND dev.itemID2 = rating.itemid

AND dev.itemID1 = movies.movieid

GROUP BY dev.itemID1, movies.movietitle

ORDER BY avgrat DESC

LIMIT 100;

(that query takes about 5 seconds with a MyISAM table, almost 200 seconds with an InnoDB table! (there are 1,967,832 rows in the dev table)). But I can’t see any reason to pre-populate with 0 values.

Now that you ask, I cannot see why on Earth one would need to populate the table with zeroes. This is very puzzling to me as I always assumed you needed to have a dense matrix for this particular trick to work, but I don’t see why it is needed now. I will add a note in the tech. report to reflect this realization.

I initially tried this with PostgreSQL but it was way too slow (even with triggers and functions.) I am currently writing stored procedures for MySQL 5 and hopefully they will make things even faster. What do you use on your site?

inDiscover.net uses MySQL 4.x I believe.
Daniel Lemire says:

November 4, 2005 at 11:08 pm

Looking forward to hear from you Chris!
Chris Harris says:

November 4, 2005 at 8:27 pm

Thanks for the reply, Daniel. I am currently working on a top-secret project 😉 Details will be revealed soon…

I think the problem of W&G being recommended when you look at One Flew Over… could also be due to there being such a few number of moview in the ML database. If there were say 100,000 movies, members’ ratings would be better “grouped” into similar items because then people would only rate the movies they liked, as you said, by searching for them rather than being presented a list to rate. But 100,000^2 makes a pretty big dev table!

I have a few more suggestions for your paper that I will post in a few days. Still trying to get the damn stored procedures working in MySQL! Their cursor implementation seems very strange to me.
Chris Harris says:

November 9, 2005 at 6:44 am

This is just something I though about over the weekend regarding similar items. If there was a “genre” column it would be even better (“and r.genre = r2.genre”):

select count(r.itemid), r.itemid, m.movietitle
from rating r, rating r2, movies m
where r.userid = r2.userid
and r2.itemid = 168
and r.itemid 168
and r2.ratingvalue = 5
and r.ratingvalue = 5
and r.itemid = m.movieid
group by r.itemid
order by count(r.itemid) desc
Vincent Gagnon says:

September 10, 2019 at 10:17 am

Hi Daniel,
I just found your article as I am trying to build a user-based recommender system myself. As the article is quite old now, do you think there are some new and/or better solutions to this problem with the technology we have today? Or do you think this solution is still relevant? After all, php and mysql are still widely used. I would be interested to hear your thoughts on this. Thank you
1. Daniel Lemire says:
  
  September 10, 2019 at 11:57 am
  
  PHP and mysql are not obsolete, certainly.