The NYT article actually does raise a few issues that you mention – such as the importance of diversity (through Maes’ complaint about narrow-mindedness).
Also, I’ve elaborated on the problems of RMSE on our blog – it was interesting to see Koren’s comment to your Dec 07 post about RMSE giving a misleading measure of progress.
Regarding your criticisms of machine learning, there is research in that field that considers diversity constraints and non-static data sets.
Section 5.1 of this paper by Smola and Le explicitly considers diversity constraints for ranking problems, of which collaborative filtering is a special case.
There is also quite a lot of research in ML on what you call “non-static” data sets. However, the ML community refers to this as “online” learning. Stochastic gradient descent is a well known and practical example of these type of algorithms.
There is even research that addresses both of your short-coming at once. For example, Crammer and Singer have a paper that provides an online algorithm for ranking and apply it to the EachMovie data set. Searching for “online learning” and “ranking” reveals more along these lines.
There is also quite a lot of research in ML on what you call “non-static” data sets.
Here is the problem again. When working with a static data set, such as what Crammer and Singer do with the EachMovie data set, they ignore the fact that, in practise, the ratings are influenced by the collaborative filtering algorithm! If you change the algorithm, you will collect different ratings. That’s because your users browse the movies, say, based on what the recommender suggests (for example, Amazon says 30% of their sales is due to the recommenders they use)… so they will rate different items, differently, if you change the algorithm. In turn, this will influence the algorithm which will then change how it influences the users.
This is like the polls right before the election. The polls are supposed to measure how people vote, but in fact, they influence people… there is a feedback loop.
That has *absolutely* nothing to do with whether you do batch or online processing. The online versus batch is a performance issue, I’m talking about how the *algorithm* changes the *data* (and vice versa).
Section 5.1 of this paper by Smola and Le explicitly considers diversity constraints for ranking problems, of which collaborative filtering is a special case.
Thanks. This recent non-peer-reviewed paper looks good indeed. But it is hardly representative of the algorithmic research done in collaborative filtering though. The diversity issue has not be ignored in collaborative filtering. I have a survey report somewhere of what was done. Several people, from way back, talked about diversity in recommender systems. However, the diversity work is tiny and vastly ignored.
Why? Because it is a lot easier to measure accuracy. So, all the work (99%) focuses on this one issue above all else.
Aleks : Yes, people know about the diversity issue. In fact, every user knows about this problem. It has just been vastly ignored for the last 10 years.
I see what you mean by non-static data now and take your point. However, I disagree that online processing has “absolutely nothing to do” with non-static data since online methods are able to track non-stationary targets. That is, if the results of the algorithm are changing the distributions underlying the data then as more data is taken into account the algorithm will adapt to them.
I also agree that there is not much work on diversity measures but I thought you were too quick to discount ML research with a sweeping statement so felt compelled to offer a counter-example.
I see what you mean by non-static data now and take your point. However, I disagree that online processing has “absolutely nothing to do” with non-static data since online methods are able to track non-stationary targets. That is, if the results of the algorithm are changing the distributions underlying the data then as more data is taken into account the algorithm will adapt to them.
An online algorithm will have a tighter feedback loop, but ultimately, you are limited by how quickly your users can react and input data. Hence, the difference between a batch algorithm run every day, and an online algorithm that adapts on the fly, might not be so large.
Of course, I would favor the online algorithm given a chance… 😉 But google seems to do well with batch indexing algorithms. I understand that they run PageRank in batch mode… and they seem to do fine.
I also agree that there is not much work on diversity measures but I thought you were too quick to discount ML research with a sweeping statement so felt compelled to offer a counter-example.
I am equally critical of my own work, of the work in any domain. It is by seeking flaws that we make progress. And I have received my share of criticism from the ML community and from the TCS community as well. (Note that I have published papers in ML and TCS journals/conferences. I refuse to live in closed gardens.)
Please go read my papers and criticize them! Publicly! If people can’t take criticism, they should stay home, in the labs, and never publish.
However, my impression is that the ML community suffer from the same flaw than any of these tightly integrated communities: it becomes strongly biased. See my post Encouraging
diversity in science for a related discussion.
I believe research should not occur within groups, but within networks. The communities should be open, not closed. Single-minded people (“accuracy above all else”) should be left behind. Science requires us to be open minded, to have a dialogue not only with people who “think alike” but also with people who think differently, so that we can do “richer” science.
I definitely agree with your last point. I’m also wary of very tightly knit groups. On that not, you’ll be happy to hear that I’m presenting a paper at a conference on Australian literary culture next month. Is that diverse enough for you? 🙂
The NYT article actually does raise a few issues that you mention – such as the importance of diversity (through Maes’ complaint about narrow-mindedness).
Also, I’ve elaborated on the problems of RMSE on our blog – it was interesting to see Koren’s comment to your Dec 07 post about RMSE giving a misleading measure of progress.
Regarding your criticisms of machine learning, there is research in that field that considers diversity constraints and non-static data sets.
Section 5.1 of this paper by Smola and Le explicitly considers diversity constraints for ranking problems, of which collaborative filtering is a special case.
There is also quite a lot of research in ML on what you call “non-static” data sets. However, the ML community refers to this as “online” learning. Stochastic gradient descent is a well known and practical example of these type of algorithms.
There is even research that addresses both of your short-coming at once. For example, Crammer and Singer have a paper that provides an online algorithm for ranking and apply it to the EachMovie data set. Searching for “online learning” and “ranking” reveals more along these lines.
There is also quite a lot of research in ML on what you call “non-static” data sets.
Here is the problem again. When working with a static data set, such as what Crammer and Singer do with the EachMovie data set, they ignore the fact that, in practise, the ratings are influenced by the collaborative filtering algorithm! If you change the algorithm, you will collect different ratings. That’s because your users browse the movies, say, based on what the recommender suggests (for example, Amazon says 30% of their sales is due to the recommenders they use)… so they will rate different items, differently, if you change the algorithm. In turn, this will influence the algorithm which will then change how it influences the users.
This is like the polls right before the election. The polls are supposed to measure how people vote, but in fact, they influence people… there is a feedback loop.
That has *absolutely* nothing to do with whether you do batch or online processing. The online versus batch is a performance issue, I’m talking about how the *algorithm* changes the *data* (and vice versa).
Section 5.1 of this paper by Smola and Le explicitly considers diversity constraints for ranking problems, of which collaborative filtering is a special case.
Thanks. This recent non-peer-reviewed paper looks good indeed. But it is hardly representative of the algorithmic research done in collaborative filtering though. The diversity issue has not be ignored in collaborative filtering. I have a survey report somewhere of what was done. Several people, from way back, talked about diversity in recommender systems. However, the diversity work is tiny and vastly ignored.
Why? Because it is a lot easier to measure accuracy. So, all the work (99%) focuses on this one issue above all else.
Aleks : Yes, people know about the diversity issue. In fact, every user knows about this problem. It has just been vastly ignored for the last 10 years.
I love the post you link to!
I see what you mean by non-static data now and take your point. However, I disagree that online processing has “absolutely nothing to do” with non-static data since online methods are able to track non-stationary targets. That is, if the results of the algorithm are changing the distributions underlying the data then as more data is taken into account the algorithm will adapt to them.
I also agree that there is not much work on diversity measures but I thought you were too quick to discount ML research with a sweeping statement so felt compelled to offer a counter-example.
I see what you mean by non-static data now and take your point. However, I disagree that online processing has “absolutely nothing to do” with non-static data since online methods are able to track non-stationary targets. That is, if the results of the algorithm are changing the distributions underlying the data then as more data is taken into account the algorithm will adapt to them.
An online algorithm will have a tighter feedback loop, but ultimately, you are limited by how quickly your users can react and input data. Hence, the difference between a batch algorithm run every day, and an online algorithm that adapts on the fly, might not be so large.
Of course, I would favor the online algorithm given a chance… 😉 But google seems to do well with batch indexing algorithms. I understand that they run PageRank in batch mode… and they seem to do fine.
I also agree that there is not much work on diversity measures but I thought you were too quick to discount ML research with a sweeping statement so felt compelled to offer a counter-example.
I am equally critical of my own work, of the work in any domain. It is by seeking flaws that we make progress. And I have received my share of criticism from the ML community and from the TCS community as well. (Note that I have published papers in ML and TCS journals/conferences. I refuse to live in closed gardens.)
Please go read my papers and criticize them! Publicly! If people can’t take criticism, they should stay home, in the labs, and never publish.
However, my impression is that the ML community suffer from the same flaw than any of these tightly integrated communities: it becomes strongly biased. See my post Encouraging
diversity in science for a related discussion.
I believe research should not occur within groups, but within networks. The communities should be open, not closed. Single-minded people (“accuracy above all else”) should be left behind. Science requires us to be open minded, to have a dialogue not only with people who “think alike” but also with people who think differently, so that we can do “richer” science.
I definitely agree with your last point. I’m also wary of very tightly knit groups. On that not, you’ll be happy to hear that I’m presenting a paper at a conference on Australian literary culture next month. Is that diverse enough for you? 🙂
I just came across:
http://www.informatik.uni-freiburg.de/~cziegler/BX/
It mentions diversity, but uses taxonomy (!) to compute it.