Daniel Lemire's blog

Online teaching is the future?

, 2 min read

Recently, Bill Gates gave us the main reason for the ongoing revolution in university teaching: Fortunately for all of you, you’re in a generation where all of these courses are going to be online and basically free. I’m taking solid state physics from MIT, though MIT doesn’t know it. You…

Publish or Perish: the Tool

, 1 min read

Through Sebastien Paquet, I found a software application called Publish or Perish. It queries Google Scholar and computes statistics for you automagically. It works well. Linux and Windows version available. The Windows version runs under MacOS if you have wine.

Should we fear Google?

, 1 min read

Google is getting in the health records business. What happens when a single company has full access to your emails, your videos, your family pictures and your health records? Abuses are possible, but I predict that not much will happen. The American NSA is recording and mining a large fraction of…

When a terabyte is small

, 1 min read

With Kamel and Owen, I am working on a paper involving database indexes. We had over a terabyte of space, and yet, in the middle of the production of the paper, we ran out of space. Only a year ago, I thought that one terabyte was large. So, I ask our technician about getting a new drive. He comes…

Recommending Journal Articles in a Scientific Digital Library

, 1 min read

André Vellino will give a talk on recommender systems in our offices (100 Sherbrooke West, room 2720) at 12:30pm this Thursday (February 21st 2008). Recommender systems for scientific digital libraries that have been the subject of experiments in recent years have used corpora that are primarily…

External-Memory Shuffles?

, 2 min read

We need to shuffle the lines in very large variable-length-record flat files. We can load the files in MySQL and do “select * from mytable order by rand().” However, loading the data in a DBMS and dumping it out is cumbersome. So, we do an in-memory shuffle block by block. It comes close to a…

What is a reusable research result?

, 1 min read

Peter argued that reusability and originality are the primary qualities of a research result. I can tell something is not original if it is looks similar to previous work. When reviewing a paper, it might difficult to determine if the research result is reusable. Nevertheless, here are some…

Yahoo! Research jobs in Montreal

, 1 min read

Fernando Diaz — an Information Retrieval Researcher from Yahoo! labs in Montreal — sent me this job offer. I had no idea Yahoo! had researchers in Montreal! I feel better about my home town! Note: Do not get in touch with me regarding this position. I am just reposting it. Machine Learning /…

No shortage of Information Technology Workers

, 1 min read

At my school, the dean of the Science Faculty claims that we should see a surge of enrollment in Computer Science given the current shortages in Information Technology workers. I have my idea on who is feeding him this information, but I believe it is nonsense. First, I do not believe there is a…

How many users are needed for an efficient collaborative filtering system?

, 1 min read

You can build an effective recommender system with as little as two people. As you have more users, you tend to have more training data. Hence, you may have more accurate recommendations. More accurate recommendations may not be important to your users.- The exact count of your users may not…

Random Write Performance in Solid-State Drives

, 1 min read

I have written that solid-state memory drives (SSD) — as found in recent laptops such as the MacBook Air — nearly bridge the gap between internal and external memory. Indeed, we went from 3 orders of magnitude to 1 order of magnitude of difference between disk and RAM! There is a catch…

Chaining CAPTCHAs for fun and profit?

, 1 min read

A CAPTCHA is a type of challenge-response test used in computing to determine whether a user is human. Yahoo! is having major difficulties with its CAPTCHAs. Russian hackers are able to pass their Turing tests with 35% accuracy. Some human beings say that their accuracy is 80% on these same…

Closed-source software is the source of innovation?

, 1 min read

Geoff cites an article by Jaron Lanier arguing that closed-source software is the source of innovation, that open source software is only polishing copies. The gist of the argument is there: Why are so many of the more sophisticated examples of code in the online world—like the page-rank…

A first draft of HTML 5… toward a new HTML?

, 1 min read

W3C just published today a first draft of HTML 5. HTML 5 replaces HTML 4 and XHTML 1. They are getting rid of the “acronym” elements because it was rarely used. The elements “canvas,” “video”, “audio” are added: the HTML becomes fully multimedia. However, MathML and SVG remain…

The network is the bottleneck?

, 1 min read

There is a really nice article on StorageMojo about Cloud Computing. Cloud Computing is more or less the idea that you can offload your storage and processing tasks to a very large set of computers, typically maintained by some large company (such as Amazon). The novelty is that you abstract out…

Tracking call for papers… with a wiki?

, 1 min read

WikiCFP is a tool to track call for papers collaboratively using a wiki. The call for papers are entered in categories: you can follow only the Machine Learning, Natural Language Processing, or databases call for papers. You can subscribe to RSS feeds for each category. What a good idea!

On the sum of power laws

, 1 min read

Many real-life data sets have power laws or Zipfian distributions. An integer-valued random variable X follows a power law with parameter a if P(X = k) is proportional to k–a. Panos asked what the sum of two power laws was. He cites Wilke at al. who claim that the sum of two power laws X and Y…

What is an effective social network?

, 1 min read

Many democratic systems require vote diversity. You do not get elected prime minister of Canada by rallying the largest number of voters. You also need to have your votes spread out over several regions. Similarly, Scott Karp argues that completely open social networks fail. He takes two examples:…