Daniel Lemire's blog

We never invent anything new, yet progress is made!

, 2 min read

Practical innovation explains how per-capita wealth increased eightfold during the last century. Yet, we are constantly reminded that we never invent anything new: Most movies are remake or variations on older movies. Most research papers are variation on a theme. Most products and services are…

My low-tech research tools

, 1 min read

I carry a pocketbook and a pen everywhere. At night, my pocketbook is by my bed. All creative workers should carry notebooks. Organizing and collecting ideas are different tasks. My pocketbook is strictly for collection. Every few days, I start a new page: a list of reminders on one side, and…

My (short) activity report for 2008

, 2 min read

I heard on radio today that the Christmas break should be used to review the past year, and decide where you want to go. Good idea! What did I do? I published the Lemur Bitmap Index C++ Library. I published lbimproved, a C++ library for Fast Nearest-Neighbor Retrieval under the Dynamic Time…

Where do presidents and prime ministers go to school?

, 3 min read

In his most recent essay, After the credentials, Paul Graham tells us that in South Korea where “college entrance exams determine 70 to 80 percent of a person’s future.” Fortunately, the Americans know better: “Where you go to college still matters, but not like it used to.” Paul writes…

Parsing CSV files is CPU bound: a C++ test case (Update 1)

, 2 min read

(See update 2.) In a recent blog post, I said that parsing simple CSV files could be CPU bound. By parsing, I mean reading the data on disk and copying it into an array. I also strip the field values of spurious white space. You can find my C++ code on my server. A reader criticized my…

Parsing CSV files is CPU bound: a C++ test case (Update 2)

, 1 min read

I am continuing my fun saga to determine whether parsing CSV files is CPU bound or I/O bound. Recall that I posted some C++ code and reported that it took 96 seconds of process time to parse a given 2GB CSV file and just 27 seconds to read the lines without parsing. Preston L. Bannister correctly…

Fast argmax in Python

, 1 min read

In my post Computing argmax fast in Python, I reported that Python has no builtin function to compute argmax, the position of a maximal value. I provided one such function and asked people to improve my solution. Here are the results: argmax function running time array.index(max(array)) 0.1…

Parsing CSV files is CPU bound: a C++ test case

, 1 min read

(These results were updated.) In Parsing text files is CPU bound, I claimed that I had a C++ test case proving that parsing CSV files could be CPU bound. By CPU bound, I mean that the overhead of taking each line, finding out where the commas are, and storing the copies of the fields into an array,…

The Synthese Recommender System

, 1 min read

Andre Vellino has just opened his Synthese Recommender System: a recommender for journal articles. Andre works for one of the largest scientific libraries in the world (CISTI). You can read all about his project on his blog.

Why is the free market letting us down?

, 2 min read

I often lean on the right politically. The idea that the free market will work is compelling. Free markets may be good at generating some form of wealth, but as we saw on the stock market, this wealth may turn out to be artificial. We have another example of the rule: pure theory is wasteful. But…

The next wave in IT: employee monitoring

, 1 min read

Up until now, it has been difficult for bosses to monitor employees remotely. A friend of ours worked from an office in downtown Montreal. She decided that working from home would be more efficient. Though her boss is conservative, he agreed. She must must be particularly happy this morning…

Parsing text files is CPU bound

, 2 min read

Computer Science researchers often stress the importance of compression to get better performance. I believe this is a good illustration of an academic bias. Indeed, file size is easy to measure. It is oblivious to Computer and CPU architectures. We even have a beautiful theory that tells you how…

Native XML databases: have they taken the world over yet?

, 2 min read

Some years ago, the database research community jumped into XML. Finally, something new to work on! For about 5 years now, I have seen predictions that the XML databases would take the world over. Every organization would soon have its XML database. People would run web sites out of XML databases.…

Are you really running out of time?

, 1 min read

A common feeling among creative workers is the lack of time. Yet, most people will run out of energy before they run out of time. A single task that takes you 5 minutes (asking a Business Development Officer for Intellectual Property rights) can drain you out for a week. Another task, like…

Social Networking for Scientists: Mendeley

, 1 min read

Among scientists-bloggers, the new buzz word is Mendeley: a social networking platform for scientists (Ricardo Vidal, Sylvie Noël, Misha Lemeshko, Michael Kuhn, …). The site is barely getting started and is still in early beta, there are bugs and limitations. However, the London-based has…

Innovative ideas are indistinguishable from crackpot ones

, 2 min read

It is impossible to distinguish objectively and systematically bogus work from high quality work. You can sort work based on external attributes such as quality of the presentation, length, logical correctness, prestige of the authors, and methodology, but not on the significance of the work.…

Diversity in recommender systems: sketch of a bibliography

, 1 min read

I have been arguing on this blog that while everyone knows diversity is a desirable property of recommender systems, there has been little work on the topic. To make my claim precise, I decided to list the papers addressing both recommender systems and diversity. I mean this list to be…

Recommender systems: where are we headed?

, 1 min read

Daniel Tunkelang comments on the recent progress in collaborative filtering: (…) the machine learning community, much like the information retrieval community, generally prefers black box approaches, (…) If the goal is to optimize one-shot recommendations, they are probably right. But I…