Daniel Lemire's blog

For your in-memory databases, do you really need an index?

, 1 min read

For large data sets on disk, indexes are often essential. However, if your data fits in RAM, indexes are often unnecessary. They may even be harmful. Consider a table made of 10,000,000 rows and 10 columns. Using normalization, you can replace each value by a 32-bit integer for a total of 381 MB.…

The rise of scientific journalism

, 3 min read

Dissidents from the Wikileaks have founded a competing organization called OpenLeaks. This new organization would differ from Wikileaks in two important ways: (1) it would be less centered on one character (such as Wikileaks’ Assange) and (2) they would not publish the original documents, relying…

A taxonomy for the suppression of dissent

, 4 min read

Unless you live under a rock, you have heard about Wikileaks. Along with several newspapers, Wikileaks has been releasing confidential diplomatic documents for several days. Noam Chomsky has said that these documents reveal a profound hatred for democracy. It is unclear to me whether Wikileaks…

Who will need database administrators in 2020?

, 2 min read

In response to my Why do we need database joins? post, many readers stressed the importance of strict database schemas to preserve data integrity. In short, we want database administrators (DBA) to input constraints at design time so that the integrity of the database is insured no matter how lousy…

Three of my all-time most popular blog posts

, 1 min read

Emotions killing your intellectual productivity: We all have to deal with setbacks. And even when things go our way, we can still remain frustrated. I offer pointers on how to remain productive despite your emotional state. Turn your weaknesses into strengths: We all have weaknesses. Maybe you are…

Over-normalization is bad for you

, 3 min read

I took a real beating with my previous post where I argued against excessive normalization on the grounds that it increases complexity and inflexibility, and thus makes the application design more difficult. Whenever people get angry enough to post comments on a post of mine, I conclude that I am…

Why do we need database joins?

, 3 min read

In a recent post, I argued that the current NoSQL trend could be called NoJoin. My argument boils down to the fact that SQL entices you to normalize your data which creates complicated schemas. Meanwhile, NoSQL database systems use simple schemas and are therefore easier to scale out. Curt Monash…

Remarkable scientists without a wikipedia page

, 1 min read

I was surprised today to learn that Michael Ley’s wikipedia page had been deleted (because it failed to indicate the significance of the subject). I have yet to meet anyone in Computer Science or Information Technology who does not know about the DBLP Computer Science Bibliography. Michael has…

Why you may not like your job, even though everyone envies you

, 5 min read

In a provoking post, Matt Welsh, a successful tenured professor at Harvard, left his academic job for an industry position. It created a serious malaise: his department chair (Michael Mitzenmacher) wrote a counterpoint answering the improbable question: “why I’m staying at Harvard?” To my…

You probably misunderstand XML

, 3 min read

When I took my current position, I was invited to teach a course on unstructured data. It is a sensible topic for a course: some say that between 80% to 90% of all enterprise data is unstructured. But I objected to the title for marketing reasons. How many students would take a course on…

Public funding for science?

, 1 min read

Terence Kealey has been arguing against public funding of science. Is it efficient to fund science with government dollars? He argues that when science is mostly funded by large government agencies, other funding sources are effectively crowded out. He has two good historical example. Firstly,…

How do search engines handle special characters? Should you care?

, 1 min read

Matt Cutts is Google’s search engine optimization expert. He runs a great YouTube channel called Google Webmaster Central. He was recently asked how Google handles special characters such as ligatures, soft hyphens, interpuncts and hyphenation points. His answer? He doesn’t know. Being a…

Who is going to need a database engine in 2020?

, 2 min read

Given the Big Data phenomenon, you might think that everyone is becoming a database engineer. Unfortunately, writing a database engine is hard: Concurrency is difficult. Whenever a data structure is modified by different processes or threads, it can end up in an inconsistent state. Database…

The future is already here: it´s just not very evenly distributed

, 2 min read

It is not 9am yet. Nevertheless, I got a lot done: I attended the thesis proposal of my student Eduardo via Skype. I was literally in my basement with a fresh cup of coffee, attending a presentation hundreds of kilometers away. Beside myself, there were professors from two different cities…

Can you trust fixed-bit computer arithmetic?

, 3 min read

Suppose that you have 10 pictures, and all lined up, they take 100 pixels. Is it safe to say that each picture has a width of x pixels if 10 x = 100? We all know that a x = b has a unique solution x as long as a is non-zero. If you work with integers, then you can say that there is at most one…

Can Science be wrong? You bet!

, 3 min read

A common answer to my post on the reliability of science, was that fraud was marginal and that, ultimately, science is self-correcting. That is true on one condition: that the science in question is bona fide science. Otherwise, I disagree that institutional science is self-correcting. It is…

Is MapReduce obsolete?

, 1 min read

Last week, the Register announced that Google moved “away from MapReduce.” Given that several companies adopted MapReduce (hence copying Google), is Google moving a step ahead of its copycats? Moreover, Tony Bain is asking today whether Stonebraker was right in stating that MapReduce was a “a…

How reliable is science?

, 3 min read

It is not difficult find instances of fraud in science: Ranjit Chandra faked medical research results. He pocketed the money meant for running the experiments. Woo-suk Hwang faked human cloning, among other terrible things. Jan Hendrik Schön faked a transistor at the molecular level. How did…

Manifesto for Half-Arsed Academic Research

, 1 min read

Research results are more important than the number of publications or citations. This is fine. Yet, we don’t have time to read your papers. So, just keep publishing a lot of papers each year. And get your influential friends to cite you. That’s how we’ll know whether you are good. Science…