Daniel Lemire's blog

, 2 min read

Changing your perspective: horizontal, vertical and hybrid data models

Data has natural layouts:

  • text is written from the first to the last word,
  • database tables are written one row at a time,
  • Google presents results one document at a time,
  • the early recommender systems compared users to other users,
  • discussions are organized in newsgroups and posting boards by topic,
  • research papers are organized in journals and conferences,
  • objects have attributes (a ball is red), and from these attributes we determine similarities between objects.

Using a database terminology, these are horizontal layouts.

We can rotate these models to create vertical layouts:

  • Instead of writing text sequentially, we can store the locations of each word in an inverted index.
  • Instead of writing database tables row-by-row (e.g., Oracle, MySQL), we can write databases column by column (e.g., C-Store/Vertica, LucidDB, Sybase IQ, and my Lemur Bitmap Index Library).
  • Instead of presenting results sets one document at a time, we present tag clouds and use faceted search to support exploration. Thus, instead of listing documents, we focus on attributes (date, topic, author).
  • Recommender systems are often more scalable when they compare items instead of users: the most famous example is Greg Linden‘s Amazon recommender (if you liked this book, you may like…). For example, the Slope One algorithms outperform many user-to-user algorithms.
  • The social web started out with topic-oriented newsgroups and posting boards, but it is not dominated by user-centric blogs and social sites (such as Facebook or Twitter). Since then, we have realized that user-oriented blogs can be preferable.
  • While research papers are published in conferences and journals, I argue that we should turn this around and organize research papers by author through author-specific feeds.
  • Some AI researchers are suggesting that relations might be primary whereas attributes would be secondary.

Many of the best solutions are hybrids. For example, text search sometimes require full-text indexes such as suffix arrays and Oracle recently announced a row/column hybrid.

Take away message__:__ If you are stuck, try to rotate your data model. If neither the vertical nor the horizontal model is a good fit, create an hybrid.