, 2 min read
Changing your perspective: horizontal, vertical and hybrid data models
Data has natural layouts:
- text is written from the first to the last word,
- database tables are written one row at a time,
- Google presents results one document at a time,
- the early recommender systems compared users to other users,
- discussions are organized in newsgroups and posting boards by topic,
- research papers are organized in journals and conferences,
- objects have attributes (a ball is red), and from these attributes we determine similarities between objects.
Using a database terminology, these are horizontal layouts.
We can rotate these models to create vertical layouts:
- Instead of writing text sequentially, we can store the locations of each word in an inverted index.
- Instead of writing database tables row-by-row (e.g., Oracle, MySQL), we can write databases column by column (e.g., C-Store/Vertica, LucidDB, Sybase IQ, and my Lemur Bitmap Index Library).
- Instead of presenting results sets one document at a time, we present tag clouds and use faceted search to support exploration. Thus, instead of listing documents, we focus on attributes (date, topic, author).
- Recommender systems are often more scalable when they compare items instead of users: the most famous example is Greg Linden‘s Amazon recommender (if you liked this book, you may like…). For example, the Slope One algorithms outperform many user-to-user algorithms.
- The social web started out with topic-oriented newsgroups and posting boards, but it is not dominated by user-centric blogs and social sites (such as Facebook or Twitter). Since then, we have realized that user-oriented blogs can be preferable.
- While research papers are published in conferences and journals, I argue that we should turn this around and organize research papers by author through author-specific feeds.
- Some AI researchers are suggesting that relations might be primary whereas attributes would be secondary.
Many of the best solutions are hybrids. For example, text search sometimes require full-text indexes such as suffix arrays and Oracle recently announced a row/column hybrid.
Take away message__:__ If you are stuck, try to rotate your data model. If neither the vertical nor the horizontal model is a good fit, create an hybrid.