Daniel Lemire's blog

, 1 min read

Stop generating metadata and access the full content!

Many researchers advocate the use of metadata to help find or recommend content automatically. Metadata is certainly useful when aggregating content for human beings: I first read the titles of research papers before reading them. However, machines do better when they access at least some of content (Lin, 2009). Moreover, metadata is of little value in ranking answers (Hawking and Zobel, 2007). I think that researchers cling to metadata because that is how we have indexed books for so long. When I was a kid, full text searches in a library was unthinkable. Yet, there is no escape: everything is miscellaneous. Folksonomies and ontologies will not save the day. When working with machines, let go of metadata and embrace the full content.

I am particularly puzzled by a common research approach. Take an object. Extract metadata. Then compare objects between themselves using the metadata, or use the metadata for retrieval. I understand that this may constitute a useful form of dimensionality reduction. Yet, researchers frequently omit to check whether it is necessary to extract metadata at all.

Reference

  • David Hawking and Justin Zobel, Does topic metadata help with Web search? Journal of the American Society for Information Science 58 (5), 2007.
  • Jimmy Lin, Is searching full text more effective than searching abstracts? BMC Bioinformatics 2009, 10:46, 2009.

Credit: Thanks to Andre Vellino for motivating this post.