Daniel Lemire's blog

, 11 min read

Stop generating metadata and access the full content!

12 thoughts on “Stop generating metadata and access the full content!”

  1. Introducing intermediate steps, such as metadata extraction, should be done with care. Specifically, you should validate that it is useful!

    You’d think it would be a trivial point… yet…

  2. Hear, hear!

    (Somewhat tangentially, this is why I think Wolfram Alpha won’t scale to google-usefulness.)

  3. I agree that ontologies (as in “library classification systems”) can’t save us, but if you approve of Everything is Miscellaneous, you should also approve of Folksonomies.

    If people can decide their own categories or define their own labels, you at least are freed from someone else’s rigid and perhaps outdated knowledge classification scheme.

    In some circumstances, it might even work better than ML methods for term extraction from full-text. Consider the case where terminology changes over time (e.g. “AI” -> “Agents”). Folksonomic tagging might allow you to identify an ’80s paper on (what used to be known as) “knowledge representation” as relevant to a “Mobile Agent” application, whereas term-extraction alone might not give you that information.

  4. @Daniel
    Where metadata, whether human-supplied or automatically generated, is useful is in building richer interfaces for interacting with data. For example, I’m curious how you would design an exploratory search interface looks like without any metadata–again, ignoring the source of that metadata.

    I specifically wrote: “Metadata is certainly useful when aggregating content for human beings”.

    So, it seems we are in agreement, aren’t we?

    In any case, I think you’re creating a false dichotomy. No one is saying you should throw away the full content. The question is whether augmenting it with a summary is useful. We may be working with machines, but the ultimate consumers are still human beings.

    I had very specific papers in mind while writing this post. I assure you, there are people who extract metadata, throw away the document and then work exclusively from the metadata.

  5. Back in 1997, when I took my first grad-level Information Retrieval course, I remember Bruce Croft saying that there had been experiments done through the 80s and 90s, about the relative value of full-text indexing vs. abstract-only vs. metadata/keyword only.

    Full-text won every time.

    Can’t remember any specific citations, though. I suppose I could ask Bruce.

  6. I also remember a paper at WSDM 2008 from some Yahoo folks, that looked at folksonomic document (web page) tagging. At the risk of misparaphrasing from memory: I think they found that once you got rid of all the tags for terms that were already found in the document itself (i.e. something that a full-text search could already match on) and then got rid of all the tags that were non-descriptive opinions, such as “cool” and “I like this”, there wasn’t much information left.

    So even with folksonomies, I don’t fully understand how much utility they really provide over full-text indexing.

  7. If all you’re interested is implementing ranked retrieval, and you have no access to human input; then, as you note, metadata extraction only serves as a form of feature reduction.

    Where metadata, whether human-supplied or automatically generated, is useful is in building richer interfaces for interacting with data. For example, I’m curious how you would design an exploratory search interface looks like without any metadata–again, ignoring the source of that metadata.

    In any case, I think you’re creating a false dichotomy. No one is saying you should throw away the full content. The question is whether augmenting it with a summary is useful. We may be working with machines, but the ultimate consumers are still human beings.

  8. Good points, Daniel. I think I agree, if you’re talking about interfaces alone. And I don’t completely disagree, if what you are saying is that you can use summaries and metadata as augmentation (e.g. enhanced term frequency) on the underlying full text.

    But too often, I see people throwing away the full-content, in favor of metadata only. Especially in the multimedia retrieval work that I’ve done, such as music. Lots of companies and research do recommendation and retrieval based on user tags, only, and don’t take the time or effort to go in and signal-process/analyze the music itself, to use that full-content as a critical piece of the similarity match.

  9. I wasn’t sure how broadly you meant “aggregating content”, but if that entails any processing that might improve the ability of people to interact with the content, then we’re in agreement on that point.

    As for the papers that completely throw out the content, that seems silly unless it’s for efficiency reasons. Feature reduction can certainly help eliminate noise, but I can’t think of a reason other than efficiency to employ an irrevocably lossy compression technique.

  10. Francois Rivest says:

    I think that full text statistics indexing as proved itself better than tagging.

    A second element that a good litterature search engine should have is to use cross references to better relates papers and to more easely find the more central ones.

    Sincerly, I think our current litt search schemes/indexes are obsolete compared to what we could have.

  11. I completely agree that meta-data needs to be let go in many cases, it is possible that humans keep usen it because we feel more comfortable with a strict definition and delimitation of the domain in which they are working at. If we let go meta-data we are no longer in an area with clear boundaries and definition, things become fuzzy, like in real life.

  12. Ronan Tournier says:

    I fully agree with the fact that content is more important than meta-data.

    I would also add a few comments: meta-data should be a combination of human and computer generated information. The computer generated part should be from real content and (if possible) from the context (i.e. the full website that hosts a page, and so on).

    Moreover, one should keep in mind that, meta-data changes over time, changes from one user to another (…). Thus in the end and according to me, using meta-data requires heavier processes than using real content. However, this issue is evaluated neither in research nor in industry.

    Why? The answer is simple: meta-data is available with documents, why bother generating something else? Let’s just suppose that the available meta-data is: 1) correct; 2) not changing over time; 3) has no variability among users… and just use it!