2nd December 2014, 33 min read

When bad ideas will not die: from classical AI to Linked Data

21 thoughts on “When bad ideas will not die: from classical AI to Linked Data”

Ozzy says:

December 3, 2014 at 1:36 am

So what are you suggesting? Not to fund anyone? kill the web? maybe stop using google until they come up with the proper solution since this one is awful…
Rzluf says:

December 3, 2014 at 4:07 am

It’s very interesting article. However, I have doubts about the uselessness of the Semantic Web. For example Google translate says that it uses WordNet and derivatives:
http://translate.google.com/about/intl/en_ALL/
Asraful says:

December 3, 2014 at 8:50 am

as a young learner i am little bit confused after reading this.

yes ,important point is that no effective result / output is not out there using semantic web .

i was studying bunch of papers on ontology and ontology matching and semantic similarity technique , but it seems too tough to fit in real world problem .
Daniel Lemire says:

December 3, 2014 at 11:24 am

@Rzluf

Google Translation relies on many techniques, but it is not a reasoning engine running over a RDF store. It is not part of the Semantic Web.

Wordnet is a brilliant and useful project but it existed before the Web was conceived, it does not use RDF, and it is not even an ontology. The relationship with the Semantic Web is tenuous at best.

Please do not make the mistake of putting anything that has to do with semantics (e.g., dictionaries) as belonging to the Semantic Web. The Semantic Web is nothing but the rebranding of classical AI: reasoning engines over predicates.
Daniel Lemire says:

December 3, 2014 at 11:27 am

@Ozzy

I have to disagree with you about Google being awful. I think that millions of people can see everyday how well it works.

Meanwhile, what have we gotten for the billions invested in classical AI? Expert systems nobody uses?
Stefane Fermigier says:

December 3, 2014 at 3:27 pm

“awful” or “evil” ?

S.
Ozzy says:

December 3, 2014 at 9:27 pm

BTW wasn’t Google spawn out from funded research belonging to those fields (e.g., expert systems)?
Jay says:

December 4, 2014 at 10:38 am

I’m not sure what to make of the rhetoric here. No one, least of all the Linked Data community, is arguing that pure logic will win the day. But there’s a widespread recognition that structured knowledge is a stepping stone to many of the goals in AI. You point to Google, but Google is a main supporter of schema.org, and has recently touted its Knowledge Graph and Knowledge Vault projects as key ingredients in search. When you ask your phone “Who won the game tonight?” the inferences necessary are not so different than the ones that your much-vilified classical AI tackled.
Edi Bice says:

December 4, 2014 at 9:17 am

Standard TFIDF has reached its limits. Two documents, one mentioning Al Gore and another mentioning George Bush are way more similar than TFIDF would have you believe. The knowledge that both are politicians brings the documents closer. And so on knowing that they ran against each other etc. I use this in my work and I believe so does Google. It does not have to be a pure solution using reasoners (predicate calculus) for knowledge bases to be useful.
Andrew Dalke says:

December 4, 2014 at 12:44 pm

My own thinking on the differences between classical AI and what we now call artificial intelligence is strongly influenced by Alex Martelli’s “rant” in comp.lang.python in 2003. Quoting from https://mail.python.org/pipermail/python-list/2003-October/222308.html .

“In the ’80s, when at IBM Research we developed the first large-vocabulary real-time dictation taking systems, I remember continuous attacks coming from the Artificial Intelligentsia due to the fact that we were using NO “AI” techniques — rather, stuff named after Bayes, Markov and Viterbi, all dead white mathematicians (it sure didn’t help that our languages were PL/I, Rexx, Fortran, and the like — no, particularly, that our system _worked_, the most unforgivable of sins:-). I recall T-shirts boldly emblazoned with “P(A|B) = P(B|A) P(A) / P(B)” worn at computational linguistics conferences as a deliberately inflammatory gesture, too:-).”
Carles FarrÃ© says:

December 5, 2014 at 5:00 am

Once I heard a top Google’s guy saying they are an AI Company. Google is chasing the holy grial of AI, so perhaps is too early to de-hype it.
Daniel Lemire says:

December 5, 2014 at 9:22 am

Google is not a classical AI shop. The way Googlers use the term Â« AI Â» is more likely to mean Â« Machine Learning Â». It is entirely different from classical AI.

Google is a very large company so some classical AI can be found. But there are also Googlers who believe that aliens visit the Earth regularly.
Anne says:

December 5, 2014 at 4:53 pm

This is admittedly a colored view, but I’ve always liked the down-to-earth problems of computer vision and machine learning.

I don’t think that we have to worry about people working on classical AI problems. Who knows what they come up with?

However, I admit I think most of the progress will come from people that work on perception and actuation (and especially the interaction between both) on robots. I think it’s very unlikely we will solve how to build a cognitive machine at once, but it will be a gradual process in which we understand how to operate in an uncertain world better and better in a very incremental way.

But we will get there. And to think about decision theory, from POMDPs, Global Workspace Theory, to old-fashioned expert systems, it will all bring us further!
Anonymous says:

December 7, 2014 at 5:00 pm

As per my Tweet: Do you consider the World Wide Web useful? I ask because Linked Data (actually, Linked Open Data) and the World Wide Web are inextricably LInked.

The World Wide Web was a Linked Data effort, from inception. Likewise, a Semantic Web, but the basic narrative has been completely mangled, and some of that does correlate with some of the concerns expressed in this post.

Links:

1. http://bit.ly/world-wide-web-25-years-later
2. http://bit.ly/evidence-that-the-world-wide-web-was-based-on-linked-data-from-inception
3. http://bit.ly/fragment-identifiers-as-global-identifier-operators-for-the-web — Linked Data in a single slide.
Kjetil Kjernsmo says:

December 8, 2014 at 4:45 am

But Google doesn’t answer everything. It doesn’t even answer very simple questions such as “what is the highest mountain on earth outside of the karakoram himalaya range?” It is a simple question, it is an undisputed, straightforward fact. You can possibly get an answer if you rephrase the question exactly as the answer is written, but then, why would you ask the question? You can also try to ask who was the first to summit that mountain, and you wouldn’t even get close.

So, how about more complex questions, such as “where should I go on holiday?” (given these preferences). Or “where should I build my home?” or even “where can I go skiing?”. We do build our houses, we go on holiday and we do go skiing, but information obtained has to be structured and evaluated by ourselves. So, how about “how do I find a spouse?”, or “how do I avoid diseases?”, or “what can I do to not contribute to global warming?” No, Google isn’t solving real-world problems. Google is solving certain problems faster but the problems it solves is constrained to be those that is perceived simple enough for programmers to solve faster. It is not largely motivated from real human problems, it is motivated from what programmers thinks are simple enough. And we thank them for solving them faster of course, but it isn’t all that impressive. And BTW, where’s my flying car?

Moreover, you didn’t get the history all that right either. Semantic Web doesn’t stem from AI. Semantic Web is the convergence of three communities: The Web community, the digital libraries community, and indeed the AI community. You are right in that in academics, the AI community is dominant. There are many reasons for this, but a large part of the problem is that you are largely judged on academic contribution, rather than impact, and then, a contribution is usually more impressive if it is involving AI.

Linked Data has extremely little to do with AI. There is very little, if any, reasoning involved in Linked Data, even in the academic community. And that’s why timbl articulated this direction. Go read his design issue at least. Linked Data is more like a graph of global identifiers that can be traversed or queried. No reasoning. No relabeling of AI. If you want to make the assertion that Linked Data is AI rebranded, please back that assertion up by showing that the papers accepted at for example the LDOW and COLD workshops are dominated by reasoning!

I think the AI angle has an unfortunate effect on the Semantic Web. I think researchers should be allowed to pursue that direction, but I was also against, and I argued so vigorously when I was a member of the SPARQL working group, that entailment regimes should not be a part of the standard at that point. I think history has proven me right, there are too many problems and too few implementations. So, please don’t say that Semantic Web is a relabeling of AI, it is simply not true.

Moreover, there are many lost opportunities. NoSQL was really our thing, we had schemalessness, graph databases and even a language to query it. But we didn’t have fast and reliable code, because research prototypes never made it there.

There are a number of us who just want to sit in a corner and hack code. Not to find answers faster to trivial problems like Google does, but to solve complex, real-world problems, problems that people actually have but can’t solve at all now. There’s little funding for that in the world, so this is progressing slowly. But, we can make progress, you could possibly use a little bit of reasoning as sugar on top, but the suggestion that Semantic Web and Linked Data is rebranding AI is simply just false, and completely lacks historical basis.
Andrew Dalke says:

December 8, 2014 at 10:11 pm

Regarding AI at Google, Joel Sposky quoted a ‘very senior Microsoft developer who moved to Google’ as saying “Google uses Bayesian filtering the way Microsoft uses the if statement.” That’s an expression of a machine learning approach, and not classical AI. I’ll note that Martelli, in my earlier quote, also works at Google.

If Google isn’t solving real world problems, then my problems – some of which were solved using Google – must not be real world problems. I wonder which non-real world I live in.

Switching directions, I can make the same objections regarding linked data. Some of my problems include; how do I get people to buy my software/fund my research, how do I get sub-second response time for chemical substructure search, and what tango steps should I teach in the next lesson? It seems that linked data only solves problems that programmers thinks are simple enough.

I looked at the LDOW 2014 papers. Nearly all seem to be about solving problems within linked data. [Integration] 1) RML is a mapping when different people use different definitions, which is apparently a frequent occurrence in real data sets. 2) The Tabular Data talk uses machine learning techniques to extract triples from tabular data. 3) AIDA-light uses machine learning techniques for entity extraction. 4) Web-Scale Querying, which comments that ‘The disappointingly low availability of public sparql endpoints is the Semantic Web community’s very own â€œInconvenient Truthâ€’ proposes that complex server-side SPARQL is the problem and that the client should be doing the hard work.

[Exploration] 5) DBpedia Viewer is just that, though in this context I observe that the talk doesn’t mention anything about the problem being solved, other than viewing DBpedia. 6) Linked Data Query Wizard, which comments that “The problem of easy-to-use interfaces for accessing Linked Data is still largely unsolved”. It conjectures that most users want a free-text search box like Google and others, tries to implement it, and runs into difficulties because ‘performant full-text search is not addressed at all’ in SPARQL 1.1. I’m happy that this one does user testing! 7) Programmable Analytics; this one confuses me. I think it only says that you can use R to work with RDF data. 8) Inverse Link Traversal proposes a new way to get information from a data store,

[Linked Data Applications] 9) daQ, because data itself has different levels of quality, so there needs to be a way to describe the quality, 10) WebVTT, because video needs to be part of linked data. 11) Social Web Meets Sensor Web – this was cool! The Reptile Road Mortality project extracts data from a Facebook group on the topic, using pictures, entity extraction from the comments (eg, the species of lizard or geographic name), and people information to build the graph, and from that build a faceted viewers for taxon, geographical, etc. 12) Linked Data Visualization Model is one of a long series of visualization tools for linked data (really! 10% of the paper and 50% of the references deal with comparisons to other tools). The authors conjecture that their viewer is good enough that public officials will release data knowing it will be be easy for others to generate good visualizations. They have no evidence that that’s true.

Yes, I read every paper from LDOW2014. Kjernsmo is right, none of them involve reasoning on the data, though a couple use machine learning to create the linked data.

Then again, nearly all of them are inward-looking – they address problems intrinsic to the approach (organization, uptake, new data types, and visualization) – rather than solve “real world” problems, like “which reptile species are more likely to be road kill?” Three of them pointed out that linked data is a) hard to query, b) doesn’t have the search features people want, and c) has many service problems. Those are not good signs that linked data is effective at solving the complex, real-world problems that Kjernsmo wants to work on.

Many papers referenced DBPedia, but only a handful mentioned other data sources. Figure 1 of the daQ paper says there are 9 available databases, and of the two shown, one is OpenCyc. Looking around I see OpenCyc is often in the top 10 listed Linked Data data sets.

Cyc of course is a classic AI project, predicated on the idea that ontology and facts combined with a reasoning engine can produce an AI. Linked Data in Cyc, as a standalone data set, “has extremely little to do with AI” .. except that it exists for the reasoning engine to have something to work on.

It’s very easy to conclude that the LDOW talks don’t include reasoning because that topic is back in the AI winter, and Linked Data conferences want to avoid the inconvenient history which motivated the original effort.

(A spot-check with COLD’s papers leaves my conclusions unchanged.)

Regarding the Semantic Web and connection to AI, I quote from Berners-Lee’s 2001 SciAm article: “For the semantic web to function, computers must have access to structured collections of information and sets of inference rules that they can use to conduct automated reasoning. Artificial-intelligence researchers have studied such systems since long before the Web was developed. Knowledge representation, as this technology is often called, is currently in a state comparable to that of hypertext before the advent of the Web: it is clearly a good idea, and some very nice demonstrations exist, but it has not yet changed the world. It contains the seeds of important applications, but to realize its full potential it must be linked into a single global system.”

There’s no rebranding. Berners-Lee included inference-based reasoning at the start of the Semantic Web (“a Web in which machine reasoning will be ubiquitous and devastatingly powerful” and “[RDF] does not address the heuristics of any particular reasoning engine, which which is an open field made all the more open and fruitful by the Semantic Web”), and based it firmly on methods AI researchers developed for knowledge representation. The connection is completely justified by the historical record.

That effort failed. The top levels of the Semantic Web Layer Cake have collapsed, leaving only the lower, Linked Data; which reminds me of Cyc. While Linked Data can do things (unlike the reasoning systems), very little seems to have come out of it other than papers about Linked Data.

Looking again at the SciAm paper, Berners-Lee writes: “semantics were encoded into the Web page when the clinic’s office manager (who never took Comp Sci 101) massaged it into shape using off-the-shelf software for writing Semantic Web pages along with resources listed on the Physical Therapy Association’s site.”

Neither the Semantic Web nor Linked Data have yet achieved this dream.
Trylks says:

December 9, 2014 at 12:05 pm

The AI winter was triggered mostly by the abandonment of connectionism. Now Google -a reference company doing very real, applied and profitable stuff- is becoming strong in deep learning, along with other companies. I’m not sure about what is the main point here, least the arguments.

However, there are a few points that I consider relevant:

– Needs more dogfooding. A bottom-up approach is good for this, top-down may be frustrating.

– There doesn’t seem to be that much interest (aka funding and business models) in AGI. Big leaps require big investments and are very risky. It is unclear why would we need AGI with current (global) unemployment levels. Maybe economists could use some additional funding to solve some “real problems”.

– The Ivory tower: I don’t think this problem is specific of the semantic web community. It’s something to keep in mind, though.

– Google and other companies use microformats and RDFa annotations.

– Since CSS, HTML is becoming more and more semantic.

– I don’t understand the distinction between “classical AI” and “machine learning AI”, and how the concept of “seed AI” fits in it.

I like the analogy from Hofstadter: Â«AI has become too much like the man who tries to get to the moon by climbing a tree: â€œOne can report steady progress, all the way to the top of the tree.â€Â» (I’m not sure about what did he exactly say, source: http://www.theatlantic.com/magazine/archive/2013/11/the-man-who-would-teach-machines-to-think/309529/).

But nothing of this matters, in the end it is money what drives the world. It seems to be now on data science, but please let me know if you know (or think) that it’s going to be somewhere else in the future.
Trylks says:

December 9, 2014 at 2:20 pm

I would say that what you refer to as “machine learning” is functionalism, and finding the oldest dog without knowing what is a dog and what is old is a perfect example of the Chinese room.

This is not very satisfying from an AGI point of view because without knowing what is a dog and what is old it’s very likely that the results retrieved will be dogs appropriate for old people, a much more plausible and statistically likely search (maybe). Actually this is the kind of problems that Ben Goertzel usually mentions with queries like: “how long does a *dead* pig live?” (yeah, trolling Google).

You mention that Google will end up using a mix of technologies and some of them may be “bad ideas” that actually may not be that bad in the end.

I love how connectionism comes and goes. _IMHO_ it depends on how hardware and software progress. When hardware goes ahead by a big distance then it’s an interesting option. When it’s not so far then there are less obscure and more efficient ways to use it. Deep learning is connectionism, and the abandonment of connectionism was one of the triggers of the AI winter.

It happens constantly, “bad ideas” reappear, and then they are good as new! I like this talk:

http://worrydream.com/dbx/

I haven’t seen much progress in AGI lately, or ever. I would appreciate some pointers, but maybe that’s a different topic, a very interesting one.

Anyway, there are many philosophers in the semantic web community, therefore it will never die, philosophers are professionals at keeping ancient far-fetched ideas alive. So you can expect many problems to be found, and very few of them to be solved. Honestly, I wanted to do a PhD in AGI, but I couldn’t find where or how, and the semantic web was very handy. It’s been crap so far, but an instructive kind of crap.

One of the things that are interesting is the formalisation of definitions, the semantics. I still don’t find a clear distinction between “classical AI” and “machine learning AI” and I can’t figure out where fuzzy logic fits in that classification. After all, elephants don’t play chess, but they don’t know anything about statistics either…

What do we need these definitions for? Well… we don’t need them, but wait and see how _everybody_ is a data scientist in 2015. Linked data is about data, and machine learning is about learning on data, accounting is about data, and we all are doing science after all, and philosophy, it’s a *Ph*D. The metrics are about papers and citations, better be the first than the last. A fictional problem will probably have more citations than a real solution (specially when it’s source code and not a paper). Citations mean money. If you want examples of the endurance of bad ideas, look at economy.
Kjetil Kjernsmo says:

December 9, 2014 at 9:26 am

@Andrew Dalke: I think you are pointing out real problems. In fact, some of the things you point out are matters of deep soul searching within the community already. I have myself been involved for 15 years, and I’m rather frustrated by the lack of progress we’ve had.

Actually, free text index in SPARQL 1.1 was one of the things that I fought very hard for, but lost due to OWL.

Unfortunately, we are not paid to do deep soul searching. We are paid to produce papers that are cited by others, which again is cited by others. It is the unfortunate fact that scientists are not paid to do science, they are paid to enhance their bibliometric quality and their proficiency to sound compelling in grant applications.

Thus, the soul searching happens in the lunch breaks or in private emails. Only occasionally, it breaks out into certain papers, one of which you found in LDOW.

There are two main points to my previous comment: One is that the accusation that Linked Data is AI rebranded. It isn’t, like you confirmed. The other is that the state-of-the art is sad. And indeed, Google is in many ways the state-of-the art, but it is still sad. So, good for you that it solved some of your problems. But it isn’t solving mine. And I think the whole industry is to blame for the complete lack of progress we’ve made over the last decade. And, oh, BTW, chemical substructure search is something you may well find if you dig deeper into the Semantic Web literature.

I think Linked Data has a lot of potential in solving the issues that are holding us back. Indeed, we haven’t done it yet, and indeed, there’s a lack of progress in this community too, partly due to that most researchers can’t find funding to do what needs to be done on the more “trivial” level.

Robin Berjon (who has a better clue of RDF than he cares to admit) has a very interesting critisism: http://berjon.com/linked-data/ This critisism, as opposed to the invalid critisism that LD == AI, as valid, and points out clear directions for us to work with. It has to be much more sensible for programmers to work with graphs rather than forrests. We have to solve that problem, but as of now, I don’t think anybody in our community is working on it.

I value critisism like that. It is very useful. Critisism that say LD == AI is not useful, it is wrong, and it doesn’t contribute anything. I like critisism that points out things we do not already know.
Daniel Lemire says:

December 9, 2014 at 1:26 pm

@Trylks

I don’t understand the distinction between â€œclassical AIâ€ and â€œmachine learning AIâ€

Classical AI is about collecting facts, predicates, and then aggregating them. The idea is that if you have enough facts, you can reason about the world.

There are many fundamental problems with it but to sum it up, the world is messy. It is hard to keep your facts straight…

Machine learning is an opposite view that takes a purely statistical point of view. It says that you do not need to know what a dog is… and what age is… to find the oldest dog.

A company like Google will end up using a mix of different techniques to achieve its goals, but collecting and curating “facts” that can reasoned over is only going to be a tiny part of their work.

Now Google -a reference company doing very real, applied and profitable stuff- is becoming strong in deep learning, along with other companies.

Deep learning is a form of machine learning. It is unrelated to classical AI.

There doesn’t seem to be that much interest (aka funding and business models) in AGI.

The people I know who do work relevant to AGI are not classical AI people.

Google and other companies use microformats and RDFa annotations (…) CSS, HTML is becoming more and more semantic

People have been doing semantics for hundreds of years… well before the term artificial intelligence was coined. There are entire industries around meta-data, and they have nothing to do with classical AI per se.

Microformats and microdata are about semantics, but they are not the Semantic Web architecture or classical AI.
Andrew Dalke says:

December 10, 2014 at 8:41 am

@Kjernsmo: The suggestion that scientists are not paid to do deep soul searching is a dangerous one. If you are a scientist, then can I conclude that you are not paid enough to do the deep soul searching needed to recognize that “linked data” is part of the same AI that lead to the AI winter?

You go on to basically deny that I, a self-funded scientist who is not dependent on bibliometric quality, even exist.

My conclusion is that Linked Data exists only because of the AI heritage of the Semantic Web, and that it’s identical to the knowledge representation of classical AI. It’s the only part of the Semantic Web Layer Cake that can do anything practical, but that’s faint praise given the difficulty of pointing to real-world solutions that couldn’t as easily been solved with non-linked data approaches.

I see now that I used the term “rebranding” incorrectly for this context. I don’t consider Linked Data as anything other than AI, so it can’t be a rebranding. I believe that Lemire uses the term for those who say that Linked Data is its own independent concept. He is correct. My apologies for the confusion.

I also reject the idea that “Linked Data” was meant as the umbrella term for metadata and links between records. You can see this in the DublinCore archives when they talk about the changes they made to fit the Linked Data approach. Eg, “Conventionally, LIS professionals have worked primarily by starting with tools pre-configured for standardized data formats such as MARC (for library catalogs) or Simple Dublin Core, the pre-Linked-Data format for item-level description required since 2001 …”.

If “Linked Data” is now the umbrella term, it’s because it’s been rebranded from its original term. I’ll guess that it started around the time of Berners-Lee’s “Giant Global Graph” essay in 2007, which makes no mention of any sort of reasoning engines.

Regarding substructure search, my interest is subsecond performance for interactive searches. Several have attempted to do this in SPAQRL. Quoting from doi:10.1186/1758-2946-3-20 from 2011: “Unfortunately, since the SPARQL query engines currently available have not been explicitly optimized chemical searching needs, they often lack many of the mechanisms developed over the past decades to accelerate the solution of this problem, and resemble the brute force approach to graph matching more closely.”

As well as “… the increase of search pattern size or complexity had an even more profound effect on query completion, from exceeding the completion time limit to SPARQL query engine-triggered query termination due to the query computational load exceeding any reasonable expectations, of 11 years, for example … many commercially and scientifically important chemical databases enumerate several orders more molecular entities, casting a shadow over the applicability of this approach to large-scale applications, until a more detailed study of search performance demonstrates otherwise. ”

That’s hardly a firm endorsement of a linked data approach.

Also, if you read the paper you’ll see they didn’t get the message that the RDF isn’t about inference engines, so use phrases like “In order to enable reasoning and inference over this chemical information” and “Thus, an ideal representation would be able to refer to every chemical entity and its part unambiguously and to capture information in a controlled, reproducible, and machine-understandable way to enable machine reasoning and to facilitate data integration.”

There are certainly ways to extend SPAQL for chemical searches similar to what GeoSPARQL does, or what various vendors have done for usable free text search. Chem2Bio2RDF did that in doi:10.1186/1471-2105-11-255 . As far as I can tell, it’s a dead project, the syntax of the SPAQL extensions were never described, and the web site’s links go to “This wiki’s subscription has expired.”

All evidence suggests though that linked data doesn’t solve my problems while other approaches do.