18th April 2016, 6 min read

How close are AI systems to human-level intelligence? The Allen AI challenge.

With respect to artificial intelligence, some people are squarely in the “optimist” camp, believing that we are “nearly there” as far as producing human-level intelligence. Microsoft co-founder’s Paul Allen has been somewhat more prudent:

While we have learned a great deal about how to build individual AI systems that do seemingly intelligent things, our systems have always remained brittleâ€”their performance boundaries are rigidly set by their internal assumptions and defining algorithms, they cannot generalize, and they frequently give nonsensical answers outside of their specific focus areas.

So Allen does not believe that we will see human-level artificial intelligence in this century. But he nevertheless generously created a foundation aiming to develop such human-level intelligence, the Allen Institute for Artificial Intelligence Science. The Institute is lead by Oren Etzioni who obviously shares some of Allen’s “pessimistic” views. Etzioni has made it clear that he feels that the recent breakthroughs of Google’s DeepMind (i.e., beating the best human beings at Go) should not be exaggerated. Etzioni took for example the fact that their research paper search engine (Semantic Scholar) can differentiate between the significant citations and the less significant ones. The way DeepMind’s engine works is that it looks at many, many examples and learn from these examples because they are clearly and objectively classified (we know who wins and who loses a given game of Go). But there is no win/lose label on the content of research papers. In other words, human beings become intelligent in an unsupervised manner, often working from few examples and few objective labels.

To try to assess how far off we are from human-level intelligence, the Allen Institute launched a game where people had to design an artificial intelligence capable of passing 8th-grade science tests. They gave generous prizes to the best three teams. The questions touch various scientific domains:

How many chromosomes does the human body cell contain?- How could city administrators encourage energy conservation?- What do earthquakes tell scientists about the history of the planet?- Describe a relationship between the distance from Earth and a characteristic of a star.

So how far are we from human-level intelligence? The Institute published the results in a short paper.

Interestingly, all three top scores were very close (within 1%). The first prize went to Chaim Linhart who scored 59%. My congratulations to him!

How good is 59%? That’s the glass half-full, glass half-empty problem. Possibly, the researchers from the Allen Institute do not think it qualifies as human-level intelligence. I do not think that they set a threshold ahead of time. They don’t tell us how many human beings can’t manage to get even 59%. But I think that they now set the threshold at 80%. Is this because that’s what human-level intelligence represents?

All three winners expressed that it was clear that applying a deeper, semantic level of reasoning with scientific knowledge to the questions and answers would be the key to achieving scores of 80% and beyond, and to demonstrating what might be considered true artificial intelligence.

It is also unclear whether 59% represent the best an AI could do right now. We only know that the participants in the game organized by the Institute could not do better at this point. What score are the researchers from the Allen Institute able to get on their own game? I could not find this information.

What is interesting however is that, for the most part, the teams threw lots of data in a search engine and used information retrieval techniques combined with basic machine learning algorithms to solve the problem. If you are keeping track, this is reminiscent of how DeepMind managed to beat the best human player at Go: use good indexes over lots of data coupled with unsurprising machine learning algorithms. Researchers from the Allen Institute appear to think that this outlines our current limitations:

In the end, each of the winning models found the most benefit in information retrieval based methods. This is indicative of the state of AI technology in this area of research; we can’t ace an 8th grade science exam because we do not currently have AI systems capable of going beyond the surface text to a deeper understanding of the meaning underlying each question, and then successfully using reasoning to find the appropriate answer.

(The researchers from the Allen Institute invite us to go play with their own artificial intelligence called Aristo. So they do have a system capable of writing 8th grade tests. Where are the scores?)

So, how close are we to human-level artificial intelligence? My problem with this question is that it assumes we have an objective metric. When you try to land human beings on the Moon, there is an objective way to assess your results. By their own admission, the Allen Institute researchers tell us that computers can probably already pass Alan Turing’s test, but they (rightfully) dismiss the Turing test as flawed. Reasonably enough they propose passing 8th-grade science tests as a new metric. It does not seem far-fetched to me at all that people could, soon, build software that can ace 8th-grade science tests. Certainly, there is no need to wait until the end of this century. But what if I build an artificial intelligence that can ace these tests, would they then say that I have cracked human-level artificial intelligence? I suspect that they would not.

And then there is a little embarrassing fact: we can already achieve super-human intelligence. Go back in 1975 but bring the Google search engine with you. Put it in a box with flashy lights. Most people would agree that the search engine is nothing but the equivalent of a very advanced artificial intelligence. There would be no doubt.

Moreover, unlike human intelligence, Google’s intelligence is beyond our biology. There are billions of human brains… it makes no practical sense to limit computers to what brains can do when it is obviously more profitable to build machines that can do what brains cannot do. We do not ask for cars that walk like we do or for planes that fly like birds… why would we want computers that think like we do?

Given our limited knowledge, the whole question of assessing how close we are to human-level intelligence looks dangerously close to a philosophical question… and I mean this in a pejorative sense. I think that many “optimists” looking at the 59% score would say that we are very close to human-level intelligence. Others would say that they only got 59% by using a massive database. But we should have learned one thing: science not philosophy is the engine of progress and prosperity. Until we can make it precise, asking whether we can achieve human-level intelligence with software is an endlessly debatable question akin to asking how many angels fit in a spoon.

Still, I think we should celebrate the work done by the Allen Institute. Not because we care necessarily about mimicking human-level intelligence, but because software that can pass science tests is likely to serve as an inspiration for software that can read our biology textbooks, look at experimental data, and maybe help us find cures for cancer or Alzheimer’s. The great thing about an objective competition, like passing 8th-grade science tests, is that it cuts through the fog. There is no need for marketing material and press releases. You get the questions and your software answers them. It does well or it does not.

And what about the future? It looks bright:

In 2016, AI2 plans to launch a new, $1 million challenge, inviting the wider world to take the next big steps in AI (…)