[info] reddit: interview with vapnik:

Alejandro Dubrovsky <alito at organicrobot.com> on Fri Nov 14 15:51:35 CET 2008

(
http://www.learningtheory.org/index.php?option=com_content&view=article&id=9%3Aqlearning-has-just-startedq-an-interview-with-prof-vladimir-vapnik&catid=12%3Ainterviews&Itemid=8&showall=1
)

"Learning Has Just Started" - an interview with Prof. Vladimir Vapnik
PDF Print E-mail 
Written by Ran Gilad-Bachrach 
Sunday, 02 March 2008 00:00

As a part of the renovation of the learningTheory.org web site, we are
launching a series of interviews with leading researchers in learning
theory and related fields. We are proud that Prof. Vladimir Vapnik
accepted our invitation to be the first to be interviewed.

 Prof. Vapnik has been working on learning theory related problems for
more than four decades. Together with Alexey Chervonenkis he studied the
problem of uniform convergence of empirical means and developed the VC
theory. He also developed the large margin principles and the Support
Vector Machines algorithm.

R-GB: Thank you for accepting our invitation to be the first one to be
interviewed for learningtheory.org. Can you tell us what your current
research directions are?

V-V: My current research interest is to develop advanced models of
empirical inference. I think that the problem of machine learning is not
just a technical problem. It is a general problem of philosophy of
empirical inference. One of the ways for inference is induction. The
main philosophy of inference developed in the past strongly connected
the empirical inference to the inductive learning process.

 I believe that induction is a rather restrictive model of learning and
I am trying to develop more advanced models. First, I am trying to
develop non-inductive methods of inference, such as transductive
inference, selective inference, and many other options. Second, I am
trying to introduce non-classical ways of inference. Here is an example
of such an inference. In the classical scheme, given a set of admissible
indicator functions {f(x)} and given a set of training data, pairs
(xi,yi) X´{±1}, one tries to find the best classification function in
this set. In the new setting, called master-class learning, we have also
given a set of admissible functions {f(x)} and our goal is also to find
the best classification function in this set. However, for the training
data we are given additional information: we are given triplets
(xi,x*i,yi), where vectors x*i belong to space X* (generally speaking,
different from the space $X$). These vectors are carriers of "hidden
information" about vectors x (they will not be available during
testing). These vectors can be a special type of holistic description of
the data.

 For example, when you have a technical description x of the object and
have some impression x* about this object you have two forms of
description: a formal description and a holistic description or Gestalt
description. Using both descriptions during training can help to find a
better decision function. This technique remains master-class learning,
like musicians training in master classes. The teacher does not show
exactly how to play. He talks to students and gives some images
transmitting some hidden information - and this helps. So, the challenge
is to create an algorithm which using additional information, will
generalize better than classical algorithms.

 I believe that learning has just started, because whatever we did
before, it was some sort of a classical setting known to classical
statistics as well. Now we come to the moment where we are trying to
develop a new philosophy which goes beyond classical models.

R-GB: You gave the example of master-classes where you see this
additional information, can you give another example?

V-V: Consider for example a figure skating coach. The coach cannot skate
as well as a good young skater; nevertheless, he can explain how to ski.
The explanation is not in technical details but something like what you
should more focus on or giving you some images you should think of. You
can look at it as if it is just blah-blah-blah, but it is not.
 This type of description contains hidden information that affects your
choice of a good rule. We checked this opportunity in the digit
recognition task. We developed metaphoric descriptions of all digits of
the training set, and used these descriptions to improve performance,
and it works. This is what real learning is about: it uses technical
description x and uses hidden information provided by the teacher in a
completely different language to create a good technical decision rule.

R-GB: Do you think this setting can be formalized in the same sense that
uniform convergence was formalized?

V-V: It is easy to formalize it, and you can use it with well-known
algorithms like support vector machines. In support vector machines one
uses many independent slack parameters in the optimization process.
Hidden information leads to a restricted set of admissible slack
functions which have a smaller capacity than all possible slack
functions used in classical SVMs. Many of these ideas were discussed in
the after-word in the second edition of my 1982 book "Estimation of
Dependencies Based on Empirical Data".

 I believe that something drastic has happened in computer science and
machine learning. Until recently, philosophy was based on the very
simple idea that the world is simple. In machine learning, for the first
time, we have examples where the world is not simple. For example, when
we solve the "forest" problem (which is a low-dimensional problem) and
use data of size 15,000 we get 85%-87% accuracy. However, when we use
500,000 training examples we achieve 98% of correct answers. This means
that a good decision rule is not a simple one, it cannot be described by
a very few parameters. This is actually a crucial point in approach to
empirical inference.

 This point was very well described by Einstein who said "when the
solution is simple, God is answering". That is, if a law is simple we
can find it. He also said "when the number of factors coming into play
is too large, scientific methods in most cases fail". In machine
learning we dealing with a large number of factors. So the question is
what is the real world? Is it simple or complex? Machine learning shows
that there are examples of complex worlds. We should approach complex
worlds from a completely different position than simple worlds. For
example, in a complex world one should give up explain-ability (the main
goal in classical science) to gain a better predict-ability.

R-GB: Do you claim that the assumption of mathematics and other sciences
that there are very few and simple rules that govern the world is wrong?

V-V: I believe that it is wrong. As I mentioned before, the
(low-dimensional) problem "forest" has a perfect solution, but it is not
simple and you cannot obtain this solution using 15,000 examples.

R-GB: Maybe it is because learning from examples is too limited?

V-V: It is limited, but it is not too limited. I want to stress another
point: you can get a very good decision rule, but it is a very
complicated function. It can be like a fractal. I believe that in many
cases we have these kinds of decision rules. But nonetheless, we can
make empirical inferences. In many cases, to make empirical inference,
we do not need to have a general decision rule; we can do it using
different techniques. That is why empirical inference is an interesting
problem. It is not a function estimation problem that has been known in
statistics since the time of Gauss. Now a very exciting time has come
when people try to find new ways to attack complex worlds. These ways
are based on a new philosophy of inference.

 In classical philosophy there are two principles to explain the
generalization phenomenon. One is Occam's razor and the other is
Popper's falsifiability. It turns out that by using machine learning
arguments one can show that both of them are not very good and that one
can generalize violating these principles. There are other
justifications for inferences.

R-GB: What are the main challenges that machine learning should address?

V-V: The main challenge is to attack complex worlds.

R-GB: What do you think are the main accomplishments of machine
learning?

V-V: First of all, machine learning has had a tremendous influence both
on modern intelligent technology and modern methods of inferences. Ten
years ago, when statisticians did not buy our arguments, they did not do
very well in solving high-dimensional problems. They introduced some
heuristics, but this did not work well. Now they have adopted the ideas
of statistical learning theory and this is an important achievement.

 Machine learning theory started in early 1960's with the Perceptron of
Rosenblatt and the Novikoff theorem about Perceptron algorithm. The
development of these works led to the construction of a learning theory.
In 1963, Alexey Chervonenkis and I introduced an algorithm for pattern
recognition based on optimal hyper-plane. We proved the consistency of
this algorithm using uniform convergence arguments and got a bound for
its accuracy. Generalization of these results led to the VC theory.

 From a conceptual point of view the most important part of VC theory is
the necessary and sufficient conditions for learn-ability not just
sufficient conditions (bounds). These conditions are based on capacity
concepts. There are three capacity measures, one is entropy, the second
is growth function and the last is VC dimension. VC dimension is the
most crude description of capacity. The best measure is the VC-entropy
which is different than the classical entropy. The necessary and
sufficient condition for a given probability measure states that the
ratio of the entropy to the number of examples must go to zero. What
happens if it goes to value a which is not zero? Then one can prove that
there exists in the space X a subspace X0 with probability measure a,
such that subset of training vectors that belong to this subspace can be
separated in all possible ways. This means that you cannot generalize.
This also means that if you have to choose a good function from an
admissible set of functions you can not avoid VC type of reasoning.

R-GB: What do you think about the bounds on uniform convergence? Are
they as good as we can expect them to be?

V-V: They are O.K. However the main problem is not the bound. There are
conceptual questions and technical questions. From a conceptual point of
view, you cannot avoid uniform convergence arguments; it is a necessity.
One can try to improve the bounds, but it is a technical problem. My
concern is that machine learning is not only about technical things, it
is also about philosophy: What is the complex world science about? The
improvement of the bound is an extremely interesting problem from
mathematical point of view. But even if you'll get a better bound it
will not be able help to attack the main problem: what to do in complex
worlds?

R-GB: Today we use these bounds and design algorithms to minimize these
bounds. But the bounds are loose. Are we doing the right thing?

V-V: I don't think you can get a bound which is not loose, because
technically it is very difficult. Not everything can be described in a
simple mathematical expressions. Now, people are working on transductive
inference. What is the difference between transductive inference and
inductive inference? In transductive inference, you have a set of
equivalent decision rules: rules which classify the training and test
data in the same way are equivalent. So instead of an infinite number of
hypotheses, you have a finite number of equivalence classes. For a set
of equivalence classes the problem becomes simpler; it is combinatorial.
You can measure size of equivalence classes using different measures and
this leads to different algorithms. You can restructure risk
minimization on equivalence classes. It describes the situation better.
But in this case, you will not have a general decision rule, you just
have answers for your test points. You give up generality but you gain
in accuracy. I am happy that people have started to develop transductive
inference which was introduced back in the 1970s.

R-GB: Do you have some pointers to interesting work that you ran across
recently?

V-V: There are a lot of very good works. Recently, the book
"Semi-Supervised Learning" was published. Along with semi-supervised
learning it contains chapters about transductive learning. I think that
semi-supervised learning is still an attempt to do induction. But I
believe that in order to get accuracy in inference we should give up the
inductive approach. It should be something else like transductive
inference or selective inference. For example, consider the following
problem. You are given pairs, (xi,yi), i=1,...,l for training and
simultaneously you are given m testing examples x1,…,xm. The problem is,
given these two sets, can you find k examples from the test set which
most probably belong to the first class? This is a decision making
problem. It is easier than the inductive pattern recognition because you
don't need to classify everything. Classification is difficult for the
border elements which are close to the decision boundary. It is also not
a ranking problem, because you are not interested in the ranking of
chosen vectors. It is a simpler problem, and therefore it can be solved
more accurately.

R-GB: Your SVM work gained a lot of popularity, what do you think is the
reason that made it so popular?

V-V: First of all, SVM was developed over 30 years. The first
publication we did jointly with Alexey Chervonenkis in 1963 and it was
about optimal separating hyper-planes. It is actually linear SVM. In
1992, jointly with Bernhard Boser and Isabelle Guyon, we introduced the
kernel trick and in 1995, jointly with Corinna Cortes, we introduced
slack variables and it became SVM. Why is it so popular? I believe there
are several reasons. First of all, it is effective. It gives very stable
and good results. From a theoretical point of view, it is very clear,
very simple, and allows many different generalizations, called kernel
machines. It also introduced pattern recognition to optimization
scientists and this includes new researchers in the field. There exist
several very good libraries for SVM algorithms. All together these make
SVM popular.

R-GB: It seems now days that almost all machine learning is about
margins and maximizing margins. Is it the right thing to look at?

V-V: In my research, I am trying to invent something better than margin,
because margin is not the best characteristic of capacity. Through
margin one can bound the VC dimension. But VC dimension is a loose
capacity concept. Now I am trying to introduce something instead of the
margin. I am checking different ideas such as using universums and
making inference by choosing the equivalence class which has the maximal
number of contradiction on universum. Margin is good to prove a general
point, but to find advanced techniques maybe we should think about what
could be better than margin - to move closer to entropy and not to VC
dimension.

R-GB: Switching to a different topic: you have been active in machine
learning for more than 4 decades and keep being very innovative. How do
you do that?

V-V: The problem of machine learning is very wide. It includes technical
aspects as well as philosophical aspects. It is also a unification of
humanities and technicalities. In different periods of my life I tried
to understand different aspects of this problem. Through 1960's-1970's,
I was interested in the mathematical aspects of this problem. In 1980's,
I understood the relation of the pattern recognition problem to problems
of classical philosophy of science. In 1990's, developing SVM I was
happy to introduce a theoretical line in creating algorithms instead of
heuristics. Now I see in this problem a central point for changing a
general paradigm that has existed for many hundreds of years which
separates inductive (or scientific) inference developed for dealing with
a simple world from direct (non-inductive or non-scientific) inference
which is a tool for dealing with a complex world. It is a very rich
problem that has many aspects and one can find appropriate aspects of
this problem in different periods of ones life.

R-GB: When you started in the 60's, did you consider your work as
studying learning?

V-V: Yes, I considered my research to be research in learning theory
from the very beginning. The uniform convergence theory was constructed
to justify learning algorithms developed in the 1960s and, in
particular, the optimal separating hyperplane. For me PAC theory that
started in mid 1980's was a backward step to prior development presented
in our joint with Alexey Chervonenkis book "Theory of pattern
recognition" (1974) and in my book "Estimation of Dependencies Based on
Empirical Data" (1979 Russian version and 1982 English translation).

R-GB: Can you share with us some recommendations for papers or books in
machine learning or related areas that you find interesting?

V-V: There are many excellent articles and books on machine learning.
Now is the time when many different ideas are implemented into
algorithms for solving large-scale technical, biological, and linguistic
problems. Now is a time when a lot of new facts and observations have
appeared. It is very interesting to try to follow them in order to
understand how they can fit in a general model.

 Recently, the first book in the philosophy of science appeared,
Reliable Reasoning - Induction and Statistical Learning Theory" by G.
Harman and S. Kulkarni, which is trying to explain a philosophical
component of what was done in machine learning, what is the VC
dimension, what was wrong in Popper's non-falsifiability concept in
contrast to VC non-falsifiability concept and so on. From my point of
view this book is still conservative: it is primarely about induction
but also mentions the transductive inference. Nevertheless, people in
philosophy are trying to take into account our science. I believe that
we should also try to advance general philosophical problems of
intelligence.
Last Updated ( Wednesday, 09 April 2008 04:37 ) 


More information about the info mailing list