We evaluate this approach in a large scale experimental study in which the SNoW learning architecture is used to learn representations for the 100 objects in the Columbia Object Image Database (COIL-100). Experimental results exhibit very good generalization and robustness properties of the SNoW-based method elative to other approaches. SNoW's recognition rate degrades more gracefully when the training data contains fewer views and it shows similar behavior also in some preliminary experiments with partially occluded objects.
This work develops a distribution free learning theory account to an archetypical visual recognition problem: object recognition. The problem is viewed as that of learning a representation of an object that, given a new image, is used to identify the target object in it. The learning account is developed within the PAC (Probably Approximately Correct) model of learnability. This framework allows us to (1) quantify success relative to the experience with previously observed objects, without making assumptions on the distribution, and (2) study the theoretical limits of what can be learned from images in terms of the expressivity of the intermediate representation used by the learning process. That is, learnability guarantees that objects sampled from the same distribution as the one that governs the experience of the learner are very likely to be recognized correctly. In addition, the framework gives guideline to developing practical algorithmic solutions to the problem.
Earlier works have discussed the possibility of identifying the theoretical limits of what can be learned from images and found that learning in terms of the raw representation of the images is computationally intractable. Attempts to explain this focused the dependence of learnability on the representation of the object but failed to provide a satisfactory explanation for it, or a practical solution.
The approach developed here builds on suggestions made in and relies
heavily on the development of a feature efficient learning approach. This
is a learning process capable of learning quickly and efficiently in domains
in which the number of potential features is a very large but any concept
of interest actually depends on a fairly small number of them. At the heart
of the learning approach are two assumptions that we abstract as follows:
Under these assumptions we prove that there exists an efficient algorithm that, given a collection of images labeled as positive or negative examples of the target object, can learn a good representation of the object. That is, it can learn a representation that, with high probability, would make correct predictions on future images that contain (or do not contain) the object. Furthermore, we show that under these conditions, the learned representations are robust under realistic noise conditions. A significant non-assumption of our approach is that it has no prior knowledge on the distribution of images nor it is trying to estimate it. The learned representation is guaranteed to perform well when tested on images sampled from the distribution that governed the data observed in its learning experience.
The framework developed here is very general. The explanations alluded to above can represent a variety of computational processes and information sources that operate on the image. They can depend on local properties of the image, the relative positions of primitives in the image, and even external information sources or context variables. Thus, the theoretical support given here applies also to an intermediate learning stage in a hierarchical process. In order to generate the explanations efficiently, this work assumes that they are syntactically simple in terms of the raw image. However, the explanation might as well be syntactically simple in terms of previously learned or inferred predicates, giving rise to hierarchical representation.
We use the Columbia Object Image Library (COIL-100) database in
all the experiments below (COIL is available at http://www.cs.columbia.edu/CAVE).
The COIL-100 dataset consists of color images of 100 objects where the
images of the objects that were taken at pose intervals of 5 degrees, i.e.,
72 poses per object. The images were also normalized such that the larger
of the two object dimensions (height and width) fits the image size off
128 x 128 pixels. Figure 1 shows the images of the 100 objects taken
in frontal view, i.e., zero pose angle. The 32 highlighted objects in 1
are considered more difficult to recognize by other researchers; we use
all 100 objects including these in our experiments. Each color image is
converted to a gray-scale image of 32 x 32 pixels for 128 x 128 pixels.
Figure 1 shows the images of the 100 objects taken in frontal view, i.e.,
zero pose angle. The 32 highlighted objects in Figure 1 are considered
more difficult to recognize by other researchers; we use all 100 objects
including these in our experiments. Each color image is converted to a
gray-scale image of 32 x 32 pixels for our experiments.

Figure 1. Columbia Object Image Library (COIL-100) consists of 100 objects
of varying poses (5 degrees apart). The objects are shown in row order
where the highlighted ones are those considered more difficult to recognize
by other researchers.
In principle, one may want to avoid using the nearest neighbor method
since it requires a lot of memory for storing templates and its recognition
time complexity is high. The goal here was simply to show that this simple
method is comparable to the complex SVM approaches for the case of dense
sampling. Therefore, the abovementioned recognition problem is not appropriat
for comparison among different methods.

Figure 2. Mismatched objects using the nearest neighbor method. (x:a,y:b)
means that object x with view angle a is recognized as object
y
with view angle b. It shows some of the 54 errors (out of 3,600
test samples) made by the nearest neighbor classifier when there are 36
views per object in the training set. (See paper for datils).
Table 1. Recognition rates of nearest neighbor classifier
| 30 objects randomly selected
from COIL dataset |
32 objects shown seletected by Pointil and Verri | The whole 100 objects in COIL
dataset |
|
| Errors/Tests | 14/1080 | 46/1152 | 54/3600 |
| Recognition Rate | 98.70% | 96.00% | 98.50% |
It is interesting to see that the pairs of the objects on which the nearest neighbor method misclassified have similar geometric configurations and similar poses. A close inspection shows that most of the recognition errors are made between the three packs of chewing gums, bottles and cars. Other dense sampling cases are easier for this method. Consequently, the set of selected objects in an experiment has direct effects on the recognition rate. This needs to be taken into account when evaluating results that use only a subset of the 100 objects (typically 20 to 30) from the COIL dataset for experiments. Table 1 shows the recognition rates of nearest neighbor classifiers in several experiments in which 36 poses of each object are used for templates and the remaining 36 poses are used for tests.
Given this baseline experiment we have decided to perform our experimental
comparisons in cases in which the number of views of objects available
in training is limited. Some of our preliminary results were presented
in companion papers.
Applying SNoW to 3D object recognition requires specifying the architecture
used and the representation chosen for the input images. As described above,
to perform object recognition we associate a target unit with each target
object. This target learns a definition of the object in terms of the input
features extracted from the image. We could either define a single SNoW
unit which contains target subnetworks for all the 100 different target
objects, or we may define different units, each with several
(e.g., two) competing target objects. Statistically, this approach is advantageous
although, clearly, it requires a lot more computation. The architecture
selected affects both the training time, where learning a definition for
object
a
makes use of negative examples of other objects that are
part of the same unit but, more importantly, it makes a difference in testing;
rather that two competing objects for a decision, there may be a hundred.
The chances for a spurious mistake caused by an incidental view point are
clearly much higher. On the other hand, it has significant advantages in
terms of space complexity and the appeal of the evaluation mode.
SVMs are two-class classifiers which, for an c-class pattern
recognition problem, need to train c*(c-1)/2 binary classifiers. Since
we compare the performance of the proposed SNoW-based method with SVMs,
in order to maintain a fair comparison we have to perform it in the {\em
one-against-one} scheme. That is, we use SNoW units of size two. To classify
a test instance, tournament-like pair-wise competition between all the
machines is performed and the winner determines
the label of the test instance. The recognition rates of the
SVM and SNoW based methods shown in Table 2 were performed using the one-against-one
scheme. That is, we trained 4,950 classifiers for each method and evaluated
99 (=50+25+12+6+3+2+1) classifiers on each test instance.
Table 2. Experimental results of three classifiers using the 100 objects in the COIL-100 dataset
| # of View/Test | ||||
| 36/3600 tests | 18/5400 tests | 8/6400 tests | 4/6800 tests | |
| SNoW | 95.81% | 92.31% | 85/13% | 81.46% |
| Linear SVM | 96.03% | 91.30% | 84.80% | 78.50% |
| Nearest Neighbor | 98.50% | 87.54% | 79.52% | 74.63% |
The experimental results show that the SNoW-based method performs as
well as the SVM-based method when many views of the objects are present
during training and outperforms SVM-based method when the numbers of views
is limited. Although it is not surprising to see that the recognition rate
decreases as the number of views available during training decreases, it
is worth noticing that both SNoW and SVM are capable of recognizing 3D
objects in the COIL-100 dataset with satisfactory performance if enough
views (e.g., >18) are provided. Also they seems to be fairly robust even
if only a limited number of views (e.g., 8 and
4) are used for training; the performance of both methods degrades
gracefully.
To provide some more insight into these methods, we note that in the SVM-based methods, only 27.78% (20 out of 72) of the input vectors serves as support vectors. For SNoW, out of 262,144 potential features in the pixel-based representation, only 13,805 were active in the dense case (i.e., 36 views). This shows the advantage gained from using the sparse rchitecture. However, only a small number of those may be relevant to the representation of each target, as a more careful look as the SNoW output hypothesis reveals.
An additional potential advantage of the SNoW architecture is that it does not learn discriminators, but rather can learn a representation for each object, which can then be used for prediction in the one-against-all scheme or to build hierarchical representations. However, as is shown in Table~\ref{chap6-table4}, this implies a significant degradation is the performance. Finding a way to make better predictions in the one-against-all scheme is one of the important issues for future investigation, to better exploit the advantages of this approach.
Table 3. Recognition rates of SNoW using two learning paradigms
| # of View | ||||
| SNoW | 36 | 18 | 8 | 4 |
| One-against-one | 95.81% | 92.31% | 85.13% | 81.46% |
| Linear SVM | 90.52% | 85.50% | 81.85% | 76.00% |
Table 3 shows the performance of the SNoW-based method when conjunctions
of edges are used to represent objects. As before, we vary the number
of views of an object n during training and use the rest of the
views 72-n of an object for testing. The results indicate
that conjunctions of edges provide useful information for object recognition
and that SNoW is able to learn very good object representations using these
features. The experimental results also
exhibit the relative advantage of this representation increases when
the number of views per object is limited.
Table 4. Experimental results of four classifiers using the 100 objects in the COIL-100 dataset
| # of View/Test | ||||
| 36/3600 tests | 18/5400 tests | 8/6400 tests | 4/6800 tests | |
| SNoW w/ conjunction of edges | 96.25% | 94.13% | 89.23% | 88.28% |
| SNoW w/ intensity values | 95.81% | 92.31% | 85.13% | 81.46% |
| Linear Support Vector Machine | 96.03% | 91.30% | 84.80% | 78.50% |
| Nearest Neighbor | 98.50% | 87.54% | 79.52% | 74.63% |
We evaluated the approach and demonstrated its feasibility in a large
scale experiment in the context of learning for object recognition. Our
experiments allowed us also to perform a fair comparison between two successful
and related learning methods and study them in the context of object
recognition. We have illustrated our approach in a large scale experimental
study in which we use the SNoW learning architecture to learn representations
for the 100 objects in COIL-100. Although it is
clear that object recognition in isolation is not the ultimate goal,
this study shows the potential of this computational approach as a basis
for studying and supporting more realistic visual inferences.
We note that for a fair comparison among different methods, we have
used pixel-based presentation in the experiments. It is clear, however,
that the edge-based representation used is even more effective and robust
and should be the starting point for future research. There is no question
that the RGFs used in this
work are not general enough to support more challenging recognition
problems; the intention was merely to exhibit the general approach. We
believe that pursing the direction of using complex intermediate representations
will benefit future work on recognition and, in particular, robust recognition
under various types of noise.