Learning to Recognize 3D Objects



    1. Overivew

    A learning account for the problem of object recognition is developed within the PAC (Probably Approximately Correct) model of learnability. The key assumption underlying this work is that objects can be recognized (or, discriminated) using simple representations in terms of ``syntactically" simple relations over the raw image. Although the potential number of these simple relations could be huge, only a few of them are actually present in each observed image and a fairly small number of those observed is relevant to discriminating an object. We show that these properties can be exploited to yield an efficient learning approach in terms of sample and computational complexity, within the PAC model. No assumptions are needed on the distribution of the observed objects and the learning performance is quantified relative to its past experience. Most importantly, the success of learning an object representation is naturally tied to the ability to represent it as a function of some intermediate representations extracted from the image.

    We evaluate this approach in a large scale experimental study in which the SNoW learning architecture is used to learn representations for the 100 objects in the Columbia Object Image Database (COIL-100). Experimental results exhibit very good generalization and robustness properties of the SNoW-based method  elative to other approaches. SNoW's recognition rate degrades more gracefully when the training data contains fewer views and it shows similar behavior also in some preliminary experiments with partially occluded objects.

    2. Apporach

    Statistical learning theory has had an influence on many applications ranging from classification and object recognition, grouping and segmentation, illumination modeling, scene reconstruction and others. The rising role of learning methods, made possible by significant improvements in computing power and storage, is largely motivated by the realization that explicit modeling of complex phenomena in a messy world cannot be done without a significant role of learning. Learning is required for model and knowledge acquisition as well  as to support generalization and avoid brittleness. However, many statistical and probabilistic learning models require making explicit assumptions, e.g., on the distribution that governs the occurrences of instances in the world. For visual inference problems such as recognition, categorization and detection, making these assumptions seems unrealistic.

    This work develops a distribution free learning theory account to an archetypical visual recognition problem: object recognition. The problem is viewed as that of learning a representation of an object that, given a new image, is used to identify the target object in it. The learning account is developed within the PAC (Probably Approximately Correct) model of learnability. This framework allows us to (1) quantify success relative to the experience with previously observed objects, without making assumptions on the distribution, and (2) study the theoretical limits of what can be learned from images in terms of the expressivity of the intermediate representation used by the learning process. That is, learnability guarantees that objects sampled from the same distribution as the one that governs the experience of the learner are very likely to be recognized correctly. In addition, the framework gives guideline to developing practical algorithmic solutions to the problem.

    Earlier works have discussed the possibility of identifying the theoretical limits of what can be learned from images and found that learning in terms of the raw representation of the images is computationally intractable. Attempts to explain this focused the dependence of learnability on the representation of the object but failed to provide a satisfactory explanation for it, or a practical solution.

    The approach developed here builds on suggestions made in and relies heavily on the development of a feature efficient learning approach. This is a learning process capable of learning quickly and efficiently in domains in which the number of potential features is a very large but any concept of interest actually depends on a fairly small number of them. At the heart of the learning approach are two assumptions that we abstract as follows:
     
     

            Representational:

                There exists a (possibly infinite) collection M of ``explanations'' such that an object o can be represented as a simple function of elements in M.

            Procedural:

                There exists a process that, given an image in which the target object o occurs, generates efficiently ``explanations'' in M that are present in the image and  such that, with high probability, at least one of them is in the representation of o.
     

    Under these assumptions we prove that there exists an efficient algorithm that, given a collection of images labeled as positive or negative examples of the target object, can learn a good representation of the object. That is, it can learn a representation that, with high probability, would make correct predictions on future images that contain (or do not contain) the object. Furthermore, we show that under these conditions, the learned representations are robust under realistic noise conditions.  A significant non-assumption of our approach is that it has no prior knowledge on the distribution of images nor it is trying to estimate it. The learned representation is guaranteed to perform well when tested on images sampled from the distribution that governed the data observed in its learning experience.

    The framework developed here is very general.  The explanations alluded to above can represent a variety of computational processes and information sources that  operate on the image. They can depend on local properties of the image, the relative positions of primitives in the image, and even external information sources or context variables. Thus, the theoretical support given here applies also to an intermediate learning stage in a hierarchical process. In order to generate the explanations efficiently, this work assumes that they are syntactically simple in terms of the raw image. However, the explanation might as well be syntactically simple in terms of previously learned or inferred predicates, giving rise to hierarchical representation.

    3. Results


    We use the Columbia Object Image Library (COIL-100) database in all the experiments below (COIL is available at http://www.cs.columbia.edu/CAVE). The COIL-100 dataset consists of color images of 100 objects where the images of the objects that were taken at pose intervals of 5 degrees, i.e., 72 poses per object. The images were also normalized such that the larger of the two object dimensions (height and width) fits the image size off 128 x 128 pixels.  Figure 1 shows the images of the 100 objects taken in frontal view, i.e., zero pose angle. The 32 highlighted objects in 1 are considered more difficult to recognize by other researchers; we use all 100 objects including these in our experiments. Each color image is converted to a gray-scale image of 32 x 32 pixels for 128 x 128 pixels.  Figure 1 shows the images of the 100 objects taken in frontal view, i.e., zero pose angle. The 32 highlighted objects in Figure 1 are considered more difficult to recognize by other researchers; we use all 100 objects including these in our experiments. Each color image is converted to a gray-scale image of 32 x 32 pixels for our experiments.

    Figure 1. Columbia Object Image Library (COIL-100) consists of 100 objects of varying poses (5 degrees apart). The objects are shown in row order where the highlighted ones are those considered more difficult to recognize by other researchers.
     
     

    3.1 Ground Truth of the COIL-100 Dataset

    At first glance, it seems difficult to recognize the objects in the COIL dataset because it consists of a large number of objects with varying pose, texture, shape and  size. Since each object has 72 images of different poses (5 degrees apart), many view-based recognition methods use 36 (10 degrees apart) of them for training and the remaining images for testing. However, it turns out that under these dense sampling conditions the recognition problem is not difficult (even when only grey-level images are used). Namely, in this case, instances that belong to the same object are very close to each other in the image space (where each data point represents an image of an object in a certain pose). We verified this by experimenting with a simple nearest neighbor classifier (using the Euclidean distance), resulting in an average recognition rate of 98.50% (54 errors out of 3,600 tests).  Figure 2 shows some of the objects misclassified by nearest neighbor method.

    In principle, one may want to avoid using the nearest neighbor method since it requires a lot of memory for storing templates and its recognition time complexity is high. The goal here was simply to show that this simple method is comparable to the complex SVM approaches for the case of dense sampling. Therefore, the abovementioned recognition problem is not appropriat for comparison among different methods.
     


    Figure 2. Mismatched objects using the nearest neighbor method. (x:a,y:b) means that object x with view angle a is recognized as object y with view angle b. It shows some of the 54 errors (out of 3,600 test samples) made by the nearest neighbor classifier when there are 36 views per object in the training set. (See paper for datils).
     

    Table 1. Recognition rates of nearest neighbor classifier


     
      30 objects randomly selected
    from COIL dataset
    32 objects shown seletected by Pointil and Verri The whole 100 objects in COIL
    dataset
    Errors/Tests 14/1080 46/1152 54/3600
    Recognition Rate 98.70% 96.00% 98.50%

    It is interesting to see that the pairs of the objects on which the nearest neighbor method misclassified have similar geometric configurations and similar poses.  A close inspection shows that most of the recognition errors are made between the three packs of chewing gums, bottles and cars. Other dense sampling cases are easier for this method. Consequently, the set of selected objects in an experiment has direct effects on the recognition rate.  This needs to be taken into account when evaluating results that use only a subset of the 100 objects (typically 20 to 30) from the COIL dataset for experiments. Table 1 shows the recognition rates of nearest neighbor classifiers in several experiments in which 36 poses of each object are used for templates and the remaining 36 poses are used for tests.

    Given this baseline experiment we have decided to perform our experimental comparisons in cases in which the number of views of objects available in training is limited. Some of our preliminary results were presented in companion papers.
     

    3.2 Experiment Setups


    Applying SNoW to 3D object recognition requires specifying the architecture used and the representation chosen for the input images. As described above, to perform object recognition we associate a target unit with each target object. This target learns a definition of the object in terms of the input features extracted from the image. We could either define a single SNoW unit which contains target subnetworks for all the 100 different target objects, or we may define different units, each with  several  (e.g., two) competing target objects. Statistically, this approach is advantageous although, clearly, it requires a  lot more computation. The architecture selected affects both the training time, where learning a definition for object a makes use of negative examples of other objects that are part of the same unit but, more importantly, it makes a difference in testing; rather that two competing objects for a decision, there may be a hundred. The chances for a spurious mistake caused by an incidental view point are clearly much higher. On the other hand, it has significant advantages in terms of space complexity and the appeal of the evaluation mode.

    SVMs are two-class classifiers which, for an c-class pattern recognition problem, need to train c*(c-1)/2 binary classifiers. Since we compare the performance of the proposed SNoW-based method with SVMs, in order to maintain a fair comparison we have to perform it in the {\em one-against-one} scheme. That is, we use SNoW units of size two. To classify a test instance, tournament-like pair-wise competition between all the machines is performed and the winner determines
    the label of the test instance.  The recognition rates of the SVM and SNoW based methods shown in Table 2 were performed using the one-against-one scheme. That is, we trained 4,950 classifiers for each method and evaluated 99 (=50+25+12+6+3+2+1) classifiers on each test instance.
     
     

    3.3 Results Using Pixel-Based Representation

    Table 2. shows the recognition rates of the SNoW-based method, the SVM-based method (using linear dot product for the kernel function), and the nearest neighbor classifier using the COIL-100 dataset. The important parameter here is that we vary the number of views of an object (v) during training and use the rest of the views (72-v) of an object for testing.

    Table 2. Experimental results of three classifiers using the 100 objects in the COIL-100 dataset


     
    # of View/Test
    36/3600 tests 18/5400 tests 8/6400 tests 4/6800 tests
    SNoW 95.81% 92.31% 85/13% 81.46%
    Linear SVM 96.03% 91.30% 84.80% 78.50%
    Nearest Neighbor 98.50% 87.54% 79.52% 74.63%

    The experimental results show that the SNoW-based method performs as well as the SVM-based method when many views of the objects are present during training and outperforms SVM-based method when the numbers of views is limited. Although it is not surprising to see that the recognition rate decreases as the number of views available during training decreases, it is worth noticing that both SNoW and SVM are capable of recognizing 3D objects in the COIL-100 dataset with satisfactory performance if enough views (e.g., >18) are provided. Also they seems to be fairly robust even if only a limited number of views (e.g., 8 and
    4) are used for training; the performance of both methods degrades gracefully.

    To provide some more insight into these methods, we note that in the SVM-based methods, only 27.78% (20 out of 72) of the input vectors serves as support vectors. For SNoW, out of 262,144 potential features in the pixel-based representation, only 13,805 were active in the dense case (i.e., 36 views). This shows the advantage gained from using the sparse  rchitecture. However, only a small number of those may be relevant to the representation of each target, as a more careful look as the SNoW output hypothesis reveals.

    An additional potential advantage of the SNoW architecture is that it does not learn discriminators, but rather can learn a representation for each object, which can then be used for prediction in the one-against-all scheme or to build hierarchical representations. However, as is shown in Table~\ref{chap6-table4}, this implies a significant degradation is the performance.  Finding a way to make better predictions in the one-against-all scheme is one of the important issues for future investigation, to better exploit the advantages of this approach.

    Table 3. Recognition rates of SNoW using two learning paradigms


    # of View
    SNoW 36 18 8 4
    One-against-one 95.81% 92.31% 85.13% 81.46%
    Linear SVM 90.52% 85.50% 81.85% 76.00%

    3.4  Results Using Edge-Based Representation

    For each 32 x 32 edge map, we extract horizontal and vertical edges (of length at least 3 pixels) and then encode as our features conjunctions of two of these edges. The number of potential features of this sort is 2,096,128. However, only an average of 1,822 of these is active for objects in the COIL-100 dataset.  To reduce the computational cost the feature vectors were further pruned and only the 512 most frequently occurring features were retained in each image.

    Table 3 shows the performance of the SNoW-based method when conjunctions of edges are used to represent objects.  As before, we vary the number of views of an object n during training and use the rest of the views 72-n of an object for testing.  The results indicate that conjunctions of edges provide useful information for object recognition and that SNoW is able to learn very good object representations using these features.  The experimental results also
    exhibit the relative advantage of this representation increases when the number of views per object is limited.


    Table 4. Experimental results of four classifiers using the 100 objects in the COIL-100 dataset


     
    # of View/Test
    36/3600 tests 18/5400 tests 8/6400 tests 4/6800 tests
    SNoW w/ conjunction of edges 96.25% 94.13% 89.23% 88.28%
    SNoW w/ intensity values 95.81% 92.31% 85.13% 81.46%
    Linear Support Vector Machine 96.03% 91.30% 84.80% 78.50%
    Nearest Neighbor 98.50% 87.54% 79.52% 74.63%

    4. Conclusion

    The main contribution of this work we view in proposing a learning framework for visual learning and exhibiting its feasibility. In this approach learnability can be rigorously studied without making assumptions on the distribution of the observed objects but, via the PAC model, the learned hypothesis' performance naturally depends on its prior experience. An important feature of the approach is that learning is not studied directly in terms of the raw data but rather with respect to intermediate representations extracted from it and can thus be quantified in terms of the ability to generate expressive intermediate representations. In particular, it makes explicit the requirements from these representations to allow learnability. We believe that research in vision should concentrate on the study of these intermediate representations.

    We evaluated the approach and demonstrated its feasibility in a large scale experiment in the context of learning for object recognition. Our experiments allowed us also to perform a fair comparison between two successful and related  learning methods and study them in the context of object recognition. We have illustrated our approach in a large scale experimental study in which we use the SNoW learning architecture to learn representations for the 100 objects in COIL-100.  Although it is
    clear that object recognition in isolation is not the ultimate goal, this study shows the potential of this computational approach as a basis for studying and supporting more realistic visual inferences.

    We note that for a fair comparison among different methods, we have used pixel-based presentation in the experiments. It is clear, however, that the edge-based representation used is even more effective and robust and should be the starting point for future research. There is no question that the RGFs used in this
    work are not general enough to support more challenging recognition problems; the intention was merely to exhibit the general approach. We believe that pursing the direction of using complex intermediate representations will benefit future work on recognition and, in particular, robust recognition under various types of noise.
     

    Publications