Using Tensorflow and Support Vector Machine to Create an Image Classifications Engine

In this post, we are documenting how we used Google’s TensorFlow to build this image recognition engine. We’ve used Inception to process the images and then train an SVM classifier to recognise the object. Our aim is to build a system that helps a user with a zip puller to find a matching puller in the database. This piece will also cover how the Inception network sees the input images and assess how well the extracted features can be classified.

Our puller project with Tensorflow

Recently, Oursky got a mini zip puller recognition project. One of our teams had to build a system for users to match an image of puller with most similar puller inside the database. The sample size for the trial is small (12 pullers), which has implications discussed below as we share our experience on trying out Google’s TensorFlow.

pullers

Images showing 12 different pullers

Our first test was to compare the HoG (Histogram of Oriented Gradient) feature computed on the input image and all the puller model images rendered from their CAD models. This solution works but the matching performance is poor if the input image background has a strong texture.

We also tested an alternative solution to address the problems with the textured background. We then built a relatively shallow CNN with 2 convolutional layers and two fully connected layer [1]  for classifying the puller image. However, since our data set is too small (around 200 puller images for each type) and lacks variety, the classification performance is poor. It is basically not different from making random guesses.

Training a CNN from scratch with a small data set is indeed a bad idea. The common approach for using CNN to do classification on a small data set is not to train your own network, but to use a pre-trained network to extract features from the input image and train a classifier based on those features. This technique is called transfer learning. TensorFlow has a tutorial on how to do transfer learning on the Inception model; Kernix also has a nice blog post talking about transfer learning and our work is largely based on that.

Brief overview on classification

In a classification task, we first need to gather a set of training examples. Each training example is a pair of input features and labels. We would like to use these training examples to train a classifier, and hope that the trained classifier can tell us a correct label when we feed it an unseen input feature.

There are lots of learning algorithms for classification, e.g. support vector machine, random forest, neural network, etc. How well a learning algorithm can perform is highly related to the input feature. Input feature is a representation that captures the essence of the object under classification.

For example, in image recognition, the raw pixel values could be an input feature. However, using raw pixel values as input feature, the feature dimension is usually too big or too generic for a classifier to work well. In this case, we can either use a more complex classifier such as deep neural network, or use some domain knowledge to brainstorm a better input feature.[2] 

For our puller classification task, we will use SVM for classification, and use a pre-trained deep CNN from TensorFlow called Inception to extract a 2048-d feature from each input image.

Bottlenecks features of deep CNN

The common structure of a CNN for image classification has two main parts: 1) a long chain of convolutional layers, and 2) a few (or even one) layers of the fully connected neural network. The long convolutional layer chain is indeed for feature learning. The learned feature will be feed into the fully connected layer for classification.

The feature that feeds into the last classification layer is also called the bottleneck feature. The following image shows the structure of TensorFlow’s Inception network we are going to use. We have indicated the part of the network that we are getting the output from as our input feature.

inceptionv3

TensorFlow Inception Model that indicating the bottlenecks feature

How Inception sees a puller

Training a CNN means it learns a bunch of image filters (kernels).

For example, if the input of the convolutional layer is an image with 3 channels, the kernel size for this layer is 3×3 and there will be an independent set of three 3×3 kernels for each output channel. Each kernel in a set will convolve with the corresponding channel of the input and produces three convolved images. The sum of those convolved images will form a channel of the output.

The illustration below is a convolution step.

conv-layer

Illustration of convolution

As the output of each convolutional layer [3] is a multi-channel image, we could also view them as multiple gray-scale images.”By plotting those grayscale images out, we can understand how the Inception network sees an image. The following images are extracted at different stages of the convolutional layer chain The points are illustrated as A,B,C and D in the Inception Model figure.

This is an input image.

All the 32 149×149 images at stage A:

Output image at stage A

Inception Output image at Stage A

All the 32 147×147 images at stage B:

Output image at stage B

Inception Output image at Stage B

All the 288 35×35 images at stage C:

Output image at stage C

Inception Output image at stage C

All the 768 17×17 images at stage D:

Output image at stage D

Inception Output image at stage D

Here we can see the images become more and more abstract going down the convolutional layer chain. We could also spot that some of the image are highlighting the puller, and some of them are highlighting the background.

Why is the bottleneck feature is good?

The bottleneck feature of Inception network is a 2048-d vector. The following is a figure showing the bottleneck feature of the previous input image in bar chart form.

bottleneck feature in bar chart form

Bottleneck feature in bar chart form

For the bottleneck feature to be a good feature for classification, we would like the features representing the same type of puller to be close (think of the feature as a point in 2048-d space) to each other, while features representing different types of puller should be far apart. In other words, we would like to see features in a data set clustering themselves according to their types.

It is hard to see this kind of clustering happened on 2048-d feature data sets. However, we can do a dimensionality reduction[4]   on the bottleneck feature and transform them to a 2-d feature which is easy to visualize. The following image is the scatter plot of the transformed feature in our puller data set[5] . Different puller type are illustrated by different colors.

Scatter plot of transformed feature of the puller dataset

Scatter plot of transformed feature of the puller dataset

As we can see, the same color points are mostly clustered together. It has a high chance that we could use the bottleneck feature to train a classifier with high accuracy.

Code for extracting inception bottleneck feature

The inception v3 model can be downloaded here.

Training a SVM classifier

Support vector machine (SVM) is a linear binary classifier.

The goal of the SVM is to find a hyper-plane that separates the training data correctly in two half-spaces while maximising the margin between those two classes.

Although SVM is a linear classifier, which could only deal with linear separable data sets, we can apply a kernel trick to make it work for non-linear separable case.

A commonly used kernel besides linear is the RBF kernel.

The hyper-parameters for SVM includes the type of kernel and the regularization parameter *C*. If using the RBF kernel, there is an additional parameter *γ* for selecting which radial basic function to use.

Usually the bottleneck feature from a deep CNN is linear separable. However, we will consider the RBF kernel as well.

We used simple grid search for selecting the hyper-parameter. In other words, we tried out all the hyper-parameter combination in the range we have specified, and evaluated the trained classifier performance using cross validation.

The rule of thumb for trying out the *C* and *γ* parameter is trying them with different order of magnitude.

We used 10-fold cross validation.

SVM is a binary classifier. However, we could use the one-vs-all or one-vs-one approach to make it a multi-class classifier.

It seems a lot of stuff to do for training a SVM classifier, indeed it is just a few function calls when using machine learning software package like scikit-learn.

Code for the training the SVM classifier

SVM training result

The following is the training result we get, which got a perfect result! Though this might deal to overfitting… 

 

We’ve used it to built an mobile app and a web front-end for the puller classifier for field testings.

Puller Matcher screenshot

Puller Matcher screenshot

Since the classifier can work with unseen samples, it seems that the over-fitting issue is not so serious.

Conclusion

A pre-trained deep CNN, Inception network in particular, could be used as a feature extractor for general image classification tasks.

The bottleneck feature of the Inception network should a good feature for classification. We have extracted the bottleneck feature from our data set and did a dimensionality reduction for visualization. The result shows a nice clustering of the sample according to their class.

The SVM classifier training on the bottleneck feature has a perfect result, and the classifier seems to work on the unseen sample.

Footnotes

1: The model is based on one of the TensorFlow Tutorial on CIFAR-10 classification, with some twist to deal with larger image size.

2: Sometimes, it will be the other way round, the dimension input feature is too small, we need to do some transformation on the input feature to expand its dimension. The process of picking a good feature to learn is called feature engineering. It is a difficult task. One of the reasons why deep learning is so popular is because we can feed in raw and generic input to the network, and it can automatically learn some good feature during the training. However, the trade-off will be a huge training data set and long training time.

3: Noticed that one convolutional layer is not just having one convolution operation, it could also have multiple convolution operations, pooling operations, or other operations.

4: The algorithm for dimensionality reduction we use is t-SNE.

5: We didn’t use the full data set for the classification, instead we remove images has low variety from the data set, this result in a data set of around 400 images.

If you find this post interesting, subscribe to our newsletter to get notified about our future posts!

7 Comments

  1. Nice. I was trying to solve a similar problem previously.
    I wonder whether you have tried non-CNN approaches like SIFT + SVM?

    • We do have another non-CNN implementation for this puller classification problem, instead of the standard SVM classifier on Bag-of-Words feature using SIFT approach, we tried another one that involve no training at all.

      For that approach, we have a database of HoG features computed from CAD models of the pullers. During the query, we compute HoG feature of the input image and match the entry in the database using SAD as similarity metric.

      Given that the input image has a clean background and the puller is in the centre of the image, this approach gives a very good result, especially when you want to find a puller with similar shape.

  2. Hi, this is a very good post on using the transfer learning. Can you please let me know how you visualised the feature maps at different layers. Thanks.

    • TL;DR, you can take a look at the code for us to plotting the feature at different layers.
      https://gist.github.com/jasonkit/95e051a3bc1bded2d5a5c568d4d6d4e5

      In short, we need to find out the name of the tensor that storing the feature we want to visualize,
      as well as its size so that we know how to plot it.

      Later on, we just need to feed in a query image, retrieve the value stored inside the interested tensor and
      plot it as a grayscale image.

      For how to figure out the name of the tensor we are interested in and its size, you can use gen-tensorboard-log.py
      to generate the tensorboard log and view the log in tensorboard.

      After running gen-tensorboard-log.py, type tensorboard --logdir and browse
      127.0.0.1:6006, choose the GRAPHS tab on the top of the screen, you will see the structure of the inception graph.

      You can drag to move around, scroll to zoom and double click to look inside the graph structure. After you select
      a node, you can see its name, together with its input/output size inside the box located on the top right corner.

      Hope this can help!

  3. A very informative and useful post.

    It seems the “pool_3.0” layer is primarily for classification uses.

    If instead of classification, you want a layer which best represents an input image as a vector, what is this layer?

    • It will still be the “pool_3.0” layer if the “best represents an input image” you are referring to mean “best capturing the content of the input image”

      You can think of the part of the network right before the fully-connected layer as a “feature extractor”. The extracted feature (that 2048-d vector) will be an abstracted representation of the (content of) input image. This kind of representation is somehow similar to the Bag-of-Word feature built using SIFT/SURF/ORB features.

      Of course, if the information you want to represent in the feature is not the content, but something like color distribution, line orientation, etc, you probably just need some simple image filters, not a CNN.

      • @jason
        Using keras the last 5 layers are:
        batchnormalization_94 Tensor(“cond_93/Merge:0”, shape=(?, 8, 8, 192), dtype=float32)
        mixed10 Tensor(“concat_14:0”, shape=(?, 8, 8, 2048), dtype=float32)
        avg_pool Tensor(“AvgPool_10:0”, shape=(?, 1, 1, 2048), dtype=float32)
        flatten Tensor(“Reshape_846:0”, shape=(?, ?), dtype=float32)
        predictions Tensor(“Softmax:0”, shape=(?, 1000), dtype=float32)

        The predictions layer is the classifier that most of us use. The avg_pool 2048-dim vector is the abstracted representation. What is the relationship between the mixed10 and avg_pool layers. I’m curious to understand how these last 5 layers are related and their role. Thanks.

Leave a Reply

Your email address will not be published.

*

© 2017 Oursky Code Blog

Theme by Anders NorenUp ↑