My Experiments In Computer Vision

Throughout 2016-2017 as I was rather intensely studying machine learning I spent a considerable amount of time reviewing the computer vision (CV) domain, which I found to be absolutely fascinating. The practical applications of CV include:

  • image classification:
    • security: intruder detection; facial recognition; …
    • anomaly detection
    • object tracking: head, eyes; gesture recognition; people, animals, cars; …
    • autonomous navigation: drones, robots, cars …
    • medical applications: radiology images (pathology; diagnoses; …)
  • image processing:
    • captioning / dense captioning; tagging; labeling; counting; semantic content extraction
    • filtering; feature extraction
    • OCR (optical character recognition)
    • transformation: color; texture; seasons; 2D → 3D; …
    • enhancement: reconstruction; super-resolution
    • compression

[Many of applications extend to the video domain as well.]

While that period of study / research was a protracted “aside” for me, it provided a solid introduction to

  • neural network architectures (notably CNN: convolutional neural nets)’
  • platforms (Theano; Caffe; TensorFlow; …)
  • implementations (Python), and
  • applications – some of which are itemized above!

Inception_v3 architecture

Stanford University’s superb cs231n: Convolutional Neural Networks for Visual Recognition lectures (at the time taught by the outstanding then-Ph.D. student Andrej Karpathy) and course materials led directly to me exploring, in more detail, the image and video classification and processing subdomain. Another entry into this area is TensorFlow’s excellent “Wide & Deep Learning” tutorial, which is also summarized in the accompanying Google Research Blog article. [More here  (large file; opens in new tab.)]

Inception_v3 architecture

That project, like many others, employs pretrained models  « (large file; opens in new tab), that can also be leveraged for classifying images that are outside the training set – a rudimentary form of transfer learning  « (large file; opens in new tab). This is really great, because even with high-end equipment (powerful GPUs, …) some of the models could take weeks to train – per experiment!

Additionally, refining the model (parametrization …) would require re-training the model. So, in practical terms, many of those high-performant classifiers would otherwise be unavailable to those lacking GPUs.

However, as summarized here  « [large file; opens in new tab] it is really easy to leverage pretrained models for image classification (and other tasks). For example, in the ImageNet classification task images are generally classified among 1000 image categories (breeds of dogs; types of flowers; etc.), with the final fully-connected layer (1000 image classes) collapsed to the top ten probabilistic (predicted) classes in the softmax layer.

In the Inception model (shown above; don’t worry about the number of features in the fully-connected layer), images go in from the left and predictions come out on the right; the very last layer will be of size 1000 and give a probability for each of the classes. However, the layers that come before are transformations over the raw image learned by the network because they were the most useful to solve the image classification task; some layers, for example, are edge detectors.

Inception_v3 architecture

The idea is to run images through the network, but instead of getting the output of the last layer, that is specialised to the ImageNet task, we instead get the second to last layer, which will give us a conceptual numerical representation of the images. We can then use that representation as features, that we can give to a new classifier that we will train on our own task.

You can think of the Inception model as a way to get from an image to a feature vector over which a new classifier can efficiently operate. We are leveraging hundreds of hours of GPU compute-­time that went into training the Inception model, but applying it to a completely new task.

But I digress …

As I was working through that and other material, I implemented most of what I was reading (image classification; image captioning; object and facial recognition) on my own computer. For the computer vision aspects, I used a Logitech C270 USB webcam that I had purchased some years previously, but never used.

Following are brief summaries of a few of my projects, from that period.

Personal Facial Identification

I find our ability to develop new models (or leverage pre-trained models) to learn new things to be very exciting! It’s truly fascinating to me that we can train software to learn!

Calvin & Hobbes - Computers that think.gif

An exemplar of this is how easy it is to train a neural network to recognize individuals in images or video streams. It is a refinement of facial recognition – itself a facile task – extended with little added effort to recognizing and identifying individual persons!

Among my subsets of CV projects, I regard this as my flagship CV work – not so much for the difficulty of the machine learning aspects, but rather the programming to get the label bounding boxes working – surprisingly tricky!

Chad Smith Will Ferrel

Around mid-2016 I read the very informative Machine Learning is Fun! Part 4: Modern Face Recognition with Deep Learning blog post, that included this highly impressive Chad Smith / Will Ferrel YouTube video:

Blog author Adam Geitgey shared the generic parts of that code – facial recognition with bounding boxes:

If you want to try this step out yourself using Python and dlib, here’s the code [local copy] for finding face landmarks and here’s the code [local copy] for transforming the image using those landmarks.

[See also Face Recognition in Videos with OpenCV and OpenFace Demo 3: Training a Classifier.]

However, that code just provides the basic functionality that was already publicly available; e.g., the OpenCV and OpenFace websites. The author kept private the most interesting bits of the code: personalized facial recognition; and labeled bounding boxes, including the classification probabilities!

Consequently, I wrote and provide that code, here in my GitHub cv_facial_identification repo!

Accompanying that (fully-commented) code are some basic installation notes (I am an Arch Linux x86_64 user; most of that type of coding is done with plain-text scripts in a basic terminal in Python virtual environments, as described here.

Here is a video (mine) of the result!

[Real-time: CPU, not GPU!]

Here, also, is a low-resolution (lossy animated GIF) screen-capture, basically showing the same thing. The parts I’ve written de novo are the labeled bounding boxes, with the classifier probabilities included. Surprisingly, I looked for such code on the web, and couldn’t find it!

animated GIF


  • CPU not GPU (GPU should run ≥10x faster!)
  • The MP4 videos play in Chrome | download in Opera | may, may not play in FireFox

Image Classification: ResNet Neural Network

The following example was actually more involved, on my part, with regard to getting it implemented. Following is my recreation of a web browser-based, webcam-based image classifier that employed a residual network (ResNet-50).

It is described on this page (mine); my code is on GitHub: keras_js_canvas_resnet-50

Image 1: web browser webcam resnet50-a.png

Image 2: web browser + separate webcam and Linux terminal windows webcam resnet50-b.png

To implement that work, I needed to do some front-end HTML/JS/CSS coding (HTML canvas …), run the resnet-50 neural net in the background (Python), feed the output of that neural net to the browser, and figure out a way to plot those data, with the live webcam feed, in a labelled bar plot, in the browser! Whew!

Pretty neat, eh?

Here are my notes from that time (being formally trained in basic research – molecular genetics … – I maintain research notes on my work: in this domain, in HTML format):

[2016-Dec-29] For the past ~2 weeks I worked on recreating (myself!) the CaffeJS, Keras-JS, and particularly the poorly-implemented (those authors) “Machine Learning for Artists keras-js ResNet live webcam demo – all summarized above. My code (below) essentially recreates the latter project, via my own approach/solutions!

Here is an example of the Machine Learning for Artists (ML4A) project, that as in Firefox v.50.1.0 also fails to load / work in Opera and Chrome!! It takes forever to load (even locally), and note the complete absence of any classification(s)!! [The ML4A project, of course, is a fork of Keras.js

Although slow to load, the Machine Learning for Artists authors included a keras-js ResNet live webcam demo, here!

Unfortunately, it is not (yet) supported in Firefox: my webcam works, but no network/classification. There is some [default] classification in Opera, but no webcam, and the webpage does not load at all in Chrome! Unsurprisingly, their GitHub repo is a mess!

It’s very cool though, and is based – like CaffeJS – on Andrej Karpathy’s ConvNetJS code!

So, there ya have it!

Here are some demos (mine), best viewed in Chrome:

[Related but different project: Torch7 / OpenCV / ImageNet classification / webcam]


  • CPU not GPU (GPU should run ≥10x faster!)
  • The MP4 videos play in Chrome | download in Opera | may, may not play in FireFox