TL;DR This is the first in a series of posts on Deep Learning in Computer Vision. Starting from scratch, our goal is to reach the top 1% in a Kaggle competition. Our weapon of choice will be PyTorch, and we are choosing a competition that can be comfortably attempted on common hardware.

What is Computer Vision?

As humans, we use our eyes and our brains to see and visually sense the world around us. Computer Vision is a scientific discipline that aims to give a similar, if not better, capability to a machine or computer.


Computer vision is concerned with the automatic extraction, analysis and understanding of useful information from a single image or a sequence of images. It involves the development of a theoretical and algorithmic basis to achieve automatic visual understanding.


The British Machine Vision Association and Society for Pattern Recognition

To understand Computer Vision in more depth, you can always read about it in Wikipedia. It’s a vast topic, and we are going to focus on the following four common tasks, which either drive most of the applications in Computer Vision or serve as a foundation for it.


Image Classification

In Image Classification, we assign (all) the pixels in given images to categories or classes of interest.

For example, given a set of images containing cats or dogs (but not both), the image classification task could be to classify the images into “Cat” or “Dog” categories.


When we perform Localization in addition to Image Classification, we construct a bounding box identifying the locality in which our object of interest appears in the image. Localization finds the location of a single object inside the image.

This is useful, for example, when we want to automatically crop a profile picture only retain the face in it. Localization would provide us with a boundary box containing the face, and we can simply add a generous margin to it and crop it.

Object Detection

Sometimes, an image may contain more than one objects of interest. In Object Detection, we try to find and classify a variable number of objects in an image. For example, instead of a cat or a dog, an image could contain two cats and one dog. With Object Detection, we try to find each of them and locate them inside the image, typically with a bounding box.

Instance Segmentation

Image segmentation is the process of assigning a label to every pixel in an image such that pixels with the same label share certain characteristics. For example, the pixels under the same label could belong to the same class/ object/ instance. In the image above, you can see how Instance Segmentation results in tight boundaries around the objects of interest (the two cats and the dog).

Computer Vision (and Deep Learning) at work

Words like Image Classification and Instance Segmentation can sound a little stuffy, but they enable us (and our computers) do amazing things. Sure, classifying images into Cats and Dogs is not that exciting, but here are a few more things you could do:

  • Build a self-driving (toy) car,
  • Read the lips from a video to automatically transcribe it,
  • Help with animal and forest conservation,
  • Help Doctors better identify diseases from lung, chest or brain scans,
  • Write a program to play First-person Shooter Games for you, and,
  • Transform your holiday snaps to make them look like they were painted by Van Gogh.

I don’t know about you, but that sounds both fun and useful to me.

What we’ll be doing?

This is the first in a series of blog posts where I am going to solve a few problems in Deep Learning using the PyTorch library. My focus will be on bringing out the intuition behind various techniques and algorithms, creating a smooth pipeline (Deep Learning is as much about engineering as it’s about the theory), and taking a crack at something moderately challenging.

More concretely, by the end of this series, we’ll try and get into the top 1% of a Kaggle Competition in Computer Vision. Tentatively, this is the competition we are aiming for, which provides enough challenge (if you want to get into the top 1%) and has a dataset mostly manageable on the average laptop. We’ll pick a differetnt dataset if this one proves too easy.

Next, I’ll be talking about how to set-up a working environment before we start with PyTorch.