video classification

Video classification with Keras and Deep Learning

On this tutorial, you’ll discover ways to carry out video classification using Keras, Python, and Deep Learning.

Specifically, you will study:

  • The difference between video classification and normal picture classification
  • How one can practice a Convolutional Neural Network utilizing Keras for image classification
  • The way to take that CNN and then use it for video classification
  • The right way to use rolling prediction averaging to scale back “flickering” in outcomes

This tutorial will serve as an introduction to the idea of working with deep learning in a temporal nature, paving the best way for once we talk about Lengthy Brief-term Memory networks (LSTMs) and ultimately human exercise recognition.

To discover ways to perform video classification with Keras and Deep studying, just hold reading!

Video Classification with Keras and Deep Learning

Videos could be understood as a collection of particular person photographs; and subsequently, many deep learning practitioners can be quick to treat video classification as performing picture classification a total of N occasions, the place N is the full number of frames in a video.

There’s a problem with that strategy although.

Video classification is extra than just easy image classification — with video we will sometimes make the idea that subsequent frames in a video are correlated with respect to their semantic contents.

If we are capable of reap the benefits of the temporal nature of movies, we will improve our precise video classification outcomes.

Neural network architectures similar to Lengthy short-term reminiscence (LSTMs) and Recurrent Neural Networks (RNNs) are fitted to time collection knowledge — two subjects that we’ll be masking in later tutorials — however in some instances, they could be overkill. They’re additionally resource-hungry and time-consuming in relation to coaching over hundreds of video information as you possibly can imagine.

As an alternative, for some purposes, all you could need is rolling averaging over predictions.

Within the remainder of this tutorial, you’ll discover ways to practice a CNN for image classification (particularly sports activities classification) and then flip it right into a extra accurate video classifier by using rolling averaging.

How is video classification totally different than picture classification?

When performing picture classification, we:

  1. Enter a picture to our CNN
  2. Get hold of the predictions from the CNN
  3. Select the label with the most important corresponding chance

Since a video is just a collection of frames, a naive video classification technique can be to:

  1. Loop over all frames in the video file
  2. For every frame, move the frame by means of the CNN
  3. Classify each frame individually and independently of one another
  4. Choose the label with the most important corresponding chance
  5. Label the frame and write the output body to disk

There’s a problem with this strategy though — should you’ve ever tried to apply simple image classification to video classification you doubtless encountered a type of “prediction flickering” as seen in the video on the prime of this part. Discover how in this visualization we see our CNN shifting between two predictions: “football” and the right label, “weight_lifting”.

The video is clearly of weightlifting and we wish our complete video to be labeled as such — however how we will forestall the CNN “flickering” between these two labels?

A easy, but elegant answer, is to utilize a rolling prediction common.

Our algorithm now becomes:

  1. Loop over all frames in the video file
  2. For each frame, cross the body by means of the CNN
  3. Acquire the predictions from the CNN
  4. Keep an inventory of the last Okay predictions
  5. Compute the typical of the final Okay predictions and choose the label with the most important corresponding chance
  6. Label the body and write the output frame to disk

The results of this algorithm could be seen in the video at the very prime of this submit — notice how the prediction flickering is gone and all the video clip is appropriately labeled!

In the remainder of this tutorial, you’ll discover ways to implement this algorithm for video classification with Keras.

The Sports activities Classification Dataset

Determine 1: A sports activities dataset curated by GitHub consumer “anubhavmaity” utilizing Google Image Search. We’ll use this image dataset for video classification with Keras. (image source)

The dataset we’ll be utilizing right here as we speak is for sport/activity classification. The dataset was curated by Anubhav Maity by downloading photographs from Google Pictures (you may additionally use Bing) for the following categories:

  1. Swimming
  2. Badminton
  3. Wrestling
  4. Olympic Capturing
  5. Cricket
  6. Soccer
  7. Tennis
  8. Hockey
  9. Ice Hockey
  10. Kabaddi
  11. WWE
  1. Gymnasium
  2. Weight lifting
  3. Volleyball
  4. Table tennis
  5. Baseball
  6. Components 1
  7. Moto GP
  8. Chess
  9. Boxing
  10. Fencing
  11. Basketball

To save lots of time, computational assets, and to show the precise video classification algorithm (the actual point of this tutorial), we’ll be coaching on a subset of the sports activities sort dataset:

  • Football (i.e., soccer): 799 pictures
  • Tennis: 718 pictures
  • Weightlifting: 577 photographs

Let’s go ahead and obtain our dataset!

Downloading the Sports activities Classification Dataset

Go forward and obtain the source code for right now’s blog submit from the “Downloads” link.

Extract the .zip and navigate into the undertaking folder from your terminal:

From there, clone Anubhav Maity’s repo:

The info we’ll be utilizing as we speak is now within the following path:

Undertaking Structure

Now that we have now our undertaking folder and Anubhav Maity‘s repo sitting inside, let’s evaluate our undertaking structure:

Our training picture knowledge is in the
Sports-Sort-Classifier/knowledge/  directory, organized by class. There’s further muddle included with the GitHub repo that we gained’t be utilizing. I’ve omitted it from the undertaking structure output above since we solely care concerning the knowledge. Furthermore, our coaching script will only practice with soccer, tennis, and weightlifting knowledge (nevertheless a easy record item change might permit you to practice with different courses as nicely).

I’ve extracted three
example_clips/  for us from YouTube to check our mannequin upon. Credit for the three clips are on the backside of the “Keras video classification results” section.

Our classifier information are in the
mannequin/  listing. Included are
exercise.mannequin  (the educated Keras model) and
lb.pickle  (our label binarizer).

An empty
output/  folder is the situation the place we’ll retailer video classification outcomes.

We’ll be overlaying two Python scripts in as we speak’s tutorial:

  • : A Keras training script that grabs the dataset class photographs that we care about, masses the ResNet50 CNN, and applies switch studying/fine-tuning of ImageNet weights to coach our model. The coaching script generates/outputs three information:

    • mannequin/exercise.model : A fine-tuned classifier based mostly on ResNet50 for recognizing sports.

    • model/lb.pickle : A serialized label binarizer containing our distinctive class labels.

    • plot.png : The accuracy/loss training history plot.

  • : Masses an enter video from the
    example_clips/  and proceeds to categorise the video ideally using right now’s rolling average technique.

Implementing our Keras training script

Let’s go ahead and implement our coaching script used to train a Keras CNN to acknowledge every of the sports activities activities.

Open up the  file and insert the following code:

On Strains 2-24, we import vital packages for coaching our classifier:

  • matplotlib : For plotting. Line three sets the backend so we will output our coaching plot to a .png picture file.

  • keras : For deep learning. Specifically, we’ll use the
    ResNet50  CNN. We’ll also work with the
    ImageDataGenerator  which you’ll be able to read about in final week’s tutorial.

  • sklearn : From scikit-learn we’ll use their implementation of a
    LabelBinarizer  for one-hot encoding our class labels. The
    train_test_split  perform will phase our dataset into training and testing splits. We’ll additionally print a
    classification_report  in a standard format.

  • paths : Incorporates convenience features for itemizing all image information in a given path. From there we’ll be capable of load our pictures into reminiscence.

  • numpy : Python’s de facto numerical processing library.

  • argparse : For parsing command line arguments.

  • pickle : For serializing our label binarizer to disk.

  • cv2 : OpenCV.

  • os : The working system module can be used to ensure we seize the right file/path separator which is OS-dependent.

Let’s go ahead and parse our command line arguments now:

Our script accepts five command line arguments, the primary three of which are required:

  • –dataset : The path to the enter dataset.

  • –mannequin : Our path to our output Keras model file.

  • –label-bin : The trail to our output label binarizer pickle file.

  • –epochs : How many epochs to coach our community for — by default, we’ll practice for
    25  epochs, but as I’ll present later in the tutorial,
    50  epochs can lead to higher outcomes.

  • –plot : The path to our output plot image file — by default it is going to be named
    plot.png  and be positioned in the same directory as this training script.

With our command line arguments parsed and in-hand, let’s proceed to initialize our
LABELS  and load our
knowledge :

Line 42 accommodates the set of class
LABELS  for which our dataset will include. All labels not present on this set can be excluded from being a part of our dataset. To save lots of on training time, our dataset will only include weight lifting, tennis, and football/soccer. Be happy to work with different courses by making modifications to the
LABELS  set.

All dataset
imagePaths  are gathered by way of Line 47 and the value contained in
args[“dataset”] (which comes from our command line arguments).

Strains 48 and 49 initialize our
knowledge  and
labels  lists.

From there, we’ll start looping over all
imagePaths  on Line 52.

In the loop, first we extract the class
label  from the
imagePath  (Line 54). Strains 58 and 59 then ignore any
label  not in the
LABELS  set.

Strains 63-65 load and preprocess an
image . Preprocessing consists of swapping colour channels for OpenCV to Keras compatibility and resizing to 224×224px. Learn more about resizing pictures for CNNs right here. To study more concerning the significance of preprocessing you’ll want to discuss with Deep Learning for Pc Vision with Python.

image  and
label  are then added to the
knowledge  and
labels  lists, respectively on Strains 68 and 69.

Continuing on, we’ll one-hot encode our
labels  and partition our
knowledge :

Strains 72 and 73 convert our
knowledge  and
labels  lists into NumPy arrays.

One-hot encoding of
labels  takes place on Strains 76 and 77. One-hot encoding is a method of marking an lively class label by way of binary array parts. For instance “football” could also be
array([1, 0, 0])  whereas “weightlifting” could also be
array([0, 0, 1]) . Discover how just one class is “hot” at any given time.

Strains 81 and 82 then phase our
knowledge  into training and testing splits utilizing 75% of the info for coaching and the remaining 25% for testing.

Let’s initialize our knowledge augmentation object:

Strains 85-96 initialize two knowledge augmentation objects — one for coaching and one for validation. Knowledge augmentation is almost all the time beneficial in deep studying for pc vision to increase mannequin generalization.

trainAug  object performs random rotations, zooms, shifts, shears, and flips on our knowledge. You possibly can learn extra concerning the ImageDataGenerator and fit_generator here. As we strengthened final week, take into account that with Keras, pictures might be generated on-the-fly (it isn’t an additive operation).

No augmentation might be carried out for validation knowledge (
valAug ), however we’ll carry out mean subtraction.

mean  pixel worth is about on Line 101. From there, Strains 102 and 103 set the
mean  attribute for
trainAug  and
valAug  so that mean subtraction shall be carried out as photographs are generated during training/evaluation.

Now we’re going to carry out what I wish to call “network surgery” as part of fine-tuning:

Strains 107 and 108 load
ResNet50  pre-trained with ImageNet weights while chopping the top of the network off.

From there, Strains 112-121 assemble a new
headModel  and suture it onto the
baseModel .

We’ll now freeze the
baseModel  so that it’ll not be educated by way of backpropagation (Strains 125 and 126).

Let’s go ahead and compile + practice our
mannequin :

Strains 131-133
compile  our
model  with the Stochastic Gradient Descent (
SGD ) optimizer with an preliminary studying fee of
1e-Four  and studying fee decay. We use
“categorical_crossentropy”  loss for training with a number of courses. In case you are working with only two courses, remember to use
“binary_crossentropy”  loss.

A name to the
fit_generator  perform on our
model  (Strains 139-144) trains our community with knowledge augmentation and mean subtraction.

Remember that our
baseModel  is frozen and we’re only coaching the top. This is called “fine-tuning”. For a fast overview of fine-tuning, make sure to learn my earlier article. And for a extra in-depth dive into fine-tuning, decide up a replica of the Practitioner Bundle of Deep Learning for Pc Imaginative and prescient with Python.

We’ll begin to wrap up by evaluating our community and plotting the training historical past:

After we consider our community on the testing set and print a
classification_report (Strains 148-150), we go forward and plot our accuracy/loss curves with matplotlib (Strains 153-163). The plot is saved to disk by way of Line 164.

To wrap up will serialize our
model  and label binarizer (
lb ) to disk:

Line 168 saves our fine-tuned Keras
model .

Lastly, Strains 171 serialize and store our label binarizer in Python’s pickle format.

Coaching results

Before we will (1) classify frames in a video with our CNN and then (2) utilize our CNN for video classification, we first need to coach the model.

Be sure to have used the “Downloads” part of this tutorial to obtain the supply code to this image (in addition to downloaded the sports activities sort dataset).

From there, open up a terminal and execute the next command:

Figure 2: Sports video classification with Keras accuracy/loss coaching history plot.

As you’ll be able to see, we’re acquiring ~92-93% accuracy after fine-tuning ResNet50 on the sports activities dataset.

Checking our model directory we will see that the fine-tuned mannequin alongside with the label binarizer have been serialized to disk:

We’ll then take these information and use them to implement rolling prediction averaging within the next part.

Video classification with Keras and rolling prediction averaging

We at the moment are ready to implement video classification with Keras by way of rolling prediction accuracy!

To create this script we’ll benefit from the temporal nature of movies, specifically the idea that subsequent frames in a video could have comparable semantic contents.

By performing rolling prediction accuracy we’ll be capable of “smoothen out” the predictions and keep away from “prediction flickering”.

Let’s get started — open up the  file and insert the next code:

Strains 2-7 load essential packages and modules. Particularly, we’ll be utilizing
deque  from Python’s
collections  module to help with our rolling common algorithm.

Then, Strains 10-21 parse 5 command line arguments, four of that are required:

  • –mannequin : The trail to the input model generated from our earlier coaching step.

  • –label-bin : The trail to the serialized pickle-format label binarizer generated by the previous script.

  • –input : A path to an enter video for video classification.

  • –output : The path to our output video which will probably be saved to disk.

  • –measurement : The max measurement of the queue for rolling averaging (
    128  by default). For a few of our instance results afterward, we’ll set the dimensions to
    1  in order that no averaging is performed.

Armed with our imports and command line
args , we’re now able to carry out initializations:

Strains 25 and 26 load our
mannequin  and label binarizer.

Line 30 then units our
imply  subtraction value.

We’ll use a
deque  to implement our rolling prediction averaging. Our deque,
Q , is initialized with a
maxlen  equal to the
args[“size”] value (Line 31).

Let’s initialize our
cv2.VideoCapture  object and begin looping over video frames:

Line 35 grabs a pointer to our input video file stream. We use the
VideoCapture  class from OpenCV to read frames from our video stream.

Our video
author  and dimensions are then initialized to
None  by way of Strains 36 and 37.

Line 40 begins our video classification 
whereas  loop.

First, we grab a
body  (Strains 42-47). If the
frame  was
not grabbed , then we’ve reached the top of the video, at which point we’ll
break  from the loop.

Strains 50-51 then set our frame dimensions if required.

Let’s preprocess our
frame :

copy  of our body is made for
output  functions (Line 56).

We then preprocess the
frame  using the identical steps as our coaching script, including:

  • Swapping shade channels (Line 57).
  • Resizing to 224×224px (Line 58).
  • Imply subtraction (Line 59).

Body classification inference and rolling prediction averaging come next:

Line 63 makes predictions on the current frame. The prediction results are added to the
Q  by way of Line 64.

From there, Strains 68-70 perform prediction averaging over the
Q  historical past leading to a class
label  for the rolling common. Broken down, these strains find the label with the most important corresponding chance across the typical predictions.

Now that we’ve got our resulting
label , let’s annotate our
output  body and write it to disk:

Strains 73-75 draw the prediction on the
output  frame.

Strains 78-82 initialize the video
author  if needed. The
output  body is written to the file (Line 85). Read more about writing to video information with OpenCV right here.

output  can also be displayed on the display till the “q” secret is pressed (or till the top of the video file is reached as aforementioned) by way of Strains 88-93.

Finally, we’ll carry out cleanup (Strains 97 and 98).

Keras video classification results

Now that we’ve carried out our video classifier with Keras, let’s put it to work.

Ensure you’ve used the “Downloads” part of this tutorial to download the supply code.

From there, let’s apply video classification to a “tennis” clip — but let’s set the
–measurement of the queue to
1, trivially turning video classification into commonplace image classification:

As you’ll be able to see, there’s fairly a little bit of label flickering — our CNN thinks sure frames are “tennis” (right) whereas others are “football” (incorrect).

Let’s now use the default queue
–measurement of
128, thus using our prediction averaging algorithm to smoothen the outcomes:

Notice how we’ve appropriately labeled this video as “tennis”!

Let’s attempt a special example, this one in every of “weightlifting”. Again, we’ll begin off through the use of a queue
–measurement  of

We as soon as again encounter prediction flickering.

Nevertheless, if we use a frame
–measurement of
128, our prediction averaging will acquire the specified outcome:

Let’s attempt one remaining instance:

Right here you possibly can see the enter video is appropriately categorised as “football” (i.e., soccer).

Discover that there isn’t any body flickering — our rolling prediction averaging smoothes out the predictions.

Whereas easy, this algorithm can enable you to carry out video classification with Keras!

In future tutorials, we’ll cover more superior strategies of activity and video classification, together with LSTMs and RNNs.

Video Credits:


In this tutorial, you discovered how you can perform video classification with Keras and deep studying.

A naïve algorithm to video classification can be to treat each individual frame of a video as unbiased from the others. Any such implementation will trigger “label flickering” the place the CNN returns totally different labels for subsequent frames, although the frames must be the same labels!

More superior neural networks, together with LSTMs and the extra basic RNNs, may also help combat this drawback and result in much larger accuracy. Nevertheless, LSTMs and RNNs could be dramatic overkill depending on what you’re doing — in some conditions, easy rolling prediction averaging offers you the results you need.

Using rolling prediction averaging, you keep an inventory of the last Okay predictions from the CNN. We then take these final Okay predictions, average them, choose the label with the most important chance, and choose this label to classify the present frame. The idea right here is that subsequent frames in a video may have comparable semantic contents.

If that assumption holds then we will benefit from the temporal nature of movies, assuming that the previous frames are just like the current body.

The averaging, subsequently, allows us to clean out the predictions and make for a better video classifier.

In a future tutorial, we’ll talk about the more superior LSTMs and RNNs as nicely. But within the meantime, check out this guide to deep learning motion recognition.

To download the supply code to this publish, and to be notified when future tutorials are revealed right here on PyImageSearch, simply enter your e mail tackle in the type under!