About the AI behind DrawClash

This page explains how DrawClash recognises your drawing in real time – from the individual stroke to a probability for each of the 345 words. We describe the model actually in use, the maths behind it, the training, and place it all in the context of the research.

Overview: from gesture to prediction

When you draw, no finished image file is created but a sequence of strokes – each a list of screen coordinates. DrawClash processes these strokes in three steps:

  1. Rasterise: the strokes are drawn into a small grayscale image.
  2. Recognise: a convolutional neural network (CNN) assigns the image probabilities for 345 words.
  3. Display: the most likely words appear live as bars next to your drawing.

The key point: recognition runs continuously. With every new stroke the image is re-evaluated, which is why the guesses change while you draw.

Step 1 – Strokes become an image (rasterisation)

Before the network can compute anything, the vector strokes are brought into a fixed format. DrawClash rasterises them into a square grayscale image of 96 × 96 pixels. The drawing is scaled and centred so that, with a small margin (about 8 %), it fills the area – regardless of whether you drew big or small, high or low. The lines are drawn with a fixed width, and the pixel values are normalised to the range [0, 1].

This normalisation matters: it makes recognition robust against position and size and ensures the network always sees the same kind of input – a tensor of shape (1, 96, 96) (one channel, because grayscale).

Step 2 – The model: a convolutional neural network (CNN)

DrawClash uses a compact, ResNet-style CNN (Convolutional Neural Network) with around 2 million parameters. “Convolutional” means: the network slides small filters (3 × 3 pixels) across the image and learns for itself which patterns matter – first simple edges, then corners and curves, finally whole shape parts like wheels, roofs or ears.

The architecture in detail:

A residual block computes not just y = f(x) but y = f(x) + x – so the input signal is also “passed through”. These shortcuts (skip connections) make it possible to train deep networks stably without the learning signal vanishing. The idea comes from the ResNet work by He et al. (2016).

Step 3 – The maths behind it

Three building blocks are enough to understand the basic principle:

Convolution

A filter is a small matrix of weights. It is slid across the image and, at each position, computes a weighted sum of the pixels beneath it. Different filters respond to different patterns (e.g. a vertical edge). Because the same filter is applied everywhere, the network needs few parameters and recognises a pattern regardless of where it appears in the image.

Activation (SiLU) and normalisation

After each convolution, a non-linear function lets the network learn non-linear relationships too. DrawClash uses SiLU (also “Swish”), defined as f(x) = x · σ(x) with the sigmoid function σ. Batch normalisation keeps the intermediate values in a stable range and speeds up training (Ioffe & Szegedy, 2015).

Softmax: logits become probabilities

The 345 raw values at the output are converted with the softmax function into probabilities that add up to 100 %:

pi = ezi / Σj ezj

The highest value pi is the top guess you see in the game as a percentage.

How the network learned (training)

The model was trained on sketches from Google’s public “Quick, Draw!” dataset – a collection of over 50 million line drawings across 345 categories that people around the world drew in the browser game of the same name. Each drawing is rasterised as described above and shown to the network together with its correct label.

During training, a loss function (cross-entropy) compares the prediction with the correct word. Through backpropagation, the millions of weights are gradually adjusted so the error gets smaller. After many passes over the data, the network reliably recognises the typical features of each word – without anyone ever explicitly telling it “what a cat looks like”.

Why a CNN on images – and not an RNN on strokes?

For Quick, Draw! sketches there are two common approaches, and both are well studied in the research:

DrawClash deliberately takes the image-based CNN route: it is small enough for response times in the millisecond range, needs no assumptions about stroke order, and for a real-time game delivers the best mix of accuracy and speed.

Real time: why the guess keeps changing

Instead of waiting until you’re finished, DrawClash sends your drawing to the model repeatedly – again after new strokes. Because the network is small and only has to process a 96 × 96 image, a pass takes just a few milliseconds. At first the AI is uncertain and suggests several words; with each additional detail the distribution sharpens. The latency shown in the game (e.g. “inf 6 ms”) is exactly this compute time of the model.

Limits – why the AI sometimes gets it wrong

The model only knows the 345 trained words and judges shape alone. Very abstract, atypical or overloaded sketches can confuse it, and similar words (say “cat” and “tiger”) are easy to mix up. Whoever emphasises the typical features of a word is recognised faster – you’ll find concrete tips in the how-to guide. The complete list of recognisable words is under All words.

Privacy: what happens to your drawings

Your sketches are processed only for the gameplay – that is, for evaluation by the model and for display to other players in the same room. Details are in the privacy policy.

Sources & further reading

If you want to go deeper, here are the scientific foundations of the building blocks used:

Note: DrawClash is an independent project and is not affiliated with Google. “Quick, Draw!” and its dataset are works by Google Creative Lab and are named here only as the data source and for context.

AI vs. your pen – Let’s Play