About the AI – How DrawClash recognises sketches (CNN, tech & sources)

Overview: from gesture to prediction

When you draw, no finished image file is created but a sequence of strokes – each a list of screen coordinates. DrawClash processes these strokes in three steps:

Rasterise: the strokes are drawn into a small grayscale image.
Recognise: a convolutional neural network (CNN) assigns the image probabilities for 345 words.
Display: the most likely words appear live as bars next to your drawing.

The key point: recognition runs continuously. With every new stroke the image is re-evaluated, which is why the guesses change while you draw.

Step 1 – Strokes become an image (rasterisation)

Before the network can compute anything, the vector strokes are brought into a fixed format. DrawClash rasterises them into a square grayscale image of 96 × 96 pixels. The drawing is scaled and centred so that, with a small margin (about 8 %), it fills the area – regardless of whether you drew big or small, high or low. The lines are drawn with a fixed width, and the pixel values are normalised to the range [0, 1].

This normalisation matters: it makes recognition robust against position and size and ensures the network always sees the same kind of input – a tensor of shape (1, 96, 96) (one channel, because grayscale).

Step 2 – The model: a convolutional neural network (CNN)

DrawClash uses a compact, ResNet-style CNN (Convolutional Neural Network) with around 2 million parameters. “Convolutional” means: the network slides small filters (3 × 3 pixels) across the image and learns for itself which patterns matter – first simple edges, then corners and curves, finally whole shape parts like wheels, roofs or ears.

The architecture in detail:

Stem: an initial convolution 1 → 32 channels (3 × 3, stride 2), followed by batch normalisation and a SiLU activation.
Four stages of two residual blocks each with a growing number of channels: 32 → 64 → 128 → 192. At the start of each stage the image resolution halves.
Head: a global average pooling condenses each feature map into a single number, then dropout (20 %) and a linear layer 192 → 345 that produces a raw value (a “logit”) for each word.

A residual block computes not just y = f(x) but y = f(x) + x – so the input signal is also “passed through”. These shortcuts (skip connections) make it possible to train deep networks stably without the learning signal vanishing. The idea comes from the ResNet work by He et al. (2016).

Step 3 – The maths behind it

Three building blocks are enough to understand the basic principle:

Convolution

A filter is a small matrix of weights. It is slid across the image and, at each position, computes a weighted sum of the pixels beneath it. Different filters respond to different patterns (e.g. a vertical edge). Because the same filter is applied everywhere, the network needs few parameters and recognises a pattern regardless of where it appears in the image.

Activation (SiLU) and normalisation

After each convolution, a non-linear function lets the network learn non-linear relationships too. DrawClash uses SiLU (also “Swish”), defined as f(x) = x · σ(x) with the sigmoid function σ. Batch normalisation keeps the intermediate values in a stable range and speeds up training (Ioffe & Szegedy, 2015).

Softmax: logits become probabilities

The 345 raw values at the output are converted with the softmax function into probabilities that add up to 100 %:

p_i = e^z_i / Σ_j e^z_j

The highest value p_i is the top guess you see in the game as a percentage.

How the network learned (training)

The model was trained on sketches from Google’s public “Quick, Draw!” dataset – a collection of over 50 million line drawings across 345 categories that people around the world drew in the browser game of the same name. Each drawing is rasterised as described above and shown to the network together with its correct label.

During training, a loss function (cross-entropy) compares the prediction with the correct word. Through backpropagation, the millions of weights are gradually adjusted so the error gets smaller. After many passes over the data, the network reliably recognises the typical features of each word – without anyone ever explicitly telling it “what a cat looks like”.

Why a CNN on images – and not an RNN on strokes?

For Quick, Draw! sketches there are two common approaches, and both are well studied in the research:

Image-based (our approach): the strokes are rasterised into an image and classified with a CNN. Advantages: very robust, fast, and it leverages the mature field of image recognition.
Sequence-based: the strokes are treated as a temporal sequence of points and processed with a recurrent network (RNN, e.g. an LSTM). The best-known example is Sketch-RNN by Ha & Eck (2017), which can even generate sketches itself. Such models additionally use the order and direction of the strokes.

DrawClash deliberately takes the image-based CNN route: it is small enough for response times in the millisecond range, needs no assumptions about stroke order, and for a real-time game delivers the best mix of accuracy and speed.

Real time: why the guess keeps changing

Instead of waiting until you’re finished, DrawClash sends your drawing to the model repeatedly – again after new strokes. Because the network is small and only has to process a 96 × 96 image, a pass takes just a few milliseconds. At first the AI is uncertain and suggests several words; with each additional detail the distribution sharpens. The latency shown in the game (e.g. “inf 6 ms”) is exactly this compute time of the model.

Limits – why the AI sometimes gets it wrong

The model only knows the 345 trained words and judges shape alone. Very abstract, atypical or overloaded sketches can confuse it, and similar words (say “cat” and “tiger”) are easy to mix up. Whoever emphasises the typical features of a word is recognised faster – you’ll find concrete tips in the how-to guide. The complete list of recognisable words is under All words.

Privacy: what happens to your drawings

Your sketches are processed only for the gameplay – that is, for evaluation by the model and for display to other players in the same room. Details are in the privacy policy.

Sources & further reading

If you want to go deeper, here are the scientific foundations of the building blocks used:

Ha, D. & Eck, D. (2017). A Neural Representation of Sketch Drawings (Sketch-RNN). arXiv:1704.03477
Google Creative Lab. The Quick, Draw! Dataset. github.com/googlecreativelab/quickdraw-dataset
He, K., Zhang, X., Ren, S. & Sun, J. (2016). Deep Residual Learning for Image Recognition (ResNet). arXiv:1512.03385
Ioffe, S. & Szegedy, C. (2015). Batch Normalization. arXiv:1502.03167
Elfwing, S., Uchibe, E. & Doya, K. (2017). Sigmoid-Weighted Linear Units (SiLU). arXiv:1702.03118
Ramachandran, P., Zoph, B. & Le, Q. (2017). Searching for Activation Functions (Swish). arXiv:1710.05941

Note: DrawClash is an independent project and is not affiliated with Google. “Quick, Draw!” and its dataset are works by Google Creative Lab and are named here only as the data source and for context.

AI vs. your pen – Let’s Play

About the AI behind DrawClash