Soccer Tracking - Part 2 - Homography

In part 1 we looked at object detection and tracking. I mentioned at the end that another one of the challenges that we’d have to tackle is mapping the players that we are tracking to birds eye view, xy coordinates on the pitch. This is a common problem in computer vision called homography or perspective transformation. A few common use cases for applying homographies is to remove perspective distortion, stitching together multiple images into a single image and creating birds eye views from an image.

Homography perspective correction

Homography panorama stitching

Examples of perspective correction and panorama stitching from https://docs.opencv.org/4.x/d9/dab/tutorial_homography.html

As shown in the OpenCV article linked above, the key idea behind homography is the homography matrix which defines the transformation between two different perspectives. A common way to compute the homography matrix is by finding or defining coordinates for corresponding points in the two perspectives. The four black points on the building in the first example and the points on the mountain in the stitching example show this. The same approach can be used to map our soccer players onto a birds eye view of the pitch by defining the coordinate system of the field, finding points in the video that correspond to specific points on the birds eye view of the field then computing the homography matrix to project all of the players onto the birds eye view. That process is shown in the image below from the paper Evaluating Soccer Player: from Live Camera to Deep Reinforcement Learning.

Soccer homography estimation

Another approach proposed in the paper Sports Camera Calibration via Synthetic Data proposes using a database of known camera orientations to estimate the homography matrix. This is the approach I went with mainly because the dataset of camera orientations is available publicly and I didn’t want to label another dataset to find keypoints although I plan to implement the other approach in the future. The main drawback of using the database of known orientations is that it is a relatively small database with only 91,000 orientations, and since soccer broadcasts are constantly panning and zooming the closest orientation in the database could be somewhat far from the true orientation causing inaccurate estimations. The paper acknowledges this drawback but also mentions that since the main camera is always situated near midfield at a similar height, this estimation is typically close.

Soccer homography using a database

The full process for this approach is shown above. First the image of the soccer game is run though a model to remove the background and also extract the field markings. Then the closest known camera orientation is looked up based on the edge image of the field markings.

Edge images

The original broadcast, broadcast with background removed and the edge image

Using the edge image from the broadcast, we can lookup the closest known camera orientation. The video below shows the edge image from the broadcast and the edge image from the closest camera orientation. As mentioned before, the orientation can change suddenly since there is a relatively small number of known orientations in the data set and the orientation is being recomputed on every frame. This causes issues when using this method to compute the players’ coordinates as the homography will change drastically and thus the estimated coordinates can jump around.

Homography lookup

Using the bounding box coordinates for each player that the object tracker finds, we can compute the birds eye view coordinates by multiplying the coordinates and the homography matrix.

Birds eye view 1 Birds eye view 2

Example images showing the broadcast and birds eye view

One other small thing I added was a way to determine each players’ team. This is done by using K-means to create two clusters using the images within each of the bounding boxes returned by the object detector. The idea being that the different jersey colors will create two distinct clusters. Both this and the homography estimation can be improved on but this is a nice start.