Coordinate Transforms in iOS using Swift, Part 1: The L Triangle

Gary Bartos
27 min readJan 31, 2021

In the zeroth post in this series I described problems working with multiple coordinate systems in iOS frameworks such as UIKit, Core Graphics, AVFoundation, Vision, and Core ML. Briefly put, it’s a burden to keep track of multiple coordinate systems.

In this post I’ll describe the “L Triangle” technique of defining coordinate frames and calculating transforms between frames. If you’re developing an iOS app for optical character recognition (OCR), 2D barcode reading, computer vision, machine learning, or other applications with multiple image coordinate systems, you may find this technique useful.

If you’re already familiar with matrix math then you’ll see that the L Triangle technique relies on constraints in the geometry of iOS device frames. We use simple types to generate point correspondences, then use these point correspondences to find affine transforms.

If matrix math is not familiar to you, then the L Triangle technique may make the math easier to learn and remember. Or you can use the code and ignore the math.

It’s All Math, But…

The L Triangle method saves you from having to think about transforms. There’s no need to specify translations, rotations, or scaling in your code. A single function takes care of the math for you. We feed that function the definitions of coordinate frames.

Initialization of a coordinate frame makes it easy to understand the frame at a glance.

let frameOCR = ImageFrame(origin: .BottomLeft, xAlignment: .Horizontal, size: normalizedSize)let framePreview = ImageFrame(origin: .TopLeft, xAlignment: .Horizontal, size: view.previewLayer.frame.size)

let frameImage = ImageFrame(origin: .TopLeft, xAlignment: .Horizontal, size: imageSize)

The sections below explains why initialization requires a size parameter, and also why an axis can be defined as Horizontal rather than Left or Right.

For transforms between any two frames we have a single function with a signature similar to the following:

static func transform(from: Frame, to: Frame) -> Transform {
//...
}

We apply the returned transform to convert points from one frame to another frame.

The function transform(from:to:) can be used in place of CGAffineTransform, CATransform3D, and the various “VN___” functions that convert to and from normalized vision coordinates.

The L Triangle: Summary

The L Triangle technique is book-keeping rather than new math. The math is standard stuff, but we hided it in the function transform(from:to:).

Here’s the L Triangle technique in brief:

  1. Consider the rectangle defined by the screen of your iPhone or iPad.
  2. Define the standard orientation as the orientation in which you typically hold the device. We’ll assume the standard orientation is portrait mode, with the longer dimension of the device up and down and the circular home button at the bottom.
  3. With your finger, trace an L shape along the edges of the screen rectangle. Seriously, trace the rectangle with your finger! Start at the top left corner and slide down to the bottom left corner. Then slide from bottom left corner to bottom right to complete the L.
  4. For each coordinate frame, and with your device in the standard orientation, determine the coordinates of the top left, bottom left, and bottom right points of the L shape. The coordinates will be defined in terms of the horizontal and vertical dimensions of the screen in the units of the coordinate frame.
  5. Consider the three points of the L as a triangle.
  6. Find the affine transform from the L triangle in one coordinate system to the L triangle in another coordinate system.
Tracing the L on an iOS device.

The enums, structs, functions, and extensions for the L Technique are provided in the sample code below. If you were to use the L Technique types in your own code, your remaining task would be to define coordinate frames. Finding the transform between any two frames requires only that you call the function transform(from:to:). Define two coordinate frames, and the transform between those two frames is like a free gift with purchase.

At the risk of putting Descartes before the horse, I’ll first describe how constraints on geometry allow us to ignore rotations, translations, and scaling. Then we’ll walk through the process of finding the coordinates of the three points in the L Triangle.

Constraints on Coordinate Frames and Transforms

Online and/or in Swift textbooks you may have found code that uses CGAffineTransform to generate transforms between two coordinate frames. CGAffineTransform has functions to perform arbitrary translation, rotation, and scaling. (Mirroring is a combination of negative scaling and translation.)

CGAffineTransform allows for arbitrary transforms: any translation, any rotation, any scaling. But to find transforms between coordinate frames the number of translations, rotations, and scales are constrained to certain values.

For iOS coordinate frames there are constraints on rotations, translations, and scaling.

The coordinate frames for UIView, images, graphics, and vision results are aligned with the edges of the screen. Rotations for transforms between coordinate frames are CGFloat values of 0.0, 90.0, 180.0, and 270.0 degrees. Rotations can be positive or negative, but there are only four magnitudes for rotation.

Translations (linear shifts in X and Y) for transforms between iOS coordinate frames depend on the width and height of the screen expressed in the coordinate frame’s units.

The width of an image in one frame may map to the height of an image in another frame, but for each frame we have just two magnitudes for translation: width and height.

Scaling also depends on image width and image height. On an iPhone 7, for example, an image captured in portrait mode has a resolution of 1080 x 1920 pixels. The image may be scaled for display in a UIView with a size of 350 x 667 pixels.

The Vision framework provides results in normalized coordinates. Normalized coordinates fall within the range (0.0, 0.0) to (1.0, 1.0) regardless of the original image width and height. The size of a frame with normalized coordinates is 1.0 x 1.0. The size of a photo from an iPhone 7 is 1080 x 1920. Given these two sizes, we can find the scale between normalized coordinates and photo coordinates.

Image Frame vs. Coordinate Frame

2D coordinate frames have (x,y) coordinates that go off to positive infinity and negative infinity. On an iOS device the screen acts as a window onto each coordinate frame, showing us just the coordinates from (0,0) to (width,height). We can find the X and Y bounds of the screen in any coordinate frame. These bounds are the width and height used for translation and scaling.

I’ll use the term image frame to mean a coordinate frame that has a corner of the screen as an origin, and that has a width and height filling the screen. Whatever appears on the iOS screen is an image scaled to the screen resolution.

Oftentimes we’ll simply use the word “frame.”

Eight Image Frames

An image frame in iOS has an origin at one of the four corners of the screen.

Within the bounds of the screen, (x,y) coordinates are positive. This means that the X-axis points from one corner to another corner, and the Y-axis also points from one corner to another corner.

There are four corners, and at each corner there are only two valid directions for X and Y. There are eight combinations of corner origin, X direction, and Y direction.

All eight combinations of corner, +X direction, and +Y direction for iOS.

For a frame it’s sufficient to specify a corner as origin and whether X is horizontal or vertical. If the origin is the bottom right corner, and if X is aligned vertically, then X points up and Y points to the left.

The additional piece of data we need to define an image frame is the size.

Sizes in Image Frames

The size of an image frame is defined as the width and height spanning the screen. The width means the horizontal extent of the image frame with the device held in the standard orientation, and the height is the vertical extent.

From my own development, here are four image frames and their sizes:

  • Photo: 1080 x 1920 pixels (iPhone 7)
  • UIView: 375 x 667 pixels
  • QR Codes: 1.0 x 1.0 normalized coordinates (metadata coordinates)
  • OCR: 1.0 x 1.0 normalized coordinates
Four image frames in iOS encountered during app development: photo, a UI layer, QR Code coordinates, and OCR coordinates

Finding the Three Points of the L in an Image Frame

For each frame, we find the coordinates of the top left, bottom left, and bottom right corners. We identify the three points in the same order for each frame because we need to match point 1 in the coordinates of the first frame to point 1 in the coordinates of the second frame. Then we match point 2 and point 3.

The L Triangle sample code calculates the coordinates of these three points automatically for each ImageFrame. You initialize an ImageFrame struct by providing the origin corner, the horizontal or vertical alignment of the X-axis, and the size of the image frame. The ImageFrame code determines the coordinates of the three points from the corner, alignment, and size you provide.

The graphic below shows the Photo, UI, QR Codes, and OCR frames with the three points of the L Triangle labeled 1, 2, 3 in red.

The three points of the L Triangle marked in each of four image frames: photo, UI, QR Code, and OCR

For the Photo frame with the origin at top left, +X pointing to the right, and +Y pointing down, and with a size of 1080 x 1920 pixels, the three points of the L Triangle are:

  1. (0,0) the origin
  2. (0, 1920) on the Y-axis
  3. (1080, 1920) which is (width, height) in the Photo frame

The UI frame, like the Photo frame, has the origin at top left, +X pointing to the right, and +Y pointing down, but the size is 375 x 667. The three points for the UI frame are

  1. (0,0) the origin
  2. (0,375)
  3. (667,375)

QR Codes in this example have an origin at top right when the device is in the standard orientation. QR Code positions are reported in normalized coordinates, meaning the width and height are both 1.0:

  1. (0.0, 1.0)
  2. (1.0, 1.0)
  3. (1.0, 0.0)

OCR results are also reported in normalized coordinates, but with the origin at lower left.

  1. (0.0, 1.0)
  2. (0.0, 0.0)
  3. (1.0, 0.0)

In your app development, coordinate systems for OCR and QR Codes may be different from what I’ve shown here. YMMV (Your Matrices May Vary). QR Codes can be read using at least two different iOS frameworks. Each framework has its own coordinate system to report QR code locations.

Finding the Origin and X-direction

Whether we use the L Triangle or some other method, we need to find the origin and the directions of the axes for each frame. Here are three methods to determine a frame’s definition:

  1. Documentation. Look up the frame definition in documentation.
  2. Experimentation. For vision results, identify the (x,y) locations of a few objects in a captured photo.
  3. Trial and error. Keep changing the frame definition in your code and run your app until you get the expected results. Trial and error is often accompanied by loud commentary and impolite words. Confirm the frame definition that seems to work by re-reading documentation and/or performing some experiments.

For the Vision framework you can take a picture of a printed paper target, write the (x,y) coordinates of the detected objects to the console. Then you can determine the directions of +X and +Y, and which corner the origin must be.

In the image below the red letters A, B, and C represent actual text printed on a white sheet and read by an OCR algorithm.

The letters A, B, C printed on a piece of paper. OCR results are reported in normalized coordinates from (0.0, 0.0) to (1.0, 1.0).

Let’s say the coordinates printed to our console are as follows:

A: (0.22, 0.81)

B: (0.23, 0.34)

C: (0.79, 0.33)

A and B differ mostly in Y. The alignment of paper to image isn’t quite perfect. From B to A the Y value increases from 0.34 to 0.81, so we know Y points up.

From B to C the X values increase from 0.23 to 0.79. The +X and +Y axes point away from the origin, which we can confirm is at the lower left corner.

An OCR image frame determined by comparing the coordinates of the letters A, B, and C printed on a sheet of paper and read using an OCR algorithm.

Even when a frame is documented, the documentation could be misleading. Save yourself frustration by confirming the frame’s definition yourself.

Why Three Points?

If we have the coordinates in two different frames for three non-colinear points, then we can find an affine transform between the two frames. We’re reproducing the core functionality but not the full API of CGAffineTransform.

There are cases for which we could calculate a transform between 2D frames using just two points. Mathematically, though, we can’t always find a transform using just two points.

Let’s assume that for one frame we choose two points that both lie on the x-axis, such as (0,0) and (10,0). Then y = 0 for both points. It’s not possible to determine how the Y-axis points given just these two points.

Two points on the X-axis. The direction of Y can not be determined from those two points alone.

A mathematical formula doesn’t pick the points itself; it merely operates on the points we feed it. We want the math to work — or to fail with an explicit error — regardless of the input. To determine a transform between 2D coordinate frames we need three non-colinear points.

The Transform Math

For each frame, the transform(from:to:) function requires three non-colinear points —a triangle.

Here’s the transform(from:to:) function:

static func transform(from: ImageFrame, to: ImageFrame) -> float3x3? {
if from.size.x == 0 || from.size.y == 0 || to.size.x == 0 || to.size.y == 0 {
return nil
}

if from == to && from.size == to.size {
return float3x3(1)
}

return Transform.affine(from: from.triangleL(), to: to.triangleL())
}

For image frames with zero width or zero height you might prefer to throw an error rather than return nil.

SIMD and the Affine Transform

The transform(to:from:) returns an optional float3x3. float3x3 is a type from the SIMD framework. Matrix math expressions can be expressed simply and efficiently using SIMD types. Points are represented by simd_float2, a 2D vector. Sizes are also represented by simd_float2.

The math and code for the affine transform are presented in one of my earlier posts, “Finding an Affine Transform the Traditional Way with three 2D point correspondences in Swift.” We create a 3x3 matrix for the three points in each frame’s L triangle. An affine transform maps a triangle in one frame to a triangle in another frame.

With a 3x3 matrix such as float3x3 or CGAffineTransform we can transform an (x,y) point in one frame to the corresponding (x,y) coordinates in another frame.

Unlike CGAffineTransform, a float3x3 can also be used for a perspective transform. Finding perspective transforms requires quadrilaterals instead of triangles. For more about perspective transforms, see the post “Perspective transform from quadrilateral to quadrilateral in Swift using SIMD for matrix operations.”

We’ll still have reason to use CGPoint, CGAffineTransform, and other Core Graphics types in our code. The L Triangle code includes extensions to perform conversions between Core Graphics types and SIMD types.

Applying the simd_float3x3 Transform to a point

Once we have our transform from one coordinate frame to another expressed as a simd_float3x3, we still need the means to apply the transform to a point. There are a few options.

The sample code includes an extension for CGPoint that adds CGPoint.applying(float3x3). Simple.

If you’re familiar with CGAffineTransform, then you can convert between float3x3 and CGAffineTransform using the extension in the sample code below. The function testTransform(from:fromName:to:toName) prints the float3x3 and the equivalent CGAffineTransform.

Finally, you could write your own code to convert a CGPoint from one frame to another using intermediate SIMD expressions:

  1. Convert a CGPoint to a simd_float2, which is a 2D vector in the SIMD framework we can use as an [x, y] point.
  2. Convert the simd_float2 to a simd_float3 expressing homogeneous coordinates: [x, y, 1]
  3. Apply the transform by finding the product (float3x3 * simd_float3). This yields a new simd_float3. The new simd_float3 is the point in the “to” image frame expressed in homogeneous coordinates (xZ, yZ, Z).
  4. Convert the simd_float3 back to a simd_float2 (x,y).
  5. Convert the simd_float2 back to a CGPoint.

The function testTransform(from:fromName:to:toName) demonstrates the use of both CGPoint.applying(CGAffineTransform) and CGPoint.applying(float3x3).

Advantages of the L Triangle

Before we get to the sample code, I’ll recap the advantages of the L Triangle:

  • We can think in terms of coordinate frames rather than in terms of transforms.
  • Once you define coordinate frames, you can find the transform between any two coordinate frames by calling the transform(from:to:) function.
  • Adding another coordinate frame requires only that you create an instance of the ImageFrame struct. That instance is ready to transform to any other frame.
  • Constraints allow us to simplify initialization of the ImageFrame struct.
  • The function to find the affine transform between frames implements simple matrix math.
  • If you don’t want to know more about the matrix math, you can simply use the transform(from:to:) function.
  • By working with the 3x3 matrix of simd_float3x3, we allow for intermingling of affine transforms and perspective transforms. Perspective transforms allow us to map images taken from two different viewpoints.
  • The L Triangle technique could easily be reproduced using other frameworks, and in other languages, without having to know much matrix math.

Sample Code in Swift

The following code is ready to copy and paste into an XCode 12 playground. Run the entire playground.

In the playground code the ImageFrame struct is provided first, followed by supporting types. The last line is a call to a test function that prints messages to the console.

// L Triangle sample code by Gary Bartos. Copyright 2021.
// If you need the code, use it! But bring no harm to people, bears, bats, or bees.
//
// The ImageFrame struct is listed first, then Triangle, Transform, and supporting types
// Test functions are provided at the very bottom.
// Run the entire playground and see the console for test output.

import CoreGraphics
import Foundation
import simd


/// A bounded coordinate frame for an image of finite extent. Nominally, X and Y coordinates are positive and are most meaningful within image bounds.
/// The coordinate frame is defined by its origin as the corner of the image rectangle,
/// the directions of the X and Y axes, and by the image size.
/// Vision framework and AVFoundation may use normalized coordinates from 0.0 to 1.0 for X and Y.
/// The rectangle has horizontal and vertical edges aligned with the screen of the iDevice.
/// The function triangleL( ) returns an "L" shape of points--top left,
/// bottom left, bottom right--that can be used to find the affine transform
/// from one coordinate frame to another.
/// +x and +y appear always to be defined so that coordinates are positive within the image.
struct ImageFrame: CustomStringConvertible {
// MARK: - Properties
/// The corner of the image when an iPhone is held in portrait mode, with the home button at bottom.
private (set) var origin: Corner

/// Direction of the x-Axis when an iPhone is held in portrait mode.
/// TODO define change in direction for iPhone or iPad in different device orientations.
private (set) var xAlignment: Alignment

/// Direction of the y-Axis when an iPhone is held in portrait mode.
/// Complement of the x-Axis direction.
var yAlignment: Alignment {
xAlignment.complement
}

/// Directions of the X and Y axes.
/// The directions are determined by the Corner origin and whether X is aligned horizontally or vertically.
var axes: (x: Direction, y: Direction) {
switch origin {
case .BottomLeft:
return xAlignment == .Horizontal ? (.Right, .Up) : (.Up, .Right)
case .BottomRight:
return xAlignment == .Horizontal ? (.Left, .Up) : (.Up, .Left)
case .TopLeft:
return xAlignment == .Horizontal ? (.Right, .Down) : (.Down, .Right)
case .TopRight:
return xAlignment == .Horizontal ? (.Left, .Down) : (.Down, .Left)
}
}

/// The width and height of the image. Normalized size will be (1.0, 1.0).
private (set) var size: simd_float2

var description: String {
"\(origin.rawValue) origin: +X \(axes.x.rawValue), +Y \(axes.y.rawValue) size: \(size.x) x \(size.y)"
}

/// Size 1.0 x 1.0 for normalized coordinate frames
static let normalizedSize = simd_float2(1.0, 1.0)

// MARK: - Triangle L
/// A triangle of points forming an "L" when we think of the iDevice (iPhone or iPad) held in portrait mode.
/// ( top left, bottom left, bottom right ). The home button is at bottom.
///
/// pt 1 .... [top right]
/// | .
/// pt 2 --- pt 3
///
/// (home button)
///
/// The left vertical leg of the triangle is size.y vertically.
/// The bottom horizontal leg of the triangle is size.x horizontally.
/// Direction.unitVector() provides the appropriate direction vector (1,0) for X and (0,1) for Y.
/// The axes point across the device such that (x,y) coordinates in ANY frame are positive
/// within the bounds of the device / image.
/// An affine transform can be calculated from a triangle "L" in one coordinate system
/// to a triangle "L" in another coordinate system.
func triangleL() -> Triangle {
let hz = size.x // Horizontal dimension. Here size.x is the first component of simd_float2, which we treat as width.
let vt = size.y // Vertical dimension. Here size.y is the second component of simd_float2, which we treat as height.

let x_hz = hz * simd_float2(1, 0) // x-axis with horizontal dimension
let x_vt = vt * simd_float2(1, 0) // x-axis with vertical dimension

let y_hz = hz * simd_float2(0, 1) // y-axis with horizontal dimension
let y_vt = vt * simd_float2(0, 1) // y-axis with vertical dimension

let zero = simd_float2(0, 0) // the origin of the current frame as coordinates rather than as a Corner

// Define the triangle in point order: top left -> bottom left -> bottom right
switch origin {
case .BottomLeft:
if xAlignment == .Horizontal {
return Triangle(y_vt, zero, x_hz) //+y up -> origin -> +x right
}
else {
return Triangle(x_vt, zero, y_hz) //+x up -> origin -> +y right
}
case .BottomRight:
if xAlignment == .Horizontal {
return Triangle(x_hz + y_vt, x_hz, zero) //+x left and +y up -> +x left -> origin
}
else {
return Triangle(x_vt + y_hz, y_hz, zero) //+x up and +y left -> +y left -> origin
}
case .TopLeft:
if xAlignment == .Horizontal {
return Triangle(zero, y_vt, x_hz + y_vt) //origin -> +y down -> +x right and +y down
}
else {
return Triangle(zero, x_vt, x_vt + y_hz) //origin -> +x down -> +x down and +y right
}
case .TopRight:
if xAlignment == .Horizontal {
return Triangle(x_hz, x_hz + y_vt, y_vt) //+x left -> +x left and +y down -> +y down
}
else {
return Triangle(y_hz, x_vt + y_hz, x_vt) //+y left -> +x down and +y left -> +x down
}
}
}

// MARK: - Initialization
/// Initializes an ImageFrame in terms of its Corner origin,
/// whether the x-Axis is .Horizontal or Vertical, and the size of the region
/// bounded by the image when the device is held in the portrait orientation.
init(origin: Corner, xAlignment: Alignment, size: simd_float2) {
self.origin = origin
self.xAlignment = xAlignment
self.size = size
}

/// Initializes an ImageFrame given the top left, bottom left, and bottom right points of
/// a triangle congruent to the "L triangle" and aligned with it (at least roughly).
/// The three point arguments may not lie on image edges, and instead may correspond to an L shape
/// somewhere in the image.
/// Size is determined using the three points, and may not correspond to the full image size, in which case
/// the purpose may simply be to determine the Corner and x-axis direction.
init(topLeft: simd_float2, bottomLeft: simd_float2, bottomRight: simd_float2) {
let hz = bottomRight - bottomLeft
let vt = topLeft - bottomLeft
let size = simd_float2(simd_length(hz), simd_length(vt))

self.init(topLeft: topLeft, bottomLeft: bottomLeft, bottomRight: bottomRight, size: size)
}

/// Initializes an ImageFrame given the top left, bottom left, and bottom right points of
/// a triangle congruent to the "L triangle" and aligned with it (at least roughly).
/// The three point arguments may not lie on image edges, and instead may correspond to an L shape
/// somewhere in the image.
init(topLeft: simd_float2, bottomLeft: simd_float2, bottomRight: simd_float2, size: simd_float2) {
// horizontal and vertical vectors from bottom left
// ^
// |
// -->

let lr = simd_normalize(bottomRight - bottomLeft) //unit vector pointing left to right
let bt = simd_normalize(topLeft - bottomLeft) //unit vector pointing bottom to top

// x-direction is horizontal if x value from left to right is greater than x value from bottom to top (for normalized lengths)
let xdir: Alignment = abs(lr.x) > abs(bt.x) ? .Horizontal : .Vertical

// find the corner from which a vector to the center has (+x, +y) components
let center = (topLeft + bottomRight)/2

if (center - topLeft).x > 0 && (center - topLeft).y > 0 {
self.init(origin: .TopLeft, xAlignment: xdir, size: size)
}
else if (center - bottomLeft).x > 0 && (center - bottomLeft).y > 0 {
self.init(origin: .BottomLeft, xAlignment: xdir, size: size)
}
else if (center - bottomRight).x > 0 && (center - bottomRight).y > 0 {
self.init(origin: .BottomRight, xAlignment: xdir, size: size)
}
else {
self.init(origin: .TopRight, xAlignment: xdir, size: size)
}
}

// MARK: - Operators
static func == (lhs: ImageFrame, rhs: ImageFrame) -> Bool {
return lhs.origin == rhs.origin
&& lhs.xAlignment == rhs.xAlignment
&& lhs.size == rhs.size
}

// MARK: - Transforms
/// Affine transform (3x3) from the current frame to another frame.
func transform(to: ImageFrame) -> float3x3? {
return ImageFrame.transform(from: self, to: to)
}

/// Affine transform between frames. We are transforming from the triangle L in one frame to the
/// triangle L in another frame.
static func transform(from: ImageFrame, to: ImageFrame) -> float3x3? {
if from.size.x == 0 || from.size.y == 0 || to.size.x == 0 || to.size.y == 0 {
return nil
}

if from == to && from.size == to.size {
return float3x3(1)
}

return Transform.affine(from: from.triangleL(), to: to.triangleL())
}
}

/// Three points nominally defining a triangle, but possibly colinear.
struct Triangle: CustomStringConvertible {
var point1: simd_float2
var point2: simd_float2
var point3: simd_float2

/// Dependent on NumberFormatter extension. Mildly convenient.
var description: String {
let f = NumberFormatter()
return f.string(self, descriptionDigits)
}

/// Digits used in description (e.g. if digits = 1, point1 (2,3) will be displayed as "(2.0, 3.0)"
var descriptionDigits = 1

init(_ point1: simd_float2, _ point2: simd_float2, _ point3: simd_float2) {
self.point1 = point1
self.point2 = point2
self.point3 = point3
}

init(_ vector1: simd_float3, _ vector2: simd_float3, _ vector3: simd_float3) {
point1 = vector1.toVector2()
point2 = vector2.toVector2()
point3 = vector3.toVector2()
}

/// Returns a triangle transformed by the 3x3 matrix.
/// newTriangle = m * self
func applying(_ t: float3x3) -> Triangle {
let m = toMatrix()
let p = t * m
return try! Triangle.fromMatrix(p)
}

/// Three points are colinear if their determinant is zero. We assume close to colinear might as well be colinear.
/// ```
/// | x1 x2 x3 |
/// det | y1 y2 y3 | = 0 --> abs( det(M) ) < tolerance
/// | 1 1 1 |
/// ```
/// NOTE: transforms to and from normalized coordinates with a range (0.0, 1.0) for X and Y,
/// the determinant of a matrix or its inverse can be quite small. For example, the transform
/// from UI coordinates to normalized vision coordinates has a determinant of about 5e-07.
func colinear(tolerance: Float = 1e-20) -> Bool {
let m = toMatrix()
return abs(m.determinant) < tolerance
}

/// Returns a 3x3 matrix with triangle vertices in columns.
/// | p1.x p2.x p3.x |
/// | p1.y p2.y p3.y |
/// | 1 1 1 |
func toMatrix() -> float3x3 {
float3x3(point1.toVector3(), point2.toVector3(), point3.toVector3())
}

/// Returns a Triangle from a 3x3 matrix that presents homogeneous coordinates (xZ, yZ, Z) in columns.
/// Throws a GeometryError.matrixInvalid exception if an element of the final row is zero
/// | p1.x p2.x p3.x |
/// | p1.y p2.y p3.y | : error thrown if p1.z, p2.z, and/or p3.z is zero
/// | p1.z p2.z p3.z |
static func fromMatrix(_ m: float3x3) throws -> Triangle {
let c1 = m.columns.0
let c2 = m.columns.1
let c3 = m.columns.2

if c1.z.isZero || c2.z.isZero || c3.z.isZero {
let f = NumberFormatter()
let s = "At least one element is zero in the last row of a triangle vertex matrix: "
+ "|\(f.string(c1.z, 6)) \(f.string(c2.z, 6)) \(f.string(c3.z, 6))|"
throw GeometryError.matrixInvalid(description: s)
}

let p1 = c1.toVector2()
let p2 = c2.toVector2()
let p3 = c3.toVector2()
return Triangle(p1, p2, p3)
}

/// Generates a random triangle with points in the range (-magnitude, -magniture) to (+magnitude, +magnitude).
/// Handy for testing affine transforms functions.
static func randomTriangle(_ magnitude: Float = 10) -> Triangle {
let randomPoint = { (mag: Float) -> simd_float2 in
simd_float2(Float.random(in: -magnitude...magnitude), Float.random(in: -magnitude...magnitude))
}
return Triangle(randomPoint(magnitude), randomPoint(magnitude), randomPoint(magnitude))
}
}

struct Transform {
/// 3x3 transform to map the 'from' point to the 't' point
/// | 1 0 (to - from).x |
/// | 0 1 (to - from).y |
/// | 0 0 1 |
static func translation(from: simd_float2, to: simd_float2) -> float3x3 {
let delta = to - from
var f = float3x3(1)
f[2,0] = delta.x
f[2,1] = delta.y
return f
}

/// Finds the affine transform (translation, rotation, scale, ...) from one triangle to another.
/// A triangle is understood to be a set of three non-colinear points.
/// See https://rethunk.medium.com/finding-an-affine-transform-the-traditional-way-with-three-2d-point-correspondences-in-swift-7c602682bfbc
static func affine(from: Triangle, to: Triangle) -> float3x3? {
// nice description of the meaning of determinant being zero:
// https://math.stackexchange.com/questions/355644/what-does-it-mean-to-have-a-determinant-equal-to-zero
// and from that page, a link to a GREAT video about determinants:
// https://www.youtube.com/watch?v=Ip3X9LOh2dk

let fc = from.colinear()
let tc = to.colinear()

// Check the (near-)colinearity condition of both triangles. Name the colinearity explicitly.
if fc || tc {
//TODO throw an error, but returning nil is sufficient for now
let sf = "'From' triangle is \(fc ? "COLINEAR" : "okay")."
let st = "'To' triangle is \(tc ? "COLINEAR" : "okay")."
print("Can not calculate affine transform. \(sf) \(st)")
return nil
}

// following example from https://stackoverflow.com/questions/18844000/transfer-coordinates-from-one-triangle-to-another-triangle
// M * A = B
// M = B * Inv(A)
let A = from.toMatrix()
let invA = A.inverse

if invA.determinant.isNaN {
print("Can not calculate affine transform. Determinant of inverse of 'From' triangle is zero.")
return nil
}

let B = to.toMatrix()
let M = B * invA

return M
}
}

/// The direction of a coordinate axis.
/// Given that coordinates within the device / image frame are positive,
/// we need only know the Corner and Direction of the X-axis to determine
/// whether the X-axis points left or right. Given the Corner and Direction
/// of the X-axis, we also know the direction of the Y-axis.
enum Alignment: String {
case Horizontal = "Horizontal"
case Vertical = "Vertical"

/// The other direction: if self is Horizontal, then the complement is Vertical.
/// If self is Vertical, then Horizontal.
var complement: Alignment {
switch self {
case .Horizontal:
return .Vertical
case .Vertical:
return .Horizontal
}
}
}

/// The corner of a rectangle with edges aligned horizontally and vertically.
enum Corner: String {
case BottomLeft = "Bottom Left"
case BottomRight = "Bottom Right"
case TopLeft = "Top Left"
case TopRight = "Top Right"
}

/// One of four directions: Down, Left, Right, Up.
enum Direction: String {
case Down = "Down"
case Left = "Left"
case Right = "Right"
case Up = "Up"

var complement: Direction {
switch self {
case .Down:
return .Up
case .Left:
return .Right
case .Right:
return .Left
case .Up:
return .Down
}
}
}

/// Minimalist Error subtype for coordinate frame functions that may throw errors.
enum FrameError: Error {
case axesNotOrthogonal
case cornerNotDetermined
}

/// An error for various computations using points, matrices, and so on.
enum GeometryError: Error {
/// Points are unexpectedly or undesirably colinear.
case colinearPoints(description: String)

/// Attempted division by zero in some calculation
case divideByZero(description: String)

/// The determinant of a matrix is zero, and shouldn't be.
case matrixDeterminantIsZero(description: String)

/// Matrix elements do not conform to expectations.
/// For example, if a 3x3 matrix contains the points of a triangle in
/// homogeneous coordinates, no element in the final row may be zero.
/// | x1 x2 x3 |
/// | y1 y2 y3 | --> invalid because the final row for point 2 has a zero value
/// | 1 0 3 |
case matrixInvalid(description: String)

/// Roll your own.
case otherError(error: Error)
}

// Convenience functions for pretty-ish printing. Typically used for string interpolation in calls to print().
extension NumberFormatter {
func string(_ m: simd_float2, _ digits: Int) -> String {
"[\(string(m.x, digits)), \(string(m.y, digits))]"
}

func string(_ m: simd_float3, _ digits: Int) -> String {
"[\(string(m.x, digits)), \(string(m.y, digits)), \(string(m.z, digits))]"
}

func string(_ m: float3x3, _ digits: Int) -> String {
//SIMD: column, row (like x,y)

"\(string(m[0][0], digits)) \(string(m[1][0], digits)) \(string(m[2][0], digits))"
+ "\n\(string(m[0][1], digits)) \(string(m[1][1], digits)) \(string(m[2][1], digits))"
+ "\n\(string(m[0][2], digits)) \(string(m[1][2], digits)) \(string(m[2][2], digits))"
}

// Triangle is a CustomStringConvertible, but here you can specify the number of digits after the decimal.
func string(_ t: Triangle, _ digits: Int) -> String {
"\(string(t.point1, digits)), \(string(t.point2, digits)), \(string(t.point3, digits))"
}

func string(_ value: Float, _ digits: Int, failText: String = "[?]") -> String {
minimumFractionDigits = max(0, digits)
maximumFractionDigits = minimumFractionDigits

guard let s = string(from: NSNumber(value: value)) else {
return failText
}

return s
}

func string(_ value: CGFloat, _ digits: Int, failText: String = "[?]") -> String {
minimumFractionDigits = max(0, digits)
maximumFractionDigits = minimumFractionDigits

guard let s = string(from: NSNumber(value: Double(value))) else {
return failText
}

return s
}

func string(_ point: CGPoint, _ digits: Int = 1, failText: String = "[?]") -> String {
let sx = string(point.x, digits, failText: failText)
let sy = string(point.y, digits, failText: failText)
return "(\(sx), \(sy))"
}
}

// Conversions between 2D points and 1x3 homogeneous coordinates.
extension simd_float2 {
/// Returns (inf, inf) if v.z == 0
static func fromVector3(_ v: simd_float3) -> simd_float2 {
simd_float2(v.x / v.z, v.y / v.z)
}

/// Returns (x, y, 1)
func toVector3() -> simd_float3 {
simd_float3(self.x, self.y, 1)
}
}

// Conversions between 1x3 homogeneous coordinates and 2D points.
extension simd_float3 {
/// Returns (x,y,1)
static func fromVector2(_ v: simd_float2) -> simd_float3 {
simd_float3(v.x, v.y, 1)
}

/// Returns (inf,inf) if v.z == 0
func toVector2() -> simd_float2 {
simd_float2(self.x / self.z, self.y / self.z)
}
}

extension CGAffineTransform {
/// Generates a 3x3 matrix from the CGAffineTransform
/// The 3x3 matrix is transposed relative to the CGAffineTransform:
/// CGAffineTransform:
/// | a b 0 |
/// | c d 0 |
/// | tx ty 1 |
///
/// 3x3 matrix
/// | a c tx |
/// | b d ty |
/// | 0 0 1 |
func toMatrix3x3() -> float3x3 {
return float3x3(
SIMD3<Float>(Float(self.a), Float(self.b), Float(0)),
SIMD3<Float>(Float(self.c), Float(self.d), Float(0)),
SIMD3<Float>(Float(self.tx), Float(self.ty), Float(1)))
}

/// Generates a CGAffineTransform from a 3x3 matrix
/// The 3x3 matrix is transposed relative to the CGAffineTransform:
/// CGAffineTransform:
/// | a b 0 |
/// | c d 0 |
/// | tx ty 1 |
///
/// 3x3 matrix
/// | a c tx |
/// | b d ty |
/// | 0 0 1 |
static func fromMatrix3x3(_ m: float3x3) -> CGAffineTransform {
if !m[0][2].isZero {
print("Non-affine matrix element [0][2] is non-zero")
}

if !m[1][2].isZero {
print("Non-affine matrix element [1][2] is non-zero")
}

return CGAffineTransform(
a: CGFloat(m[0][0]),
b: CGFloat(m[0][1]),
c: CGFloat(m[1][0]),
d: CGFloat(m[1][1]),
tx: CGFloat(m[2][0]),
ty: CGFloat(m[2][1]))
}
}

// Conversions to/from CGPoint for use with CGImage and SIMD matrix operations.
extension CGPoint {
/// Applies a 3x3 matrix to the CGPoint.
/// Converts from CGPoint to 1x3 homogeneous coordinate,
/// applies the transform, then converts back to CGPoint.
/// The 3x3 matrix, such as that generated by perspectiveTransform(),
/// will be transposed relative to CGAffineTransform, which
/// has translation components tx and ty in the bottom row.
func applying(_ matrix: float3x3) -> CGPoint {
let v = self.vector3
let t = matrix * v
return CGPoint.fromVector3(t)
}

/// A 1x2 vector of the point: (x, y)
var vector2: simd_float2 {
simd_float2(Float(self.x), Float(self.y))
}

/// A 1x3 vector of the point (x, y, 1)
var vector3: simd_float3 {
simd_float3(Float(self.x), Float(self.y), Float(1))
}

/// Returns a point (v.x, v.y)
static func fromVector2(_ v: simd_float2) -> CGPoint {
CGPoint(x: CGFloat(v.x), y: CGFloat(v.y))
}

/// Returns a point (x, y) = (v.x / v.z, v.y / v.z)
/// Returns {x +∞, y +∞} if v.z == 0
static func fromVector3(_ v: simd_float3) -> CGPoint {
CGPoint(x: CGFloat(v.x / v.z), y: CGFloat(v.y / v.z))
}

/// Returns a (CGPoint) -> CGPoint function for a 3x3 transform
static func converter(_ transform: float3x3) -> (CGPoint) -> CGPoint {
let function = { (cg: CGPoint) -> CGPoint in
let p3 = cg.vector3
let q3 = transform * p3
return CGPoint.fromVector3(q3)
}
return function
}
}

/* TEST CODE */
/// Find the transform between two frames and print info to the console.
func testTransform(from: ImageFrame, fromName: String, to: ImageFrame, toName: String) {
print()
print("**********************************")
print("from '\(fromName)' [\(from)]")
print("to '\(toName)' [\(to)]")

guard let t = ImageFrame.transform(from: from, to: to) else {
print()
print("Error: could not find transform between image frames.")
return
}

let n = NumberFormatter()
print()

var s = "\(fromName) --> \(toName)"

if from == to {
s += " [should be identity matrix -- all 1.0s along diagonal]"
}

print("\(s)")
print("\(n.string(t, 2))")
print()

let cg = CGAffineTransform.fromMatrix3x3(t)
print(cg)

//convert a point 1/2 the width and 1/3 the height of the "from" frame
let p = CGPoint(x: CGFloat(from.size.x) / 2.0, y: CGFloat(from.size.y) / 3.0)
let qt = p.applying(t)
let qcg = p.applying(cg)

print()
print("from \(n.string(p, 2)) -> \(n.string(qt, 2)) using CGPoint.applying(float3x3)")
print("from \(n.string(p, 2)) -> \(n.string(qcg, 2)) using CGPoint.applying(CGAffineTransform)")
}

func testFrames() {
let normalizedSize = ImageFrame.normalizedSize
let uiSize = simd_float2(375, 667)
let imageSize = simd_float2(1080,1920)

let frameOCR = ImageFrame(origin: .BottomLeft, xAlignment: .Horizontal, size: normalizedSize)
let frameUI = ImageFrame(origin: .TopLeft, xAlignment: .Horizontal, size: uiSize)
let frameImage = ImageFrame(origin: .TopLeft, xAlignment: .Horizontal, size: imageSize)
let frameQR = ImageFrame(origin: .TopRight, xAlignment: .Vertical, size: normalizedSize)

var frames: [(frame: ImageFrame, name: String)] = []
frames.append((frame: frameOCR, name: "OCR"))
frames.append((frame: frameUI, name: "UI"))
frames.append((frame: frameImage, name: "Image"))
frames.append((frame: frameQR, name: "QR Code"))

//transform from each frame to every other frame (including itself)
for f in frames {
for g in frames {
testTransform(from: f.frame, fromName: f.name, to: g.frame, toName: g.name)
}
}
}

testFrames()

Console Output

Here are the first few transforms printed to the console:

**********************************
from 'OCR' [Bottom Left origin: +X Right, +Y Up size: 1.0 x 1.0]
to 'OCR' [Bottom Left origin: +X Right, +Y Up size: 1.0 x 1.0]

OCR --> OCR [should be identity matrix -- all 1.0s along diagonal]
1.00 0.00 0.00
0.00 1.00 0.00
0.00 0.00 1.00

CGAffineTransform(a: 1.0, b: 0.0, c: 0.0, d: 1.0, tx: 0.0, ty: 0.0)

from (0.50, 0.33) -> (0.50, 0.33) using CGPoint.applying(float3x3)
from (0.50, 0.33) -> (0.50, 0.33) using CGPoint.applying(CGAffineTransform)

**********************************
from 'OCR' [Bottom Left origin: +X Right, +Y Up size: 1.0 x 1.0]
to 'UI' [Top Left origin: +X Right, +Y Down size: 375.0 x 667.0]

OCR --> UI
375.00 0.00 0.00
0.00 -667.00 667.00
0.00 0.00 1.00

CGAffineTransform(a: 375.0, b: 0.0, c: 0.0, d: -667.0, tx: 0.0, ty: 667.0)

from (0.50, 0.33) -> (187.50, 444.67) using CGPoint.applying(float3x3)
from (0.50, 0.33) -> (187.50, 444.67) using CGPoint.applying(CGAffineTransform)

**********************************
from 'OCR' [Bottom Left origin: +X Right, +Y Up size: 1.0 x 1.0]
to 'Image' [Top Left origin: +X Right, +Y Down size: 1080.0 x 1920.0]

OCR --> Image
1080.00 0.00 0.00
0.00 -1920.00 1920.00
0.00 0.00 1.00

CGAffineTransform(a: 1080.0, b: 0.0, c: 0.0, d: -1920.0, tx: 0.0, ty: 1920.0)

from (0.50, 0.33) -> (540.00, 1280.00) using CGPoint.applying(float3x3)
from (0.50, 0.33) -> (540.00, 1280.00) using CGPoint.applying(CGAffineTransform)

What about Image Orientation and Device Orientation?

We’re skipping discussion of orientations. Once you find a base coordinate frame, you can transform to other orientations. A future post in this series may address orientations.

Automating the L Triangle Process

Earlier in this post I described how to determine the OCR frame definition using a paper target on which the letters A, B, and C are printed. to improve robustness of OCR recogntion we can print words such as “Ant,” “Bee,” and “Cat” instead of just individual letters. In a later post I’ll describe further how to automate the process of finding coordinate frames.

Maybe it’s all Greek or Cyrillic to You: Alternatives to L

My mother tongue is (American) English, so I picked the L shape as the simplest mnemonic. Our choice of three points is arbitrary, so we might as well pick something easy to remember.

If you know the Greek or Cyrillic alphabets, maybe you want to think in terms of gamma or Г (ge/ghe). Why not? You could trace Г from top right to top left to bottom left.

If you read Braille, then you might choose either the Braille symbol ⠓ for “h” (dots 1, 2, 5) ⠓ or the Braille symbol ⠥ for “u” (dots 1, 3, 6).

--

--

Gary Bartos

Founder of Echobatix, developing assistive technology for the blind. echobatix@gmail.com