A fundamental technique in AI is the application of neural networks, which are proven by the “universal approximation theorem” to be “universal function approximators”. They can reproduce the behavior of any continuous function within a specified domain, provided they have the appropriate structure and size. In practice, neural networks can also reproduce functions with discontinuities and capture trends in non-deterministic processes by smoothing over issues and learning expected behaviors.
As software engineers, we construct functions by arranging syntax to express our logical intent — e.g. if n <= 1 { n } else { fn(n-1) + fn(n-2) }
. This naturally leads to functions that are easy for humans to understand and interpret, because (1) they are written in a way that is similar to how we would explain the function to others, (2) we have a full range of high-level programming syntaxes and library calls available to us instead of only a limited set of mathematical operations, and finally (3) related logic is generally refactored so as to be colocated, SOLID, etc.
Unfortunately, the functions modeled by neural networks are not inherently human-interpretable due to the way they are constructed. A neural network is essentially an incomprehensibly large mathematical expression composed of stacked layers. Each layer forwards its outputs to another layer, and is made up from mathematical expressions representing artificial neurons computing the weighted sum of inputs plus a bias (an affine transformation but sometimes referred to as a “linear map”) followed by a non-linear activation function (e.g. activation(sum(weights * inputs) + bias)
). There is nothing other than the overall mathematical expression formulated by the neural network, the numeric weights and biases used alongside it that are known to produce accurate results, and the scaffolding used to “train” the expression to perform as expected. This does not lead to natural interpretability and represents a significant issue for businesses reliant on being able to provide authentic explanations to clients on outputs or decisions that are called into question.
The reason modern neural networks are constructed as giant mathematical expressions, isn’t merely due to computational efficiency, but is substantially because training them requires a process called “backpropagation” (also known as reverse-mode automatic differentiation).
Backpropagation is, in essence, just the chain rule on (multivariate) functions, and the chain rule is just a way to calculate the derivative (read “rate of change”) of a composite function by multiplying the derivatives of the functions within the composition.
If you haven’t used calculus much since school and have forgotten this, an intuitive understanding of the chain rule is that it allows us to understand the relationships between different rates of change that can be observed in nature.
“If a car travels twice as fast as a bicycle and the bicycle is four times as fast as a walking man, then the car travels $2 * 4 = 8$ times as fast as the man.”
— George F. Simmons
As an example, suppose we have a balloon being inflated such that its volume and radius are changing over time. We can use the chain rule to find out how fast the volume is changing by multiplying the rate of radius change with the rate of radius change per unit volume.
There are two ways this can be expressed. The first highlights the derivative as a composition of functions (Lagrange’s notation), while the latter is the more common form and uses the symbol $∂$ to denote “partial derivative” (Leibniz’s notation).
$\begin{equation*} \begin{aligned} \textcolor{red}{V(r(t))'} &= \textcolor{green}{r'(t)} \cdot \textcolor{blue}{V'(r)} \\\\ \textcolor{red}{\frac{∂V}{∂t}} &= \textcolor{green}{\frac{∂r}{∂t}} \cdot \textcolor{blue}{\frac{∂V}{∂r}} \end{aligned} \end{equation*}$The equations above represent the chain rule applied to the balloon example, where $r$ is the radius, $V$ is the volume, and $t$ is time. The left-hand side represents “the rate of change of the volume with respect to time”, while the right-hand side represents the product of “the rate of change of the radius with respect to time” and “the rate of change of the volume with respect to the radius”.
If you still feel unclear on the chain rule check out the article “You Already Know Calculus: Derivatives” (2011) which skillfully uses everyday examples to explain calculus rules including the chain rule.
Returning to neural networks, a common pattern is to implement the mathematical expressions within it so that instead of only computing outputs they build a computational graph on application of each operation/function. This computational graph can later be traversed outwards-in to compute the derivatives of each value with respect to the final output value.
To vastly over-simplify, for scalar values it might look like this:
javascript
// This is a very cut-down example to help give you the gist of the// high-level architecture of a neural network and what we mean when// we say "computational graph".function neuron(inputs) {// ...`weights` and `bias` available for each neuron...return {weights,bias,forwardPass(inputs) {// ...`activation`, `add`, `sum`, and `multiply` functions// would be available and would allow us to create mathematical// expressions that would create a "computational graph".//// Note: In languages with operator overloading (e.g. Python)// you can avoid creating functions for every mathematical// operation through the use of operator overloading.// This leads to easier to read mathematical expressions,// however, the con is that the creation of the computational// graph might not be clear to whoever reads the code.return activation(add(sum(multiply(weights, inputs)), bias));},};}function layer() {// ...`neurons` available for each layer...return {neurons,forwardPass(inputs) {return neurons.map((neuron) => neuron.forwardPass(inputs));},};}function model() {// ...`layers` would be available...return {*parameters() {for (let layer of layers) {for (let neuron of layer.neurons) {yield* neuron.weights;yield neuron.bias;}}},forwardPass(inputs) {let outputs = inputs;for (let layer of layers) {outputs = layer.forwardPass(outputs);}return outputs;},};}// The computational graph looks like a "parse tree" or "directed acyclic// graph" (DAG), in that, it is a a hierarchical data structure representing// the computations of the function.//// e.g.//// [// {// operation: 'relu-activation',// output: 5,// gradient: 0.0,// inputs: [// {// operation: 'add',// output: 5,// gradient: 0.0,// inputs: [// {// operation: 'add',// output: 3,// gradient: 0.0,// inputs: [// {// operation: 'source',// output: 1,// gradient: 0.0,// inputs: []// },// {// operation: 'source',// output: 2,// gradient: 0.0,// inputs: []// }// ]// },// {// operation: 'source',// output: 2,// gradient: 0.0,// inputs: []// }// ]// }// ]// },// ...// ]//const computationalGraph = model().forwardPass(inputs);
Being able to decompose large functions into compositions of many smaller functions is helpful when implementing a neural network, as when coupled with the chain rule and the ability to calculate the local derivative of an input with respect to its output, this allows us to decompose the relative “impact” of each input parameter on the final output. This is incredibly useful as it means we can determine the impact of each weight and bias on the overall model outputs.
$\textcolor{red}{\frac{∂L}{∂input\_value}} \mathrel{+}= \textcolor{green}{\frac{∂current\_value}{∂input\_value}} \cdot \textcolor{blue}{\frac{∂L}{∂current\_value}}$The equation above represents the chain rule applied to a node within a neural network’s computational graph and shows how we can compute the partial derivative of “the loss function with respect to an input weight or bias” by multiplying the local derivative of the “current weight or bias with respect to its input weight or bias” by the partial derivative of “the loss function with respect to the current weight or bias”. (Note: we’ll discuss loss functions later on — for now substitute the final output wherever you see the loss function $L$ mentioned.)
The $input\_value$ and $current\_value$ will alternate between being weights and biases, transitory computed values, hardcoded values that are part of computations, and at the edges of the computational graph, its input values and predicted/actual output values. However, from the perspective of training our network we ultimately care about the updates made to the “the loss function with respect to a weight or bias” (the gradient
).
In the example above $\textcolor{blue}{\frac{∂L}{∂current\_value}}$ (the gradient
) would have been computed as $\textcolor{red}{\frac{∂L}{∂input\_value}}$ by a prior iteration of the backpropagation algorithm and therefore can be substituted with the gradient
of the current value.
On the other hand, $\textcolor{green}{\frac{∂current\_value}{∂input\_value}}$ is the local derivative and must be computed based on the type of operation/function and its input values.
A function is differentiable if it is continuous and has a derivative at every point in its domain.
Basic mathematical operators are trivially differentiable. For example:
Addition: When $current\_value$ was produced by $weighted\_sum + bias$ where $weighted\_sum = sum(weights \times inputs)$, calculating the derivative of $\textcolor{green}{\frac{∂current\_value}{∂input\_value}}$ for each input while holding the other constant:
$\begin{equation*} \begin{aligned} \textcolor{green}{\frac{∂(weighted\_sum + bias)}{∂weighted\_sum}} &= \textcolor{green}{\frac{∂weighted\_sum} {∂weighted\_sum} + \frac{∂bias}{∂weighted\_sum}} \\ &= \textcolor{green}{1 + 0} \\ &= \textcolor{green}{1} \\\\ \textcolor{green}{\frac{∂(weighted\_sum + bias)}{∂bias}} &= \textcolor{green}{\frac{∂weighted\_sum}{∂bias} + \frac{∂bias}{∂bias}} \\ &= \textcolor{green}{0 + 1} \\ &= \textcolor{green}{1} \end{aligned} \end{equation*}$Multiplication: When $current\_value$ was produced by $weight \times input$, calculating the derivative of $\textcolor{green}{\frac{∂current\_value}{∂input\_value}}$ for each input while holding the other constant:
$\begin{equation*} \begin{aligned} \textcolor{green}{\frac{∂(weight \times input)}{∂weight}} &= \textcolor{green}{input \times \frac{∂weight}{∂weight} + weight \times \frac{∂input}{∂weight}} \\ &= \textcolor{green}{input \times 1 + weight \times 0} \\ &= \textcolor{green}{input} \\\\ \textcolor{green}{\frac{∂(weight \times input)}{∂input}} &= \textcolor{green}{input \times \frac{∂weight}{∂input} + weight \times \frac{∂input}{∂input}} \\ &= \textcolor{green}{input \times 0 + weight \times 1} \\ &= \textcolor{green}{weight} \end{aligned} \end{equation*}$Discontinuities in a function can make it non-differentiable at those specific points. For example, the non-linear activation function $ReLU(x) = max(0, x)$ is discontinuous at $x = 0$ and therefore is not differentiable at that point, however, it is still differentiable otherwise (e.g. $ReLU'(x) = 0, x < 0$ or $ReLU'(x) = 1, x > 0$). In practice, $x = 0$ is very rare and we can safely set the subderivative to 0 at that point.
We accumulate (e.g. $\mathrel{+}=$) the result of multiplying these two partial derivatives into $\textcolor{red}{\frac{∂L}{∂input\_value}}$ which means that multiple output values of the network could contribute to the gradient of a single input weight or bias. Only after all functions/operations that an input weight or bias is involved in have been processed will the $\textcolor{red}{\frac{∂L}{∂input\_value}}$ have been computed and be ready for use as a $\textcolor{blue}{\frac{∂L}{∂current\_value}}$ in a future iteration of the backpropagation algorithm. A topological sort may be used to ensure that this is the case.
As long as there is a way to compute or approximate the local derivative of every function/operation, we can use this to help compute the derivative of the loss function with respect to every input weight and bias in the neural network.
A tiny implementation of backpropagation showing how gradient
s can be computed is given below:
javascript
function updateInputGradientsForAdd(value) {const inputA = value.inputs[0];const inputB = value.inputs[1];// local derivative:// ∂current_value/∂input_a = 1//// gradient accumulation update rule:// (∂L/input_a) += (∂current_value/∂input_a) * (∂L/∂current_value)inputA.gradient += 1.0 * value.gradient;// local derivative:// ∂current_value/∂input_b = 1//// gradient accumulation update rule:// (∂L/input_b) += (∂current_value/∂input_b) * (∂L/∂current_value)inputB.gradient += 1.0 * value.gradient;}function updateInputGradientsForMultiply(value) {const inputA = value.inputs[0];const inputB = value.inputs[1];// local derivative:// ∂current_value/∂input_a = input_b//// gradient accumulation update rule:// (∂L/input_a) += (∂current_value/∂input_a) * (∂L/∂current_value)inputA.gradient += inputB.output * value.gradient;// local derivative:// ∂current_value/∂input_b = input_a//// gradient accumulation update rule:// (∂L/input_b) += (∂current_value/∂input_b) * (∂L/∂current_value)inputB.gradient += inputA.output * value.gradient;}function updateInputGradientsForReluActivation(value) {const input = value.inputs[0];// local derivative:// ∂current_value/∂input = 1.0, if input > 0// = 0.0, otherwise//// gradient accumulation update rule:// ∂L/∂input += (∂current_value/∂input) * (∂L/∂current_value)input.gradient += (value.output > 0.0 ? 1.0 : 0.0) * value.gradient;}function sortTopologically(value,visited = new Set(),topologicallySortedValues = []) {if (!visited.has(value)) {visited.add(value);for (const input of value.inputs) {sortTopologically(input, visited, topologicallySortedValues);}topologicallySortedValues.push(value);}return topologicallySortedValues;}function backpropagation(rootValue) {// Perform a topological sort of all of the `inputs` values in the graph// and then reverse this so that the output values are before their// respective input values.const topologicallySortedValues = sortTopologically(rootValue).reverse();// The derivative of a value with respect to itself is always 1.0 so we// set the gradient of the output value to this to begin with before// beginning backwards propagation.topologicallySortedValues[0].gradient = 1.0;// Given the reversed topologically ordered values, we will be starting// at the output value and applying the chain rule on each iteration to// update the gradients of the current value's inputs.for (const value of topologicallySortedValues) {switch (value.operation) {case "multiply": {updateInputGradientsForMultiply(value);break;}case "add": {updateInputGradientsForAdd(value);break;}case "relu-activation": {updateInputGradientsForReluActivation(value);break;}default:throw new Error(`Unrecognized operation: ${value.operation}`);}}}backpropagation(computationalGraph);
For further discussion on computational graphs and the efficiency benefits of applying derivatives using these I can’t recommend “Calculus on Computational Graphs: Backpropagation” (2015) highly enough. It’s a very easy-to-understand guide to computing derivatives that is detailed as well as economical with your time.
It’s not enough to merely have a function that can be used to “predict” values by repeatedly computing weighted sums of inputs, adding biases and passing their results through activation functions. Even if we had a way to compute derivatives of these outputs with respect to their weights and biases, it would still tell us nothing about how to improve the performance of the network. What we need is a way to measure how well the network is performing and a method of using this information to update weights and biases.
That is where the “loss” function comes in. The loss function (sometimes known as a cost function or error function) is a function that compares the predicted value produced by the model with the actual value that we want the model to produce. It provides both a performance metric and an optimization objective, with the goal of minimizing the loss function during training to improve the network’s performance.
$\begin{array}{c} \mathcal{L}(\text{predicted}, \text{actual}) = \frac{1}{n} \sum_{i=1}^{n} (\text{predicted}_i - \text{actual}_i)^2 \\ \\ \text{Mean squared error (MSE) loss function} \end{array}$The lower the loss, the less information the model loses and the better it performs; the higher the loss, the worse the model performs.
Once your neural network’s huge mathematical expression is producing a loss value as its output, backpropagation can be used to compute the derivative of the loss function with respect to each weight or bias in the network — known as its gradient
. It’s important to note that the gradient
of a weight or bias is not the same as the weight or bias itself. The gradient
is the name given to the derivative (“rate of change”) of the loss function with respect to the weight or bias and represents the impact of a small change in the weight or bias on the loss function.
This gradient
can then be used in a process called “gradient descent” to update the weight or bias in a way that reduces the total loss of the network — e.g. if the gradient
of a weight is positive, then the weight should be decreased, while if the gradient
of a weight is negative, then the weight should be increased; similarly if the gradient
is large, then the weight should be updated by a large amount, while if the gradient
is small, then the weight should be updated by a small amount.
The process described above is repeated for each “epoch” (iteration) of the training loop, and the magnitude of these updates to the weights and biases are also controlled by a “learning rate”. Both the learning rate and the number of epochs are hyperparameters that can be tuned to improve the performance of the network, alongside other aspects of the network such as the number of layers, the number of neurons in each layer, the activation function used in each layer, amongst other things.
A very basic training loop might look a bit like this:
javascript
function zip(as, bs) {return as.map((a, i) => [a, bs[i]]);}function mse(predictions, actuals) {// ...`multiply`, `divide`, `sum`, `power` and `subtract` functions// would be available and would allow us to create mathematical expressions// that produce a "computational graph".return multiply(divide(1, actuals.length),sum(zip(predictions, actuals).map(([predicted, actual]) =>power(subtract(predicted, actual), 2))));}function loss(xData, yData, model) {const yPredictions = xData.map((x) => model.forwardPass(x));// Passing `yPredictions` into `mse` extends the computation graph of// the `model` so that it also contains the computation of "loss".// This is possible because neural networks are composable and allows// us to start at the loss output and backpropagate all the way back// through the model's weights and biases to the inputs.//// Note that the connection between the result of the `loss` function// and the `model` is one-way; the `model` is not connected to the// `loss` function and `loss` does not participate in the computation// when `forwardPass` is called on the `model`.return mse(yPredictions, yData);}const epochs = 1000;const learningRate = 0.01;for (const epoch = 0; epoch < epochs; epoch++) {// ...`xTrainingData`, `yTrainingData` and `model` would be available.const totalLoss = loss(xTrainingData, yTrainingData, model);// Zero out all gradients before backpropagation to avoid// accumulating gradients from previous iterations, which// would result in erratic parameter updates.for (const parameter of model.parameters()) {parameter.gradient = 0;}backpropagation(totalLoss);// As we wish to minimize the loss, we move the parameters in// the opposite direction of the gradient. If the gradient is// positive, then the parameter is adjusted in the negative// direction, and if the gradient is negative, the parameter// is adjusted in the positive direction.for (const parameter of model.parameters()) {parameter.data -= learningRate * parameter.gradient;}if (epoch % 10 === 0) {console.log(`Epoch: ${epoch} / Loss: ${totalLoss.data}`);}}
In our pursuit of computers that can autonomously figure out how to achieve desired outcomes, we discovered that neural networks are “universal function approximators” capable of approximating arbitrary functions.
To achieve this, we stack layers (parameterized affine transformations followed by non-linear activation functions) and train our model by repeatedly adjusting its parameters until it behaves like the function we’d like to approximate.
As layers are stacked, the overall function becomes increasingly non-linear, allowing the model to represent more complex functions. However, this comes with a bit of a devil’s bargain, as the mathematics behind neural networks constrain our ability to represent functions in ways that are naturally interpretable:
All operations/functions used by a neural network to produce its output must be differentiable and composable.
Any logic or understanding learnt will be generically represented by the network’s weights and biases (its parameters) and because these participate in the calculations of derivatives these must be real-valued numeric values.
Mathematical expressions and parameter initializations must be carefully designed to avoid issues such as “symmetry” (neurons that produce the same outputs), “dead neurons” (neurons that always output zero), “exploding gradients” (gradients that grow exponentially in magnitude), and “vanishing gradients” (gradients that shrink exponentially in magnitude), as well as other numerical stability issues.
The price of magic was interpretability — whether we can get this back is a question for another day.
If you want to learn further about neural networks, I highly recommend following along with “The spelled-out intro to neural networks and backpropagation: building micrograd” and implementing your own version of micrograd
in a language of your choosing (see my heavily-commented Rust version here).
The belief that algorithmic tech interviews do not test for real-world programming skills is a memetic hazard. There might be truth to it but it serves as displacement from what is often the real underlying reason for resistance: the fear of negative judgement.
Without passing judgement on whether these tests are a good way to assess skill level, it’s crucial to realise that dismissing them has drawbacks. These problems often contain different ways of thinking and require novel techniques or hard-won algorithms, and their deliberate practice can be a valuable way to improve your problem-solving skills. While uncomfortable, acknowledging this offers us an opportunity to grow.
Mastering a few fundamental problems can make you more effective at a wide range of challenges, but how do you go about truly understanding these ‘eigenproblems’?
It’s commonly heard that if you ever need an algorithm you’ll “just Google it”, however this belief is quite suspect as you’re unlikely to reach for Google without awareness of the techniques that might be applicable. Having a foundational understanding can help you to recognise opportunities for application in the first place.
The creator of the “Blind 75” problem set, an engineer from Meta, attempted to distill the most useful questions for learning the core concepts and techniques for each category/type of problems from their experience doing 400-500 LeetCode questions during their last job hunt.
75 questions is still a very large time commitment, and the reality is that it’s even more time-consuming than it might look as fluency generally requires spaced repetition of the problems by doing a variety of topics each week and revisiting these again in subsequent weeks.
We want the ability to be able to identify and apply the right techniques given an unknown problem, and this is about grasping the intent, design, and the decision-making involved in this. This ability can certainly be picked up incidentally by doing lots of problems and eventually beginning to pattern-match them, but I’d argue that this isn’t the most efficient way to learn as a lot of the knowledge doesn’t appear to be naturally tacit.
Current methods of teaching often focus on mere implementation, overlooking the decision-making skills needed to identify the right techniques for a problem. This article aims to fill that gap, by articulating what is often left unsaid. While it won’t replace hands-on experience in the nuances of implementation, it should be helpful in improving your understanding of technique ‘applicability’.
Note
We’re intentionally not discussing the implementation details of these techniques, as this is already well-covered elsewhere. For deep explanations, video tutorials and code examples, we recommend:
This article is a work in progress and I’ll be adding more to it over time. Please email me with any insights or suggestions you have for improvement.
Is the problem related to a linear/sequential data structure (e.g. array, linked list, string, search space)?
Are we being asked to find pairs, triplets or sub-arrays that match a constraint?
To solve this optimally (e.g. $O(n)$), we need to be able to easily compare elements in the sequence without using nested loops. We have at most two options:
Ensuring the sequence is sorted and then applying the Two Pointers technique:
We can avoid the need for a nested loop if the sequence is sorted, as we can operate upon the sequence using two explicit pointers, one at the start and one at the end. The pointers clarify which elements are being considered for the condition being evaluated, and on each iteration we choose how to move these pointers based on the result of a comparison.
This is particularly useful when a deterministic order matters or when we are looking for unique pairs/triplets.
Using a Hash Map:
Are we being asked to find a longest/shortest substring or subsequence that matches a particular constraint or to compute aggregate statistics for a particular length subsequence?
To solve this optimally, we want to be able to re-use the computations from previous windows so that we can update our answer for each window in constant time (e.g. $O(1)$) and the whole sequence can be completed in linear time (e.g. $O(n)$).
We can do this by applying the Sliding Window technique, either using a fixed window size or a dynamic window size that can expand, shrink or reset based on certain conditions. Please note that these are not implemented in the same way as two pointers problems as explicit pointers towards the start and end of the window, but instead they generally use a single loop iterator and some additional variables to keep track of the window’s characteristics. This is a subtle but important distinction from the two-pointer technique.
See also: the Rabin-Karp string search algorithm which employs a specialised form of the sliding window technique that utilises rolling hashes to achieve efficient substring matching.
Are we being asked to find a target value or minimum/maximum value in this sequence?
If the sequence is unsorted, there is no way to do better than a linear scan of the sequence (e.g. $O(n)$).
But, if the sequence is sorted we can apply the Binary Search technique.
In fact, we can apply the Binary Search technique if the sequence is partially sorted as long as it is partitionable (e.g. we can discern a sorted and unsorted partition on each iteration). For example, we can apply binary search to rotated sorted arrays or bitonic sequences.
We can also apply the Binary Search technique to compute numerical approximations if a sequence is ordered but virtual (e.g. a search space). For example, to approximate the square root of a number.
Are we being asked to detect cycles in a linked list or to find the middle element?
Use the Fast-And-Slow Pointers technique. If the pointers meet, there is a cycle.
If we need to find the middle element, we can rely on the fact that the slow pointer will be at the middle element when the fast pointer reaches the end of the list.
Are we being asked to find duplicates in a sequence?
If the sequence is sorted, then duplicate items will be next to each other and we can find them by comparing adjacent elements in the sequence.
If the sequence is unsorted, we can add elements that we see into a Set or Hash Map and then check whether the current element is within this.
Does the problem require frequent and ordered access to recently processed elements?
In situations in which you need access to elements in some dynamic order you cannot just use a for
-loop to iterate through some elements in a fixed order, and likely will need to append elements into a dynamically ordered sequence maintained within a Stack or Queue.
If we need to access the most recently processed elements we can use a Stack. This has $O(1)$ access at the end that elements are appended.
Note that ‘stack’ is a bit of an overloaded term as it refers to any last-in first-out (LIFO) structures. They are often arrays (or linked lists) but this isn’t necessarily the case. For example, when executing programs we keep a stack of frames within a block of contiguous memory.
Not all problems involving a ‘stack’ are similar. For example, there are some in which (1) you only ever need to access, check or pop the most recently appended element, while there are others in which (2) you will need to repeatedly traverse backwards through the stack while some condition is true. The latter approach would often use a nested while
-loop and can allow you to gather information or maintain state using previous elements in the stack or even to maintain some invariant in the stack that can be depended upon by future iterations or even a second phase of the algorithm.
I think it can also be helpful to think of how other data structures can also have stack-like access patterns. For example, when given a decimal integer we can ‘peek’ at its last digit using % 10
and then ‘pop’ this using a // 10
.
If we need to access the least recently processed elements we can use a Queue. This has $O(1)$ access at its start, at the opposite end to where elements are appended.
Are we being asked to merge sorted lists?
Does the problem involve categorising or identifying unique elements based on some aspect of them not immediately known?
Does the problem require you to perform efficient lookups of data?
Do we need to quickly store or lookup strings or provide autocomplete functionality?
Are we being asked to query the smallest/largest/median values? Do you need to be able to read out a sorted sequence of elements?
Use a Min-Heap or Max-Heap data structure to query for the smallest/largest values (in $O(1)$).
You can use a combination of both Min-Heap and Max-Heap if you want to compute the median.
There are time/space costs to maintaining heaps (creating a heap costs $O(n)$ but updates are $O(\log n)$ due to the cost to heapify
), so this makes more sense when you expect to make multiple queries over the lifetime of your program. The concept of amortised cost is also important when comparing min-heaps to algorithms like quickselect. While min-heaps offer a consistent update time of $O(\log n)$, quickselect’s time complexity may be more efficient for one-off or occasional operations.
Heaps do not maintain sorted order of their elements, and instead only maintain a partial ordering. Because of this, they do not offer $O(1)$ access to the $k^{\text{\tiny th}}$ smallest/largest element or allow you to read out a sorted sequence of elements. If you need to be able to do this, while still allowing for efficient updates, you can use a Binary Search Tree (BST) data structure, possibly augmented with additional information to make it self-balancing (e.g. AVL Tree or Red-Black Tree).
Does the problem require you to recover the order of some elements or substrings within a sequence?
If you are able to infer partial information about how particular elements are connected, for example, if they come before or after each other or if they overlap (or touch), you can use this information to help build a Directed Graph. Vertices can represent elements, while edges can represent the information inferred from the data. Once this is achieved, it’s often possible to recover the order by running something like a Topological Sort.
In general, graphs seem a good fit for problems involving recovering the order of a sequence, for example, “de Bruijn” graphs are used for the re-assembly of de-novo DNA sequences from k-mers in computational biology.
Are you dealing with data points that are interconnected in some way, needing to generate and test paths or sequences, or wishing to traverse elements in a matrix?
Note
- Problems like this can often be solved using graph or tree data structures, but it’s important to note that these are not the only ways of representing data that is interconnected. For example, rotating a tree 45-degrees gives you a grid, and you can use an adjacency matrix to represent a graph, or an array to represent a tree (as is done when implementing heaps). The choice of data structure is not only a trade-off between memory usage and performance, but can also lead to different ways of thinking about a problem.
- There are also similarities between iteration of sequences and traversals in graphs, trees or state-spaces. Traversal is essentially an extended form of iteration, designed to handle dynamic, non-linear, branching data structures. Different traversal strategies, such as BFS (level-order) and DFS (pre-order, in-order, post-order), offer specific advantages tailored to the nature of the problem at hand.
- An important insight is the fluidity and overlapping between techniques like trees, dynamic programming and path-finding. This allows for multiple perspectives on a problem and can occasionally offer more elegant or optimal solutions. For example, in Tristan Hume’s “Designing a Tree Diff Algorithm Using Dynamic Programming and A*”, the problem of diffing begins as a decision tree before being transformed into a 2D grid that is amenable to dynamic programming style techniques and eventually the path-finding algorithm A*.
Have you been asked to find a combination or permutation of elements that match a constraint?
When you have a particular target or constraint that needs to be met, it generally suggests that you will need to go deeply into the search space to find and test each path. This suggests that you will need to use a Depth-First Search (DFS) approach, however, this will likely result in an explosion of paths that need to be explored and very slow performance as well as bad memory usage. To avoid this, you will need to prune paths early-on that cannot possibly lead to a solution, and, also due to the potential depth of the search space use a Backtracking approach to (a) invalidate any paths later shown to be impossible, and (b) rollback any modifications to shared state that are made during the search.
Another way of looking at things is that DFS doesn’t need to traverse a materialised tree or graph. They can be used to explore a more implicit solution space such as a “decision tree” in which each choice appends an item to a stack. For example, it would be possible to use something like a regular expression (e.g. (ab[cd]e?)*
) as a generative structure that creates a DFS-like search space that generates strings that match that expression.
Depth-First Search (DFS) has a lot of overlap with Top-Down Memoisation. In fact, top-down memoisation is depth-first search but with a caching optimisation to avoid re-computing subproblems (e.g. “overlapping subproblems”). Top-down memoisation is generally assumed to be implemented using recursion, but, it can also be implemented iteratively using a Stack.
The choice over whether to implement Depth-First Search (DFS) using recursion or with iteration and a stack isn’t merely an aesthetic choice but also has practical implications:
Firstly, recursion presents you with limits due to the maximum recursion depth of your chosen programming language.
But more importantly, recursion provides a natural mechanism for backtracking as you can simply return from a recursive call to ‘undo’ the last step. On the other hand, iteration with an explicit stack requires you to manually implement backtracking by either copying data structures into each stack item (so that dropping stack items is effectively backtracking) or by using shared state and carefully rolling back changes to this.
More subjectively, some people prefer recursion for its more intuitive mathematical representation of a problem, while others prefer iteration for its more explicit and easier to debug/trace states.
Do you need to fill or quantify contiguous regions in a matrix?
You can do so using a Flood-Fill algorithm, which can be either a Breadth-First Search (BFS) or Depth-First Search (DFS) traversal from a starting point with the constraint that you can only move to adjacent cells that are the same ‘color’ as the starting point.
You can avoid accidentally revisiting cells by placing them into a Set or Hash Map as you visit them, however, in order to save memory you can also mutate the matrix in-place by marking visited cells with a different ‘color’ (and then, if you wish to, reverting these changes once you have finished).
Is searching for the shortest path between some nodes or doing a ‘level-order’ traversal your main concern?
Use Breadth-First Search (BFS).
Note that, BFS uses more memory than DFS because it needs to keep all of the nodes at the current level in memory, while DFS only needs to store the nodes along a single path in the tree/graph as it traverses. Therefore, if memory usage is a concern, DFS could be the better choice.
BFS traversals use a FIFO queue to traverse nodes and expect the edges to have uniform weights/costs. For non-uniform weight/cost graphs, we swap the FIFO queue for a priority queue (e.g. min-heap) to enable cost-aware traversals using Dijkstra's or A*. Dijkstra's prioritises nodes based on cumulative cost from the start finding shortest paths to all nodes, while A* also adds an (admissible and consistent) heuristic estimate towards the goal making it goal-oriented and often more efficient for finding a specific target.
Are you being asked to determine whether nodes are connected, to count the number of components in a graph, or to detect cycles?
In an undirected graph? Use Union-Find. Note that Union-Find is not suitable for directed graphs because it inherently deals with undirected equivalence relations; it merges sets and loses the sense of direction between nodes.
In a directed or undirected graph? Use Depth-First Search (DFS). Unlike Union-Find, DFS is suitable for directed graphs and can also provide information about the shape or structure of components.
Being able to detect whether two nodes are connected can be a useful general optimisation technique when path-finding (e.g. BFS, DFS, Dijkstra or A*) as it can allow you to “early exit” before beginning an invalid traversal.
Does a solution to the problem result in a very expensive branching program suggesting that we might have an optimisation problem?
If local optimum decisions appear to lead to a global optimum:
Apply a Greedy Algorithm. Note that there’s no guaranteed way of knowing upfront whether local optimum decisions will lead to a global optimum, and it will require testing and analysis to determine whether this is the case.
If the solution to the problem seems to require future information to decide the current step, a greedy algorithm will not be appropriate and you will need to either use Depth-First Search (DFS) or switch to a Dynamic Programming approach involving Top-Down Memoisation.
A greedy algorithm makes decisions at each step based solely on the current state and local considerations, and cannot require backtracking, reconsideration of previous decisions, or deeper exploration of decision paths (like depth-first search).
If it seems to be possible to compute a solution from a combination of previously computed solutions (e.g. “optimal substructure”) then we can solve it using Dynamic Programming. Dynamic programming is particularly beneficial when the same subproblem reappears multiple times in a computation (e.g. “overlapping subproblems”).
Dynamic programming allows you to save time at the expense of space.
There are two high-level ways of implementing dynamic programming and the choice depends on the problem:
We can always apply Top-Down Memoisation. This is effectively a DFS of the state space, generally implemented using recursive memoised function calls. It’s not always the most efficient way to solve the problem due to the overhead of recursion, but it does avoid the need to compute all possible subproblems first.
If the transitions between subproblems in the state space are computed in a fixed manner we can apply Bottom-Up Tabulation. This computes all possible subproblems iteratively first and then uses these to compute the final solution, but it avoids the overhead of recursion by computing the subproblems one-by-one iteratively and is able to store these in an array for lower memory usage.
The following questions present scenarios that test your ability to select the most appropriate technique.
Greedy Algorithm or Dynamic Programming?
Two Pointers or Sliding Window?
Stack or Two Pointers?
Stack or Queue?
Backtracking or DFS?
Union-Find or DFS?
DFS or Dynamic Programming?
One of the problems we have when answering the question “What is an interface?” is that we don’t have an accurate representation of what an interface looks like. There is a tendency for us to become confused and pick just one concrete instance of an interface that we know well, leading to responses like “Oh, you mean a GUI” or for those that have read books like “The Design Of Everyday Things” something more physical like a door handle.
Like many people in the tech industry I’ve become acquainted with a small subset of interfaces known as GUIs and this has largely controlled the frame in which I recognise and understand interfaces.
So, like you, when I think of an interface I normally think of this:
And not this:
Perhaps you’re wondering why I’ve shown you a street full of cars and called it an interface. What is the use in that?
Empathising
To illustrate what I mean, I’m going to talk about Uber for a little bit.
Uber is an app company that aims to connect stranded city-dwellers with a driver on-the-go.
A user opens the app, sets their pickup location, and requests a vehicle.
Then they wait.
The other day on a particularly cold and rainy night I undertook this ritual, putting my phone back into my pocket as I waited for the 5-10 minutes it takes a driver to arrive.
So there I was, staring into the dark, wet street waiting for my car to arrive. As it drew close, I received a notification from Uber in the form of a vibration/text message and reached into my pocket to retrieve my phone.
The interface
Your natural inclination might be to think that I’m not currently interfaced with Uber because I do not have their app on screen. However this is something I fundamentally disagree with.
interface
ˈɪntəfeɪs/
noun
- a point where two systems, subjects, organizations, etc. meet and interact.
Recall that the problem that Uber is trying to solve is that of connecting people to drivers.
That is the core interface: a connection between a person and a driver.
So I’m standing in a busy street with a message from Uber telling me that somewhere in front of me is a car. And as of yet I don’t know precisely where it is or how to differentiate it from other vehicles, so I need information to recognise it such as its make, colour and number plate.
This information that urges a user to interact in the right way exists under a bit of umbrella terminology known as a “perceived affordance”.
The word “affordance” was created by the psychologist J. J. Gibson to refer to actionable properties between the world and an actor (e.g. person). To Gibson, affordances were relationships. Later on this word was introduced to Design by the famous usability engineer, Don Norman. He also prefixed it with an additional word “perceived”, as he felt that while an object might have many affordances it’s important to distinguish between those that are easily perceived and those that are not, in order that designers might bias perceptions towards particular affordances.
In order to find the information necessary to continue the interaction, I need to unlock my phone and re-open the Uber app. I believe there’s an opportunity here to ease the interaction between user and driver. If we consider the app to be a sub-interface existing in relation to a larger interface between the user and the driver we can begin to notice other easier to perceive sub-interfaces which exist alongside it. A perfect example is the text message that notified me of the driver’s arrival.
The text message can be used to give the information required to quickly and easily identify and search for the driver.
“Hi Seb, your Uber is arriving now!”
Could become
“Hi Seb, your Uber is arriving now! Look out for the silver BMW with the number plate K50 WTB.”
Almost all notifications can be considered a place to inject actionable interfacing information into a user’s head.
This isn’t the only affordance that could be used. Other car companies are incidentally also using their own perceived affordances.
That’s a competitor called Lyft. Lyft handed out “carstaches” to their drivers. This is being hailed as a branding and marketing exercise which it is a great example of, but it is also a perceived affordance which lets you quickly identify your lyft.
So what does an interface look like?
To me the term interface describes a conduit between a person and a resource that a designer might bias towards particular user experiences with the use of perceived affordances.
If I was to visualise the mental model I have of an interface I would draw a loop that connects your head to a resource.
Technology is rapidly allowing us to interface with the world in new and profound ways. We shouldn’t let old notions of what an interface is dictate to us how we interact with the world. We must always remember that user experience is a function of who and where we are.
Interfaces start in your head.
]]>I’ve named this the “Inverse Number Spiral” problem and here is the problem description:
#$1071^{-1}$: Inverse Number Spiral
Time limit: 1.00 s Memory limit: 512 MBA number spiral is an infinite grid whose upper-left square has number 1. Here are the first five layers of the spiral:
Your task is to find out the row $y$ and column $x$ given a number $N$.
Input
The first input line contains an integer $t$: the number of tests.
After this, there are $t$ lines, each containing an integer $N$.
Output
For each test, print the row $y$ and column $x$.
Constraints
$1 ≤ t ≤ 10^{5} \\ 1 ≤ y,x ≤ 10^{9}$Example
Input:
38115Output:
2 31 14 2
My initial thought was that I could solve this problem by simply iterating over the spiral and checking if the current number is equal to the given number $N$. I imagined a robot starting at position $[1, 1]$, moving like a snake—right, down, left and up—while adjusting the number of steps in each direction. However, understanding the sequence of steps proved to be quite challenging. The pattern appeared complex and difficult to comprehend, making it difficult to implement (e.g. right 1, down 1, left 1, down 1, right 2, up 2, right 1, down 3, left 3, down 1, right 4, up 4, and so on).
The time complexity of such an algorithm is $O(n)$. However, the number of tests $t$ is also an input parameter, which makes the overall time complexity $O(t \cdot n)$. Given that there can be up to $10^{5}$ tests and up to $10^{9}$ input numbers, a brute-force approach could potentially require up to $10^{14}$ iterations which may cause the program to exceed the time limit.
Given these challenges, I decided to look to see whether I could find an alternative approach.
As I contemplated the spiral grid, I recognised a familiar pattern at the end of each layer of the spiral:
$1, 4, 9, 16, 25$It was the sequence of square numbers (e.g. $n^2$). This observation sparked my curiosity and led me to wonder if I could leverage this pattern to my advantage.
I realised that the maximum value in each layer of the spiral was equal to the square of the layer number $L$. For example, the maximum value in layer 3 was $3^2 = 9$, the maximum value in layer 4 was $4^2 = 16$, and so on. This meant that I could use the square root function to determine the layer number $L$ for a given value $N$. Only the square numbers of a layer would directly produce the layer number when taking a square root, however all values in the layer including its minimum produce a decimal value greater than the previous layer number and therefore as long as we round up this value to the nearest integer (e.g. Math.ceil
) our method produces the correct layer number.
typescript
functionlayer (N : number) {returnMath .ceil (Math .sqrt (N ));}constL1 =layer (1); // 1constL2 =layer (2); // 2constL3 =layer (3); // 2constL4 =layer (4); // 2constL5 =layer (5); // 3constL6 =layer (6); // 3constL7 =layer (7); // 3constL8 =layer (8); // 3constL9 =layer (9); // 3constL10 =layer (10); // 4
Once I had a layer number $L$, I was able to use this to determine the range of values for that layer, as for a given layer $L$ the maximum value is $L^2$ and the minimum value is $(L-1)^2 + 1$ (if you add one to the maximum value of the previous layer you get the minimum of the layer that follows it).
typescript
functionlayerRange (L : number) {conststart =Math .pow (L - 1, 2) + 1;constend =Math .pow (L , 2);return [start ,end ] asconst ;}constR1 =layerRange (1); // [1, 1]constR2 =layerRange (2); // [2, 4]constR3 =layerRange (3); // [5, 9]constR4 =layerRange (4); // [10, 16]
The way I saw it at this point was that a layer range represented a sort of one-dimensional version of each layer of the spiral. In order to determine the $[y, x]$ coordinates for a given value $N$, I would need to be able to determine the position of $N$ within this layer, and then be able to translate that position back into $[y, x]$ coordinates.
I found that I was able to get the position of $N$ within the one-dimensional layer range quite easily by subtracting the minimum value of the layer from $N$. For example, the position of $N = 7$ in layer 3 was $7 - 5 = 2$. However, in order to convert the one-dimensional position into two-dimensional $[y, x]$ coordinates within the grid, there were two further properties of the spiral that I needed to use to my advantage:
With all of this in mind, I was finally able to determine the $[y, x]$ coordinates for a given value $N$, like so:
typescript
functionlayer (N : number) {returnMath .ceil (Math .sqrt (N ));}functionlayerRange (L : number) {conststart =Math .pow (L - 1, 2) + 1;constend =Math .pow (L , 2);return [start ,end ] asconst ;}functiondirection (L : number,axis : "y" | "x") {switch (axis ) {case "y": {// Determine the direction for the "y" axis based on the even/odd nature// of the layer (L). If L is even, the direction is down (1); otherwise,// it is up (-1).returnL % 2 === 0 ? 1 : -1;}case "x": {// Determine the direction for the "x" axis based on the even/odd nature// of the layer (L). If L is even, the direction is left (-1); otherwise,// it is right (1).returnL % 2 === 0 ? -1 : 1;}default: {throw newError (`Invalid axis argument supplied: ${axis }`);}}}functioncoord (N : number,axis : "y" | "x") {constL =layer (N );const [start ,end ] =layerRange (L );// The `sequenceIndex` is a zero-indexed "position" of `N` within the layer// range.constsequenceIndex =N -start ;// The `midIndex` is the zero-indexed mid-point of the layer range.constmidIndex = (end -start ) / 2;// Depending on the direction of the spiral and the axis, the coordinate// can be either (1) the layer number `L`, (2) the position of `N` computed// by starting from the beginning of the layer range, (3) the position of// `N` computed by starting from the end of the layer range and counting// backwards towards the center.//// The value of `D` might be somewhat difficult to grasp at first as it// abstracts away the direction (clockwise/anti-clockwise) and the axis// into a single value, that determines whether the coordinate is// calculated by counting forwards towards the middle of the layer range// or counting backwards from the end towards the middle of the layer// range.constD =direction (L ,axis );switch (D ) {// If the direction is down or right.case 1: {if (sequenceIndex <=midIndex ) {// For the first half of the sequence, the coordinate is simply the// `sequenceindex` incremented by one to convert it from// zero-indexed to one-indexed.return 1 +sequenceIndex ;}// For the second half of the sequence, the coordinate is the layer `L`// itself.returnL ;}// If the direction is up or left.case -1: {if (sequenceIndex >midIndex ) {// For the second half of the sequence, the coordinate is calculated// by counting back from the maximum value of the outer layer `L`// towards the center. We do this by subtracting the difference// between the `sequenceindex` and the `midIndex` from the layer `L`.returnL - (sequenceIndex -midIndex );}// For the first half of the sequence, the coordinate is the layer `L`// itself.returnL ;}default: {throw newError (`Invalid direction generated: ${direction }`);}}}functiony (N : number) {returncoord (N , "y");}functionx (N : number) {returncoord (N , "x");}functionf (N : number) {return [y (N ),x (N )] asconst ;}constF1 =f (1); // [1, 1]constF4 =f (4); // [2, 1]constF7 =f (7); // [3, 3]constF9 =f (9); // [1, 3]constF11 =f (11); // [2, 4]constF13 =f (13); // [4, 4]constF14 =f (14); // [4, 3]constF17 =f (17); // [5, 1]constF18 =f (18); // [5, 2]constF24 =f (24); // [2, 5]constF25 =f (25); // [1, 5]
By identifying patterns in the number spiral and leveraging mathematical relationships, we were able to transform a seemingly complex problem into a solvable one. In the process, we developed an efficient algorithm that significantly reduced the time complexity compared to a brute-force approach.
Because we are able to calculate the coordinates for a given number $N$ using only constant-time mathematical operations, the time complexity of the solution is $O(1)$. When we execute f(N)
for each test case $t$ in the input, the overall time complexity becomes $O(t)$, a substantial improvement over the brute-force approach which would have a time complexity of $O(t \cdot n)$.
Although further optimizations, such as memoizing f(N)
to minimize duplicate calculations, could still be made, the current solution is both efficient and elegant.
Now that you've seen our approach to solving this problem, we encourage you to try your hand at the original “Number Spiral” problem. Finally, if you're looking for more challenges, the entire CSES Problem Set is an excellent resource to explore, offering a wide range of problems to hone your coding and problem-solving skills.
]]>My view is that the adage “the right tool for the job” suffered from being transmitted by engineers in an almost folkloric manner. It’s a truism (“obviously you want to pick the right tool!”) but in practice was often wrong since shoe-horning an old bit of common sense advice into the domain of software engineering gave licence to the self-interests of software engineers who could avoid thinking about its costs (operational complexity, training/ramp-up costs of employees, etc).
On the other hand, “Choose Boring Technology” treats technology choices as if the only system they interact with is a technical one and human incentives can be disregarded — it pays no attention to why “the right tool” was so-often endorsed, and how this related to the polyglot movement (both were, in a sense, marketing that allowed startups and employees that wanted to chart uncharted waters to find each other). Boring is also a misnomer — what is meant is “well-understood” but the word choice is being used to frame the domain as one in which technologies can either be “shiny” or “boring” — which isn’t the primary frame that I'd use to evaluate technology.
While reducing operational complexity, using technologies that are flexible and avoiding unknown-unknowns are all important and suggest that “Choose Boring Technology” is a sensible approach, we should also pay careful attention to whether our choices result in happy developers. Retention is important to the long-term maintenance of knowledge within a company, and developers that don’t feel like they’re learning anything new or that feel stifled by centralised decision-making over the tools they’re allowed to use can end up leaving due to feeling disrespected and disenchanted. On the other hand, there’s also “retention risk” in attracting developers that seek shiny technologies as once these technologies have lost their shine they can leave just as eagerly as they joined. Additionally, in many cases there are technologies that offer an order of magnitude faster/better ways of doing things, and so the best teams will make judgments over when to let their engineers invest their time seeking improvements.
Unchecked restraint in technology choices can eventually lead towards obsolescence and brain drain. Following “The Golden Path” and creating a team to centralise technical decisions is a good idea when a company reaches a particular scale, but you must also work out how you’ll create affordances for developers outside of that team to go above the predefined baseline of quality (for a similar idea read Lethain's "Providing pierceable abstractions").
There are also differences between individuals and differences due to scale (personal project, team, company-wide, etc):
At the individual level, while I might reach into my toolbox and find that my most well-worn tool is JavaScript and use that, another might reach into their toolbox and pick Ruby. We should use what we know best. But what if you’re just starting out? If I was a beginner I would pick up-and-coming general-purpose technologies that are growing in usage. While I might benefit from prior knowledge in an older ecosystem, I will learn more by being part of a highly-engaged community of people trying to achieve state-of-the-art approaches.
But at the scale of a company, there are further problems to be aware of due to the difficulty of migrating software stacks and a tendency for policies and processes that don't work well to be left unfixed.
What should we recommend to a company that has always written both their application and database logic using SQL Server Stored Procedures? The boring choice might be for them to continue extending and creating new software in this way, but since there is very little beginner interest in learning that stack eventually they will run into hiring problems and brain drain caused by an ageing workforce (institutional knowledge loss). To survive, they will need to rework their system in a way that gives them access to a more fungible workforce, and they will need to do this without interrupting the business.
Another common problem at large companies is for there to be policies and processes around evaluating and recording technical decisions but for these to not achieve anything other than getting developers to copy-and-paste the marketing claims of a new technology into a document without consideration of its context/impact within the organisation. These policies can often be empty ceremonies with no impact on the actual decision-making process that allow developers to rubber-stamp their desire to learn new technologies.
So how would I make technology choices?
This can lead to useless boilerplate proliferating and sometimes to the creators eventually deciding that they have no choice but to disavow their earlier approaches.
Dogma is bad because it leads to poorly-fit solutions.
However, in the absence of dogma, a greater problem can reveal itself: it’s hard to correctly notice the causes of problems.
Dogma might not be the right solution, but it does serve a purpose. Without frameworks to act as guard rails, patterns and best-practices are needed to help less experienced engineers find their feet and avoid wasting time.
The central instruments of problem-solving are:
As engineers we often focus our energy on the final item, however here we will concentrate on ‘noticing things’ and avoiding false causes and the bad solutions that arise from these and which result in wasted developer productivity.
In writing this essay my hope to to inform those that create libraries or tools on how to best increase accessibility for both beginners and experts and to provide a high-level framework for thinking about some of the problems that we encounter.
Over the last few years, I’ve spent quite a bit of time as a consultant technical lead, often advising teams made up of very junior software engineers. This has given me a lot of exposure to the problems that they face.
Some of the problems that I noticed are already in the process of being fixed (e.g. complex build processes should no longer be the default as better tools attempt to provide good default choices), however I also encountered a number of problems that surprised me. This lead me to believe that (1) there are still many ways in which we are too permissive even as we reject past dogma, and (2) there are useful patterns and heuristics that are either unwritten or overlooked.
Here are some of the issues I encountered:
salience
ˈseɪlɪəns/ noun
noun: salience; noun: saliency; plural noun: saliencies
the quality of being particularly noticeable or important; prominence.
“the political salience of religion has a considerable impact”
Great developer experience (DX) is largely about control over salience. Software and its outputs should be understood by all engineers that contribute to them.
Here are a few examples of how problems of saliency can occur:
Silence
I once worked on a codebase which was very heavily tooled, and on which one of the more junior engineers often complained about having difficulties getting his code to work. From time to time I would come over to provide direction and help them to fix logical issues, often relying on my ability to quickly understand what was being coded rather than reaching for any particular debugging technique. Since this generally helped, I incorrectly assumed that their complaint was due to the occasional mistakes I spotted within their code.
A few weeks later, I made a linting error in my own code, started the app and got a white screen of death. I checked the Developer Console and to my surprise saw a 404 Not Found
error against the HTTP request for JavaScript.
It turned out that the build process had been misconfigured and it would exit without outputting code if it found any linting error. To make matters worse it emitted no errors when it did this. Failure was silent.
This behaviour primarily affected junior engineers on the team, since many senior engineers program within the linters rules by default and hence rarely see its errors. Ironically, a process that had originally been setup to help engineers write consistent code was hindering their understanding by making their logic fail for irrelevant reasons. And, since those that wrote the build process were less exposed to linting errors the problem was effectively invisible to them.
This class of problem is likely more common than you’d expect. Build processes are quite often cobbled together at the start of projects by lead/senior engineers, who often hit different edge-cases than beginners.
Lessons learned:
Noise
On the other end of the spectrum is the Tragedy of the Commons that occurs when combining lots of disparate tools into a single process.
A popular software principle is the Unix Philosophy’s Rule of Silence. This states:
Developers should design programs so that they do not print unnecessary output. This rule aims to allow other programs and developers to pick out the information they need from a program’s output without having to parse verbosity.
Often individual tools will follow this principle, or at least provide options to help reduce the default verbosity (e.g. npm run -s
). However, as engineers begin to combine them the total output will tend to become noisy and difficult to parse, reducing its usability.
Beginners tend to understand and debug problems through tinkering and excavation instead of through contextual readings of the code or situation.
This will sometimes lead them to defactor logic towards units of meaning that are easier for them to granularly understand and observe. This can lead to a loss of salience for more experienced engineers that have learnt to work at a higher-level of abstraction due to its greater expressivity, and reduced surface area for errors.
Unless we retreat back up the ladder of abstraction after gaining understanding this can aggravate future problems.
A preference for trees or forests
A suspicion I have is that in order to gain understanding and debug problems different engineers require different things to be salient. Senior engineers might prefer for the overall approach and context to be expressive and concise so that it can be checked against their previous experiences, while junior engineers might need the individual details of the problem to be most salient so they can build understanding from scratch.
Laddering
Maybe a useful way of looking at the developer experience of a codebase is to try to judge it by the quality of the abstraction ladder that has been embedded within it? How easy is it for engineers with differing preferences towards granularities of abstraction to move up-and-down this ladder? Can they do so non-destructively?
“Language in Thought and Action” by S.I. Hayakawa
I have recently seen valuable work being done improving error messages by expressing them as a granular detail (including a diff of expected to actual) alongside context and beginner hints. This is analogous to Bret Victor’s tweet on stories and stats:
When programming in JavaScript, it’s not unusual to run into errors like TypeError: props.service.manufacturingService is undefined
.
In general, in the absence of static typing deeply-nested object properties signal that some code is likely to be fragile.
For example:
javascript
import React from 'react';const roleNames = {CHAIRMAN: 'Chairman',CEO: 'Chief Executive Officer',MD: 'Managing Director'};export function SomeComponentDeepWithinHierarchy(props) {return (<div className="service-box"><h2>{props.service.name}<h2><div className="service-box__info"><p>{props.service.description}</p>{props.service.manufacturingService.factories[0]?<ul><li>Primary Factory: {props.service.manufacturingService.factories[0].name}</li><li>Owner Role: {roleNames[props.service.manufacturingService.factories[0].owners[0].roleType]}</li></ul>: ''}</div></div>);}
The logic shown above has many opportunities to throw TypeError
s:
props.service
could be null.props.service.manufacturingService
could be null.props.service.manufacturingService.factories[0]
could be null or empty.props.service.manufacturingService.factories[0].owners[0]
could be null or empty.roleNames
could be missing a key-value mapping for props.service.manufacturingService.factories[0].owners[0].role
.
undefined
.Every time service
is passed down the component hierarchy into a component that will read from it, it endows a stealth requirement to either trust that the data is there or to manually check before each property access.
Often back-end engineers that work in languages with static typing produce deeply-nested objects like these without thinking twice. And, if the shape of the object hasn’t yet been stabilised on the back-end, uncertainty on the front-end can quickly cause a proliferation of defensive programming throughout the component hierarchy (e.g. if-else checks on props.service && props.service.manufacturingService && props.service.manufacturingService.factories.length && ...
). Over time these checks become FUD that clouds other team member’s understanding of data contracts.
Engineers with less experience working with JavaScript won’t realise that they have a problem until it’s too late. And, they will sometimes exasperate the problem often attempting to resolve the problem while also trying to reduce key strokes: for example, by choosing to pass through kitchen-and-sink objects so function signatures look simpler.
Of course, there are best practices. For example: objects can be flattened, nullability reduced at the source, defaults can be provided, TypeScript definitions setup, transforms moved to the edges, selectors configured, and finally when there is no other choice careful use of deep property selector functions.
The best way I can explain this one is to intentionally write bad code:
javascript
const initialState = {priceToggle: false,userConfig: {},};export default function reducer(state = initialState, action) {const newState = cloneDeep(state);switch (action.type) {case "SET_CONFIG_DATA":return {userConfig: action.payload.userConfig,};case "FETCH_USER_SUCCESS":newState.currentUser = action.payload;break;case "FETCH_PRODUCTS_SUCCESS":newState.products = action.payload;break;case "MARK_AS_EDITING_PRODUCT":const { productIndex } = action.payload;newState.editingProductIndex = productIndex;newState.products[productIndex].originalData =newState.products[productIndex];newState.products[productIndex].editing = true;break;case "UPDATE_PRODUCT_KEY":const { productIndex, propertyName, propertyValue } = action.payload;if (newState.products[productIndex][propertyName] !== propertyValue) {newState.products[productIndex].changed = true;}newState.products[productIndex][propertyName] = propertyValue;break;case "MARK_AS_NO_LONGER_EDITING_PRODUCT":const { productIndex } = action.payload;newState.products[productIndex] =newState.products[productIndex].originalData;break;case "CREATE_PRODUCT_SUCCESS":const { productIndex } = action.payload;newState.products[productIndex].editing = false;delete newState.products[productIndex].originalData;break;case "FETCH_SUPERSTORES_SUCCESS":newState.superstores = action.payload;if (!newState.currentSuperstoreId) {newState.currentSuperstoreId = newState.superstores[0].id;}break;case "SELECT_SUPERSTORE":newState.currentSuperstoreId = action.payload;break;case "TOGGLE_PRICE":newState.priceToggle = !newState.priceToggle;break;}return newState;}
Here, various problems arise:
priceToggle
and an empty object for userConfig
, but almost every other property might be null
, and in fact products
can have dynamic properties.newState
in order to generate the correct object.SET_CONFIG_DATA
is called after the other actions it will destroy the state they’d setup.This problem is considerably worse when a function has over 20 branches or is over 1000 lines long. In one project I consulted on, a decision had been made to store all of the state required for each page in a respective reducer
function, and this combined with a lack of experience handling data made it very difficult to reason when there were bugs. As more features were added to pages, the number of branches increased and the number of possible output shapes increased combinatorially.
Nowadays I recommend using static typing in your app. But, the underlying theory is to reduce the number of possible shapes that your data can take, as this affords you the ability to reason about your application’s state more easily while also requiring less complication in your logic.
Within the earlier reducer
code example, there are also a few other issues:
Mutating the data that is currently displayed on the screen, and then resetting it if the edit is cancelled is a bad pattern. It’s preferable to mimic transactions by copying the data that is going to be edited into a separate property where it can be mutated, and then only choosing to mutate the original data if the operation is successful. This is better since it is less destructive by default, and side-effects only when it needs to.
Functions of the form setProductProperty(propertyName: string, propertyValue: any | undefined)
allow you to write any value into an object. This is problematic since it is so general that it is descriptive of almost any mutation. There are a few cases in which it might be the right solution, however in most cases we should constrain signatures in a way that they are descriptive of more specific intentions and name these so that they describe the action that should occur instead of the state that should be set.
Pages that receive data from an API, display it, update it and then send it back to the server should not mutate the cache of the original data that was received from the server. The reason for this is that it’s confusing for the client and the server to be out-of-sync and buggy if other pages or components rely on this data being correct. Instead it’s better to treat the server data as if it is immutable, and to store it separately from the data that the client is preparing for the server. A benefit of this is that it makes it much more explicit whether the data that is being sent back was provided by the server or whether it has been created or modified client-side.
]]>You are choreographing a circus show with various animals. For one act, you are given two kangaroos on a number line ready to jump in the positive direction (i.e, toward positive infinity).
- The first kangaroo starts at location $x_1$ and moves at a rate of $v_1$ meters per jump.
- The second kangaroo starts at location $x_2$ and moves at a rate of $v_2$ meters per jump.
You have to figure out a way to get both kangaroos at the same location at the same time as part of the show. If it is possible, return
YES
, otherwise returnNO
.
NOTE: Although not explicitly stated, it’s implied that the kangaroos always jump at the same time.
With the constraints given of $0 \leq x_1 \leq x_2 \leq 10000$ and $1 \leq v_1 \leq 10000$ and $1 \leq v_2 \leq 10000$, the problem is very simple and can be easily solved with a brute force approach. As we know there will be no more than $10000$ jumps, we can iteratively simulate the jumps of both kangaroos and check if they ever land on the same spot, returning YES
if they do while otherwise returning NO
. This approach gives us a time complexity of $O(n)$ which should be acceptable when $n \leq 10000$.
e.g.
python
from typing import Literaldef number_line_jumps(x1: int,v1: int,x2: int,v2: int) -> Literal["YES", "NO"]:if x1 == x2:return "YES"p1, p2 = x1, x2for _ in range(10000):p1 += v1p2 += v2if p1 == p2:return "YES"return "NO"
However, if we remove the limit of $10000$ from each constraint, we have difficulties. If the kangaroos never land on the same spot, we get an infinite loop! We can still solve this using iteration but must ensure that we only iterate while the kangaroos are getting closer to each other, and that we exit our loop if each iteration has the kangaroos getting further apart. Unfortunately, this has us keeping track of the distance between the kangaroos, complicating our solution quite a bit. For example, in Python, we might write something like the following:
python
from typing import Literaldef number_line_jumps(x1: int,v1: int,x2: int,v2: int) -> Literal["YES", "NO"]:# If the kangaroos start at the same location, then we can# immediately return YES.if x1 == x2:return "YES"# If the kangaroo furthest away is moving faster, the other# will never catch up.if x1 >= x2 and v1 >= v2:return "NO"if x2 >= x1 and v2 >= v1:return "NO"# Otherwise, we iterate until:# (1) the kangaroos are at the same location,# (2) or, exit the loop if they are diverging and the# difference between them is increasing.p1, p2 = x1, x2prev_diff = float("inf")while abs(p1 - p2) < prev_diff:prev_diff = abs(p1 - p2)p1 += v1p2 += v2if p1 == p2:return "YES"return "NO"
I think it’s quite natural for a software engineer to reach for a solution like this. When all you have is iteration, everything looks like a loop.
But, there’s a much more elegant solution to this question that naturally follows from use of mathematical notation.
If we recall from the problem statement:
- The first kangaroo starts at location $x_1$ and moves at a rate of $v_1$ meters per jump.
- The second kangaroo starts at location $x_2$ and moves at a rate of $v_2$ meters per jump.
We can formulate the position of each kangaroo as a function of the number of jumps they’ve taken, $p(j)$. For example, the position $p$ of each kangaroo after $j$ jumps is given by:
$\begin{equation*} \begin{aligned} p_1(j) &= v_1j + x_1 \\ p_2(j) &= v_2j + x_2 \end{aligned} \end{equation*}$These position functions are linear equations. We could plot them on a graph to see whether they intersect and if so where, or we could solve them algebraically by setting $p_1(j) = p_2(j)$ and solving for $j$. Like so:
$\begin{equation*} \begin{aligned} p_1(j) &= p_2(j) \\\\ v_1j + x_1 &= v_2j + x_2 \\\\ v_1j + x_1 - x_1 &= v_2j + x_2 - x_1 \\\\ v_1j &= v_2j + x_2 - x_1 \\\\ v_1j - v_2j &= v_2j - v_2j + x_2 - x_1 \\\\ v_1j - v_2j &= x_2 - x_1 \\\\ (v_1 - v_2)j &= x_2 - x_1 \\\\ j &= \frac{x_2 - x_1}{v_1 - v_2} \end{aligned} \end{equation*}$The equation produced (i.e. $j = \frac{x_2 - x_1}{v_1 - v_2}$) allows us to determine whether the kangaroos will ever land on the same spot, and, if so, how many jumps it will take.
This equation is almost directly applicable to solving this problem, apart from two issues that must be handled:
When $v_1 - v_2 = 0$ the result is indeterminate (either a NaN
or a divide-by-zero error depending on your choice of programming language). This implies that when plotted on a graph, the slope of each line will run parallel to the other, never intersecting. In this situation, the kangaroos will never land on the same spot, unless they started at the same location.
When $j$ is a non-integer, it implies that the kangaroos will be at the same position mid-jump but never land on the same spot.
After resolving these two issues, the finished solution looks like this:
python
from typing import Literaldef number_line_jumps(x1: int,v1: int,x2: int,v2: int) -> Literal["YES", "NO"]:# If the kangaroos start at the same location, then we can# immediately return YES.if x1 == x2:return "YES"# If the kangaroos are moving at the same speed, but started# at different locations, they will never land on the same# spot.if v1 == v2:return "NO"# Finally, instead of iterating we can use the equation we# derived above to determine whether j is an integer or not# by using the modulo operator to check that the remainder# of the division is zero.## Put in another way, the difference between their starting# positions must be divisible evenly by the difference in# their speeds for them to meet. The reason for this is that# the difference (v1 - v2) represents the incremental step size# in the difference between the two starting positions. If it# doesn't divide evenly, that means they will never land on the# same spot.if (x2 - x1) % (v1 - v2) == 0:return "YES"return "NO"
This solution is not only much more elegant than the iterative solution, it’s also more efficient as it has a time complexity of $O(1)$ instead of $O(n)$.
I think it’s very easy to get tunnel vision when programming and to not see mathematical relationships. For those of you that grew up in the UK, this is Key Stage 3 material that is covered prior to GCSE maths, yet it wasn’t my first instinct to reach for it.
The learning I took away from this was that even if you haven’t yet developed the right mindset to immediately discern mathematical solutions, writing things down using mathematical notation can be a very useful tool in your arsenal. It can make these relationships more apparent and help with pattern matching mathematical approaches to solving problems.
]]>But don’t.
Don’t try to push harder when you start to lose concentration. Do what you are most comfortable with: change direction and keep your energy levels high.
Here are my suggestions on the changes which I’ve found have made me a more productive learner:
Discuss on Hacker News and LessWrong
]]>I’m aiming to make these posts accessible to other software engineers, by giving explanations that are grounded in ways that should be familiar to us, but which don’t assume a lot of prior knowledge. I’m not an expert in this field, so please let me know if you notice any inaccuracies or have any feedback by dropping me an email.
In my first post, the forward pass of a neuron was described as computing the weighted sum of inputs plus a bias followed by a non-linear activation function. Some simple psuedocode was used to show this:
js
activation(add(sum(multiply(weights, inputs)), bias));
In reality, instead of computing the “weighted sum of inputs” using sum(multiply(weights, inputs))
(e.g. $\text{{activation}}\left(\sum_i w_i \cdot x_i + b\right)$) we use a mathematical operation called the dot product (e.g. $\text{{activation}}\left(W \cdot x + b\right)$).
To implement this we can take two vectors represented by arrays of equal length, zip them together, multiply each pair, and then sum these products to produce a single number.
e.g.
typescript
// This is a highly inefficient implementation of "dot product"// and the code only exists to provide an intuitive explanation// of how it works.//// See: https://github.com/Jam3/math-as-code#dot-productfunctiondotProduct (v1 : number[],v2 : number[]): number {returnsum (zip (v1 ,v2 ).map (([x ,y ]) =>x *y ));}functionzip <T ,U >(a :T [],b :U []): [T ,U ][] {if (a .length !==b .length ) {throw newError ("Arrays must have the same length");}returna .map ((x ,i ) => [x ,b [i ]]);}functionsum (arr : number[]): number {returnarr .reduce ((acc ,val ) =>acc +val , 0);}constweights = [1, 2, 3];constinputs = [4, 5, 6];constweightedSum =dotProduct (weights ,inputs ); // 32
A vector is a representation of a multi-dimensional quantity with a magnitude and direction. It can be thought of as an arrow starting from the origin of the coordinate system $(0, 0, 0, ...)$ and ending at a point in vector space $(x, y, z, ...)$.
For example, in the code above, the weights
vector [1, 2, 3]
can be thought of as an arrow starting from the origin of the coordinate system $(0, 0, 0)$ and ending at the point $(1, 2, 3)$, while the inputs
vector [4, 5, 6]
can be thought of as an arrow starting from the origin of the coordinate system $(0, 0, 0)$ and ending at the point $(4, 5, 6)$.
In machine learning, we often use vectors to represent the individual input rows of a dataset, where each dimension of a vector is a column representing an attribute/feature of an input row (e.g. $(f_1, f_2, f_3, ..., f_n)$). For example, if we were trying to predict the price of a house, we might have a dataset in which each row was a vector representing a house with the following numeric features: number of bedrooms, number of bathrooms, square footage, number of floors, and so on.
We can also use vectors to represent non-numeric data, however, in order to do this, we need to encode this data into a numeric representation. For instance, categorical variables (e.g. $colors = \{\mathit{red}, \mathit{blue}, \mathit{green}\}$ or $countries = \{\mathit{USA}, \mathit{UK}, \mathit{Canada}\}$) can be represented using a “one-hot” encoded vector (e.g. $(0, 0, 1, 0, ..., 0)$) by first assigning each category a unique integer and then setting all elements of the vector that we want to represent this category to $0$ except for the element at the index of the unique integer we assigned to the category (which we’d set to $1$).
In language models, the “one-hot” encoding approach to representing non-numeric data falls short as it does not capture any information about the semantic relationship between words. For example, the words “dog” and “cat” are more similar to each other than they are to the word “democracy”, but this is not reflected in the one-hot vectors for these words. Word embeddings, learned from a large amount of text data, solve this problem by embedding words in a continuous vector space where similar words are placed closer together. For example, in this space, the word “dog” might be represented as $(0.9, -0.2, 0.4, ...)$ and “cat” might be $(0.8, -0.3, 0.5, ...)$. This information can be learned due to the “distributional hypothesis” which states that words that frequently occur close together tend to have similar meanings.
“You shall know a word by the company it keeps.”
— J.R. Firth
Originally these embeddings were learned using specific algorithms such as Word2Vec and GloVe, however in transformer-based models like GPT, these embeddings can be learned during the training process via backpropagation and through their interactions within the self-attention mechanism.
The capacity to capture the semantic relationships between words is the essence of why “word embeddings” are able to represent words. And, they are in fact, so effective at achieving this that not only do they allow us to see similarities between words, but they also enable us to perform arithmetic operations on these; a classic example being $king - man + woman ≈ queen$.
Semantic arithmetic is more than just a novelty — it’s consequential and carries significant implications. It means that the vector space is a semantic space that captures meaning itself. This semantic space can provide a kind of substrate for models to operate on allowing for (1) small iterative arithmetic adjustments to reflect nuanced changes in meaning, (2) avoidance of discontinuities in meaning and possibility of interpolated or intermediate meanings during computation, and (3) representation of the nameless or even ineffable.
Dimensionality refers to the number of attributes or features that belong to each data point (row/vector) of a dataset.
In practice, the vectors used in machine learning can have an overwhelmingly large number of dimensions. For instance, a word embedding in GPT-3 has 768 dimensions. Because of this, in order to make sense of them, we often have to resort to dimensionality reduction techniques like PCA, t-SNE, or UMAP. These techniques map high-dimensional data into a lower-dimensional space, typically 2 or 3 dimensions, while attempting to preserve as much of the data’s original structure and information as possible. This helps us to visualize the data to unveil patterns or relationships within it.
For example, if we ran a dimensionality reduction algorithm on the word embeddings of the words “dog”, “cat” and “democracy” and plotted these on a graph, we might see something like this:
Dimensionality reduction does lead to some information loss and isn’t perfect but it can be very useful when dealing with high-dimensional data that otherwise would be practically impossible to interpret.
Of course, we don’t want to resort to drawing a graph every time we wish to compare two vectors. Instead, we employ a similarity metric to measure the similarity between two vectors. One such metric is the dot product operation described at the beginning of this post.
As a metric of similarity, the dot product has some useful properties. For example, if two vectors are pointing in the same direction, the dot product between them will be positive suggesting they are similar. If they are pointing in opposite directions, the dot product will be negative suggesting they are dissimilar. And, if they are perpendicular to each other, the dot product will be zero — indicating orthogonality and suggesting that the two vectors are independent from each other or capturing different information within the vector space.
Crucially, the dot product’s measure of similarity is sensitive to both the direction and magnitude of the vectors. For example, if we were to double the overall magnitude of the vector for “cat” in the above diagram, the dot product between it and the vector for “dog” would increase even though the two vectors are still pointing in the same direction and therefore the words are just as similar to each other as they were before. This may or may not be desirable depending on the task at hand — if we want a measure of similarity that is less sensitive to the magnitude of the vectors, we can utilize cosine similarity or a scaling factor.
Note that, in high-dimensional spaces, the interpretation of the dot product becomes less straightforward as negative dot products could indicate that the vectors are spread apart in the vector space instead of being opposite or dissimilar to each other; this similarly affects both zero and positive dot products.
matmul
goes brrrThe dot product shows up again and again in machine learning. It’s used in the forward pass of neural network neurons to compute “weighted sums of inputs”, in dot product similarities within attention mechanisms, and even has a hashmap-like usage in which one-hot encoded vectors are used for lookups.
It’s often quipped that modern machine learning boils down to performing matrix multiplications as quickly as possible and then stacking more and more layers of these until the network is able to generalize. There’s more than a grain of truth here. The ability to quickly process an incredible amount of dot product operations is key to training models like GPT-3 as these require a massive amount of computational power (in the ballpark of $3.14 \times 10^{23}$ FLOPS were required to train GPT-3).
Fortunately, in the last few decades there has been a lot of investment into the efficient execution of floating point calculations, driven primarily by the demand for high-quality visuals in videogames. The architecture of GPUs allow them to perform many similar operations simultaneously, and this hardware and algorithmic investment has ended up delivering a bit of a windfall for AI, as they were able to be repurposed for machine learning. They now form the backbone of modern machine learning computation. (Although there are also other architectures being developed for this purpose which harness systolic arrays and attempt to directly optimize for the repeated multiply-accumulate operations (MAC) that constitute dot product operations).
In order to take advantage of the parallelism offered by GPUs, we need to be able to represent our data as matrices. A matrix is a two-dimensional array of numbers, and can be thought of as a collection of vectors, and roughly analogous to a spreadsheet or table of rows. Matrices are stored in contiguous blocks of memory (known as contiguous memory layouts), which helps to speed up data loading and processing due to locality of reference. Matrices also enable further optimizations to be made to their operations using techniques like matrix tiling and by vectorizing calculations using single instruction, multiple data (SIMD) instructions.
In graphics programming, matrices are a way to combine linear transformations (scale, rotation, translation, etc) into a single structure so we can apply these transformations onto vectors.
In contrast, in machine learning, we frequently use matrices for their ability to parallelize dot product operations (with matrix multiplication) or to parallelize other element-wise floating point vector operations like addition and subtraction. For example, rather than computing the forward pass of a single neuron one at a time (e.g. activation(add(dotProduct(weights, inputs), bias))
), we can compute the forward pass of an entire layer of neurons in parallel by representing the weights of the layer as a matrix (e.g. activation(add(matmul(weightsByNeuron, inputs), biasByNeuron))
), where each row of the weightsByNeuron
matrix represents a neuron and each column represents a weight for a particular input. This allows us to compute the dot product (“weighted sum”) of each row of this matrix with the inputs
in parallel.
Matrix multiplication (e.g. matmul
) is an operation from linear algebra that has both a procedural and geometric interpretation:
To better understand these two interpretations, André Staltz has an interactive visualization showing how a matrix multiplication operation is calculated, while 3Blue1Brown has a great video explaining the operation in terms of geometry, by showing how a matrix can represent the transformation of a vector space into a new coordinate system.
It’s defined like so:
$C_{\textcolor{red}{a},\textcolor{blue}{b}} = \sum_{\textcolor{green}{s=1}}^{\textcolor{green}{sDims}} A_{\textcolor{red}{a},\textcolor{green}{s}} \cdot B_{\textcolor{green}{s},\textcolor{blue}{b}}$Matrix multiplication operations are not allowed unless the number of columns in the first matrix $A$ matches the number of rows in the second matrix $B$. The resulting matrix $C_{\textcolor{red}{a},\textcolor{blue}{b}}$ will then have dimensions determined by the remaining dimensions of the input matrices.
An inefficient but simple implementation of matmul
might look like this:
typescript
// This is an incredibly inefficient implementation of "matrix multiplication"// and the code only exists to provide an intuitive explanation of how it// works.//// If you want a faster implementation, you should use `PyTorch` which is// already very optimized or if you'd like to learn how GPUs work try to// reimplement the logic below using WebGPU using the techniques// described here: https://jott.live/markdown/webgpu_safarifunctionmatmul (A : number[][],B : number[][]): number[][] {if (A [0].length !==B .length ) {throw newError (`The number of columns in A must equal the number of rows in B: ` +`${A [0].length } !== ${B .length }`);}constaRows =A .length ;constbCols =B [0].length ;constsDims =A [0].length ;constC =Array .from ({length :aRows }).map (() =>Array .from <number>({length :bCols }).fill (0));// For s from 0 to sDims, accumulate A[a][s] times B[s][b].for (leta = 0;a <aRows ;a ++) {for (letb = 0;b <bCols ;b ++) {for (lets = 0;s <sDims ;s ++) {C [a ][b ] +=A [a ][s ] *B [s ][b ];}}}returnC ;}constA = [[1, 2, 3],[4, 5, 6],];constB = [[7, 8],[9, 10],[11, 12],];constC =matmul (A ,B );// [// [58, 64],// [139, 154]// ]
In “bridging the gap between neural networks and functions”, we argued that essential to the use of backpropagation was that all inputs must be real-valued numeric values and all operations must be differentiable and composable. As we’ve shown, word embeddings are real-valued numeric vectors, and matrix multiplication operations are differentiable and composable due to being made up of many dot product operations that are ultimately composed of many multiply-accumulate (MAC) operations.
$\text{matrix multiplication} \longrightarrow \text{dot product} \longrightarrow \text{multiply-accumulate}$As a software engineer, the matrix multiplication convention of multiplying the columns of one matrix by the rows of another matrix seems arbitrary and confusing. As far as I can understand, it comes from the geometric interpretation of matrix multiplication as composing linear transformations (e.g. scaling, rotating, skewing, or shearing the space). While in machine learning we seem to primarily use matrix multiplication for its ability to parallelize dot product operations, it’s possible that the kind of semantic arithmetic mentioned earlier when discussing word embeddings makes it reasonable to interpret matrix multiplication within high-dimensional vector spaces geometrically. We know that these models do sometimes appear to perform geometric operations in order to compute outputs, for example, in “Progress measures for grokking via mechanistic interpretability” a simple transformer was trained to perform modular addition (aka, “clock” arithmetic) and when the authors reverse-engineered the algorithm learned by the network they found that it was using discrete Fourier transforms and trigonometric identities to convert addition to rotation about a circle. If your whole world consists of distances in high-dimensional vector spaces perhaps it makes sense that the way you compute outputs will be composed of angles and frequencies.
What is abundantly clear is that modern machine learning models are highly demanding of computational resources (FLOPS) and that matrix multiplication operations can concisely express the application of parallel dot product operations in equations.
So, to conclude: dot products are found at the heart of many operations in machine learning. They are integral for calculating weighted sums, measuring similarity, and even power the “attention” mechanism in transformers. However, their real power comes when we scale up the number of these operations using matrix multiplication and as a consequence cause complex linear transformations within high-dimensional spaces. GPUs are instrumental to this, efficiently handling the massive computational demands it requires, and we have videogames to thank for this!
]]>potatogpt
. Although the performance may be slow, it contains a very interesting approach to type-checking tensor arithmetic. This approach eliminates the need to run your code to verify whether operations are allowed or to keep track of the sizes of tensors in your head.
The implementation is quite complex, employing several advanced TypeScript techniques. In order to make it more accessible and easier to understand, I’ve attempted to simplify and explain the implementation with clarifying comments below.
Finally, I show how this approach allows us to easily create type-safe versions of functions like zip
and matmul
.
In order that Tensor
s can have exact dimensions we need to support only numeric literals (e.g. 16
, 768
, etc) for sizes known at compile time, and “branded types” for sizes only known at runtime. We must disallow non-literal number
types or unions of number
s (e.g. 16 | 768
) as if these get introduced into an application, data produced using these would also lack exact dimensions.
typescript
// We check whether `T` is a numeric literal by checking that `number`// does not extend from `T` but that `T` does extend from `number`.typeIsNumericLiteral <T > = number extendsT ? false:T extends number? true: false;// In order to support runtime-determined sizes we use a "branded type"// to give these dimensions labels that they can be type-checked with// and a function `Var` to generate values with this type.export typeVar <Label extends string> = number & {label :Label };export constVar = <Label extends string>(size : number,label :Label ) => {returnsize asVar <Label >;};typeIsVar <T > =T extendsVar <string> ? true : false;// For type-checking of tensors to work they must only ever be// created using numeric literals (e.g. `5`) or `Var<string>`// and never from types like `number` or `1 | 2 | 3`.typeIsNumericLiteralOrVar <T extends number |Var <string>> =And <// We disallow `T` to be a union of types.Not <IsUnion <T >>,Or <// We allow `T` to be a numeric literal but not a number.IsNumericLiteral <T >,// We allow `T` to be a `Var`.IsVar <T >>>;// UtilitiestypeAnd <A ,B > =A extends true ? (B extends true ? true : false) : false;typeOr <A ,B > =A extends true ? true :B extends true ? true : false;typeNot <A > =A extends true ? false : true;// `IsUnion` is based on the principle that a union like `A | B` does not// extend an intersection like `A & B`. The conditional type uses a// "tuple trick" technique that avoids distributing the type `T` over// `UnionToIntersection` by wrapping the type into a one-element tuple.// This means that if `T` is `'A' | 'B'` the expression is evaluated// as `['A' | 'B'] extends [UnionToIntersection<'A' | 'B'>]` instead of// `'A' | 'B' extends UnionToIntersection<'A'> | UnionToIntersection<'B'>`.typeIsUnion <T > = [T ] extends [UnionToIntersection <T >] ? false : true;// `UnionToIntersection` takes a union type and uses a "distributive// conditional type" to map over each element of the union and create a// series of function types with each element as their argument. It then// infers the first argument of each of these functions to create a new// type that is the intersection of all the types in the original union.typeUnionToIntersection <Union > = (Union extends unknown ? (distributedUnion :Union ) => void : never) extends (mergedIntersection : inferIntersection ) => void?Intersection : never;
If you need to, you can read further on the more advanced TypeScript techniques here:
We can then implement a type-safe Tensor
with a unique constraint: the dimensions must be specified using numeric literals or “branded types”. This approach pushes the limits of TypeScript’s standard type-checking capabilities and requires a non-idiomatic usage of conditional types to represent these errors. Note that, we diverged from Ben’s original implementation by enforcing this dimensional constraint at the argument-level instead of doing so at the return-level with a conditional return type that produces an invalid tensor. The downside of this is that you must use as const
on the shape
argument to prevent TypeScript from widening the literal types to number
.
typescript
export typeDimension = number |Var <string>;export typeTensor <Shape extends readonlyDimension []> = {data :Float32Array ;shape :Shape ;};export functiontensor <constShape extends readonlyDimension []>(shape :AssertShapeEveryElementIsNumericLiteralOrVar <Shape >,init ?: number[]):Tensor <Shape > {return {data :init ? newFloat32Array (init ): newFloat32Array ((shape asShape ).reduce ((a ,b ) =>a *b , 1)),shape :shape asShape ,};}// `ArrayEveryElementIsNumericLiteralOrVar` is similar to JavaScript's// `Array#every` in that it checks that a particular condition is true of// every element in an array and returns `true` if this is the case. In// TypeScript we have to hardcode our condition (`IsNumericLiteralOrVar`)// as we do not yet have higher-kinded generic types that can take in// other generic types and apply these.//// In the code below we create a "mapped object type" from an array type// and then apply the condition to each value in the mapped object type.// We then use a conditional type to check whether the type outputted// extends from a type in which the value at every key is `true`.typeArrayEveryElementIsNumericLiteralOrVar <T extendsReadonlyArray <number |Var <string>>> =T extendsReadonlyArray <unknown>? { [K in keyofT ]:IsNumericLiteralOrVar <T [K ]> } extends {[K in keyofT ]: true;}? true: false: false;typeInvalidArgument <T > = readonly [never,T ];typeAssertShapeEveryElementIsNumericLiteralOrVar <T extendsReadonlyArray <number |Var <string>>> = true extendsArrayEveryElementIsNumericLiteralOrVar <T >?T :ReadonlyArray <InvalidArgument <"The `shape` argument must be marked `as const` and only contain number literals or branded types.">>;// TestsconstfourDimensionalTensorWithStaticSizes =tensor ([10, 100, 1000, 10000,] asconst );constthreeDimensionalTensorWithRuntimeSize =tensor ([5,Var (3, "dim"),10,] asconst );constinvalidTensor1 =tensor ([10, 100, 1000, 10000]);constType 'number' is not assignable to type 'InvalidArgument<"The `shape` argument must be marked `as const` and only contain number literals or branded types.">'.invalidTensor2 =tensor ([10 as number,100 ,1000 ,10000 ] asconst );
Type 'number' is not assignable to type 'InvalidArgument<"The `shape` argument must be marked `as const` and only contain number literals or branded types.">'.
Type 'number' is not assignable to type 'InvalidArgument<"The `shape` argument must be marked `as const` and only contain number literals or branded types.">'.
Type 'number' is not assignable to type 'InvalidArgument<"The `shape` argument must be marked `as const` and only contain number literals or branded types.">'.2322
2322
2322
2322Type 'number' is not assignable to type 'InvalidArgument<"The `shape` argument must be marked `as const` and only contain number literals or branded types.">'.
Type 'number' is not assignable to type 'InvalidArgument<"The `shape` argument must be marked `as const` and only contain number literals or branded types.">'.
Type 'number' is not assignable to type 'InvalidArgument<"The `shape` argument must be marked `as const` and only contain number literals or branded types.">'.
Type 'number' is not assignable to type 'InvalidArgument<"The `shape` argument must be marked `as const` and only contain number literals or branded types.">'.constType 'number' is not assignable to type 'InvalidArgument<"The `shape` argument must be marked `as const` and only contain number literals or branded types.">'.invalidTensor3 =tensor ([5 , 3 as 3 | 6 | 9,10 ] asconst );
Type 'number' is not assignable to type 'InvalidArgument<"The `shape` argument must be marked `as const` and only contain number literals or branded types.">'. Type 'number' is not assignable to type 'readonly [never, "The `shape` argument must be marked `as const` and only contain number literals or branded types."]'.
Type 'number' is not assignable to type 'InvalidArgument<"The `shape` argument must be marked `as const` and only contain number literals or branded types.">'.2322
2322
2322Type 'number' is not assignable to type 'InvalidArgument<"The `shape` argument must be marked `as const` and only contain number literals or branded types.">'.
Type 'number' is not assignable to type 'InvalidArgument<"The `shape` argument must be marked `as const` and only contain number literals or branded types.">'. Type 'number' is not assignable to type 'readonly [never, "The `shape` argument must be marked `as const` and only contain number literals or branded types."]'.
Type 'number' is not assignable to type 'InvalidArgument<"The `shape` argument must be marked `as const` and only contain number literals or branded types.">'.
If you need to, you can read further on the more advanced TypeScript techniques here:
typescript
functionisDimensionArray (maybeDimensionArray : any):maybeDimensionArray is readonlyDimension [] {return (Array .isArray (maybeDimensionArray ) &&maybeDimensionArray .some ((d ) => typeofd === "number"));}functionis2DArray (maybe2DArray : any):maybe2DArray is number[][] {return (Array .isArray (maybe2DArray ) &&maybe2DArray .some ((row ) =>Array .isArray (row )));}functionflat <T >(arr :T [][]):T [] {letresult :T [] = [];for (leti = 0;i <arr .length ;i ++) {result .push .apply (result ,arr [i ]);}returnresult ;}export typeMatrix <Rows extendsDimension ,Columns extendsDimension > =Tensor <readonly [Rows ,Columns ]>;export functionmatrix <constTwoDArray extendsReadonlyArray <ReadonlyArray <number>>>(init :TwoDArray ):Matrix <TwoDArray ["length"],TwoDArray [0]["length"]>;export functionmatrix <constShape extends readonly [Dimension ,Dimension ]>(shape :AssertShapeEveryElementIsNumericLiteralOrVar <Shape >,init ?: number[]):Matrix <Shape [0],Shape [1]>;export functionmatrix <constShape extends readonly [Dimension ,Dimension ]>(shape :AssertShapeEveryElementIsNumericLiteralOrVar <Shape >,init ?: number[]):Matrix <Shape [0],Shape [1]> {letresolvedShape : readonly [any, any];if (isDimensionArray (shape )) {resolvedShape =shape ;} else if (is2DArray (shape )) {resolvedShape = [shape .length ,shape [0].length ];init =flat (shape );} else {throw newError ("Invalid shape type for matrix.");}returntensor (resolvedShape ,init );}// TestsconstmatrixWithStaticSizes =matrix ([25, 50] asconst );constmatrixWithRuntimeSize =matrix ([10,Var (100, "configuredDimensionName"),] asconst );constmatrixWithSizeFromData =matrix ([[1, 2, 3],[4, 5, 6],[7, 8, 9],]);constinvalidMatrix1 =matrix ([25, 50]);constNo overload matches this call. Overload 2 of 2, '(shape: readonly InvalidArgument<"The `shape` argument must be marked `as const` and only contain number literals or branded types.">[], init?: number[] | undefined): Matrix<...>', gave the following error. Type 'number' is not assignable to type 'InvalidArgument<"The `shape` argument must be marked `as const` and only contain number literals or branded types.">'. Overload 2 of 2, '(shape: readonly InvalidArgument<"The `shape` argument must be marked `as const` and only contain number literals or branded types.">[], init?: number[] | undefined): Matrix<...>', gave the following error. Type 'number' is not assignable to type 'InvalidArgument<"The `shape` argument must be marked `as const` and only contain number literals or branded types.">'.2769No overload matches this call. Overload 2 of 2, '(shape: readonly InvalidArgument<"The `shape` argument must be marked `as const` and only contain number literals or branded types.">[], init?: number[] | undefined): Matrix<...>', gave the following error. Type 'number' is not assignable to type 'InvalidArgument<"The `shape` argument must be marked `as const` and only contain number literals or branded types.">'. Overload 2 of 2, '(shape: readonly InvalidArgument<"The `shape` argument must be marked `as const` and only contain number literals or branded types.">[], init?: number[] | undefined): Matrix<...>', gave the following error. Type 'number' is not assignable to type 'InvalidArgument<"The `shape` argument must be marked `as const` and only contain number literals or branded types.">'.invalidMatrix2 =matrix ([25 as number, 50] asconst );constNo overload matches this call. Overload 2 of 2, '(shape: readonly InvalidArgument<"The `shape` argument must be marked `as const` and only contain number literals or branded types.">[], init?: number[] | undefined): Matrix<...>', gave the following error. Type 'number' is not assignable to type 'InvalidArgument<"The `shape` argument must be marked `as const` and only contain number literals or branded types.">'. Overload 2 of 2, '(shape: readonly InvalidArgument<"The `shape` argument must be marked `as const` and only contain number literals or branded types.">[], init?: number[] | undefined): Matrix<...>', gave the following error. Type 'number' is not assignable to type 'InvalidArgument<"The `shape` argument must be marked `as const` and only contain number literals or branded types.">'. Type 'number' is not assignable to type 'readonly [never, "The `shape` argument must be marked `as const` and only contain number literals or branded types."]'.2769No overload matches this call. Overload 2 of 2, '(shape: readonly InvalidArgument<"The `shape` argument must be marked `as const` and only contain number literals or branded types.">[], init?: number[] | undefined): Matrix<...>', gave the following error. Type 'number' is not assignable to type 'InvalidArgument<"The `shape` argument must be marked `as const` and only contain number literals or branded types.">'. Overload 2 of 2, '(shape: readonly InvalidArgument<"The `shape` argument must be marked `as const` and only contain number literals or branded types.">[], init?: number[] | undefined): Matrix<...>', gave the following error. Type 'number' is not assignable to type 'InvalidArgument<"The `shape` argument must be marked `as const` and only contain number literals or branded types.">'. Type 'number' is not assignable to type 'readonly [never, "The `shape` argument must be marked `as const` and only contain number literals or branded types."]'.invalidMatrix3 =matrix ([10, 100 as 100 | 115] asconst );
typescript
typeAssertSizeIsNumericLiteralOrVar <T extendsDimension > =true extendsIsNumericLiteralOrVar <T >?T :InvalidArgument <"The `size` argument must only contain number literals or branded types.">;export typeRowVector <Size extendsDimension > =Tensor <readonly [1,Size ]>;export typeVector <Size extendsDimension > =RowVector <Size >;export functionvector <constOneDArray extends readonlyDimension []>(init :OneDArray ):Vector <OneDArray ["length"]>;export functionvector <constSize extendsDimension >(size :AssertSizeIsNumericLiteralOrVar <Size >,init ?: number[]):Vector <Size >;export functionvector <constSize extendsDimension >(size :AssertSizeIsNumericLiteralOrVar <Size >,init ?: number[]):Vector <Size > {letshape : readonly [1, any];if (typeofsize === "number") {shape = [1,size ];} else if (Array .isArray (size )) {shape = [1,size .length ];init =size ;} else {throw newError ("Invalid size type for vector.");}returntensor (shape ,init );}// TestsconstvectorWithStaticSize =vector (2);constvectorWithRuntimeSize =vector (Var (4, "configuredDimensionName"));constvectorWithSizeFromData =vector ([1, 2, 3]);constNo overload matches this call. Overload 1 of 2, '(init: readonly Dimension[]): Vector<number>', gave the following error. Argument of type 'number' is not assignable to parameter of type 'readonly Dimension[]'. Overload 2 of 2, '(size: InvalidArgument<"The `size` argument must only contain number literals or branded types.">, init?: number[] | undefined): Vector<number>', gave the following error. Argument of type 'number' is not assignable to parameter of type 'InvalidArgument<"The `size` argument must only contain number literals or branded types.">'.2769No overload matches this call. Overload 1 of 2, '(init: readonly Dimension[]): Vector<number>', gave the following error. Argument of type 'number' is not assignable to parameter of type 'readonly Dimension[]'. Overload 2 of 2, '(size: InvalidArgument<"The `size` argument must only contain number literals or branded types.">, init?: number[] | undefined): Vector<number>', gave the following error. Argument of type 'number' is not assignable to parameter of type 'InvalidArgument<"The `size` argument must only contain number literals or branded types.">'.invalidVector1 =vector (2 as number);constNo overload matches this call. Overload 1 of 2, '(init: readonly Dimension[]): Vector<number>', gave the following error. Argument of type 'number' is not assignable to parameter of type 'readonly Dimension[]'. Type 'number' is not assignable to type 'readonly Dimension[]'. Overload 2 of 2, '(size: InvalidArgument<"The `size` argument must only contain number literals or branded types.">, init?: number[] | undefined): Vector<100 | 115>', gave the following error. Argument of type 'number' is not assignable to parameter of type 'InvalidArgument<"The `size` argument must only contain number literals or branded types.">'. Type 'number' is not assignable to type 'readonly [never, "The `size` argument must only contain number literals or branded types."]'.2769No overload matches this call. Overload 1 of 2, '(init: readonly Dimension[]): Vector<number>', gave the following error. Argument of type 'number' is not assignable to parameter of type 'readonly Dimension[]'. Type 'number' is not assignable to type 'readonly Dimension[]'. Overload 2 of 2, '(size: InvalidArgument<"The `size` argument must only contain number literals or branded types.">, init?: number[] | undefined): Vector<100 | 115>', gave the following error. Argument of type 'number' is not assignable to parameter of type 'InvalidArgument<"The `size` argument must only contain number literals or branded types.">'. Type 'number' is not assignable to type 'readonly [never, "The `size` argument must only contain number literals or branded types."]'.invalidVector2 =vector (100 as 100 | 115);
Once we have a Vector
and Matrix
type defined, we can use these to write a type-safe zip
function that combines two Vector
s of the same length into a Matrix
of [VectorLength, 2]
, like so:
typescript
/*** The `zip` function combines two vectors of the same length into a matrix* where each row contains a pair of corresponding elements from the input* vectors. The output matrix's data is stored in a `Float32Array` with an* interleaved arrangement of elements (row-major storage order) for efficient* access.** Example:* Input vectors: [a1, a2, a3] and [b1, b2, b3]* Output matrix:* | a1 b1 |* | a2 b2 |* | a3 b3 |** Memory layout in Float32Array: [a1, b1, a2, b2, a3, b3]*/functionzip <SameVector extendsVector <Dimension >>(a :SameVector ,b :SameVector ):Matrix <SameVector ["shape"][1], 2> {if (a .shape [1] !==b .shape [1]) {throw newError (`zip cannot operate on different length vectors; ${a .shape [1]} !== ${b .shape [1]}`);}constlength =a .shape [1];constresultData : number[] = [];for (leti = 0;i <length ;i ++) {resultData .push (a .data [i ],b .data [i ]);}returnmatrix ([length as any, 2] asconst ,resultData );}// TestsconstthreeElementVector1 =vector ([1, 2, 3]);constthreeElementVector2 =vector ([4, 5, 6]);constfourElementVector1 =vector ([7, 8, 9, 10]);constzipped =zip (threeElementVector1 ,threeElementVector2 );constArgument of type 'Vector<4>' is not assignable to parameter of type 'Vector<3>'. Type '4' is not assignable to type '3'.2345Argument of type 'Vector<4>' is not assignable to parameter of type 'Vector<3>'. Type '4' is not assignable to type '3'.zippedError =zip (threeElementVector1 ,); fourElementVector1 constthreeElementVector3 =vector (Var (3, "three"), [1, 2, 3]);constthreeElementVector4 =vector (Var (3, "three"), [5, 10, 15]);constfourElementVector2 =vector (Var (4, "four"), [10, 11, 12, 13]);constzipped2 =zip (threeElementVector3 ,threeElementVector4 );constArgument of type 'Vector<Var<"four">>' is not assignable to parameter of type 'Vector<Var<"three">>'. Type 'Var<"four">' is not assignable to type 'Var<"three">'. Type 'Var<"four">' is not assignable to type '{ label: "three"; }'. Types of property 'label' are incompatible. Type '"four"' is not assignable to type '"three"'.2345Argument of type 'Vector<Var<"four">>' is not assignable to parameter of type 'Vector<Var<"three">>'. Type 'Var<"four">' is not assignable to type 'Var<"three">'. Type 'Var<"four">' is not assignable to type '{ label: "three"; }'. Types of property 'label' are incompatible. Type '"four"' is not assignable to type '"three"'.zippedError2 =zip (threeElementVector3 ,); fourElementVector2
Finally, functions like matmul
that expect two operands with different but compatible shapes, can be implemented using the same techniques:
]]>typescript
functionmatmul <RowsA extendsDimension ,SharedDimension extendsDimension ,ColumnsB extendsDimension >(a :Matrix <RowsA ,SharedDimension >,b :IsNumericLiteralOrVar <SharedDimension > extends true?Matrix <SharedDimension ,ColumnsB >:InvalidArgument <"The rows dimension of the `b` matrix must match the columns dimension of the `a` matrix.">):Matrix <RowsA ,ColumnsB > {constaMatrix =a ;constbMatrix =b asMatrix <SharedDimension ,ColumnsB >;const [aRows ,aCols ] =aMatrix .shape ;const [bRows ,bCols ] =bMatrix .shape ;if (aCols !==bRows ) {throw newError ("The rows dimension of the `b` matrix must match the columns dimension of the `a` matrix.");}constshape = [aRows ,bCols ] asAssertShapeEveryElementIsNumericLiteralOrVar <[RowsA ,ColumnsB ]>;constdata =Array <number>(aRows *bCols ).fill (0);for (letrowIndex = 0;rowIndex <aRows ;rowIndex ++) {for (letcolumnIndex = 0;columnIndex <bCols ;columnIndex ++) {letdotProduct = 0;for (letsharedDimensionIndex = 0;sharedDimensionIndex <aCols ;sharedDimensionIndex ++) {constrowCellFromA =aMatrix .data [rowIndex *aCols +sharedDimensionIndex ];constcolumnCellFromB =bMatrix .data [sharedDimensionIndex *bCols +columnIndex ];dotProduct +=rowCellFromA *columnCellFromB ;}data [rowIndex *bCols +columnIndex ] =dotProduct ;}}returnmatrix (shape ,data );}// Testsconsta =matrix ([2, 3] asconst );constb =matrix ([3, 2] asconst );constc =matrix ([7, 7] asconst );constvalidMatmul =matmul (a ,b );constArgument of type 'Matrix<7, 7>' is not assignable to parameter of type 'InvalidArgument<"The rows dimension of the `b` matrix must match the columns dimension of the `a` matrix.">'.2345Argument of type 'Matrix<7, 7>' is not assignable to parameter of type 'InvalidArgument<"The rows dimension of the `b` matrix must match the columns dimension of the `a` matrix.">'.invalidMatmul =matmul (a ,); c