<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/">
    <channel>
        <title>Stanley Wang's Blog</title>
        <link>https://stanleywang.dev</link>
        <description>Writing about tech, algorithms, and more</description>
        <lastBuildDate>Thu, 07 May 2026 11:40:21 GMT</lastBuildDate>
        <docs>https://validator.w3.org/feed/docs/rss2.html</docs>
        <generator>https://github.com/jpmonette/feed</generator>
        <language>en</language>
        <copyright>All rights reserved 2026</copyright>
        <item>
            <title><![CDATA[Exploring a Recurrent Neural Network]]></title>
            <link>https://stanleywang.dev/writing/exploring-a-recurrent-neural-network</link>
            <guid>https://stanleywang.dev/writing/exploring-a-recurrent-neural-network</guid>
            <pubDate>Sat, 28 Mar 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[COMP 551 Notes]]></description>
            <content:encoded><![CDATA[<TOC title="Exploring a Recurrent Neural Network" variant="vim" />

These notes walk through Andrej Karpathy's minimal character-level **Recurrent Neural Network (RNN)** implementation. [^1]

RNNs are special <Note aside="they reuse the same weights across every timestep while carrying a hidden state, so context flows forward through the sequence">'lil guys</Note> because instead of being acyclic, they reuse the same weights across timesteps while carrying a hidden state, so each prediction can use context from previous steps. That makes them especially useful for _sequence tasks_ such as text generation, speech modeling, and frame-by-frame video classification where earlier inputs should influence later predictions.

A **sequence** is just an ordered list over time: characters, words, audio frames, video frames. A **timestep** is one position in that list. At timestep $t$, the model reads $x_t$, combines it with $h_{t-1}$, and produces $h_t$ plus a prediction.

Because the same update rule is reused at every step, one RNN can map many different sequence lengths using the same parameters. In this note, we use a character-level **many-to-many** mapping: each input character predicts the next one. Example: input `hello` gives shifted targets `ello...`, so each timestep contributes one prediction and one loss term.

## Data I/O

```python showLineNumbers
import numpy as np
# data I/O
data = open('input.txt', 'r').read() # should be simple plain text file
chars = list(set(data))
data_size, vocab_size = len(data), len(chars)
print("data has %d characters, %d unique." % (data_size, vocab_size))
char_to_ix = {ch: i for i, ch in enumerate(chars)}
ix_to_char = {i: ch for i, ch in enumerate(chars)}
```

- Read all characters in the corpus.
- Extract the unique character set, the vocabulary.
- Build lookup maps: character to index and index to character.

```text
Small toy example:

input.txt:
hello

chars = ['h', 'e', 'l', 'o']
data_size = 5
vocab_size = 4
char_to_ix = {'h': 0, 'e': 1, 'l': 2, 'o': 3}
ix_to_char = {0: 'h', 1: 'e', 2: 'l', 3: 'o'}

Larger running example:

hello world is the most commonly used starter program in code today.
it is taught world wide in many many languages which is why is it so famous.
beginners often write it as their very first line of code.
seeing those words print to the screen gives a great sense of accomplishment.
from there, developers move on to more complex topics like loops and functions.
it truly serves as the universal greeting of the programming community.

data_size = 436 characters
vocab_size = 27 unique characters
```

## Initializations

```python showLineNumbers
# hyperparameters
hidden_size = 100  # size of hidden layer of neurons
seq_length = 25  # number of steps to unroll the RNN for
learning_rate = 1e-1

# model parameters
Wxh = np.random.randn(hidden_size, vocab_size) * 0.01  # input to hidden 100 x 27
Whh = np.random.randn(hidden_size, hidden_size) * 0.01  # hidden to hidden 100 x 100
Why = np.random.randn(vocab_size, hidden_size) * 0.01  # hidden to output 27 x 100
bh = np.zeros((hidden_size, 1))  # hidden bias, 100 x 1
by = np.zeros((vocab_size, 1))  # output bias, 27 x 1
```

`seq_length{:python}` is the number of steps we unroll the RNN before doing
backpropagation. The full dataset can be much larger, so we train on chunks of
25 characters at a time in this implementation.

The weight matrices are initialized to small random values so the network starts with small, non-identical activations instead of a symmetric state.

## Main Loop

```python showLineNumbers
n, p = 0, 0
mWxh, mWhh, mWhy = np.zeros_like(Wxh), np.zeros_like(Whh), np.zeros_like(Why)
mbh, mby = np.zeros_like(bh), np.zeros_like(by)  # memory variables for Adagrad
smooth_loss = -np.log(1.0 / vocab_size) * seq_length  # loss at iteration 0
while True:
    # prepare inputs (we're sweeping from left to right in steps seq_length long)
    if p + seq_length + 1 >= len(data) or n == 0:
        hprev = np.zeros((hidden_size, 1))  # reset RNN memory
        p = 0  # go from start of data
    inputs = [char_to_ix[ch] for ch in data[p : p + seq_length]]
    targets = [char_to_ix[ch] for ch in data[p + 1 : p + seq_length + 1]]

    # sample from the model now and then
    if n % 100 == 0:
        sample_ix = sample(hprev, inputs[0], 200)
        txt = "".join(ix_to_char[ix] for ix in sample_ix)
        print("----\n %s \n----" % (txt,))
```

If the current sequence exceeds the total text length, we reset $h_{t-1}$ and start from the beginning of the dataset again.

This first block samples a batch of 25 characters from the dataset:

- Inputs: the indices for `data[p : p + seq_length]{:python}`
- Targets: the indices for the next characters in `data[p + 1 : p + seq_length + 1]{:python}`

At every timestep, we want the network to predict the character directly following the current one.

Every 100 iterations we call `sample{:python}` to [visualize the current predictive power](#sample-function) of the RNN.

We then call the [loss function](#loss-function):

```python showLineNumbers {2}#c {3}#b
    # forward seq_length characters through the net and fetch gradient
    loss, dWxh, dWhh, dWhy, dbh, dby, hprev = lossFun(inputs, targets, hprev)
    smooth_loss = smooth_loss * 0.999 + loss * 0.001
    if n % 100 == 0:
        print("iter %d, loss: %f" % (n, smooth_loss))  # print progress
```

Because training runs on short 25-character chunks, the raw loss from one chunk to the next is noisy. We therefore track an [exponential moving average (EMA)](https://towardsdatascience.com/intuitive-explanation-of-exponential-moving-average-2eb9693ea4dc/) so the trend is easier to read.

Conceptually, this line says: keep 99.9% of our historical average, and mix in 0.1% of the new raw loss.

Lastly, we update the parameters with [Adagrad](https://optimization.cbe.cornell.edu/index.php?title=AdaGrad):

```python showLineNumbers
    # perform parameter update with Adagrad
    for param, dparam, mem in zip(
        [Wxh, Whh, Why, bh, by],
        [dWxh, dWhh, dWhy, dbh, dby],
        [mWxh, mWhh, mWhy, mbh, mby],
    ):
        mem += dparam * dparam
        param += -learning_rate * dparam / np.sqrt(mem + 1e-8)  # adagrad update

    p += seq_length  # move data pointer
    n += 1  # iteration counter
```

## Sample Function

This function acts as a preview of the RNN's current predictive power.

```python showLineNumbers
def sample(h, seed_ix, n):
    """
    sample a sequence of integers from the model
    h is memory state, seed_ix is seed letter for first time step
    """
    x = np.zeros((vocab_size, 1))
    x[seed_ix] = 1
    ixes = []
    for t in range(n):
        h = np.tanh(np.dot(Wxh, x) + np.dot(Whh, h) + bh)
        y = np.dot(Why, h) + by
        p = np.exp(y) / np.sum(np.exp(y))
        ix = np.random.choice(range(vocab_size), p=p.ravel())
        x = np.zeros((vocab_size, 1))
        x[ix] = 1
        ixes.append(ix)
    return ixes
```

- At any point during training, we can generate what the model currently predicts the sequence should look like.
- We let it freestyle 200 characters using its current weights.

Operationally, we run a forward pass on the current state of the learned parameters to get a probability distribution over the next character. We sample one index from that distribution, feed that predicted character back in as the next input, and repeat.

In a standard non-recurrent neural network (e.g. [Convolutional Neural Networks](https://en.wikipedia.org/wiki/Convolutional_neural_network)), if you wanted to process 25 characters, you would need 25 separate hidden layers, each with its own different set of weights.

But in an RNN, we use the **exact same three weight matrices** `Wxh{:python}`, `Whh{:python}`, and `Why{:python}` at step 1, step 2, step 15, and step 25. This is **weight sharing**. We are trying to learn the best overall parameters to make inferences at any timestep.

<Img
  src="/images/exploring-a-recurrent-neural-network/forward-pass-prediction.webp"
  alt="Diagram of the RNN hidden-state forward pass showing W_hh h_(t-1) and W_xh x_t being added to produce h_t"
  title="RNN Hidden State Forward Pass"
  caption="Computing the hidden-state forward pass with shared weights across timesteps."
  size={90}
/>

When we first start training, we can glimpse what the RNN is currently capable of predicting:

```text
----
 w nkuebtakl.mx,vbrpra.wfcdvne
r.ymdsubiugslp,ibtc.pstv
bo khd hwp.oknypaefgius.s ksvhxfhhuy,.d bxrt
rueto y,ic
gfhfheui,pcxvxyx bnou,hcvrkuhtvreur.lvd,dfbcxn
 mcblnhtellgey.avfnyyrltgeu,yot.cocposuhrc
----
iter 0, loss: 82.395918
----
```

It's nothing to write home about, but as one might expect, we see a marked improvement down the line after **heavily minimizing our loss**:

```text
----
 taught world program in code today.
it is taught world wide in many many languages which is why is it so famous.
beginners print to taugr ag s theit very firse of it as their very farstaline ofveens.
----
iter 33000, loss: 1.283691
----
```

## Loss Function

In the loss function we compute the forward pass to get the loss, and the backward pass to compute parameter gradients.

```python showLineNumbers
def lossFun(inputs, targets, hprev):
    """
    inputs,targets are both list of integers.
    hprev is Hx1 array of initial hidden state
    returns the loss, gradients on model parameters, and last hidden state
    """
    xs, hs, ys, ps = {}, {}, {}, {}
    hs[-1] = np.copy(hprev)
    loss = 0

```

### Forward

```python showLineNumbers
    # forward pass
    for t in range(len(inputs)):
        xs[t] = np.zeros((vocab_size, 1))  # encode in 1-of-k representation
        xs[t][inputs[t]] = 1
        hs[t] = np.tanh(
            np.dot(Wxh, xs[t]) + np.dot(Whh, hs[t - 1]) + bh
        )  # hidden state
        ys[t] = np.dot(Why, hs[t]) + by  # unnormalized log probabilities for next chars
        ps[t] = np.exp(ys[t]) / np.sum(np.exp(ys[t]))  # probabilities for next chars
        loss += -np.log(ps[t][targets[t], 0])  # softmax (cross-entropy loss)
```

We receive the 25 input targets and iterate through them from left to right.

- Start at `t = 0{:python}`
- Create `xs[t]{:python}` as a zero vector
- One-hot encode the current character index into that vector

Then we compute the recurrence:

$$
\begin{align*}
h_{t} &= \text{tanh}(W_{hh}h_{t-1} + W_{xh}x_{t}) \\
\\
y_{t} &= W_{hy}h_{t}
\end{align*}
$$

Using the dictionaries defined at the start of the function, we cache hidden states for every timestep in the 25-step window. Backpropagation needs these cached values, while inference during sampling does not. In each iteration, gradients are computed only over the current 25-character window; the next iteration advances to the next window. This is **Truncated Backpropagation Through Time**.

We now have the logits in `ys[t]{:python}`, which are normalized by softmax and turned into a distribution over the next character.

The total chunk loss is just the sum of the per-timestep prediction errors, and that accumulated loss is what drives one parameter update for the whole window.

### Backward

```python showLineNumbers
    # backward pass: compute gradients going backwards
    dWxh, dWhh, dWhy = np.zeros_like(Wxh), np.zeros_like(Whh), np.zeros_like(Why)
    dbh, dby = np.zeros_like(bh), np.zeros_like(by)
    dhnext = np.zeros_like(hs[0])
    for t in reversed(range(len(inputs))):
        dy = np.copy(ps[t])
        dy[targets[t]] -= 1  # backprop into y
        dWhy += np.dot(dy, hs[t].T)
        dby += dy
        dh = np.dot(Why.T, dy) + dhnext  # backprop into h
        dhraw = (1 - hs[t] * hs[t]) * dh  # backprop through tanh nonlinearity
        dbh += dhraw
        dWxh += np.dot(dhraw, xs[t].T)
        dWhh += np.dot(dhraw, hs[t - 1].T)
        dhnext = np.dot(Whh.T, dhraw)
    for dparam in [dWxh, dWhh, dWhy, dbh, dby]:
        np.clip(dparam, -5, 5, out=dparam)  # clip to mitigate exploding gradients
    return loss, dWxh, dWhh, dWhy, dbh, dby, hs[len(inputs) - 1]
```

We iterate backward through timesteps, propagating gradients through the unrolled network.

#### Backprop Through Softmax

```python
dy = np.copy(ps[t])
dy[targets[t]] -= 1
```

Recall that in this example we have $K = 27$ total classes, one for each unique character in the dataset.

During the forward pass, at each timestep $t$, we compute logits over all classes $k \in \{1, \dots, K\}$:

$$
\hat p_{k} = \frac{\exp(y_{k})}{\sum_{j=1}^{K}\exp(y_{j})}
\quad \text{(softmax)}
$$

If the target class index is $c = \texttt{targets}[t]$, the per-timestep loss is:

$$
L_t = -\log(\hat p_{c})
\quad \text{(cross-entropy)}
$$

The total sequence loss accumulates across timesteps as $L = \sum_t L_t$, which is exactly what `loss += -np.log(ps[t][targets[t], 0]){:python}` computes.

The generic categorical cross-entropy expression is:

$$
L = -\sum_{k=1}^{K} p_k \log(\hat p_k)
$$

Because the target distribution is one-hot, all terms vanish except the correct class term, giving us:

$$
L = -\log(\hat p_c)
$$

Now, during backprop for `dy{:python}`, we chain gradients from the loss back to the logits at timestep $t$:

$$
\begin{align*}
\dots &\to y_t \to \hat p_t \to L_t \quad \text{(forward pass)} \\
\\
\dots &\gets \frac{\partial L_t}{\partial \hat p_t}
\gets \frac{\partial \hat p_t}{\partial y_t}
\Rightarrow \frac{\partial L_t}{\partial y_t}
\quad \text{(backward pass for } dy\text{)}
\end{align*}
$$

Keeping $t$ fixed and only writing class indices $k$ and $c$:

**Step 01:** Gradient of the loss with respect to the softmax probability of the correct class $c$:

$$
\frac{\partial L_t}{\partial \hat p_c}
= \frac{\partial}{\partial \hat p_c}(-\log \hat p_c)
= -\frac{1}{\hat p_c}
$$

Only the correct-class term contributes directly to the loss because the targets are one-hot.

**Step 02:** Gradient of the softmax output $\hat p_c$ with respect to logit $y_k$:

The softmax Jacobian has two cases:

$$
\frac{\partial \hat p_c}{\partial y_k} =
\begin{cases}
\hat p_c(1 - \hat p_c), & k = c \\
\\
-\hat p_c \hat p_k, & k \neq c
\end{cases}
$$

If $k$ is the target class, the gradient encourages increasing $\hat p_c$. If $k$ is not the target class, increasing $y_k$ pulls probability mass away from the target class, so the sign flips accordingly.

**Step 03:** Combine the pieces with the chain rule:

$$
\frac{\partial L_t}{\partial y_k}
= \frac{\partial L_t}{\partial \hat p_c}
\cdot
\frac{\partial \hat p_c}{\partial y_k}
$$

Substituting the two cases gives:

$$
\frac{\partial L_t}{\partial y_k} =
\begin{cases}
-\frac{1}{\hat p_c} \cdot \hat p_c(1 - \hat p_c) = \hat p_c - 1, & k = c \\
\\
-\frac{1}{\hat p_c} \cdot (-\hat p_c \hat p_k) = \hat p_k, & k \neq c
\end{cases}
$$

So the target class gets $\hat p_c - 1$, and every non-target class gets $\hat p_k$. That is exactly the compact update used in code.

Visually:

$$
\frac{\partial L_t}{\partial y} =
\begin{bmatrix}
\hat p_1 \\
\vdots \\
\hat p_c - 1 \\
\vdots \\
\hat p_K
\end{bmatrix}
$$

From there, the backward pass continues through `Why{:python}`, the `tanh{:python}` nonlinearity, and finally the hidden-to-hidden and input-to-hidden pathways. The implementation then clips gradients to the range `[-5, 5]{:python}` to reduce the risk of exploding gradients.

## Closing Notes

I kinda got lazy at the end and didn't wanna keep writing latex, so I decided to call it a night, but hopefully this was useful ;P

<StarDivider />

[^1]: Implementation: https://gist.github.com/karpathy/d4dee566867f8291f086]]></content:encoded>
            <category>Machine Learning</category>
            <category>RNNs</category>
        </item>
        <item>
            <title><![CDATA[Improvisation on the Artist]]></title>
            <link>https://stanleywang.dev/writing/improvisation-on-the-artist</link>
            <guid>https://stanleywang.dev/writing/improvisation-on-the-artist</guid>
            <pubDate>Sun, 15 Jun 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[On Jack Whitten and the adhesion of visual art & jazz.]]></description>
            <content:encoded><![CDATA[<TOC title="Improvisation on the Artist" variant="staffline" />

> My environment is unnatural, unsensual, tough and uncompromising. Within this milieu I have decided to create my art. The painting is not the conduit. I am the conduit.

As the hour ticked onward, my family grew more and more tired of me. Apart from a painting of an evil baby that my mom really seemed to enjoy, it was clear they'd had their fair share of being dragged along through the Museum of Modern Art.

So, I made promises to be hasty with my exploration of the sixth floor, where a new exhibition had recently been launched: [Jack Whitten: The Messenger](https://www.moma.org/calendar/exhibitions/5785).

---

I walked up the stairs, hearing the faint whisper of _So What_ by Miles Davis playing at the top. Reaching the landing, I became immediately drawn in by the sight of a cosmic window adorning the wall.

<Img
  src="/images/improvisation-on-the-artist/Homecoming.jpg"
  alt="Homecoming: For Miles Davis"
  caption="Jack Whitten, Homecoming: For Miles Davis, 1992. Image courtesy of Julian Myers and Joanna Szupinska"
/>

As my steps fell closer and closer, the painting's outer constellations began to exceed the boundaries of my vision. My burgeoning heartbeat burned up as I reached into the expanse of stars, ultimately silenced in the face of infinity portrayed behind each individual tile. Soon, the rest of me followed.

---

American artist **Jack Whitten** (1939-2018) created _Homecoming: For Miles Davis_ in commemoration of the jazz musician, who he had been close friends with. The goal of the painting was to portray Davis's soul in accordance with how he saw it. 

Whitten started by laying down sheets of acrylic and splattering individual droplets of white paint over them. After they had dried, he would carve each individual tile and rearrange them to form a "cosmic net." [^1]
After all was said and done, he would be left with what he called an _Acrylic Tesserae_, a mosaic composed of paint.

As I explored the rest of his exhibit, it became clear that Whitten had a deep connection to jazz, not just in his relationships, but in all aspects of his life and his art.

I believe that avenues exist between conceptual awe & understanding of a subject, collapsed in both directions by our experiences with it. It is up to us to color the entrances with curiosity and the ends with comprehension.

_And so, I let my curiosity carry me forward._

## Double Shift Originality

<Img
  src="/images/improvisation-on-the-artist/Jack Whitten 2.jpg"
  alt="Jack Whitten Portrait"
  caption="Portrait of Jack Whitten, 1973. © Jack Whitten Estate. Courtesy the Estate and Hauser & Wirth"
/>

For all of his life, Jack Whitten had felt a connection to jazz music.

He started as a sax player in junior high, years before he ever started painting. But after moving to New York, he knew his musicianship was not sufficient enough to participate in the experimental jazz scene.

<Quote mark>
The music was a way of me defining myself. I couldn't do it with the horn, so I figured I could deal with it in paint. [^2]
</Quote>

It became clear to me as I explored how Whitten talked about music, that he saw jazz as a conduit which bound both forms of artistry—both visual and audible. To him, his paintings were another way to create "the music," they were an abstraction on how he wanted art to define him.

## Philosophy of Jazz

To Jack Whitten, jazz represented an "expansion of freedom" [^3] due to its embrace of improvisation; its spontaneous nature imposes no limits on feeling or expression.

Whitten saw improvisation as a necessary element of art. Without it, the "spirit" of art and jazz alike would, in essence, be stale and unmoving. But he also believed that spontaneity alone would not allow for art to connect with us. It needed help in the form of the "conceptual":

> The acceptance of spontaneity/improvisation does not **reject** the value of conceptual thought. Conceptualism is a tool in the service of spontaneity/improvisation. The multidimensional sheets of sound in John Coltrane's music could not reach cognition without the conceptual. [^4]

Whitten parallels "conceptualism" to composition: the structural elements that underlay all forms of art and music. There would be nothing to improvise over or incite spontaneity if not for that compositional skeleton. Hence, art can not be truly felt without both elements.

Imagine, for example, what a musician might need when improvising. The forming parts of a song like the key, the time, the motifs, and the melodies serve as the structure intended for the musician to push and pull away from. Without these, the improvisation would have nothing to contradict and nothing to magnify.

Whitten wished for his own work to represent these ideas, desiring his "color to be improvisational." [^5]

Knowing this, it's not surprising that Whitten's philosophy on jazz can be witnessed in every aspect of his craft. His paintings were not separated from the ways in which he thought about improvisation and conceptualism. They were, if nothing else, an embodiment of those principles.

To him, he _was_ creating jazz.

### Russian Speedway

Enter _Russian Speedway_, an oil painting that Whitten created in 1971.

<Img
  src="/images/improvisation-on-the-artist/Russian Speedway.jpg"
  alt="Russian Speedway"
  caption="Jack Whitten, Russian Speedway, 1971. Image courtesy of Hauser & Wirth"
/>

He started by laying down the compositional foundation of the piece, pooling layer after layer of paint on top of one another. Then, using a T-Shaped tool he fashioned called the 'Developer,' he swiftly spread the topmost layer of paint in a _single improvised gesture_ carried across the canvas.

<Img
  src="/images/improvisation-on-the-artist/the developer.webp"
  alt="Jack Whitten's Developer Tool"
  caption="Jack Whitten's 'Developer,' 1970. Image courtesy of The New York Times"
/>

Whitten transposed this technique onto other paintings like _Chinese Sincerity_ (1974). The process of layering paint would always remain the same between these instances, but the moment in which it was ruptured was always executed improvisationally.

According to Whitten, it's in these moments that he has been programmed to act—to "purely conceptualize it"—like how a jazz musician can play without thinking after years of technical training. [^6] It's in these moments that his paintings are made. [^7]

## Homecoming

I view Jack Whitten's expression of Miles Davis through the lens of how [Yousuf Karsh portrayed Cellist Pablo Casals](https://karsh.org/photographs/pablo-casals/)—his portrait capturing not physical likeness, but artistic essence.

> The point I want to make with painting is that abstraction, as we know it, can be
> directed towards the specifics of subject—a person, a thing, an experience. [^8]

Whitten states in his studio journal that his paint is the "synthesis of concreteness plus abstraction," [^9] the composition plus improvisation. If we view abstraction as the spontaneous element, we can see that Whitten directs his improvisations towards the specifics of his subjects.

In _Homecoming_, the subject _is_ Miles Davis, and the specifics are the essence and soul he put into his jazz. These specifics play the same role that the **key** does in a musical composition.

Whitten's improvisation breaks us out of Davis's figurative key by portraying his soul through a **visual medium**, pushing away from and abstracting on top of its traditionally **auditory nature**. It is the structural element necessary for spontaneity to be achieved. But like all good jazz, there is a push and pull away from form.

Given that Davis's ability to improvise was also something so central to his style and identity as a jazz musician, he and Whitten become bound through that shared form of artistic expression. Again, we see the [visual and the audible bridged together with jazz as their conduit](#double-shift-originality).
This expression, previously informing the _contradiction of structure_ within Whitten's portrayal of Davis's soul, becomes the very thing that pulls it _back into_ key.

___ 

Despite the differences in the mode and medium in which Whitten and Davis navigated jazz, I see their improvisations as the breathtaking foundation behind _Homecoming_.

> Experimentation is the key. I believe that there are sounds we have not heard.      
> I believe that there are colors we have not seen. And I believe that there are feelings yet to be felt. [^10]

<iframe
  src="https://open.spotify.com/embed/track/4vLYewWIvqHfKtJDk8c8tq?utm_source=generator"
  width="100%"
  height="152"
  frameBorder="0"
  allow="autoplay; clipboard-write; encrypted-media; fullscreen; picture-in-picture"
  loading="lazy"
/>

<StarDivider />

[^1]: Museum of Modern Art. Wall caption for Jack Whitten, Homecoming: For Miles Davis, 1992.

[^2]: Museum of Modern Art. ["Jack Whitten Light Sheet I. 1969."](https://www.moma.org/audio/playlist/345/4699)

[^3]: Whitten, Jack. _Notes from the Woodshed_, p. 316.

[^4]: Whitten, Jack. _Notes from the Woodshed_, p. 411.

[^5]: Whitten, Jack. _Notes from the Woodshed_, p. 283.

[^6]: Sortor, Emily. ["Jack Whitten's Memorial Paintings."](https://walkerart.org/magazine/jack-whittens-memorial-paintings-2/) _Walker Art Center Magazine_.

[^7]: Sung, Victoria. ["Jack Whitten and the Philosophy of Jazz."](https://walkerart.org/magazine/jack-whitten-and-the-philosophy-of-jazz/) _Walker Art Center Magazine_.

[^8]: Alexander Gray Associates. ["Jack Whitten in Conversation."](http://prod-images.exhibit-e.com/www_alexandergray_com/Whitten_Exhibition_Catalogue_9_11_2013.pdf), p. 3.

[^9]: Whitten, Jack. _Notes from the Woodshed_, p. 367.

[^10]: Whitten, Jack. _Notes from the Woodshed_, p. 410.]]></content:encoded>
            <category>Jazz</category>
            <category>Art</category>
        </item>
        <item>
            <title><![CDATA[Dijkstra's Algorithm]]></title>
            <link>https://stanleywang.dev/writing/dijkstras-algorithm</link>
            <guid>https://stanleywang.dev/writing/dijkstras-algorithm</guid>
            <pubDate>Mon, 21 Apr 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[COMP 251 Notes]]></description>
            <content:encoded><![CDATA[<TOC title="Dijkstra's Algorithm" />

{/* #### Quick Definition */}

**Quick Recap on Dijkstra's:**

- no negative-weight edges
- weighted version of breadth-first-search
  - instead of FIFO queue, use priority queue
  - keys are shortest-path weights $d[v]$
- two sets of vertices
  - $S$ vertices whose final shortest-path weights are already found
  - $Q$ priority queue = vertices still in the queue, those we still need to evaluate their shortest path
- greedy-choice: at each step we choose the light edge

<StarDivider />

## Code

```python showLineNumbers
def init_single_source(vertices, source_node):
	for v in vertices:
		d[v] = inf
		p[v] = nil
	d[s] = 0

def relax(u, v, w):
	if d[v] > d[u] + w(u, v):
		d[v] = d[u] + w(u, v)
		p[v] = u

def dijkstra(V, E, w, s):
	init_single_source(V, s):
	S = set() # init empty
	Q = priority_queue(V)

	while Q is not empty:
		u = extract_min(Q)
		S = S.add(u)
		for v in adj_list[u]:
			relax(u, v, w)
```

## Mathematical Context For Path Properties

First, let's cover the working parts & properties:

### Triangle Inequality

The **Triangle Inequality** states that for all $(u,v)\in E$, we have $\delta(s,v) \leq \delta(s,u)+w(u,v)$.

#### Proof

- we have a path $s\leadsto u\to v$, as well as a shortest path $s\leadsto v$
  - the weight of the shortest path $s\leadsto v$ is $\leq$ any path $s\leadsto v$
- let us say that the path $s\leadsto u=\delta(s,u)$
- this means if we use the shortest path $s\leadsto u$ and direct edge $u\to v$, then the weight along $s\leadsto u\to v=\delta(s,u)+w(u,v)$

<Img
  src="/images/DijkstraProof-attachment.jpeg"
  alt="Dijkstra Proof"
  caption="Triangle Inequality"
  invert
  size={60}
/>

### Upper Bound Property

The **Upper Bound Property** states that we always have $d[v]\geq\delta(s,v),\forall v\in V$. Once $d[v]=\delta(s,v)$, it never changes.

#### Proof by Contradiction on Inequality

- let's assume this starts initially true
- then, assume $\exists v\in V;d[v]<\delta(s,v)$
  - this is the first instance of it happening

We know that this can't have happened at initialization, because `init_single_source()` sets all $d[v]=\infty$, therefore this must have happened at some point during the algorithm's run time.

Let $u$ be the vertex that causes $d[v]$ to change, since in order for us to have altered $d[v]$, `relax(u, v, w){:python}` must have been called.

Within `relax(u, v, w){:python}`, $d[v]$ is altered only if:

- `d[v] > d[u] + w(u,v)` evaluates to true.
- if so, `d[v] = d[u] + w(u, v)` is the change made to $d[v]$

Recall our initial assumption, we have: $d[v]<\delta(s,v)$

1. via [Triangle Inequality](#triangle-inequality), $\delta(s,v)\leq\delta(s,u)+w(u,v)$
2. $d(s,u)\leq d[u]$, since $v$ was the first vertex where its estimate was less than the shortest path, meaning:
   - $\delta(s,u)\leq d[u]\implies \delta(s,u)+w(u,v)\leq d[u]+w(u,v)$
3. this results in the full inequality:

$$
d[v]<\delta(s,v)\leq\delta(s,u)+w(u,v)\leq d[u]+w(u,v)
$$

However, this is impossible, since `relax(u, v, w){:python}` set $d[v]=d[u]+w(u,v)$, and nothing can be equal **and** be explicitly less than something else simultaneously.

Thus, we have proved $\delta(s,v)\leq d[v],\forall v\in V$. $\blacksquare$

<StarDivider />

### No-Path Property

The **No-Path Property** states that if $\delta(s,v)=\infty$, then $d[v]$ will **always** equal $\infty$

#### Proof

- via the [Upper Bound Property](#upper-bound-property), $d[v]\geq\delta(s,v)$
- this means $\delta(s,v)=\infty \implies d[v]=\infty$

$\blacksquare$

<StarDivider />

### Convergence Property

The **Convergence Property** states that if:

1. we have a path $s\leadsto u\to v=\delta(s,v)$ – (it is a shortest path)
2. $d[u]=\delta(s,u)$
3. we call `relax(u, v, w){:python}`,

then $d[v]=\delta(s,v)$ afterward.

#### Proof

We relax $v$ within this code:

```python
if d[v] > d[u] + w(u, v):
	d[v] = d[u] + w(u, v)
	p[v] = u
```

After this code, $d[v]\leq d[u]+w(u,v)$, because when entering `relax(u, v, w){:python}`:

1. if $d[v]$ was $\leq d[u]+w(u,v)$ – we would bypass the if-condition, and nothing happens
2. if $d[v]$ was $>$, then it is set $=d[u]+w(u,v)$

The only two cases resulting in $d[v]\leq d[u]+w(u,v)$.

We can take the RHS and simplify it, as we have defined $d[u]=\delta(s,u)$:

$$
d[v]\leq\;\;\;\delta(s,u)+w(u,v)
$$

Since we defined $s\leadsto u\to v$ to be a shortest path, meaning:

$$
d[v]\leq\;\;\;\delta(s,v)=\delta(s,u)+w(u,v)
$$

Finally, by the [Upper Bound Property](#upper-bound-property), we know that $d[v]\geq \delta(s,v)$. This means we must have $d[v]=\delta(s,v)$. $\blacksquare$

<StarDivider />

### Path Relaxation Property

Let $p = \langle v_{0},v_{1},\dots,v_{k} \rangle$ be a shortest path from $s=v_{0}$ to $v_{k}$. Relaxing these edges, **in order**, will ensure that $d[v_{k}]=\delta(v_{0},v_{k})$. (The shortest path estimate at $v_{k}$ is the correct one).

#### Proof by Induction

We will show via induction on the number of vertices that $d[v_{i}]=\delta(s,v_{i})$ after the edge $(v_{i-1},v_{i})$ is relaxed.

**Base Case:** $i=0$, and $v_{0}=s$.

At initialization in `init_single_source(){:python}`, we set $d[s]=0\implies\delta(s,s)=0$.

**Inductive Step:** Assume $d[v_{i-1}]=\delta(v_{i-1}, s)$.

As we relax edge $(v_{i-1},v_{i})$, note that we have met the pre-conditions for the [Convergence Property](#convergence-property):

1. we have a shortest path $s\leadsto v_{i-1} \to v_{i} \leadsto v_{k}$
   - by optimal substructure, the path $s\leadsto v_{i-1} \to v_{i}$ must also be a shortest path $\delta(s,v_{i})$
2. we have $d[v_{i-1}]=\delta(v_{i-1},s)$
3. we are now calling `relax` on $(v_{i-1},v_{i})$

hence, $d[v_{i}]$ converges to be $\delta(v_{i},s)$ and never changes.

We have proved by induction that $d[v_{k}]=\delta(v_{0},v_{k})$ if we relax the edges in order. $\blacksquare$

<StarDivider />

## Dijkstra's Proof

### via Loop Invariant

We will prove via a **Loop Invariant** that Dijkstra's Algorithm is correct.

```python
def dijkstra(V, E, w, s):
	init_single_source(V, s):
	S = set() # init empty
	Q = priority_queue(V)

	while Q is not empty:
		u = extract_min(Q)
		S.add(u) # [!code hl]
		for v in adj_list[u]:
			relax(u, v, w)
```

**Loop Invariant:** At the end of each iteration of the while loop, $d[v]=\delta(s,v), \forall v\in S$

#### Initialization

At initialization, $S$ is an empty set, and so the loop invariant holds as a by-product of having no $v\in S$ yet.

#### Maintenance

Show that $d[v] = \delta(s,v)$ when $v$ is added to $S$ in each iteration.

We will prove the maintenance property through contradiction:

Assume that for the first time, after an iteration on some vertex $v$, we have added $v$ to $S$, and $d[v]\neq\delta(s,v)$.

What do we know?

For starters, we know that $v\neq s$, as $d[s]=\delta(s,s)=0$. This means that $s\in S$ and $S\neq \emptyset$ when $v$ is added.

We also know, by the [No-Path Property](#no-path-property), there **exists some path** $s\leadsto v$. Otherwise, the property states that:

$$
\{\;\delta(s,v)=\infty\; \} \implies \{ \; d[v] \text{ will always }= \infty \;\} \implies \{ \;d[v]=\infty=\delta(s,v)\;\}
$$

which contradicts our assumption that $d[v]\neq\delta(s,v)$.

Since there exists a path $s\leadsto v$, there **must** exist a shortest path, $p$, from $s\leadsto v$.

Allow us to decompose $p$ into $s\leadsto^{p_{1}} x\to y\leadsto^{p_{2}}v$, such that $\{ s,x \}\in S$, $\{ y,v \}\in Q$, and edge $(x,y)$ is the edge crossing the two sets $S,Q$.

**_Claim:_** $d[y]=\delta(y,s)$ when $u$ is added to $S$

**Proof:**

    1. by optimal substructure, any subpath within $s\leadsto v$, such as $s\leadsto x\to y$, is a shortest path as well
    2. $x\in S \implies d[x] = \delta(s,x)$
    3. we called `relax` on edge $(x,y)$ at the time of adding $x$ to $S$
        - so by the [Convergence Property](#convergence-property), $d[y]=\delta(s,y)$

This means that if $y=v$, we have already reached a contradiction, as our initial assumption was:

$$
d[v] \neq\delta(s,v)
$$

and we have just proved that the estimate **is** the correct delta, and the proof is finished.

However, what if $y\neq v$? Can we still reach a contradiction?

Once again, we know that there exists a shortest path $p$ from $s\leadsto v$ via the [No-Path Property](#no-path-property), and that any subpath along $p$ is also a shortest path by optimal substructure. This implies a chain of logic:

1. $s\leadsto y$ is a shortest path
2. by our **_Claim_**, $d[y]=\delta(s,y)$
3. since there are no _non-negative_ edge weights, a shortest path $s\leadsto y\leadsto v$ must be at least as long as $\delta(s,y)$, meaning:
4. $s\leadsto y\leadsto v\implies\delta(s,y)\leq\delta(s,v)$
   - this is because we must pass $y$ to get to $v$
5. $\delta(s,v)\leq d[v]$ by the [Upper Bound Property](#upper-bound-property)

Putting this all together:

$$
\begin{align*}
\ d[y]&=\delta(s,y)\quad\text{(2)}\\
\ &\leq \delta(s,v)\quad\text{(4)}\\
\ &\leq d[v]\quad \quad \, \text{(5)}\\
\ &\implies d[y] \leq d[v]\\
\end{align*}
$$

Lastly, we know that:

- we are in the iteration of the while loop where we **choose $v$**
- $Q$ stores a vertex $v$ as a key-value pair `{ v : d[v] }{:python}`
- `extract_min(Q){:python}` chooses to extract the vertex `v` if `d[v]{:python}` is minimum across all estimates it finds in `Q`
- both $v$ and $y$ were in $Q$ when we chose $v$

This means in order for $v$ to have been chosen, $d[v]\leq d[y]$. We can conclude:

$$
d[v]\leq d[y] \land d[y]\leq d[v]\implies d[v]=d[y]
$$

The estimate $d[v]$ **must be equal** to the estimate $d[y]$ if it is both $\leq$ and $\geq$ $d[y]$. This, again, contradicts our initial assumption that $d[v]\neq\delta(s,v)$ as $d[v]=d[y]$, and $d[y]=\delta(s,y)$ by our initial **_Claim_**. $\blacksquare$

#### Termination

At the end of the while loop, $Q$ (which was equal to $V$) is now the $\emptyset$. At each iteration, we added the current vertex to $S$, meaning that now, $S=V$. This implies that:

$$
d[v]=\delta(s,v),\;\forall v\in V
$$

The loop variant has been shown to hold across initialization, maintenance, and termination, thus proving Dijkstra's Algorithm. $\blacksquare$]]></content:encoded>
            <category>Algorithms</category>
        </item>
    </channel>
</rss>