Stanley Wang's Blog

Stanley Wang's Blog https://stanleywang.dev Writing about tech, algorithms, and more Thu, 07 May 2026 12:59:25 GMT https://validator.w3.org/feed/docs/rss2.html https://github.com/jpmonette/feed en All rights reserved 2026 <![CDATA[Exploring a Recurrent Neural Network]]> https://stanleywang.dev/writing/exploring-a-recurrent-neural-network https://stanleywang.dev/writing/exploring-a-recurrent-neural-network Sat, 28 Mar 2026 00:00:00 GMT These notes walk through Andrej Karpathy's minimal character-level **Recurrent Neural Network (RNN)** implementation. [^1] RNNs are special 'lil guys because instead of being acyclic, they reuse the same weights across timesteps while carrying a hidden state, so each prediction can use context from previous steps. That makes them especially useful for _sequence tasks_ such as text generation, speech modeling, and frame-by-frame video classification where earlier inputs should influence later predictions. A **sequence** is just an ordered list over time: characters, words, audio frames, video frames. A **timestep** is one position in that list. At timestep $t$, the model reads $x_t$, combines it with $h_{t-1}$, and produces $h_t$ plus a prediction. Because the same update rule is reused at every step, one RNN can map many different sequence lengths using the same parameters. In this note, we use a character-level **many-to-many** mapping: each input character predicts the next one. Example: input `hello` gives shifted targets `ello...`, so each timestep contributes one prediction and one loss term. ## Data I/O ```python showLineNumbers import numpy as np # data I/O data = open('input.txt', 'r').read() # should be simple plain text file chars = list(set(data)) data_size, vocab_size = len(data), len(chars) print("data has %d characters, %d unique." % (data_size, vocab_size)) char_to_ix = {ch: i for i, ch in enumerate(chars)} ix_to_char = {i: ch for i, ch in enumerate(chars)} ``` - Read all characters in the corpus. - Extract the unique character set, the vocabulary. - Build lookup maps: character to index and index to character. ```text Small toy example: input.txt: hello chars = ['h', 'e', 'l', 'o'] data_size = 5 vocab_size = 4 char_to_ix = {'h': 0, 'e': 1, 'l': 2, 'o': 3} ix_to_char = {0: 'h', 1: 'e', 2: 'l', 3: 'o'} Larger running example: hello world is the most commonly used starter program in code today. it is taught world wide in many many languages which is why is it so famous. beginners often write it as their very first line of code. seeing those words print to the screen gives a great sense of accomplishment. from there, developers move on to more complex topics like loops and functions. it truly serves as the universal greeting of the programming community. data_size = 436 characters vocab_size = 27 unique characters ``` ## Initializations ```python showLineNumbers # hyperparameters hidden_size = 100 # size of hidden layer of neurons seq_length = 25 # number of steps to unroll the RNN for learning_rate = 1e-1 # model parameters Wxh = np.random.randn(hidden_size, vocab_size) * 0.01 # input to hidden 100 x 27 Whh = np.random.randn(hidden_size, hidden_size) * 0.01 # hidden to hidden 100 x 100 Why = np.random.randn(vocab_size, hidden_size) * 0.01 # hidden to output 27 x 100 bh = np.zeros((hidden_size, 1)) # hidden bias, 100 x 1 by = np.zeros((vocab_size, 1)) # output bias, 27 x 1 ``` `seq_length{:python}` is the number of steps we unroll the RNN before doing backpropagation. The full dataset can be much larger, so we train on chunks of 25 characters at a time in this implementation. The weight matrices are initialized to small random values so the network starts with small, non-identical activations instead of a symmetric state. ## Main Loop ```python showLineNumbers n, p = 0, 0 mWxh, mWhh, mWhy = np.zeros_like(Wxh), np.zeros_like(Whh), np.zeros_like(Why) mbh, mby = np.zeros_like(bh), np.zeros_like(by) # memory variables for Adagrad smooth_loss = -np.log(1.0 / vocab_size) * seq_length # loss at iteration 0 while True: # prepare inputs (we're sweeping from left to right in steps seq_length long) if p + seq_length + 1 >= len(data) or n == 0: hprev = np.zeros((hidden_size, 1)) # reset RNN memory p = 0 # go from start of data inputs = [char_to_ix[ch] for ch in data[p : p + seq_length]] targets = [char_to_ix[ch] for ch in data[p + 1 : p + seq_length + 1]] # sample from the model now and then if n % 100 == 0: sample_ix = sample(hprev, inputs[0], 200) txt = "".join(ix_to_char[ix] for ix in sample_ix) print("----\n %s \n----" % (txt,)) ``` If the current sequence exceeds the total text length, we reset $h_{t-1}$ and start from the beginning of the dataset again. This first block samples a batch of 25 characters from the dataset: - Inputs: the indices for `data[p : p + seq_length]{:python}` - Targets: the indices for the next characters in `data[p + 1 : p + seq_length + 1]{:python}` At every timestep, we want the network to predict the character directly following the current one. Every 100 iterations we call `sample{:python}` to [visualize the current predictive power](#sample-function) of the RNN. We then call the [loss function](#loss-function): ```python showLineNumbers {2}#c {3}#b # forward seq_length characters through the net and fetch gradient loss, dWxh, dWhh, dWhy, dbh, dby, hprev = lossFun(inputs, targets, hprev) smooth_loss = smooth_loss * 0.999 + loss * 0.001 if n % 100 == 0: print("iter %d, loss: %f" % (n, smooth_loss)) # print progress ``` Because training runs on short 25-character chunks, the raw loss from one chunk to the next is noisy. We therefore track an [exponential moving average (EMA)](https://towardsdatascience.com/intuitive-explanation-of-exponential-moving-average-2eb9693ea4dc/) so the trend is easier to read. Conceptually, this line says: keep 99.9% of our historical average, and mix in 0.1% of the new raw loss. Lastly, we update the parameters with [Adagrad](https://optimization.cbe.cornell.edu/index.php?title=AdaGrad): ```python showLineNumbers # perform parameter update with Adagrad for param, dparam, mem in zip( [Wxh, Whh, Why, bh, by], [dWxh, dWhh, dWhy, dbh, dby], [mWxh, mWhh, mWhy, mbh, mby], ): mem += dparam * dparam param += -learning_rate * dparam / np.sqrt(mem + 1e-8) # adagrad update p += seq_length # move data pointer n += 1 # iteration counter ``` ## Sample Function This function acts as a preview of the RNN's current predictive power. ```python showLineNumbers def sample(h, seed_ix, n): """ sample a sequence of integers from the model h is memory state, seed_ix is seed letter for first time step """ x = np.zeros((vocab_size, 1)) x[seed_ix] = 1 ixes = [] for t in range(n): h = np.tanh(np.dot(Wxh, x) + np.dot(Whh, h) + bh) y = np.dot(Why, h) + by p = np.exp(y) / np.sum(np.exp(y)) ix = np.random.choice(range(vocab_size), p=p.ravel()) x = np.zeros((vocab_size, 1)) x[ix] = 1 ixes.append(ix) return ixes ``` - At any point during training, we can generate what the model currently predicts the sequence should look like. - We let it freestyle 200 characters using its current weights. Operationally, we run a forward pass on the current state of the learned parameters to get a probability distribution over the next character. We sample one index from that distribution, feed that predicted character back in as the next input, and repeat. In a standard non-recurrent neural network (e.g. [Convolutional Neural Networks](https://en.wikipedia.org/wiki/Convolutional_neural_network)), if you wanted to process 25 characters, you would need 25 separate hidden layers, each with its own different set of weights. But in an RNN, we use the **exact same three weight matrices** `Wxh{:python}`, `Whh{:python}`, and `Why{:python}` at step 1, step 2, step 15, and step 25. This is **weight sharing**. We are trying to learn the best overall parameters to make inferences at any timestep. Diagram of the RNN hidden-state forward pass showing W_hh h_(t-1) and W_xh x_t being added to produce h_t

Diagram of the RNN hidden-state forward pass showing W_hh h_(t-1) and W_xh x_t being added to produce h_t

When we first start training, we can glimpse what the RNN is currently capable of predicting: ```text ---- w nkuebtakl.mx,vbrpra.wfcdvne r.ymdsubiugslp,ibtc.pstv bo khd hwp.oknypaefgius.s ksvhxfhhuy,.d bxrt rueto y,ic gfhfheui,pcxvxyx bnou,hcvrkuhtvreur.lvd,dfbcxn mcblnhtellgey.avfnyyrltgeu,yot.cocposuhrc ---- iter 0, loss: 82.395918 ---- ``` It's nothing to write home about, but as one might expect, we see a marked improvement down the line after **heavily minimizing our loss**: ```text ---- taught world program in code today. it is taught world wide in many many languages which is why is it so famous. beginners print to taugr ag s theit very firse of it as their very farstaline ofveens. ---- iter 33000, loss: 1.283691 ---- ``` ## Loss Function In the loss function we compute the forward pass to get the loss, and the backward pass to compute parameter gradients. ```python showLineNumbers def lossFun(inputs, targets, hprev): """ inputs,targets are both list of integers. hprev is Hx1 array of initial hidden state returns the loss, gradients on model parameters, and last hidden state """ xs, hs, ys, ps = {}, {}, {}, {} hs[-1] = np.copy(hprev) loss = 0 ``` ### Forward ```python showLineNumbers # forward pass for t in range(len(inputs)): xs[t] = np.zeros((vocab_size, 1)) # encode in 1-of-k representation xs[t][inputs[t]] = 1 hs[t] = np.tanh( np.dot(Wxh, xs[t]) + np.dot(Whh, hs[t - 1]) + bh ) # hidden state ys[t] = np.dot(Why, hs[t]) + by # unnormalized log probabilities for next chars ps[t] = np.exp(ys[t]) / np.sum(np.exp(ys[t])) # probabilities for next chars loss += -np.log(ps[t][targets[t], 0]) # softmax (cross-entropy loss) ``` We receive the 25 input targets and iterate through them from left to right. - Start at `t = 0{:python}` - Create `xs[t]{:python}` as a zero vector - One-hot encode the current character index into that vector Then we compute the recurrence: $$ \begin{align*} h_{t} &= \text{tanh}(W_{hh}h_{t-1} + W_{xh}x_{t}) \\ \\ y_{t} &= W_{hy}h_{t} \end{align*} $$ Using the dictionaries defined at the start of the function, we cache hidden states for every timestep in the 25-step window. Backpropagation needs these cached values, while inference during sampling does not. In each iteration, gradients are computed only over the current 25-character window; the next iteration advances to the next window. This is **Truncated Backpropagation Through Time**. We now have the logits in `ys[t]{:python}`, which are normalized by softmax and turned into a distribution over the next character. The total chunk loss is just the sum of the per-timestep prediction errors, and that accumulated loss is what drives one parameter update for the whole window. ### Backward ```python showLineNumbers # backward pass: compute gradients going backwards dWxh, dWhh, dWhy = np.zeros_like(Wxh), np.zeros_like(Whh), np.zeros_like(Why) dbh, dby = np.zeros_like(bh), np.zeros_like(by) dhnext = np.zeros_like(hs[0]) for t in reversed(range(len(inputs))): dy = np.copy(ps[t]) dy[targets[t]] -= 1 # backprop into y dWhy += np.dot(dy, hs[t].T) dby += dy dh = np.dot(Why.T, dy) + dhnext # backprop into h dhraw = (1 - hs[t] * hs[t]) * dh # backprop through tanh nonlinearity dbh += dhraw dWxh += np.dot(dhraw, xs[t].T) dWhh += np.dot(dhraw, hs[t - 1].T) dhnext = np.dot(Whh.T, dhraw) for dparam in [dWxh, dWhh, dWhy, dbh, dby]: np.clip(dparam, -5, 5, out=dparam) # clip to mitigate exploding gradients return loss, dWxh, dWhh, dWhy, dbh, dby, hs[len(inputs) - 1] ``` We iterate backward through timesteps, propagating gradients through the unrolled network. #### Backprop Through Softmax ```python dy = np.copy(ps[t]) dy[targets[t]] -= 1 ``` Recall that in this example we have $K = 27$ total classes, one for each unique character in the dataset. During the forward pass, at each timestep $t$, we compute logits over all classes $k \in \{1, \dots, K\}$: $$ \hat p_{k} = \frac{\exp(y_{k})}{\sum_{j=1}^{K}\exp(y_{j})} \quad \text{(softmax)} $$ If the target class index is $c = \texttt{targets}[t]$, the per-timestep loss is: $$ L_t = -\log(\hat p_{c}) \quad \text{(cross-entropy)} $$ The total sequence loss accumulates across timesteps as $L = \sum_t L_t$, which is exactly what `loss += -np.log(ps[t][targets[t], 0]){:python}` computes. The generic categorical cross-entropy expression is: $$ L = -\sum_{k=1}^{K} p_k \log(\hat p_k) $$ Because the target distribution is one-hot, all terms vanish except the correct class term, giving us: $$ L = -\log(\hat p_c) $$ Now, during backprop for `dy{:python}`, we chain gradients from the loss back to the logits at timestep $t$: $$ \begin{align*} \dots &\to y_t \to \hat p_t \to L_t \quad \text{(forward pass)} \\ \\ \dots &\gets \frac{\partial L_t}{\partial \hat p_t} \gets \frac{\partial \hat p_t}{\partial y_t} \Rightarrow \frac{\partial L_t}{\partial y_t} \quad \text{(backward pass for } dy\text{)} \end{align*} $$ Keeping $t$ fixed and only writing class indices $k$ and $c$: **Step 01:** Gradient of the loss with respect to the softmax probability of the correct class $c$: $$ \frac{\partial L_t}{\partial \hat p_c} = \frac{\partial}{\partial \hat p_c}(-\log \hat p_c) = -\frac{1}{\hat p_c} $$ Only the correct-class term contributes directly to the loss because the targets are one-hot. **Step 02:** Gradient of the softmax output $\hat p_c$ with respect to logit $y_k$: The softmax Jacobian has two cases: $$ \frac{\partial \hat p_c}{\partial y_k} = \begin{cases} \hat p_c(1 - \hat p_c), & k = c \\ \\ -\hat p_c \hat p_k, & k \neq c \end{cases} $$ If $k$ is the target class, the gradient encourages increasing $\hat p_c$. If $k$ is not the target class, increasing $y_k$ pulls probability mass away from the target class, so the sign flips accordingly. **Step 03:** Combine the pieces with the chain rule: $$ \frac{\partial L_t}{\partial y_k} = \frac{\partial L_t}{\partial \hat p_c} \cdot \frac{\partial \hat p_c}{\partial y_k} $$ Substituting the two cases gives: $$ \frac{\partial L_t}{\partial y_k} = \begin{cases} -\frac{1}{\hat p_c} \cdot \hat p_c(1 - \hat p_c) = \hat p_c - 1, & k = c \\ \\ -\frac{1}{\hat p_c} \cdot (-\hat p_c \hat p_k) = \hat p_k, & k \neq c \end{cases} $$ So the target class gets $\hat p_c - 1$, and every non-target class gets $\hat p_k$. That is exactly the compact update used in code. Visually: $$ \frac{\partial L_t}{\partial y} = \begin{bmatrix} \hat p_1 \\ \vdots \\ \hat p_c - 1 \\ \vdots \\ \hat p_K \end{bmatrix} $$ From there, the backward pass continues through `Why{:python}`, the `tanh{:python}` nonlinearity, and finally the hidden-to-hidden and input-to-hidden pathways. The implementation then clips gradients to the range `[-5, 5]{:python}` to reduce the risk of exploding gradients. ## Closing Notes I kinda got lazy at the end and didn't wanna keep writing latex, so I decided to call it a night, but hopefully this was useful ;P [^1]: Implementation: https://gist.github.com/karpathy/d4dee566867f8291f086]]> Machine Learning RNNs <![CDATA[Improvisation on the Artist]]> https://stanleywang.dev/writing/improvisation-on-the-artist https://stanleywang.dev/writing/improvisation-on-the-artist Sun, 15 Jun 2025 00:00:00 GMT > My environment is unnatural, unsensual, tough and uncompromising. Within this milieu I have decided to create my art. The painting is not the conduit. I am the conduit. As the hour ticked onward, my family grew more and more tired of me. Apart from a painting of an evil baby that my mom really seemed to enjoy, it was clear they'd had their fair share of being dragged along through the Museum of Modern Art. So, I made promises to be hasty with my exploration of the sixth floor, where a new exhibition had recently been launched: [Jack Whitten: The Messenger](https://www.moma.org/calendar/exhibitions/5785). --- I walked up the stairs, hearing the faint whisper of _So What_ by Miles Davis playing at the top. Reaching the landing, I became immediately drawn in by the sight of a cosmic window adorning the wall.

As my steps fell closer and closer, the painting's outer constellations began to exceed the boundaries of my vision. My burgeoning heartbeat burned up as I reached into the expanse of stars, ultimately silenced in the face of infinity portrayed behind each individual tile. Soon, the rest of me followed. --- American artist **Jack Whitten** (1939-2018) created _Homecoming: For Miles Davis_ in commemoration of the jazz musician, who he had been close friends with. The goal of the painting was to portray Davis's soul in accordance with how he saw it. Whitten started by laying down sheets of acrylic and splattering individual droplets of white paint over them. After they had dried, he would carve each individual tile and rearrange them to form a "cosmic net." [^1] After all was said and done, he would be left with what he called an _Acrylic Tesserae_, a mosaic composed of paint. As I explored the rest of his exhibit, it became clear that Whitten had a deep connection to jazz, not just in his relationships, but in all aspects of his life and his art. I believe that avenues exist between conceptual awe & understanding of a subject, collapsed in both directions by our experiences with it. It is up to us to color the entrances with curiosity and the ends with comprehension. _And so, I let my curiosity carry me forward._ ## Double Shift Originality Jack Whitten Portrait

For all of his life, Jack Whitten had felt a connection to jazz music. He started as a sax player in junior high, years before he ever started painting. But after moving to New York, he knew his musicianship was not sufficient enough to participate in the experimental jazz scene. The music was a way of me defining myself. I couldn't do it with the horn, so I figured I could deal with it in paint. [^2] It became clear to me as I explored how Whitten talked about music, that he saw jazz as a conduit which bound both forms of artistry—both visual and audible. To him, his paintings were another way to create "the music," they were an abstraction on how he wanted art to define him. ## Philosophy of Jazz To Jack Whitten, jazz represented an "expansion of freedom" [^3] due to its embrace of improvisation; its spontaneous nature imposes no limits on feeling or expression. Whitten saw improvisation as a necessary element of art. Without it, the "spirit" of art and jazz alike would, in essence, be stale and unmoving. But he also believed that spontaneity alone would not allow for art to connect with us. It needed help in the form of the "conceptual": > The acceptance of spontaneity/improvisation does not **reject** the value of conceptual thought. Conceptualism is a tool in the service of spontaneity/improvisation. The multidimensional sheets of sound in John Coltrane's music could not reach cognition without the conceptual. [^4] Whitten parallels "conceptualism" to composition: the structural elements that underlay all forms of art and music. There would be nothing to improvise over or incite spontaneity if not for that compositional skeleton. Hence, art can not be truly felt without both elements. Imagine, for example, what a musician might need when improvising. The forming parts of a song like the key, the time, the motifs, and the melodies serve as the structure intended for the musician to push and pull away from. Without these, the improvisation would have nothing to contradict and nothing to magnify. Whitten wished for his own work to represent these ideas, desiring his "color to be improvisational." [^5] Knowing this, it's not surprising that Whitten's philosophy on jazz can be witnessed in every aspect of his craft. His paintings were not separated from the ways in which he thought about improvisation and conceptualism. They were, if nothing else, an embodiment of those principles. To him, he _was_ creating jazz. ### Russian Speedway Enter _Russian Speedway_, an oil painting that Whitten created in 1971.

He started by laying down the compositional foundation of the piece, pooling layer after layer of paint on top of one another. Then, using a T-Shaped tool he fashioned called the 'Developer,' he swiftly spread the topmost layer of paint in a _single improvised gesture_ carried across the canvas. Jack Whitten's Developer Tool

Whitten transposed this technique onto other paintings like _Chinese Sincerity_ (1974). The process of layering paint would always remain the same between these instances, but the moment in which it was ruptured was always executed improvisationally. According to Whitten, it's in these moments that he has been programmed to act—to "purely conceptualize it"—like how a jazz musician can play without thinking after years of technical training. [^6] It's in these moments that his paintings are made. [^7] ## Homecoming I view Jack Whitten's expression of Miles Davis through the lens of how [Yousuf Karsh portrayed Cellist Pablo Casals](https://karsh.org/photographs/pablo-casals/)—his portrait capturing not physical likeness, but artistic essence. > The point I want to make with painting is that abstraction, as we know it, can be > directed towards the specifics of subject—a person, a thing, an experience. [^8] Whitten states in his studio journal that his paint is the "synthesis of concreteness plus abstraction," [^9] the composition plus improvisation. If we view abstraction as the spontaneous element, we can see that Whitten directs his improvisations towards the specifics of his subjects. In _Homecoming_, the subject _is_ Miles Davis, and the specifics are the essence and soul he put into his jazz. These specifics play the same role that the **key** does in a musical composition. Whitten's improvisation breaks us out of Davis's figurative key by portraying his soul through a **visual medium**, pushing away from and abstracting on top of its traditionally **auditory nature**. It is the structural element necessary for spontaneity to be achieved. But like all good jazz, there is a push and pull away from form. Given that Davis's ability to improvise was also something so central to his style and identity as a jazz musician, he and Whitten become bound through that shared form of artistic expression. Again, we see the [visual and the audible bridged together with jazz as their conduit](#double-shift-originality). This expression, previously informing the _contradiction of structure_ within Whitten's portrayal of Davis's soul, becomes the very thing that pulls it _back into_ key. ___ Despite the differences in the mode and medium in which Whitten and Davis navigated jazz, I see their improvisations as the breathtaking foundation behind _Homecoming_. > Experimentation is the key. I believe that there are sounds we have not heard. > I believe that there are colors we have not seen. And I believe that there are feelings yet to be felt. [^10]