A proposal to build new hardware and thermodynamic algorithms for stochastic computing

Patrick Coles writes:

Modern AI has moved away from the absolute, deterministic procedures of early machine learning models. Nowadays, probability and randomness are fully embraced and utilized in AI. Some simple examples of this are avoiding overfitting by randomly dropping out neurons (i.e., dropout), and escaping local minima during training thanks to noisy gradient estimates (i.e., stochastic gradient descent). A deeper example is Bayesian neural networks, where the network’s weights are sampled from a probability distribution and Bayesian inference is employed to update the distribution in the presence of data . . .

Another deep example is generative modeling with diffusion models. Diffusion models add noise to data in a forward process, and then reverse the process to generate a new datapoint (see figure illustrating this for generating an image of a leaf). These models have been extremely successful not only in image generation, but also in generating molecules, proteins and chemically stable materials . . .

AI is currently booming with breakthroughs largely because of these modern AI algorithms that are inherently random. At the same time, it is clear that AI is not reaching its full potential, because of a mismatch between software and hardware. For example, sample generation rate can be relatively slow for diffusion models, and Bayesian neural networks require approximations for their posterior distributions to generate samples in reasonable time.

Then comes the punchline:

There’s no inherent reason why digital hardware is well suited for modern AI, and indeed digital hardware is handicapping these exciting algorithms at the moment.

For production AI, Bayesianism in particular has been stifled from evolving beyond a relative niche because of its lack of mesh with digital hardware . . . .the next hardware paradigm should be specifically tailored to the randomness in modern AI. Specifically, we must start viewing stochasticity as a computational resource. In doing so, we could build a hardware that uses the stochastic fluctuations produced by nature.

Coles continues:

The aforementioned building blocks are inherently static. Ideally, the state does not change over time unless it is intentionally acted upon by a gate, in these paradigms.

However, modern AI applications involve accidental time evolution, or in other words, stochasticity. This raises the question of whether we can construct a building block whose state randomly fluctuates over time. This would be useful for naturally simulating the fluctuations in diffusion models, Bayesian inference, and other algorithms.

The key is to introduce a new axis when plotting the state space: time. Let us define a stochastic bit (s-bit) as a bit whose state stochastically evolves over time according to a continuous time Markov chain . . .

Ultimately this involves a shift in perspective. Certain computing paradigms, such as quantum and analog computing, view random noise as a nuisance. Noise is currently the biggest roadblock to realizing ubiquitous commercial impact for quantum computing. On the other hand, Thermodynamic AI views noise as an essential ingredient of its operation. . . .

I think that when Coles says “AI,” he means what we would call “Bayesian inference.” Or maybe AI represents some particularly challenging applications of Bayesian computation.

Analog computing

OK, the above is all background. Coles’s key idea here is to build a computer using new hardware, to build these stochastic bits so that continuous computation gets done directly.

This is reminiscent of what in the 1950s and 1960s was called “analog computation” or “hybrid computation.” An analog computer is something you build with a bunch of resistors and capacitors and op-amps to solve a differential equation. You plug it in, turn on the power, and the voltage tells you the solution. Turn some knobs to change the parameters in the model, or set it up in a circuit with a sawtooth input and plug it into an oscilloscope to get the solution as a function of the input, etc. A hybrid computer mixes analog and digital elements. Coles is proposing something different in that he’s interested in the time evolution of the state (which, when marginalized over time, can be mapped to a posterior distribution), whereas in traditional analog computer, you just look at the end state and you’re not interested in the transient period that it takes to get there.

Here’s the technical report from Coles. I have not read it carefully or tried to evaluate it. That would be hard work! Could be interest to many of you, though.

6 thoughts on “A proposal to build new hardware and thermodynamic algorithms for stochastic computing

  1. > I have not read it carefully or tried to evaluate it.

    Same boat. I have trouble evaluating these upcoming technologies.

    A friend wrote a paper “A quantum parallel Markov chain Monte Carlo” that I liked. I have trouble thinking about the quantum computing security stuff — no fish in the game — I don’t know quantum or security. I do know MCMC applications tho, so I think that made digging around in the confusing parts more interesting.

    That said don’t quiz me on it now.

  2. I wouldn’t say you’re not interested in the transient, analog computers could model an entire timeseries, you might be interested in for example the vibration of the arm of a bucket loader while driving the bucket loader around on rough ground. You could build an op amp and capacitor and inductor network that models the relationship between the tires, the frame, and the loader arm. You drive the input to the “wheels” with some rough multi-sin-wave input, and see how much the bucket shakes. The entire trajectory could be of interest.

  3. He should make one of those 8-bit breadboard cpus as a demo. Actually, are there even bits being stored in the ram/cache/registers?

    Is the device adding 2 + 2 = 4 + e? Ie, noise is added only during computation.

    Or is it (2 + e1) + (2 + e2) = 4 + (e1 + e2)? The storage is noisy itself.

    Or is it both? There is an e3 on the RHS.

    Does this randomness amount to flipping the least significant bit(s) every now and then?

    Maybe it is my fault for not seeing how it would work at the hardware level, but I feel the article was meant to be accessible.

  4. I’d really like to understand what all the people in this space are doing. Is this all dependent on quantum computing to make sense? Is there anything here for me fitting Bayesian models this year or next or the year after that?

    AI is not reaching its full potential, because of a mismatch between software and hardware. For example, sample generation rate can be relatively slow for diffusion models, and Bayesian neural networks require approximations for their posterior distributions to generate samples in reasonable time.

    While sample generation is slow for diffusion models, the bottleneck is solving stochastic differential equations, not generating (pseudo) random numbers. For large language models and most neural nets, the bottleneck is computing gradients (i.e., matrix multiplication), not generating (pseudo) random numbers.

    Modern AI has moved away from the absolute, deterministic procedures of early machine learning models.

    It’s been an ongoing move. N-gram language models (random process) were introduced in 1948. Stochastic gradient descent (random algorithm) was introduced in 1951. The K-means clustering algorithm (random algorithm for a random process) was introduced in 1956. Random algorithms have been prevalent in ML at least since the 1990s.

    Diffusion models add noise to data in a forward process, and then reverse the process to generate a new datapoint

    In stats, we’d reverse the labels here and call the process that maps “white noise” to images the “forward model” or “data generating process.” For diffusion models, it’s a stochastic differential equation; for normalizing flows, it’s a neural network. The “inverse problem” is that of mapping from data (observed images and/or text) to model parameters (of the SDE for a diffusion or of the NN for a normalizing flow). The utility of inverting the diffusion or flow is to evaluate the density of an image, which is often required for non-parametric density estimation (the alternative is to estimate a Bayesian posterior with variational inference using the diffusion or flow as the variational family).

    • Perhaps he’s saying we need algorithms that approximate something like a gradient with noise directly and cheaply? Or give samples from an SDE directly and cheaply without calculating a lot of intermediate steps (at least approximately)? A little like saying sure you can calculate pi using 10 terms in a series to get 15 digits of accuracy, but if you need 1 sample from a normal distribution whose average is pi and whose standard deviation is 1 then you can maybe approximate that one sample by a uniform random number mapped through a cheap function and for that purpose it’s fine.

      I’ve often thought that sampling from a cheap surrogate process and then squishing your sample afterwards was a good idea. I think exploring cheap inexact computation is valuable compared to running more expensive exact sampling procedures. The key is sort of getting probabilistic bounds on how inexact you are.

  5. Last year Veritasium presented an argument that analog computers are particularly suited for the large amounts of matrix multiplication used in machine learning:
    https://www.youtube.com/watch?v=GVsUOuSjvcg
    The emphasis on avoiding the waste from the “Von Neumann Bottleneck” of the typical computer architecture, at the cost of sacrificing accuracy/precision, struck me as the opposite (from a J. Storrs Hall “Where’s My Flying Car”) perspective from the reform of increasing accuracy via “posits”.
    https://twitter.com/TeaGeeGeePea/status/1598549962324938752

Leave a Reply

Your email address will not be published. Required fields are marked *