*Pryor unhooks the deer’s skull from the wall above his still-curled-up companion. Examines it. Not a good specimen –the back half of the lower jaw’s missing, a gap that, with the open cranial cavity, makes room enough for Pryor’s head.*

*He puts it on. – Will Eaves, Murmur*

So as we roll into the last dying embers of the flaming glory that is (North American) Pride Month (Europe you’ve still got some fun ahead), I’ve doing my best to make both good and questionable decisions.

The good decision was to read an absolutely fabulous book by a British author named Will Eaves (on a run of absolutely stunning books) who fairly recently released a book called Murmur. Murmur is an odd beast, I guess you could say it’s a novel about Alan Turing but that would be moderately inaccurate and probably unfair.

Firstly, because even from one of my favourite authors, I am basically allergic to people writing about maths and mathematicians. It’s always so stodgy and wooden, as if they are writing about a world they don’t understand but also can’t convincingly fake. (Pride month analogy: I mostly avoid straight people writing queer stories for the same reason.) And Turing is a particular disaster for cultural portrayals: he intersects with too many complicated things (world war 2, cryptography, computers, pre-1967 British homosexuality) for him to ever be anything but an avatar.

So this is not a book about Alan Turing being a tortured genius. It’s a book about a guy called Alec Pryor who just happens to share a bunch of biographical details with Turing (Bletchley, Cambridge, ex-Fiance, arrest and chemical castration, Jungian therapist). And it’s not a book about him being sad, wronged, gay genius who kills himself. It’s a story about him living and him processes the changes in his internal life due to his punishment and his interactions with the outside world and his past and his musings on consciousness and computation.

All of which is to say Murmur is a complex, wonderfully written book that over its hundred and seventy something pages sketches out a fully realized world that doesn’t make me want to hide under the sofa in despair. And it does that rare thing for people telling this story: it doesn’t flatten out the story by focusing on the punchline but rather the person and the life behind it.

(As an aside, I’d strongly recommend Hannah Gadsby’s Netflix special Nanette, which talks about the damage we do by truncating our stories to make other people happy. It’s the only cultural document so far of 2018 worth going out of your way to see.)

I would recommend you find yourself a copy. It’s published by a small press, so order it from them or, if you’re in London, pop into Gays The Word and get yourself a copy.

**And now that you’ve made your way through the unexpected book review, let’s get to the point**

But the point of writing this post wasn’t to do a short review of a wonderful book (it was to annoy Aki who is waiting for me to finish something). But Murmur is a book that spends some time (as you inevitably do when considering an ersatz Turing) considering the philosophical implications of artificial intelligence. (Really selling it there Daniel.) And this parallels some discussion that I’ve been seeing around the traps about what we talk about when we talk about neural networks.

Also because the quote that I ripped from an absolutely wonderful run in the novel to unceremoniously shove at the top of this post made me think of how we use methods that we don’t fully understand.

The first paper I fell into (and incidentally reading papers on neural nets is my aforementioned questionable decision) has the direct title Polynomial Regression As an Alternative to Neural Nets, where Cheng, Khomtchouk, and Matloff argue that we might as well use polynomial regression as it’s easier to interpret than a NN and basically gives the same answer.

The main mathematical argument in the paper is that if you build a NN with a polynomial activation function, each layer gives a higher-order polynomial. They argue that the Stone-Weierstrass approximation theorem suggests that any activation function will lead to a NN that can be well approximated by a high-order polynomial.

Now as a general rule, anytime someone whips out Stone-Weierstrass I feel a little skeptical. Because the bit of me that remembers my approximation theory remembers that the construction in this theorem is very slow to converge. I’m also alarmed by the use of high-degree polynomial regression using the natural basis and no regularization. Both of these things are a **very bad idea**.

But the whole story–that neural networks can also be understood as being quite like polynomial regression and that analogy can allow us to improve our NN techniques–is the sort of story that we need to tell to understand how to make these methods better in a principled way.

(Honestly, I’m not a giant fan of a lot of the paper–it does polynomial regression in exactly the way people have been telling applied people they should never do it. But hey, there’s some interesting stuff in there.)

The other paper I read was a whole lot more interesting. Rather than trying to characterize a neural network as something else, it instead tries to argue that NNs should be used to process electronic health records. And it gets good results compared to standard methods, which is always a good sign. But that’s not what’s interesting.

The interesting thing comes via Twitter. Uri Shalit, who’s a prof at Technion, noticed something very interesting in the appendix. Table 1 in the appendix showed that regularized logistic regression performs almost exactly as well as the complicated deep net.

Once again, the Goliath of AI is slain by the David of “really boring statistical methods”.

But again, there’s more to this than that. Firstly, the logistic regression required some light “feature engineering” (ie people who knew what they were doing had to do something to the covariates). In particular, they had to separate them into time bins to allow the model to do a good job at modelling time. The Deep Net didn’t need that.

This particular case of feature engineering is trivial, but in a lot of cases it’s careful understanding of the structure of the problem that lets us use simple statistical techniques instead of something weird and complex. My favourite example of this is Bin Yu’s work where she (read: her and a lot of collaborators) basically reconstructed movies a person was watching from an fMRI scan! The way they did it was to understand how certain patterns excite certain pathway (basically using existing science) and putting those activation patterns in as covariates in a LASSO regression. So the simplest modern technique + feature engineering (aka science) gave *fabulous *results.

The argument for all of these complex AI and deep learning methods is that they allow us to be a little more sloppy with the science. And it seems to work quite well for images and movies, which are extremely structured. But electronic health records are not particularly structured and can have some really weird missingness problems, so it’s not clear that the same methods will have as much room to move. In this case a pretty boring regularized logistic regression does almost exactly as well, which suggests that the deep net is not able to really fly.

The path forward now is to start understanding these cases, working out how things work, when things work, and what techniques we can beg borrow and steal from other areas of stats and machine learning. Because deep learning is not a panacea, it’s just a boy standing in front of a girl asking her to love him.

Dan:

Thanks. This blog can always use a bit more literary criticism.

I usually only read trash. Definitely not stuff I can shoehorn into a discussion about NNs

“Now as a general rule, anytime someone whips out Stone-Weierstrass I feel a little skeptical. Because the bit of me that remembers my approximation theory remembers that the construction in this theorem is very slow to converge. I’m also alarmed by the use of high-degree polynomial regression using the natural basis and no regularization. Both of these things are a very bad idea.”

This is not quite how I reacted to the paper. First, the authors are mostly using _quadratic_ polynomials and arguing that using NNs is like using high-degree polynomials. See last slide of

http://heather.cs.ucdavis.edu/polygrail.pdf

I guess limiting the degree is a form of regularization, but I agree they should also be regularizing the coefficients for any given degree polynomial.

Second, whenever someone whips out a criticism of whipping out the Stone-Weierstrass theorem (Andrew did also in an email to me and Bob) I whip out my goto criticism of the criticism of the Stone-Weierstrass theorem:

http://www.chebfun.org/ATAP/atap-first6chapters.pdf (last paragraph of chapter 6)

“The trouble with this line of research is that for almost all the functions encountered in practice, Chebyshev interpolation works

beautifully! Weierstrass’s theorem has encouraged mathematicians over the years to give too much of their attention to pathological

functions at the edge of discontinuity, leading to the bizarre and unfortunate situation where many books on numerical analysis caution

their readers that interpolation may fail without mentioning that for functions with a little bit of smoothness, it succeeds

outstandingly.”

The end of https://www.boost.org/doc/libs/1_67_0/libs/math/doc/html/math_toolkit/sf_poly/chebyshev.html references this book when discussing its chebyshev_transform function, which inputs a function on the (a, b) interval and outputs the near-minimax polynomial for that function. It worked well for me when doing an integral that could otherwise only be done with the integrate_1d() function in Stan that Ben Bales is close to merging. It would be interesting to see someone plug in their NN activation function and see how mini the near-minimax approximation to it is.

Two things:

1) I don’t think most NNs in practice use quadratic activation functions, so S-W is definitely important. The network itself will be much higher degree (doubling on every layer). So the polynomial basis they’re using for regression is of a very high degree (see the tables)

2) Chebfun work typically assumes functions are (piecewise) analytic. In that case I’m skeptical if people pulling out S-W because it’s the wrong tool. My PhD was pretty heavy on approximation theory. I am aware of the literature. My reading of the paper was that it was being applied only to the activation function not to the response surface, so smoothness may be sensible.

Re: Firstly, because even from one of my favourite authors, I am basically allergic to people writing about maths and mathematicians. It’s always so stodgy and wooden, as if they are writing about a world they don’t understand but also can’t convincingly fake.

—

I don’t know, I am on the fence about whether Frank Plumpton Ramsey’s sister did justice to Frank’s life. I can say being related to the family that I was shocked that Frank was engaged in a open marriage. I guess as a sister, it’s hard to get into the steamy stuff. But our family, as probably nearly all families, has its share of iconoclasts and eccentrics.

Like I still smile when a hunk walks past me on the patio. I couldn’t resist that one. Seriously my patio inundated with hunks. LOLLLLLLL

Someone will get the joke who might be reading. I am a huge tease and joker.

Smile

I sure picked up some great lines for my new script. Smile.

I think my main problem with the Matloff et al is that things like RELU or a sigmoid function are morally very far from being polynomials (sigmoids are bounded by (-1,1), and RELU has too many 0s), and to me the problem with ‘well you can approximate them anyway on an interval’ is that you need to hope really hard that the output from your previous level isn’t too far outside that interval.

Deja vu (sort of)

https://www.ices.on.ca/Publications/Journal-Articles/1998/January/A-comparison-of-statistical-learning-methods-on-the-GUSTO-database

(I was a bystander in that work and predicted the result.)

I came across approximation theory as a side effect of an interest in smart Monte Carlo and numerical integration and remember being blown away as to why that was never part of any my statistics courses. Think it should be a required for getting a stats degree.

Re: ‘bad idea’ of using natural basis (polyreg, the package we introduce in ‘Polynomial Regression As an Alternative to Neural Nets’) lets the user do PCA first. Re: no regularization, we note plans to include such comparisons in the next draft. More importantly, we do not recommend this as a tool used for inference; in fact, polyreg does not even report coefficient estimates. As the paper shows, polynomial regression is surprisingly robust for predictive purposes long after multicollinearity has set in such that one would not trust individual coefficient estimates.

It’s not really that surprising that polynomial regression continues to predict well even after multicollinearity sets in. Multicollinearity means that the likelihood doesn’t have a unique maximum point — rather, a whole subspace of parameter space that achieve the maximum value and you get the same predicted values *no matter where you are in the optimal subspace*. This is true of linear regression in general; one just has to be mindful of the injunction against extrapolating outside of the fitted data (and mindful of the fact that when multicollinearity occurs the dimensionality of the fitted data is smaller than the cardinality of the set of predictors).

Thank you for this properly thoughtful response to Will’s brilliant book. I’m a friend of his and completely biased about this work. And it brings so much pleasure to know he has readers out there like you, Dan. He must be chuffed. I am sure he is. Have a wonderful weekend!