I sure picked up some great lines for my new script. Smile.

]]>Smile

]]>Re: Firstly, because even from one of my favourite authors, I am basically allergic to people writing about maths and mathematicians. It’s always so stodgy and wooden, as if they are writing about a world they don’t understand but also can’t convincingly fake.

—

I don’t know, I am on the fence about whether Frank Plumpton Ramsey’s sister did justice to Frank’s life. I can say being related to the family that I was shocked that Frank was engaged in a open marriage. I guess as a sister, it’s hard to get into the steamy stuff. But our family, as probably nearly all families, has its share of iconoclasts and eccentrics.

Like I still smile when a hunk walks past me on the patio. I couldn’t resist that one. Seriously my patio inundated with hunks. LOLLLLLLL

Someone will get the joke who might be reading. I am a huge tease and joker.

]]>It’s not really that surprising that polynomial regression continues to predict well even after multicollinearity sets in. Multicollinearity means that the likelihood doesn’t have a unique maximum point — rather, a whole subspace of parameter space that achieve the maximum value and you get the same predicted values *no matter where you are in the optimal subspace*. This is true of linear regression in general; one just has to be mindful of the injunction against extrapolating outside of the fitted data (and mindful of the fact that when multicollinearity occurs the dimensionality of the fitted data is smaller than the cardinality of the set of predictors).

]]>I think my main problem with the Matloff et al is that things like RELU or a sigmoid function are morally very far from being polynomials (sigmoids are bounded by (-1,1), and RELU has too many 0s), and to me the problem with ‘well you can approximate them anyway on an interval’ is that you need to hope really hard that the output from your previous level isn’t too far outside that interval.

]]>https://www.ices.on.ca/Publications/Journal-Articles/1998/January/A-comparison-of-statistical-learning-methods-on-the-GUSTO-database

(I was a bystander in that work and predicted the result.)

I came across approximation theory as a side effect of an interest in smart Monte Carlo and numerical integration and remember being blown away as to why that was never part of any my statistics courses. Think it should be a required for getting a stats degree.

]]>Two things:

1) I don’t think most NNs in practice use quadratic activation functions, so S-W is definitely important. The network itself will be much higher degree (doubling on every layer). So the polynomial basis they’re using for regression is of a very high degree (see the tables)

2) Chebfun work typically assumes functions are (piecewise) analytic. In that case I’m skeptical if people pulling out S-W because it’s the wrong tool. My PhD was pretty heavy on approximation theory. I am aware of the literature. My reading of the paper was that it was being applied only to the activation function not to the response surface, so smoothness may be sensible.

This is not quite how I reacted to the paper. First, the authors are mostly using _quadratic_ polynomials and arguing that using NNs is like using high-degree polynomials. See last slide of

http://heather.cs.ucdavis.edu/polygrail.pdf

I guess limiting the degree is a form of regularization, but I agree they should also be regularizing the coefficients for any given degree polynomial.

Second, whenever someone whips out a criticism of whipping out the Stone-Weierstrass theorem (Andrew did also in an email to me and Bob) I whip out my goto criticism of the criticism of the Stone-Weierstrass theorem:

http://www.chebfun.org/ATAP/atap-first6chapters.pdf (last paragraph of chapter 6)

“The trouble with this line of research is that for almost all the functions encountered in practice, Chebyshev interpolation works

beautifully! Weierstrass’s theorem has encouraged mathematicians over the years to give too much of their attention to pathological

functions at the edge of discontinuity, leading to the bizarre and unfortunate situation where many books on numerical analysis caution

their readers that interpolation may fail without mentioning that for functions with a little bit of smoothness, it succeeds

outstandingly.”

The end of https://www.boost.org/doc/libs/1_67_0/libs/math/doc/html/math_toolkit/sf_poly/chebyshev.html references this book when discussing its chebyshev_transform function, which inputs a function on the (a, b) interval and outputs the near-minimax polynomial for that function. It worked well for me when doing an integral that could otherwise only be done with the integrate_1d() function in Stan that Ben Bales is close to merging. It would be interesting to see someone plug in their NN activation function and see how mini the near-minimax approximation to it is.

]]>I usually only read trash. Definitely not stuff I can shoehorn into a discussion about NNs

]]>Thanks. This blog can always use a bit more literary criticism.

]]>