## Calibrating patterns in structured data: No easy answers here.

“No easy answers” . . . Hey, that’s a title that’s pure anti-clickbait, a veritable kryptonite for social media . . .

Anyway, here’s the story. Adam Przedniczek writes:

I am trying to devise new or tune up already existing statistical tests assessing rate of occurrences of some bigger compound structures, but the most tricky part is to take into account their substructures and building blocks.

To make it as simple as possible, let’s say we are particularly interested in a test for enrichment or over-representation of given structures, e.g. quadruples, over two groups. Everything is clearly depicted in this document.

And here the doubts arise: I have strong premonition that I should take into consideration their inner structure and constituent pairs. In the attachment I show such an adjustment for enrichment of pairs, but I don’t know how to extend this approach properly over higher (more compound) structures.

Hey—this looks like a fun probability problem! (Readers: click on the above link if you haven’t done so already.) The general problem reminds me of things I’ve seen in social networks, where people summarize a network by statistics such as the diameter, the number of open and closed triplets, the number of loops and disconnected components, etc.

My quick answer is that there are two completely different ways to approach the problem. It’s not clear which is best; I guess it could make sense to do both.

The first approach is with a generative model. The advantage of the generative model is that you can answer any question you’d like. The disadvantage is that with structured dependence, it can be really hard to come up with a generative model that captures much of the data features that you care about. With network data, they’re still playing around with variants of that horribly oversimplified Erdos-Renyi model of complete independence. Generative modeling can be a great way to learn, but any particular generative model can be a trap if there are important aspects of the data it does not capture.

The second approach is more phenomenological, where you compare different groups using raw data and then do some sort of permutation testing or bootstrapping to get a sense of the variation in your summary statistics. This approach has problems, too, though, in that you need to decide how to do the permutations or sampling. Complete randomness can give misleading answers, and there’s a whole literature, with no good answers, on how to bootstrap or perform permutation tests on time series, spatial, and network data. Indeed, when you get right down to it, a permutation test or a bootstrapping rule corresponds to a sampling model, and that gets you close to the difficulties of generative models that we’ve already discussed.

So . . . no easy answers! But, whatever procedure you do, I recommend you check it using fake-data simulation.

1. Given some experience trying to explain modern statistics (post p < .05) to non statisticians (and them not seeming to hate it), I am playing with the ideas and wording of fake universes.

"To address the challenge uncertainty about what would repeatedly happen, statistics need to utilize abstract (fake) representations of reality. Literally making fake universes to represent for instance side effects to a new treatment in which we can discern what would repeatedly happen in such a fake universe. That will enable us to make sense of what happens in our universe (the reality of what actually happens to the treated patients) and especially the uncertainty of that.

Statisticians usually have used probability models to make these fake universes. Probability models not only allow the creation of almost any fake universes they also simply answer what would repeated happen (e.g. with simulation). Unfortunately, that creation of almost any fake universes means we can inadvertently choose to represent reality in very silly and damaging ways.

For statistics to do more harm than good, the chosen fake universes must reasonably represent reality. However statisticians will seldom have a good sense on their own as to what is reasonable in other’s areas of expertise. "

• Opps “more good than harm”.

• I think that’s a great idea and have been doing the same myself.

Statistical reasoning is intrinsically counterfactual in that we consider that an outcome could have come out differently even after we’ve observed that outcome. I’ve always found this the hardest part of stats to teach people. Sure, Rapinoe scored the winning goal in the recent world cup match. It’s in the record book. But as statisticians, we like to pretend she might not have and talk about the chance of her scoring that goal. The complicated thing is that we talk about that chance conditionally, not absolutely. So we might ask the question before the match started, when there were 30 minutes left, or in the very situation where she goes to make the kick. The answer is a different probability each time, so the probability isn’t part of the event per se, but conditional on how much we know about it.

I worked on modal and temporal logics for natural language semantics before I learned probability theory. Thus I tend to think of points in the sample space as possible worlds. That is, each point in the sample space determines everything (relevant) about a possible way the world can be. Then evaluating a random variable at a point in the sample space feels is very much taking the extension of a term in an intensional logic. The indexical logic is identical, evaluating operators like conjunction pointwise in both systems. The only thing new in probability theory is that the possible worlds are assumed to form a particular kind of algebra with a particular form of measure on top.

• > I’ve always found this the hardest part of stats to teach people.
Agree, why its important and why traditional stats curriculum will likely continue to fail for most people.

> but conditional on how much we know about it.
I would rather say, determined by how much of what we know that we choose to represent in a probability model (reserving the word conditioning for conditioning on the data (i.e. using ; notation rather than | notation).

> each point in the sample space determines everything (relevant) about a possible way the world can be
I would put that in the parameter space – each point in the parameter space determines everything (relevant) about how often observations will repeatedly occur in that possible way the world can be. The prior then represents how often one chooses to represent how often various worlds reasonably compatible with our knowledge of our world would occur (Bayesian reference set).

> modal and temporal logics for natural language semantics and
> indexical logic is identical, evaluating operators like conjunction pointwise in both systems
Stuff I likely need to learn about.

Thanks for the input.

• Martha (Smith) says:

“I would rather say, determined by how much of what we know that we choose to represent in a probability model (reserving the word conditioning for conditioning on the data (i.e. using ; notation rather than | notation)”

“I would put that in the parameter space – each point in the parameter space determines everything (relevant) about how often observations will repeatedly occur in that possible way the world can be. The prior then represents how often one chooses to represent how often various worlds reasonably compatible with our knowledge of our world would occur (Bayesian reference set).”

These both make sense to me.

2. This looks like some kind of abstracted genetic regulation problem, but I couldn’t follow the details. I got lost without a definition of G[i], which kept recurring in subsequent definitions.

In the very first picture, it looks like arcs are forbidden from crossing blue balls, but allowed to cross at least one red ball (please excuse my obsession with boundary conditions if it’s irrelevant here).

To allay your suspicions, you need to read your hypothesis tests very literally. And be careful that they only reject the null, which is not usually the hypothesis of interest. There may be nothing special about pairs (biologically?), but that may be what the hypothesis test is testing. If you don’t like that, you need to change the test. That may cause difficulty in either formulating an appropriate likelihood or computing it (hence all the simple approximations Andrew’s talking about). The hypothesis tests (or Bayesian decision-theoretic equivalents thereof) may also be problematic (hence the suggestion to test with simulated data).

And the hypothesis tests, which is why Andrew suggests using simulated data to see if they do what you think they’re doing in controlled cases.

3. Michael Nelson says:

As an applied statistician, I know nothing about this type of question. As an amateur mathematician, I know nothing about this type of question…but I do have an instinct that it might be easier to answer if translated into the language of pure mathematics. Pure mathematicians know far more than we do about these kinds of structures, partly because often they can be better understood in terms of polynomials. “There are several known results that assert that a combinatorial structure satisfies certain combinatorial property if and only if an appropriate polynomial associated with it lies in a properly defined ideal” (Noga Alon, 1999).

I may be completely misreading the question (see above, “I know nothing…”), but I see some similarities between this problem and the multidimensional necklace splitting problem (https://arxiv.org/pdf/math/0610800.pdf). If, in the problem at hand, you were to limit each cell to containing only one color of ball, you would have what mathematicians call a one-dimensional necklace; the necklace-splitting problem is to determine how many sub-necklaces (continuous subsets of cells) you must split the structure into so that the sub-necklaces can be put into g groups so that each group has an identical number of balls of the same color. Allowing multiple colors of balls in a cell (as in this case) extends the problem to multiple dimensions; both the 1D and 2D+ cases have been solved, the latter in terms of probability measures.

The statistical question at hand is more complex than this, because Przedniczek wants to preserve a specific structure in the necklace when forming the groups, and I assume he also wants to allow for overlapping sub-necklaces. But I conjecture that knowing the solution to the necklace splitting problem for a given set of cells will also tell you something about the complexity of that sequence in terms of repetitions. (The more compound structures a necklace contains, the easier it is to separate in into g groups.)

This is obviously a completely different approach, but it may be that if Przedniczek can talk to someone with expertise in this area, they will know of a very similar one that has been addressed in some way that is relevant. Or they may find it intriguing enough to pursue themselves!