## A question of experimental design (more precisely, design of data collection)

An economist colleague writes in with a question:

What is your instinct on the following. Consider at each time t, 1999 through 2019, there is a probability P_t for some event (e.g., it rains on a given day that year). Assume that P_t = P_1999 + (t-1999)A. So P_t has a linear time trend. What we wish to understand is whether P_t is growing or shrinking over time and by how much.

Gathering data is manual and costly. Is it better to gather 100 observations from 1999 and 100 from 2019. Or is it better to gather 10 observations per year for 20 years. My instincts are to gather 100 from 1999 and 100 from 2019 — that this will have more power.

Yes, this is a standard problem in experimental design, and to first approximation it is best to gather data from the extremes, ie. your intuition is correct. You can see this through the formula for the standard error of the regression coefficient (it’s all in closed form when you’re just estimating linear regression with one predictor). There is one complication, though, in that you have discrete data so if the probabilities are close to 0 or 1 you’ll need more data to compensate for the additional variation.

Simplest way to figure out the optimal solution (and also to make it understandable) is to do a simulation.

Also perhaps relevant is this article I wrote a long time ago.

P.S. As with all problems, this all gets more and more interesting the more you look into it. In practice, yes, you can best estimate the slope from points at the edge of the range, but that seems wrong, somehow. But, as discussed in my linked article, even if you gather data from the middle of the curve, it can be hard to pick up on nonlinearity. So much depends on the ultimate goals of your data collection and analysis.

P.P.S. I looked through the comments, and . . . please read the article I linked to above, Should we take measurements at an intermediate design point? It directly addresses many of the things you’re talking about! !!

1. Dale Lehman says:

As many people have pointed out repeatedly (on this blog), you must have a model, and your model contains assumptions which may or may not be correct. The question assumes there is a linear time trend. Then, surely having 100 data points from the first and last years is better than 10 data points for each year. However, I suspect there is little practical reason for it to be safe to assume there is a linear time trend. More commonly, you would want to see if there is a linear trend, or indeed any trend, in the data. Then, the beginning and end points tell you very little. This does not contradict anything Andrew is saying, but the question seems to run the risk of implicitly assuming something that needs to be investigated. I could be wrong – perhaps it is a simple mathematical question, but too often I’ve seen assumptions (such as a linear time trend) made implicitly without evaluation.

• Brent Hutto says:

I share the same thoughts, Dale, and at the risk of getting nitpicky over semantics I’d say comparing 100 measurements in 1999 to 100 measurements in 2019 answers a slightly different question than “growing or shrinking over time”. Comparing just the endpoints answers the question “Does P_t differ at 2019 versus 1999?”. Imagine a scatter plot with, say, 10 measurements per year and you superimpose a fitting trend and uncertainty bounds. Now imagine a scatter plot with 100 measurements each at 1999 and 2019 with superimposed indications of central tendency and uncertainty at each of those two times. I personally find the first scatter plot more informative in a situation where the true behavior of P_t is poorly understood. Who knows what might happen year to year and for all we know (as the problem is stated) there may be things that make certain years unusual.

• Daniel H. says:

This sounds like something you can check with fake-data simulation (a proposal I only can make because of reading this blog).
Really, set up a generative model that includes various effects, including a (quasi)-linear trend, a year-wise variation and some other fun things (maybe confounders to other variables)? I could imagine doing a game-theory-related approach where it’s a good idea to throw more weight on the outer years, but not limit it to the extremes only.

• Z says:

This is what I came to the comment section to write, and you did it better than I was going to.

2. Although this isn’t an interpolation problem due to the noise in measurements, its useful to think about the Chebyshev interpolation points for nonlinear fitting. The way you get those is to do uniform angle increments and take the cos to find the x coord.

So with two points its the endpoints. With three you include the midpoint. With four you take the endpoints and the cos(pi/4) cos(3pi/4) points…

This is for the interval -1,1 so you shift and scale that interval…

Why Chebyshev points? For interpolation, because it avoids the Runge phenomenon of wild oscillations. Basically it weights points near the edges higher since the endpoints provide information that’s constrained to the interval

3. The following paper by List, Sadoff, Wagner has answers to this specific question and to many more questions like that, specifically targeted to experiments in economics: https://www.nber.org/papers/w15701

• Dale Lehman says:

What I find interesting in the paper you cite are statements such as the following: “For example, if the analyst is interested in estimating the effect of treatment and has strong priors that the treatment has a linear effect, then the sample should be equally divided on the endpoints of the feasible treatment range, with no intermediate points sampled.” I think the emphasis should be on the “priors.” Too often, the emphasis is on the mathematically correct conclusion and the importance of these priors is not appreciated. It may be that experimental economists are better practitioners of this than other fields, but I think economists are often guilty of glossing over such critical assumptions. Again, I can think of very few practical applications where it is safe to assume a linear time trend. Indeed, I think the question of whether (and what type of) a time trend exits is more interesting than estimating its average slope over the entire time period.

• Sandro Ambuehl says:

Generally, in economics, if you make functional form assumptions, you need to be prepared to have good arguments to justify them; you’ll certainly be asked in seminars. This holds for theorists as much as for empiricists (though Nobel prize winners like Heckman get away with more…). At least that’s been the case since I’ve started (around 2012). Things might have been quite different before the “credibility revolution” and the empirical turn in economics.

• Andrew says:

Sandro, Dale:

See P.P.S. above.

• Dale Lehman says:

Let’s use the context of the question you were asked about: the probability of an event over time. Your demonstration of the superiority of collecting end point data over intermediate data contains in part “What prior information is available on δ and T? We first consider δ. If the treatment effect is monotone, then δ must be …” I can think of problems where motonocity is a reasonable prior: e.g., the proportion of eCommerce sales over time? Then we might wish to estimate the rate at which this is growing. More interesting would be whether the rate itself is growing over time, or not, and data on the endpoints may not be sufficient for this. I realize, we might assume a quadratic relationship, and that might get us somewhere. However, I think the more interesting question would be whether the rate at which the eCommerce proportion of retail sales is monotonic or not. Similarly, most of the interesting time series questions I can easily think of involve questions about whether or not the trend is monotonic.

I realize that your paper is talking about treatment effects and the original question is about time trends. But I think these contexts are different in terms of what a reasonable prior might be. Treatment effects would normally either be monotonic or quadratic (there are exceptions, but I think this covers the majority of cases). Time trends do not seem so readily characterized by me, at least the interesting ones I can think of. Maybe my imagination is too limited.

4. Art Owen says:

For plain regression the optimal but not always wise thing is to put half the points at the left and half at the right. Points in the middle allow you to check up on linearity at the cost of higher variance for the slope. There is old work (Peter Huber I think) on uniform spacing having some sort of optimal robustness.

For logistic regression the optimal design actually depends on the true unknown parameter values. The optimum is to put half the x’s where P(Y=1|x) is something like 15% and half where it is something like 85%. Of course we don’t know those points. This leads to sequential Bayesian methods. Chaloner and Verdinelli I think.

It is usual for the optimum to involve the same number of distinct x’s as there are parameters in the model, making it hard to test the model.

5. Zhou Fang says:

Counterpoint: In most real world cases, a much more reasonable model for this kind of data would be something like

P_t = P_1999 + (t-1999)A + E_t

where E_t is a year-specific random variation term. In that kind of construction, just piling on measurements at the end-points will run into serious pseudo-replication issues, so you would indeed be better off spreading out your measurements over time. This is independent of any effects of nonlinearity….

6. Daniel L Speyer says:

The dependent variable is a *probability*. Therefore we know the relationship is extremely nonlinear. Also, therefore, the concept of delta as defined in the linked paper doesn’t really apply. The difference between 0.9999 and 0.8 may be less than 25%, but it’s qualitatively completely different.

• It’s nonlinear, but not necessarily over the time period in question. Like if the probability starts at 0.02 and increases by 0.003 per year for 20 years, up to 0.08 that could definitely be linear-ish over that time frame.

7. Yuling Yao says:

If the experiment is adaptive, is it useful to use loo or “gradient of data” to optimize the next stage design?

8. Ron Kenett says:

Andrew – nice 1999 paper. The discussion, so far, has ignored experimental design optimality properties (except Art Owen). The design layout should be related to the study goals. A quick review follows:

D-optimal designs are used in experiments conducted to estimate effects in a model. Their main application is in designs with an experimental goal of identifying the active factors.

A design is A-optimal if it minimizes the sum of the variances of the regression coefficients

I-optimal designs minimize the average variance of prediction over the design space. If the primary experimental goal is to predict a response or determine regions in the design space in which the response falls within an acceptable range, the I-optimality criterion is more appropriate than the D-optimality criterion.

A related approach is G-optimal designs, which minimize the maximum prediction variance over the design region.

9. Cat says:

Please, explain the connection between linear regression and ‘Causal Inference’.

• Andrew says:

Cat:

See our book, Regression and Other Stories, which has four chapters on causal inference and other discussion of causal interest throughout.

10. John N-G says:

Since I deal with trends in rainfall events in my research, my immediate reaction was to worry about autocorrelation. Temporal autocorrelation (in the departures from the statistical model) would tend to favor choosing many observations at the period endpoints. Autocorrelation among synchronous samples (such as spatial autocorrelation for rainfall) would tend to favor observations distributed throughout the time period.

Depending on the magnitude of the autocorrelations, this consideration can be controlling.

11. Mendel says:

My intuition is that confining the sampling to the endpoints reduces redundancy to the point where you’ll be unable to discover errors in your experimental design (which can be a good thing if you’re a cynic and want to make your study hard to criticize).
If you limit yourself to sampling the ends, wouldn’t it be harder to discover any confounding factors that you might have missed? The data you’re getting can’t deviate from your model, there is no possibility of a misfit in any way.

For example, use a differently/wringly calibrated instrument ten years later, and you may see a change where there is none.

12. Daniel Trejo says:

I once had a project in which we needed to identify the parameters of a bilogistic fit to a growth curve from an automated plate reader. The idea was to find the optimal experimental design, my idea was to set up a Bayesian A-optimal experimental design, and let the posterior identify the best sampling points to identify the curve parameters (based on a discrete list of possible sample points, given by the robots actions’ time).

The project took other turns and I never really had a chance to work on this idea. I would be interested if someone has a reference on an idea like the one I described above.