Skip to content

Building a regression model . . . with only 27 data points

Dan Silitonga writes:

I was wondering whether you would have any advice on building a regression model on a very small datasets. I’m in the midst of revamping the model to predict tax collections from unincorporated businesses. But I only have 27 data points, 27 years of annual data. Any advice would be much appreciated.

My reply:

This sounds tough, especially given that 27 years of annual data isn’t even 27 independent data points.

I have various essentially orthogonal suggestions:

1 [added after seeing John Cook’s comment below]. Do your best, making as many assumptions as you need. In a Bayesian context, this means that you’d use a strong and informative prior and let the data update it as appropriate. In a less formal setting, you’d start with a guess of a model and then alter it to the extent that your data contradict your original guess.

2. Get more data. Not by getting information on more years (I assume you can’t do that) but by breaking up the data you do have, for example by geography, or class of business, or size of business, or some other factor. Or could each business be a data point? What I’m getting at is, it seems that you must have a lot more than 27 pieces of information you could analyze.

3. With a small n and many predictors, you often can’t come to a good story about what is happening but you can still rule out a lot of potential stories. For example, suppose you have 20 candidate predictors. You can’t just throw these into a regression. But you can correlate each of the predictors with the outcome, one at a time, and discover either a very close predictive relation with one or more of the separate predictors, or no such relation. Either way, you’ve learned something. It ain’t nothing to know that none of these 20 inputs determines the output all by itself.

4. You can combine predictors. For example, if you have 5 similar predictors, each measuring some aspect of a common input, you can average them (after rescaling, if necessary) and then use that average as a single predictor. Bill James did that sort of thing in his baseball analyses. Instead of throwing all his variables into a regression, he’d use theory (of a sort) and some data analysis to compute composite scores such as “runs created” and then use these composites in his further analyses.


  1. John says:

    We build regression models with fewer data points all the time in dose-finding methods. We start with no data and update the model every time a patient outcome becomes available. Obviously we’re grasping at straws, but doing the best we can.

  2. K? O'Rourke says:

    The biggest thing you don’t have is random assigment – right?
    (Pretty sure John has this in the dose-finding methods)

    With small amounts of data, I kind of think you need some sense of explanation even if to just get those good assumptions so there is some hope of out of sample prediction.

    Point 4 was nicely covered by Mosteller and Tukey in their regression book and ften a sensible strategy (that failed for me when I tried to get quality scores in meta-analysis).

    Point 3 is often easy to forget – there might be something obvious about some related aspect thats worth pinning down.

    • jimmy says:

      hi keith, many (most?) dose-finding trials are not randomized, i believe. consider that the 3+3 design is probably the most popular dose-finding trial design. (much better designs exist, which are also not randomized.)

      • K? O'Rourke says:

        I understood John’s dose finding to be for an optimal dose (somehow balancing benefits and side effects)rather than phase one finding of an acceptably safe non-fatal dose.

  3. Vlad says:

    You could also use Charles Ragin’s comparative analysis instead of the regression.


    Consejo para un Modelo de Regresión para una muestra pequeña (27 datos):

    Como sabemos los fines de un Modelo de Regresión es para realizar una predicción sobre la variable dependiente (recaudación de impuestos de las empresas no constituidas en sociedad).

    Con una muestra pequeña sería mejor analizar la relación funcional, pero si se quiere realizar una predicción se debe tener en cuenta:

    Algunas medidas muy comunes, tales como el coeficiente de correlación y el coeficiente de determinación R2, pueden dar una idea equivocada sobre las capacidades predictivas del modelo de estimación en cuestión. El R2 es un criterio de valoración de la capacidad de explicación de los modelos de regresión, y representa el porcentaje de la varianza justificado por la variable independiente El coeficiente R2 es una medida de la relación lineal entre dos variables. Tanto el R2 como el coeficiente de correlación no son los más adecuadas para evaluar la predicción de un modelo; en el mejor de los casos se trata de medidas del ajuste de la ecuación a los datos, no de la capacidad predictiva del modelo.

    Por lo tanto, se debieran examinar también medidas claramente predictivas (fueron definidas en Conte et al., 1986) como:

    • Magnitud Media del Error Relativo, MMRE:
    Se define como , donde e es el valor real de la variable, ê es su valor estimado y n es el número de proyectos. Así si el MMRE es pequeño, entonces tenemos un buen conjunto de predicciones. Un criterio habitual para considerar un modelo como bueno es el de MMRE < 0,25.
    • Predicción de Nivel l -PRED(l):
    Donde l es un porcentaje, se define como el cociente del número de casos en los que las estimaciones están dentro del límite absoluto l de los valores reales entre el número total de casos. Por ejemplo PRED(0.1) = 0,9 quiere decir que 90% de los casos tienen estimaciones dentro del 10% de sus valores reales; PRED(0,25) = 0,9 quiere decir que el 90% de los casos tiene estimaciones dentro del 25% de sus valores reales. Un criterio habitual para aceptar un modelo suele ser el de PRED(0,25) = 0,75, aunque algunos autores rebajan este requisito.

    Además, siempre deben tenerse en cuenta la aparición de casos anómalos, la normalidad de los datos (difícilmente conseguible), la colinealidad entre variables independientes (si es que se realizara el estudio con más de una variable independiente), etc..

    En definitiva, lo que queremos es realizar predicciones lo más acertadas posibles, sin importarnos el método. Y para medir esa capacidad de predicción se deben utilizar variables predictivas principalmente, no sólo explicativas.

  5. Gabe says:

    “Get more data. Not by getting information on more years (I assume you can’t do that) but by breaking up the data you do have, for example by geography, or class of business, or size of business, or some other factor.”

    If n is small already, why add more predictor variables to the mix?