Anoop Balachandran writes:

This is one of the abstracts of the paper i am about to publish: My question is can I really say both training program were effective for increasing power and function? Studies of similar duration employing sedentary control showed either negative or 1-2% changes. Also, I don’t think strength and function will improve in older adults due to placebo effect or natural history. What is your opinion?

I know within-group changes may not mean much. But usually power and function declines or stays same in older adults. Also, between group CI shows atleast a small effect in favor for both groups. I know I didn’t have a control since studies using it showed negative changes/1-2% changes. i just want to make sure I am not exaggerating my case so any feedback would be helpful.

Objectives: Power training has shown to be more effective than conventional resistance training for improving physical function in older adults; however, most trials used pneumatic machines for power training. Considering that the general public only have access to plate-loaded machines, the effectiveness and safety of power training using plate-loaded machines compared to pneumatic machines remains uncertain. The purpose of this investigation was to compare the effects of high-velocity training using pneumatic machines (Pn) versus standard plate-loaded machines (PL).

Design: Single-blind, randomized controlled trial

Participants: Independently-living older adults, 60 years or older.

Intervention: Participants were randomized into two groups: pneumatic machine (Pn, n=19) and plate-loaded machine (PL, n=17). After 12 weeks of high-velocity training twice per week, groups were analyzed using an intention-to-treat approach.Measurements: Primary outcomes were lower body power measured using a linear transducer and upper body power using medicine ball throw. Secondary outcomes included lower and upper body strength, the Physical Performance Battery (PPB), gallon jug test and get up and go test, and self-reported function using the Patient Reported Outcomes Measurement Information System (PROMIS) and an online video questionnaire. Outcome assessors were blinded to group membership.

Results: Lower body power significantly improved in both groups (Pn: 19%, PL: 31%), with no significant difference between the groups (Cohen’s d = 0.4, 95% CI (−1.1, 0.3). Upper body power significantly improved only in the PL group, but showed no significant difference between the groups (Pn: 3%, PL: 6%). For balance, there was a significant difference between the groups favoring the Pn group (d=0.7, 95% CI (0.1, 1.4); however, there were no statistically significant differences between groups for PPB, gallon jug transfer, strength, get up and go or self-reported function. No serious adverse events were reported in either of the groups.

Conclusions: Pneumatic machines were not superior to plate-loaded machines in improving power in older adults. Pneumatic and plate-loaded machines were both effective in improving lower body power and physical function in older adults. The results suggest that power training can be safely and effectively performed using either pneumatic machines or plate-loaded machines among older adults.

I don’t do any power training myself but I thought this could be interesting to share, not so much because of the subject matter, but because it represents the sort of everyday research that goes on all the time, but which we don’t think so much about.

If you have any suggestions for Anoop, just put them in the comments.

Anoop:

This is probably not the kind of feedback you are looking for at this point, but speaking as an older adult (early seventies), I think the practicality and safety of the exercises is important to consider. You are are to some extent considering practicality by taking into account that plate-loaded machines are more readily available than pneumatic ones, but your abstract has not addressed the question of whether appropriate exercise without machines (which would be even more accessible than plate-loaded machines) would be as good as using machines. And although you mention safety, you do not mention any outcome variable measuring safety. These are some factors I hope you will take into account in designing future studies.

A colleague did a PhD (2004ish) on a high intensity leg strength exercise for the frail elderly done in the home with basic equipment. Unfortunately, there were quite a few injuries.

http://www.ncbi.nlm.nih.gov/pubmed/12588571

From her webpage it still looks like it’s a continuing interest

http://profiles.bu.edu/display/153464

My population is independently living. I haven’t read the paper so not sure what their exercise protocol was.

hi Martha,

Great Question! In fact, that is my next question. The NIA ( National institute of Aging) has a program using dumbells and bands. I think it will be interesting to use power training in that program.

I didn’t get space to write more about those in the abstract. I used the outcomes(falls, musculoskeletal, Cardiac, Doctor visits) suggested for physical activity assessments.

Thank you!

How to interpret null results is one of the things causing a lot of confusion in all areas. The conclusion here is that ” Pneumatic machines were not superior to plate-loaded machines in improving power in older adults.” I would have said: “We found no evidence that pneumatic machines are superior to plate-loaded machines in improving power in older adults.” Also, how about repeating the study with a larger sample size? How were these adults selected and what does 60 years or older mean? What was their baseline fitness level? Were these comparable across the groups?

+1

Psycholinguists especially excel at this (misinterpreting null results). I’ve been politely pointing this out for some years now to specific people but they generally ignore me and call me a cranky data cop who doesn’t understand the important theoretical issues. I have also been accused of being on a witch hunt. What I have learnt so far is that polite reminders don’t work. Maybe Andrew’s way, in-your-face-and-loud-as-hell, is the right way. It may get results fast.

I’d personally go with something like “we found that pneumatic and plate loaded machines were similarly effective in improving power in older adults”

Focus on the effect sizes and estimate intervals. Report an interval for the ratio of the two improvements. Obviously I’d like to see a Bayesian estimate of that. Something like “relative effectiveness of Pl (Pl/Pn) was estimated to be between 0.8 and 2.1 (or whatever the numbers would be)”

Since you have prior evidence for Pn causing improvements you can assume it’s a well defined positive denominator so that the ratio is meaningful.

I think the problem with saying “similiar” that is I run the risk of saying it is an equivalence trial. Right?

hmm i would love to use the Bayesian estimate. But no clue how to. Any beginner level practical book?

Thank you!

The concept of an “equivalence trial” is inherently a null hypothesis testing framework (ie. testing to see if two things are within a practical epsilon of each other). From the Bayesian perspective, this is just a trial that tries to estimate two quantities, and/or the ratio of those two quantities. Whether something is “equivalent” comes down to if the estimates are close, or the ratio is estimated to be very close to 1. There’s no testing involved.

Let’s assume you have some vector of measured improvements for Pl and some vector of measured improvements for Pn. Suppose these are scaled as fractional improvements, so that 15% improvement is 0.15 for example. That is, you’re calculating (FinalScore/InitialScore – 1) from the measured data.

Your next step is to decide on a distribution that represents the ranges of outcomes that would be expected. For example, a normal distribution, but there might be reasons to choose something else. You can then build a model like this (in Stan):

Just based on our intuition that few people double or triple their performance on this kind of task, we can assume that these improvement numbers are of order of magnitude of tens of percent:

/* we expect positive improvements on average with order of magnitude around 20%*/

plavg ~ gamma(2,2/.2);

pnavg ~ gamma(2,2/.2);

/* variability in improvement could be easily +- on order 10%*/

plsd ~ gamma(2.0,2.0/.1);

pnsd ~ gamma(2.0,2.0/.1);

/* observed values assumed to have normal distribution around average*/

plimp ~ normal(plavg,plsd);

pnimp ~ normal(pnavg,pnsd);

And then in the transformed parameters section calculate plavg/pnavg and use those samples to see the posterior distribution of the ratio of the average improvements.

Examples in the Stan reference manual would be where I start.

The parts I’d be concerned about are whether a normal distribution makes sense or if you have some better information. For example the performance scores are positive numbers so if you start with a value Foo and the final value is 0 (at the end of things they couldn’t do ANYTHING) then you’ll get 0/Foo – 1 = -1 is the farthest left your score could be, so a normal distribution could be problematic. perhaps you should instead work with Final/Initial rather than Final/Initial – 1, for sure Final/Initial is a positive number, so you could work with a gamma distribution for final/initial as the sampling distribution. These are the kinds of things I’d be concerned about, rather than testing anything.

Hi Shravan,

Thanks for the question. Good suggestion!

And what is a ‘large’ sample size? I think any difference can be made significant with a large sample size. The sample size was selected based on the power calculation prior to the study. the Es selected for power calculation is debatable I think I have the inclusion criteria in the paper and independently living means they can function without assistance or not limited in function. Yes they were comparable and adjusted for baseline differences using an ANCOVA.

To answer the main question: no, you can’t conclude that both programmes were effective because the experiment did not test that. It may be true though, and you can make an informal comparison with what might have been expected without treatment, but that probably shouldn’t be in the abstract.

Other things I would suggest:

1. Don’t use the word significant. I’d just give the 95% CI (or better still, do a Bayesian analysis and give 95% credible interval, but that would also involve thinking through what a sensible prior would be – I would expect differences between two methods like this to be small). You could give exact p-value if you’re keen on them but that risks everyone who reads it automatically dichotomising the result into significant/non-significant. I suppose the confidence interval has that problem too though (so you can’t win!). Another approach would be to use Bayes factors to measure the strength of evidence in favour of one approach being superior – this might be a good thing to do and makes results more informative. There isn’t “no difference” between the groups so it is of interest to look at what the data suggest about how likely each is to be better.

2. Start the results in the abstract with primary outcomes. They’re supposed to be the most important things.

3. I wouldn’t give a list of secondary outcomes that were “not significant” – that really tells us nothing. The differences might have been tiny or huge. Either include the results or omit.

A few suggestions… hope it’s some help.

They didn’t have a control group getting no treatment, but what happens when people don’t do either kind of exercise is something we have a lot of prior evidence for. In fact, very likely some previous study has tested Pneumatic machines against controls with no treatment (and the author probably knows this literature). So, the point of this study is to compare how effective the two methods are. Given that, I’d say estimate the fundamental thing of interest, the *relative* size of the improvements, in other words (Improvements vs Baseline for Pl)/(improvement vs baseline for Pn)

give a Bayesian high probability density interval for that ratio. It’s the main point of the study.

> Another approach would be to use Bayes factors to measure the strength of evidence in favour of one approach being superior – this might be a good thing to do and makes results more informative.

Really? (Just because it has Bayes labeling does not make it appropriate or defensible.)

Maybe Benjamin and Berger http://ssgac.org/documents/p-value-comments.pdf ?

I am assuming Anoop will be doing their own analyses?

“Proponents of the “Bayesian revolution” should be wary of chasing yet another chimera: an apparently universal inference procedure. A better path would be to promote both an understanding of the various devices in the “statistical toolbox” and informed judgment to select among these.” http://jom.sagepub.com/content/41/2/421.short

I really like the idea of a universal inference procedure, and chasing one. It would undoubtedly not work in many cases. But if it even worked in 50% of cases, that would be a fantastic improvement. I think the lack of high quality routine data analysis procedures can be interpreted as a primary cause of the replication crises. In the long term, I am hopeful for a future where checklists are common in social science and data analysis, as they have become in medicine. Universality may be beyond our grasp, but I think we might eventually be able to approximate it.

And such a thing actually exists!… in a certain sense. You can define it and prove theorems about it (e.g., its total summed expected prediction error on any computable sequence is bounded by a constant) but it is not itself computable.

If perpetual motion even worked in 1% of cases, that would be a fantastic ;-)

More seriously we want the “statistical toolbox” to be full of default methods that are easy to use but also that _scream_ when applied in inappropriate situations – see http://statmodeling.stat.columbia.edu/2010/07/15/quote_of_the_da/

Unfortunately we _know_ this is impossible even for the simple case of the 2by2 table – The Simpson’s paradox unraveled http://ije.oxfordjournals.org/content/40/3/780.full.pdf (Here though I think causality is sufficient but its not necessary http://statmodeling.stat.columbia.edu/2016/09/08/its-not-about-normality-its-all-about-reality/ ).

I really think we need to represent the _reality_ that generated the data – not too wrongly – in order to have any assurance of a sensible statistical analysis.

Hi simon,

that’s what I thought. Thank you!

1. I used it because my prof needs it. I think Dr. Cummings mentioned it to me too :). Any book suggestion for Bayesian analysis?

2. My primary outcome was power.

3. why not? Bcos we didn’t power for those?

Thanks a lot!

Whether something is “statistically significant” or not tells you more about how well you can measure it than what the value is. Suppose you test two people one from group A and one from group B to see how much they can lift, one lifts 100 pounds, one lifts 500 pounds. Suppose some significance test is performed. There are only TWO people, one in each group, not much can be learned from that using null hypothesis tests, so no statistically significant difference can be detected using the test. Should you report “no statistically significant difference in ability to lift was found?” or should you report “person A lifted 100 pounds and person B lifted 500”

If you can see the difference between the two approaches, you will see some of what Bayesian statistics is about. The fact that 500 seems like a lot more than 100 is down to our prior knowledge that very few people in the whole world can lift 500 pounds! Given our information, that difference is substantial, but given a rather silly assumption that both values came from some anonymous random number generator whose parameters we have no knowledge of, we would be unable to say much. The first approach is Bayesian estimation, the second approach is Frequentist testing of a null hypothesis.

1. Richard McElreath’s Statistical Rethinking

2. Sorry, didn’t express that well – I meant best to give the results for the comparison of the groups first (as that’s the main point) but I guess it’s OK to point out that both groups improved. I think I was really just getting hung up on the “significantly”.

3. I meant in the abstract – seems important to me to put in efect sizes and numerical results if results are to be included. In the main paper obviously all the results would appear in full.

This seems like a good place to do a non-inferiority or equivalence analysis. The key trick will be choosing the largest difference which would be considered as not meaningfully different — and you’ll need to lock this in before doing the analysis (no fair changing it later).

That seems sensible – Anoop is missing the placebo group and some how must impute one (the whole group) using judgement or meta-analyses of trails that had placebo groups or some mix of the two (judgement and empirical evidence.)

This approach is also referred to as network meta-analysis or indirect comparisons.

It is very tricky to do a credible and convincing analysis when the placebo group is completely missing but in areas where a placebo group is unethical – its unavoidable.

I agree. Also, just harder with 3 groups to get enough sample size.

Hey Clark,

You are 100% right! And that was my problem. I wanted to do a non-inferiority design. But there is no clinical difference reported for power. I used a large ES difference to calculate power since considering the cost of pneumatic machines only a large Es can justify the use. If I had to do an inferiority trial, what Es would I chose ? half of 0.8?

Thank you!

I’m not exactly an expert on this sort of thing, but Wellek’s book “Testing Statistical Hypotheses of Equivalence and Noninferiority” suggests (Table 1.1, p. 16) using something corresponding to standardized effect sizes of .36 as a strict tolerance or .74 as a liberal tolerance when doing equivalence or non-inferiority testing based upon a two-sample t-test. As you suggest, this sounds like a good case for a non-inferiority test.

I strongly disagree with the general concept of a noninferiority test. If I am someone who is thinking of doing a new exercise regimen if you tell me

“we were unable to detect a difference as big as 0.25 in a trial of 50 people 25 of whom tried pneumatic and 25 of whom tried plate machines” that tells me something, but not much.

For example, suppose the pneumatic people declined by 0.11 and the plate machines declined by 0.18, and so therefore plate machines are not statistically inferior to pneumatic machines…. (but btw you didn’t bother to tell me, both were worse than useless)

Whereas if you tell me “on average people using pneumatic machines improved by .19 and people using plate machines improved by somewhere between 0.75 and 2.2 times as much as the pneumatic machines” then I’m going to say to myself “well, no matter which machine I use I’m going to improve by around 20% and maybe with the plate machine I might even improve a little more” and that’s going to give me the information I want.

The information the patient wants is “how much does each of my possibilities do for me on average?” not “is there less than 95% chance that if these two processes are the output of a random number generator, the average value of each random generator is within 0.2 of the other?”

What the non-inferiority test will tell the patient is whether there is a practical difference between the results they would get with the different machines (assuming we can agree on what the smallest practical difference is); actually I’m describing an equivalence test, non-inferiority would indicate that one machine is at least equivalent to the other, if not superior. If there is no practical difference in the results, and one machine costs twice as much (or takes twice as much effort or time), then a well-informed patient would choose the machine with the smaller cost. I as a consumer would find this valuable information. If there is a difference between the results, then I’d want to know something about how the results differ. Of course, there’s no reason not to present the patient with all the information.

But non-inferiority or equivalence tests (or hypothesis tests in general) are a terrible way to do decision making, and the information from a non-inferiority test is technically a single bit (yes or no). The information from the estimate of a ratio of the improvements is exactly the information you’d need to plug into a cost-benefit analysis. If the cost difference is 3% more what should I do? If the cost is 5% more? if the cost is 10% more? what if the cost is 3% less? etc etc.

An informed patient should make cost benefit tradeoffs, and hypothesis testing of non-inferiority is bad specifically because it only gives really crappy information for the purposes of decision making.

Yes, if one costs a million dollars and the other costs $1 and you can’t detect a practical difference you can easily make a decision, but outside of the really simple boneheaded case, a “yes” or “no” answer to a noninferiority test is of extremely limited usefulness.

> at least equivalent to the other, if not superior.

Not much worse is a much more apt phrase for this.

In drug regulation, it is considered especially bad to approve something that is worse than placebo and the non-inferiority machinery is to help avoid doing just that. If the new drug is not much worse than the approved drug (which was accepted as better than placebo) then the new drug won’t be worse than placebo.

In other contexts (actually all contexts really) Daniel is right – you need to consider cost-benefits.

Here, if one thought it would be OK to lose 10% of the pneumatic effect over placebo – that would set out the non-inferiority analysis (but of course one should do this for various % of the pneumatic effect over placebo).

Maybe a place to start (with worked examples) would be http://www.fda.gov/downloads/Drugs/…/Guidances/UCM202140.pdf but this is far from exemplary and there are reasons it has been stuck in comments for 6 years.

Just cam across this that may be a better first read for anyone interested in doing a non-inferiority analysis

Through the looking glass: understanding non-inferiority Jennifer SchumiEmail author and Janet T Wittes https://trialsjournal.biomedcentral.com/articles/10.1186/1745-6215-12-106?utm_source=BMCSite&utm_medium=LP&utm_campaign=10thAnnoi

The Bayesian version is just better in every way, including that it agrees with Wald’s theorem.

Suppose you have some outcome which is a positive number and bigger is better. we’re going to compare outcome under condition A (a pre-existing reference treatment) vs condition B (a proposed alternative treatment).

p(R) is a density over the range [0,inf) for the ratio of effects under B vs under A

What is the probability that B is superior? that’s the same as integrate(p(R), 1,inf)

what is the probability that B is better than 10% inferior? that’s the same as integrate(p(R),0.9,inf)

what is the probability that B is 10% inferior or worse? integrate(p(R),0,0.9)

which treatment should we choose? use posterior on effect size for B, A and net cost function for a given effect size/treatment to choose the one with minimum expected cost (or max expected benefit if you prefer)

Expectedcost(A) = integrate(p(A)costA(A),0,inf)

Expectedcost(B) = integrate(p(B)costB(B),0,inf)

choose whichever is smaller. Yes, costA and costB must be decided on by each patient to trade off dollars, side effects, expected benefits, etc. But even a too-simple version of this analysis will be much more nuanced than a decision based on “yes or no B is non-inferior to A”

The non-inferiority machinery addresses a different decision problem – one in which the loss if B is worse than placebo is infinite, serious if B has less than x% of A’s superiority over placebo and zero otherwise.

Whether one wants to use Bayes is another matter but even here I would argue for Bayes with an informative prior about the missing placebo group.

Now there are better ways to make decisions about A versus B – but that is not what non-inferiority machinery is about.

The non-inferiority machinery attempts to substitute the real and continuous problem we have, with a fake-deterministic binary decision. If B is found to be inferior to A then B should never be used. If B is found to be non-inferior, then use the cheaper one. Never mind anything else, never mind the 5% chance that 95% intervals don’t contain the truth, never mind the patient’s actual tradeoffs, etc. It’s a kind of numeracy poison with a delicious candy coating soaked in the pheromones of bureaucrats.

I share that general dislike for non-inferiority testing. To me non-inferiority would be a conclusion that you come to informed by the data. People seem to have an ingrained habit or wanting a “test” which (as Daniel says) reduces the information to a binary yes/no. But a result that just tells you whether something is “non-inferior” or “not non-inferior” is not very useful.

Plus there is the issue that non-inferiority margins are often unrealistic – I’ve seen clinical trials where a 5% difference in mortality is regarded as non-inferior. Nobody actually thinks 5% worse mortality is OK, but you just need to assume this to make the sameple size sums look OK. To me that is just crazy, and symptomatic of a situation where the limitations of the methodology are overriding the important clinical issues.

> the limitations of the methodology are overriding the important clinical issues.

I would certainly agree with that even in the research literature on non-inferiority.

What likely underlies that is not explicitly realizing the missing placebo group is some how being imputed to get at “rather than reducing mortality by 10% by using the active (the believed to be effective treatment) one might only get (at 90% confidence/credibility) a 5% reduction.”