Fascinating. Is there somewhere I can look to see how this could be worked out?

]]>Thanks!

]]>Most hatchery fish in Oregon and Washington are marked with a fin clip and/or a coded wire tag/passive integrated transponder (PIT) tag. It’s kind of amazing how quickly fish can be tagged and the massive number of fish that are tagged. While they’re only inserted in a subset of hatchery (and natural origin) fish PIT tags are kind of amazing to consider as a dataset. There are something like 12 million fish tags in the PTAGIS database. (www.ptagis.org)

Regarding avian predation, this is not such a big problem at the hatcheries themselves (netting and other mitigations can take care of this), but is a huge issue in the Columbia River hydropower-system. Tags don’t wind up in trees, but they do wind up on island bird colonies. That’s where some really cool Bayesian modelling comes in. Here’s a shameless plug for some buddies of mine, but there are some pretty great papers on this:

Quantifying Avian Predation on Fish Populations: Integrating Predator-Specific Deposition Probabilities in Tag Recovery Studies

http://www.birdresearchnw.org/Hostetter%20et%20al.%20-%202015%20-%20Quantifying%20Avian%20Predation%20on%20Fish%20Populations%20.pdf

And this one just came out:

Avian Predation on Juvenile Salmonids: Spatial and Temporal Analysis Based on Acoustic and Passive Integrated Transponder Tags

http://www.tandfonline.com/doi/full/10.1080/00028487.2016.1150881

Is this correct?”

No, not in the least. In fact, for any single observed confidence interval, there is almost a 1/3 probability that one of the endpoints is closer to the “true” value than is the point estimate. There is no normal distribution within the confidence interval.

]]>Once you have the posterior distribution you can answer lots of questions such as those you mention. The point is that the Bayesian model, using both prior data and a probability model that gives a plausibility for a very specific result we are interested in answers the real-world research questions that scientists actually have, such as “how much credence should I give to the idea that there really is less than 5% of escaped fish? answer: 7% chance there is less than 5% escaped fish in the river”

]]>>”using qbeta(.95,7,67) we’d get that there’s a 95% Bayesian probability that there is less than 15.6% farm escapees”

To compare to the CIs shouldn’t you instead calculate this interval?:

> qbeta(.025,7,67)

[1] 0.03942901

> qbeta(.975,7,67)

[1] 0.1703611

Also, shouldn’t we do:

> pbeta(0.05,7,67)

[1] 0.07231941

Gt looks like, given this model/data, the probability the proportion of farmed fish is less than the 5% threshold equals 0.07. This tells quite a different story than the binary “is the cutoff inside the interval or not” method.

]]>I looked through the docs for that function and it didn’t look like any was a bootstrap CI. Which one are you referring to?

]]>That’s the usual way confidence intervals are interpreted i.e. as if they were credible intervals from a posterior probability distribution. I see this all the time in medical things. It’s in the original post too: “…they reply that it is more likely the real value lies near the 10% point estimate since the confidence has the shape of a normal distribution.”

Why? because that makes sense to people, and it’s really hard for people to get their heads around the idea that a confidence interval is just an interval and doesn’t represent a probability distribution.

Thanks for the reply!

I especially like the idea of the modeling approach using Stan.

]]>Finite-sample correction – that is the name I was looking for. Thanks!

Do you think this could be of any help with respect to the original question?

]]>Before moving to Oregon I had never been to a fish hatchery, and I would have thought your question was a very reasonable one. Now, however, I’ve seen vast pools densely filled with jumping, darting fish, and I would shudder to try to count them. But, according to a comment above, the farmed fish may all be tagged, which is amazing, and so a count could in fact be easy to do. However: losing a fish from the farm certainly doesn’t mean it’s in the water in the wild. At least here, I’d guess that the majority of lost fish are eaten by raptors, and so are likely to be found in trees! (The big mystery to me is why the hatcheries aren’t constantly being invaded by swarms of eagles and ospreys…) Fish hatcheries are fascinating, by the way.

]]>Did I miss something here? Why don’t they just count the farm fish to find out how many escaped? Isn’t that under the farmer’s control? It seems crazy to estimate the fish in the wild when you have them in an enclosed space (which is what I assume a farm means).

]]>Oops you already said that!

]]>You could also bootstrap the CI.

]]>Jrc:

No need to cite recent work on this. The finite-sample correction is well known in survey sampling, it’s in every textbook on the subject.

]]>Paul, suppose you’re sampling from a finite population and trying to determine the average of some measurement in the population, for example the age of the population.

So, you have N people and have sampled n of them. The average age of the n in your sample is x which is observed, and the average age in the rest of the population is X which is not observed.

Then the average age of the full population is XX = (nx + (N-n)X)/N

You can use the information in your sample of n people with average age x to get a Bayesian uncertainty interval over X. You may also have uncertainty about the exact size N of the population. You may have information about the approximate size of some biasing factors in your sampling procedure, you may have information about roundoff errors in the calculation of the sample average x, etc etc. Each of these factors can be quantified approximately at least using Bayesian probability intervals. Using software like Stan you can easily calculate a posterior over the overall average XX as defined above, either using X and N as parameters, or treating them as generated quantities based on some other model where there are parameters, like for example that you’re fitting a distributional form to the data, and then generating a plausible sample for the unknown values from the fitted distribution.

When you’re dealing with explicit sampling from a finite population, resampling/bootstrap procedures can make good sense. There’s a sense in which a bootstrap resample is like a sample from one of the high-probability models for the distribution of the values in the whole population. I think this can be made formal with the appropriate assumptions.

]]>A few no-names (Alberto Abadie, Susan Athey, Guido W. Imbens, Jeffrey Wooldridge) have some recent work on that:

http://www.cemmap.ac.uk/uploads/220114_Imbens.pdf

“In this paper we investigate the justification for positive standard errors in cases where the researcher estimates regression functions with data from the entire popu- lation. We take the perspective that the regression function is intended to capture causal effects, and that standard errors can be justified using a generalization of randomization inference. We show that these randomization-based standard errors in some cases agree with the conventional robust standard errors, and in other cases are smaller than the conventional ones.”

Also, in your toy example of sampling 99 out of 100, you can just use arithmetic. There is a 100% chance that the true population percentage is within 1PP of your estimate! But that isn’t about “causal effects” that is just about estimating proportions in a population (which is closer to the current fish example than the framework above, but I think that paper is an interesting enough thought experiment it was worth passing along).

]]>If you’re sampling 60 and see 6, and there really are only around P=500 total population in a given year, then a small finite-sample correction would be in order. Basically you know there are 6 + f*(P-60) escapees where f is your uncertain fraction and P is your uncertain population. But with f on the order of 10% and 60 on the order of 10% of P your finite sample error is on the order of 1% which is a lot less than the sampling error in f, or the estimation error in P even with a camera system you could easily imagine you miss maybe 10% of the fish, so it’s marginal as to whether it matters that much. On the other hand, in a year where you only get P = 150 or 200 fish, 60 starts to be a big deal!

As to the catching and associated mortality, I think that’s a really good reason to be smart about the data analysis, and try to use your information as efficiently as you can. As Rahul says, looking for clipped fins in the camera data, or including time-series information on fish counts. There is probably a “wave” of fish, they don’t all arrive on the same exact day, and perhaps wild and escaped salmon tend to arrive at different times within the season for example, so instead of capturing 60 fish at once you might be better off capturing 20 fish at 3 different times and using your background information about time-series to get better estimates with the same sample size.

]]>That’s a good point. And if it’s an internal mark, like a cheap PIT tag, you could get an electronic signal of each farmed fish passing upstream. It seems like one solution to the problem would simply be mandating marking of all farmed fish in such a way that they can be enumerated moving upstream either by a PIT detector or by camera identification at the existing enumeration point. Then it’s a question of a census and not a sampling problem. Less invasive than netting for the fish as well.

]]>That doesn’t surprise me at all considering how important both the Salmon fishing and the Hydroelectric power business are in the northwest, but it does seem that the Pacific situation is perhaps quite different from a population where only 500 to 3000 wild salmon are visiting Norway. I went to a salmon hatchery and took a half-day tour outside Eugene Oregon about a decade ago. There they were doing things like intentionally dropping the eggs down a rough sloped bed to try to kill off the weaker embryos and select for stronger juveniles… But there they were actually trying to bolster the natural populations as you say, not produce human food for market.

We’ve had several fisheries questions on this blog over the last 4 or 5 years, and my impression is that the use of Bayesian fitting of mechanistic models is becoming “a thing” at least among a certain population of researchers, so that’s encouraging.

]]>If it is an external mark are the camera-acquired images not good enough to distinguish it without a catch?

Alternatively are these tags not readable from a distance, at least in those narrow river portions?

]]>“In the 2014 BiOp, NOAA Fisheries assumes very specific numerical benefits from habitat improvement. These benefits, however, are too uncertain and do not allow any margin of error. Further, a key measure of survival and recovery employed in the 2014 BiOp already shows a decline, but NOAA Fisheries has discounted this measurement, concluding that it falls within the 2008 BiOp’s “confidence intervals.” Those confidence intervals, however, were so broad, that falling within them is essentially meaningless.”

http://earthjustice.org/sites/default/files/files/1404%202065%20Opinion%20and%20Order.pdf

]]>This is a multi-million dollar business in the Pacific Northwest. Lots of people are looking into just these types of questions and there is already a rich literature. Not to mention the ongoing experiment in using hatcheries to recover natural populations.

]]>Additionally, shouldn’t there be a finite population correction in here?

]]>Selective pressures being different is the main thing I’d think of as well. The information they’re collection might actually be useful to evaluate whether the captive bred salmon really are less successful as juveniles, and how much that impacts things in general. Perhaps they are less effective, but then, because of that, they tend to select out of the population anyway, so you can tolerate them. Or, perhaps they are too effective as adults and compete in the oceans for the food, wiping out the natives and then when they come back and lay eggs their offspring die off… so that they tend to push the population towards extinction. Perhaps their mixing in of genes selected for disease resistance actually helps bolster the wild salmon over the medium to long term.

There’s lots of potentially interesting modeling to do here in terms of iterated seasons of competition and cross-breeding, and with a detailed dataset of fish counts you could do a lot more effective model fitting than just with a few small sampling surveys.

]]>Again, I can’t speak to Norway, but for Pacific salmon hatchery fish are typically identified with either an external mark (clipped adipose fin) or an internal mark (coded wire tag or PIT tag).

Given the imperiled status of wild stocks of Atlantic salmon, I imagine the wild fish are gently released back to the river. Farmed fish are likely euthanized.

]]>Regarding how the threshold was defined: I can’t speak to Norway, but I can speak to Pacific salmon in the Columbia basin. One paper I see often cited is this one by Mike Ford: https://swfsc.noaa.gov/uploadedFiles/Events/Meetings/Fish_2015/Document/8.2_Ford_2002_Cons_Biol.pdf

The idea is that fitness in the captive populations is selected differently than fitness in wild populations. Specifically, hatchery fish don’t have to compete for food in a natural river during the juvenile lifestage. All these fish have to do is eat their food pellets, grow nice and fat, and resist diseases that inevitably occur when you crowd a bunch of fish in pens. But as a consequence this fish may be more fit as adults than wild fish because they had the opportunity to grow large as juveniles. So the idea is that they may out-compete wild fish on the spawning ground, but that the resulting offspring will be less fit as juveniles because they do not have the wild genes that help optimize survival as a wild fish.

In practice though, these thresholds are set somewhat arbitrarily as a result of negotiation in stakeholders meetings (often including federal and state regulators, tribal representatives, and if in a river with a hydroelectric project the dam owner or other landowners).

]]>You should hire him Andres! :) He really is damn good (I think).

]]>That’s fine, and expert guesses can helpful, but the point is that the existing threshold is maybe not based on much information compared to the information currently being collected, so if you’re going to the trouble to collect huge amounts of information with cameras at narrow points in the river etc, you might as well actually do some analysis and come up with some new science based decision making.

Normally I don’t, but I’ll plug my little consulting company here http://www.lakelandappliedsciences.com/ because this sounds like something I could really help Andres with and I have various background that is relevant.

]]>To get a Bayesian probability interval for this problem I’d probably use a beta-binomial model with a fairly strong prior. I mean, you’re pretty sure that there are some escapees, but nowhere near 50% of the fish are farm-escapees right? So you could start with something like a beta(1,7) prior (which has 99.2% probability that the frequency is less than 0.5) and then if you see say 6 out of 60 in your sample, your new probability over the frequency is given by the curve dbeta(x,7,67) and using qbeta(.95,7,67) we’d get that there’s a 95% Bayesian probability that there is less than 15.6% farm escapees

The prior information is very real here, and most likely you have even more prior information that beta(1,7) expresses. Most likely MUCH more. beta(3,30) sounds like it’d be a reasonable choice based on this kind of background info and the idea that those numbers are about what has been seen in the past.

]]>How does one tell them apart? Phenotype? Or genetic testing?

Also: Do those 60 end up as dinner or is it catch-and-release?

]]>You are too harsh. :) I’d rephrase: *“Someone, somewhere had to set a threshold & lacking the resources (time, money) to create a model or a detailed analysis used his prior knowledge & intuition to select a number for the threshold”*

There are many approaches to calculating such intervals:

> require(binom)

Loading required package: binom

Warning message:

package ‘binom’ was built under R version 3.2.5

> binom.confint(6,60)

method x n mean lower upper

1 agresti-coull 6 60 0.1000000 0.04320331 0.2049342

2 asymptotic 6 60 0.1000000 0.02409092 0.1759091

3 bayes 6 60 0.1065574 0.03645465 0.1843301

4 cloglog 6 60 0.1000000 0.04069043 0.1909142

5 exact 6 60 0.1000000 0.03759127 0.2050577

6 logit 6 60 0.1000000 0.04562248 0.2052514

7 probit 6 60 0.1000000 0.04325646 0.1979359

8 profile 6 60 0.1000000 0.04102707 0.1923304

9 lrt 6 60 0.1000000 0.04100430 0.1923297

10 prop.test 6 60 0.1000000 0.04131130 0.2116995

11 wilson 6 60 0.1000000 0.04664283 0.2014946

I haven’t investigated the properties of these tests at all, but you can find this paper by searching around in which he says the method you used (apparently called “asymptotic” in the R function above) does not have the nominal coverage:

“Method 1, the simplest and most widely used, is very anti-conservative on average, with arbitrarily low CP for low h. Indeed, the maximum coverage probability is only 0)959; min DNCP is 0 and min MNCP is 0)0205. In this evaluation with h(0)5, the deÞcient coverage probability stems from right non-coverage; the interval does not extend su¦ciently far to the right, as evidenced by the high frequency of ZWIs and the fact that a large part of the calculated interval may lie beyond the nearer boundary, 0.”

http://www.ncbi.nlm.nih.gov/pubmed/9595616 (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.408.7107&rep=rep1&type=pdf)

>”We have asked this question to the government, but they reply that it is more likely the real value lies near the 10% point estimate since the confidence has the shape of a normal distribution.”

The way confidence intervals are calculated means that the coverage (eg 95%) refers to the percent of intervals that are supposed to include the “real” value upon repeatedly sampling and calculating the CIs. They seem to be thinking of Bayesian credible intervals. Many times for simple estimates like this the confidence interval approximates the credible interval using a uniform prior, so this confusion does not cause much practical issue. I am not sure about the case of proportions though.

]]>or, was there some genetic population dynamics model that said “when we put 5% into our model then we have an unacceptable rate of genetic mixing so we should try to keep the population in the river below 5%” (5% is a population level parameter based on theory with no sampling error included, now a decision analysis is needed to balance risk given limited information from a sample)

Because, in the first case the 5% is chosen as a sample proportion and could be based on the uncertainty already, whereas in the second case it’s the population parameter that is important, and so the uncertainty is not built-in to the decision.

Though, knowing how this stuff is usually done, my guess is that someone simply set 5% in some rule-book somewhere based on no analysis whatsoever just that it “seemed like a low enough value”.

]]>