Distinguishing logistic curves

https://www.lesswrong.com/posts/fRSf65CxthNMaLnCP/distinguishing-logistic-curves

Contents

Logistic curves

The logistic curves look like this:

Logistic curves can be specified by three parameters, c, l > 0, and k > 0. Their equation is then:

F_{c,l,k}(x) = l\frac{e^{kx}}{e^{k(x-c)}+1}.

Note that this l is different from that in this article. The turning point of this curve is at x=c (where it takes the value of l e^{kc}/2) while its supremum is le^{kc}; it tends to this value as x\to\infty. Take the limit as c\to\infty as being the exponentials:

\Fi(x) = l e^{kx}.

Figuring out the right curve

We’ll imagine a simple Bayesian setup. An analyst of logistic curves is seeing data from one distribution, and has two hypotheses about it: F_{C,L,K}, for values C, L, and K, and F_{c,l,k} with values c, l, and k. We’ll designate F_{C,L,K} by \FC and F_{c,l,k} by \Fc.

Now, the true distribution is \FC, but the analyst doesn’t know that. The question we’re asking is thus:

Noisy Sampling

If the analyst can sample noiselessly from the curve, then three samples should generally suffice to fully establish \FC, and one sample should generally suffice to distinguish \FC from \Fc. So we’ll consider the (realistic) situation where there is noise in the samples.

So assume the analyst samples n points, at \vec{x}=(x_1,x_2,\ldots, x_n). In return, it gets n values, \vec{y}=(y_1,y_2,\ldots, y_n); these are sampled independently from N(\FC(x_i),\sigma_i^2). This is a normal distribution with mean \FC(x_i) and standard deviation \sigma_i.

The analyst is assumed to know the vector \vec{\sigma}=(\sigma_1,\ldots \sigma_n), and indeed everything about this setup, with one exception: whether the means of these normal distributions are \FC(x_i) or \Fc(x_i).

Let P_a be the analyst’s probability distribution. Their prior gives equal weight to both hypotheses: P_a(\FC)=P_a(\Fc)=1/2. Let \Ov be the analyst observing \vec{y} after sampling from \vec{x}; their posterior is then P_a(\Fc \mid \Ov).

Note that, from our perspective, P_a(\FC \mid \vec{x}) is a random variable whose distribution we know. Say that:

We could choose other criteria, and this a relatively loose one. It only assumes three bits of information in favour of \FC over \Fc. Note that since P_a \geq 0, we can get probability bounds on P_a as well, from this result; for instance:

So, for instance, our criteria above ensures that with probability at least 3/4, P_a(\Fc \mid \vec{x}) \leq 1/4. Conversely, since P_a \leq 1, probability bounds on P_a translate into expectation bounds, making the two approaches loosely equivalent. We’ll use expectation bounds, as they are more natural for this random variable.

Bounding results

Our first result, proven in later sections, is a lower bound on the expectation of P_a(\Fc \mid \vec{x}):

\begin{array}{rlr} \mathbb{E}\left[P_a(\Fc \mid \vec{x})\right] &\geq \frac{1}{2} \prod_{i=1}^n \left[ 1 - \operatorname{erf}\left(\frac{\delta_i}{\sigma_i 2\sqrt{2}}\right) \right].& (1) \end{array}

Here \operatorname{erf} is the error function and \delta_i is the absolute difference between \FC(x_i) and \Fc(x_i). We can then get the slightly looser but more easily computable bound:

\begin{array}{rlr} \mathbb{E}\left[P_a(\Fc \mid \vec{x})\right] &\geq \frac{1}{2} \left[ 1 - \sum_{i=1}^n \frac{\delta_i}{\sigma_i \sqrt{2\pi}} \right].& (2) \end{array}

How to sample

Sampling very large positive or negative values

Note that:

0\leq F_{c,l,k} = l\frac{e^{kx}}{e^{k(x-c)}+1} \leq l e^{kx}.

Hence we can bound the \delta_i via:

\delta_i = | F(x_i)-f(x_i) | \leq \max(L e^{Kx_i}, le^{kx_i}).

Let m(x_i)=\max(L e^{Kx_i}, le^{kx_i}); note this is an increasing function, exponential for very negative x_i.

Assume we sample n' different x_i values below a very negative X; then if \sigma^- is the minimum of all the \sigma_i for x_i \leq X, the contribution of these n' points to the expectation bound is at most 0 and at least:

This gives our result for very negative values:

The behaviour for large positive x_i is also clear: unless le^{kc}=Le^{KC}, f and F must have different asymptotes. So as long as there is an upper bound \sigma^+ on the noise, sampling the curve at large values will cause the expectation of P_a(f\mid \vec{x}) to converge to 0. For large x_i, this is essentially trying to distinguish N(le^{kc},\sigma^+) from N(Le^{KC},\sigma^+), so each extra sample applies a multiplicative factor to the expected value of P_a(\Fc \mid \vec{x}). So, for large samples, the probability of the wrong function converges geometrically to zero in the number of samples.

Finding (any) turning point

So, distinguishing F and f for very low samples is very hard, but distinguishing them for very high samples is generally not very useful. But enough about asymptotic behaviour. The question is: what happens in between, closer to the turning points C and c of \FC and \Fc?

We can make some scaling and translation choices to simplify \FC, setting c=0 and l=k=1. So the turning point is at 0 (y value 1/2) and the supremum is 1:

\FC(x)=\frac{e^x}{e^x+1} = \frac{1}{1+e^{-x}}.

Assume now that the noise \sigma_i is a constant \sigma. We want \Fc to have a different turning point, so that can see how easy it is to identify this turning point. Let’s choose the worst possible scenario: \Fc is an exponential function with no turning point:

\Fc(x)=le^{kx}.

So, how can the analyst sample so that they have the greatest possible chance of distinguishing between a true function with a turning point at 0, and a false function with no turning point at all?

We have two free variables: the k and l of \Fc, and we typically want to see how well we can do when sampling below a given X. For constant \sigma, the elements of the bound are given by:

\frac{\delta_i}{\sigma \sqrt{2\pi}} = \frac{|\FC(x_i) - \Fc(x_i)|}{\sigma \sqrt{2\pi}} =\left|\frac{le^{(k+1)x}+le^{kx}-e^x}{(e^x + 1)\sqrt{2\pi}}\right|\frac{1}{\sigma}.

Define d(l,k)(x) as this function, without the \sigma term. We’ll now consider X=0, ie we are sampling at any point before the turning point. Then some experimentation allows us to minimize d(l,k)(x) for negative values, by setting l=0.51 and k=0.69; given these values, d(l,k)(x) is bounded above by 0.007:

Consequently we can use equation (2) to get a bound:

\begin{array}{rl} \mathbb{E}\left[P_a(\Fc \mid \vec{x})\right] &\geq \frac{1}{2} \left[ 1 - \frac{n}{\sigma} 0.007 \right] \end{array}

To establish the difference between \FC and \Fc, we need this below 1/16. Consequently, we need \frac{n}{\sigma}0.007 \geq \frac{7}{8}, or

\frac{n}{\sigma} \geq 125.

So if the noise is 1/200, ie 1% of the value at the turning point, a single data point might suffice. But if the noise is 10% of the value at the turning point, then at least seven samples are needed.

Anyway, that’s all the way to the turning point; what about if X is chosen so that the value \FC(X) is 1/3 (two thirds of the value at the turning point) or 1/4 (a half of the value at the turning point)? To get these, we need X=-\log(2) and X=-\log(3), respectively. We’ll also look at past the turning point, X=\log(2) and \log(3).

Optimising l and k for all five situations give:

But equation (2) gives poor bounds for low \sigma. Using equation (1) instead, for \sigma=1/200 (1% of turning point y-value) and \sigma=1/20 (10% of turning point y-value), gives the number n of samples needed as:

\begin{array}{|c|c|c|} \hline X & \sigma=1/200 & \sigma=1/20 \ \hline\hline -\log(3) & 6 & 69 \ \hline -\log(2) & 3 & 38 \ \hline 0 & 1 & 14\ \hline \log(2)& 1 & 6 \ \hline \log(3)& 1 & 5 \ \hline \end{array}

Other difficulties

The bounds above are only good if the values are sampled independently and close to the peak of the d(l,k) function. If the values are not independent—as values sampled close to each other tend not to be—then more must be sampled, and the same goes if the values are sampled away from the peaks.

The other issue is that, here, we’ve first optimised l and k for minimal peak of d(l,k), then assumed the best x_i were sampled. We need to consider the opposite situations, too: given the sampled x_i, optimise l and k. So, even if n samples are enough to distinguish \FC from this specific \Fc, there are other exponential functions \Fi that would be harder to distinguish from \FC.

Proof

This section will prove the bounds in equation (1) and (2).

By Bayes rule:

\begin{array}{rl} P_a(\Fc \mid \Ov) &= \frac{P_a(\Ov \mid \Fc) P_a(\Fc)}{P_a(\Ov)}\ &= \frac{P_a(\Ov \mid \Fc) P_a(\Fc)}{P_a(\Ov \mid \Fc) P_a(\Fc) + P_a(\Ov \mid \FC) P_a(\FC)}\ &= \frac{P_a(\Ov \mid \Fc)}{P_a(\Ov \mid \Fc) + P_a(\Ov \mid \FC)}, \end{array}

since the prior probabilities are equal. Since the analyst knows the true variances, P_a(\Ov \mid \Fc)=P(\Ov \mid \Fc) and similarly for \FC: we can replace the analyst’s probabilities with the true probabilities. So, contracting P(\Ov\mid\FC) as p_1(\vec{y}) and P(\Ov\mid\Fc) as p_2(\vec{y}), we get:

\begin{array}{rl} P_a(\Fc \mid \Ov) &= \frac{p_2(\vec{y})}{p_2(\vec{y}) + p_1(\vec{y}))} \ &= \frac{1}{1 + p_1(\vec{y})p_2(\vec{y})^{-1}}. \end{array}

To get the true expectation of this P_a, we need to integrate over the possible values of \vec{y}, weighted by the true probability P(\Ov\mid\ \FC)=p_1(\vec{y}) of this happening:

\begin{array}{rl} \mathbb{E}\left[P_a(\Fc \mid \vec{x})\right] &= \int \frac{1}{1 + p_1(\vec{y})p_2(\vec{y})^{-1}} p_1(\vec{y}) d\vec{y} \ &= \int \frac{1}{p_1(\vec{y})^{-1} + p_2(\vec{y})^{-1}} d\vec{y}. \end{array}

Note that p_1(\vec{y}) and p_2(\vec{y})^{-1} are both (strictly) positive, and that 1/(p_1(\vec{y})^{-1} + p_2(\vec{y})^{-1}) is half the harmonic mean of the two.

The harmonic mean of any number of positive elements is bounded below by the minimum value of its arguments. Hence:

\begin{array}{rl} \mathbb{E}\left[P_a(\Fc \mid \vec{x})\right] &\geq \frac{1}{2} \int \min(p_1(\vec{y}), p_2(\vec{y})) d\vec{y}. \end{array}

Now, since the noise is independent, p_j(\vec{y})=\prod_{i=1}^n p_j(y_i) where p_1(y_i)=P(y_i \mid x_i, \FC) and p_2(y_i)=P(y_i \mid x_i, \Fc). For positive elements, the minimum of two products is greater than or equal to the product of minimums, so

\begin{array}{rl} \mathbb{E}\left[P_a(\Fc \mid \vec{x})\right] &\geq \frac{1}{2} \int_{-\infty}^{\infty} \prod_{i=1}^n \min(p_1(y_i), p_2(y_i)) d\vec{y} \ &\geq \frac{1}{2} \prod_{i=1}^n \int_{-\infty}^{\infty} \min(p_1(y_i), p_2(y_i)) dy_i. \end{array}

The expressions \min(p_1(y_i), p_2(y_i)) can be expressed analytically. If \varphi is the probability density function of N(0,1), the normal distribution with mean 0 and variance 1, then

\begin{array}{rl} p_1(y_i) &= \frac{1}{\sigma_i} \varphi\left(\frac{\FC(x_i)-y_i}{\sigma_i}\right),\ p_2(y_i) &= \frac{1}{\sigma_i} \varphi\left(\frac{\Fc(x_i)-y_i}{\sigma_i}\right). \end{array}

So the two curves are normal curves with the same variance and means \FC(x_i) and \Fc(x_i). Assume, without loss of generality, that \FC(x_i)\leq\Fc(y_i). Then the two functions will be equal at the midpoint \mu_i=(\FC(x_i)+\Fc(x_i))/2, and for y_i \leq \mu_i, p_1(y_i) is higher, while for y_i\geq \mu_i, p_2(y_i) is higher.

Thus

\min(p_1(y_i), p_2(y_i)) = \left{\begin{array}{rl} \frac{1}{\sigma_i} \varphi\left(\frac{\Fc(x_i)-y_i}{\sigma_i}\right), \ y_i \leq \mu_i, \ \frac{1}{\sigma_i} \varphi\left(\frac{\FC(x_i)-y_i}{\sigma_i}\right), \ y_i \geq \mu_i. \end{array}\right.

If \delta_i = |\FC(x_i)-\Fc(x_i)| is the distance between the two peaks, this becomes:

\min(p_1(y_i), p_2(y_i)) = \left{\begin{array}{rl} \frac{1}{\sigma_i} \varphi\left(\frac{\mu_i+ \delta_i/2-y_i}{\sigma_i}\right), \ y_i \leq \mu_i, \ \frac{1}{\sigma_i} \varphi\left(\frac{\mu_i- \delta_i/2 -y_i}{\sigma_i}\right), \ y_i \geq \mu_i. \end{array}\right.

Since the integral of \varphi is 1/2\left[ 1 + \operatorname{erf}(y/\sqrt{2}) \right], for \operatorname{erf} the error function, we can bound the expected probability by:

\begin{array}{rl} \mathbb{E}\left[P_a(\Fc \mid \vec{x})\right] &\geq \frac{1}{2} \prod_{i=1}^n \int_{-\infty}^{\infty} \min(p_1(y_i), p_2(y_i)) dy_i \ &\geq \frac{1}{2} \prod_{i=1}^n \left(\int_{-\infty}^{\mu_i} \frac{1}{\sigma_i} \varphi\left(\frac{\mu_i+ \delta_i/2-y_i}{\sigma_i}\right)dy_i + \int_{\mu_i}^{\infty} \frac{1}{\sigma_i} \varphi\left(\frac{\mu_i- \delta_i/2-y_i}{\sigma_i}\right)dy_i \right)\ &\geq \frac{1}{2} \prod_{i=1}^n \left(\int_{-\infty}^{-\delta_i/2} \frac{1}{\sigma_i} \varphi\left(\frac{-y_i}{\sigma_i}\right)dy_i + \int_{\delta_i/2}^{\infty} \frac{1}{\sigma_i} \varphi\left(\frac{-y_i}{\sigma_i}\right)dy_i \right)\ &\geq \frac{1}{2} \prod_{i=1}^n \left( \frac{1}{2}\left[ 1 + \operatorname{erf}\left(\frac{-\delta_i/2}{\sigma_i \sqrt{2}}\right) \right] + \frac{1}{2}\left[ 1 + \operatorname{erf}\left(\frac{-\delta_i/2}{\sigma_i \sqrt{2}}\right) \right] \right)\ &\geq \frac{1}{2} \prod_{i=1}^n \left[ 1 - \operatorname{erf}\left(\frac{\delta_i}{\sigma_i 2\sqrt{2}}\right) \right]. \end{array}

For positive values, the error function is concave, and it has derivative 2/\sqrt{\pi} at the origin, so

\operatorname{erf}\left(\frac{\delta_i}{\sigma_i 2\sqrt{2}}\right) \leq \frac{2}{\sqrt{\pi}}\frac{\delta_i}{\sigma_i 2\sqrt{2}} = \frac{\delta_i}{\sigma_i \sqrt{2\pi}}.

Consequently

\begin{array}{rl} \mathbb{E}\left[P_a(\Fc \mid \vec{x})\right] &\geq \frac{1}{2} \prod_{i=1}^n \left[ 1 - \frac{\delta_i}{\sigma_i \sqrt{2\pi}} \right]. \end{array}

Using the fact that for x,y positive, (1-x)(1-y) \geq 1-(x+y), we get a final bound:

\begin{array}{rl} \mathbb{E}\left[P_a(\Fc \mid \vec{x})\right] &\geq \frac{1}{2} \left[ 1 - \sum_{i=1}^n \frac{\delta_i}{\sigma_i \sqrt{2\pi}} \right] \end{array}