Contents
- Logistic curves
- Figuring out the right curve
- Noisy Sampling
- Bounding results
- How to sample
- Sampling very large positive or negative values
- Finding (any) turning point
- Other difficulties
- Proof \newcommand{\FC}{F}\newcommand{\Fc}{f}\newcommand{\Fi}{F_{\infty,l,k}}\newcommand{\Ov}{O_{\vec{x}}^{\vec{y}}} This post will attempt to formalise the intuition that "it’s hard to figure out the turning point of a logistic curve, at least until after that turning point". Ashort visual "proof" can also be found here.
Logistic curves
The logistic curves look like this:
Logistic curves can be specified by three parameters, c, l > 0, and k > 0. Their equation is then:
F_{c,l,k}(x) = l\frac{e^{kx}}{e^{k(x-c)}+1}.
Note that this l is different from that in this article. The turning point of this curve is at x=c (where it takes the value of l e^{kc}/2) while its supremum is le^{kc}; it tends to this value as x\to\infty. Take the limit as c\to\infty as being the exponentials:
\Fi(x) = l e^{kx}.
Figuring out the right curve
We’ll imagine a simple Bayesian setup. An analyst of logistic curves is seeing data from one distribution, and has two hypotheses about it: F_{C,L,K}, for values C, L, and K, and F_{c,l,k} with values c, l, and k. We’ll designate F_{C,L,K} by \FC and F_{c,l,k} by \Fc.
Now, the true distribution is \FC, but the analyst doesn’t know that. The question we’re asking is thus:
- Starting from an equal prior on \FC and \Fc, how much of what kind of observation will the analyst need to establish that \FC is the true underlying distribution?
Noisy Sampling
If the analyst can sample noiselessly from the curve, then three samples should generally suffice to fully establish \FC, and one sample should generally suffice to distinguish \FC from \Fc. So we’ll consider the (realistic) situation where there is noise in the samples.
So assume the analyst samples n points, at \vec{x}=(x_1,x_2,\ldots, x_n). In return, it gets n values, \vec{y}=(y_1,y_2,\ldots, y_n); these are sampled independently from N(\FC(x_i),\sigma_i^2). This is a normal distribution with mean \FC(x_i) and standard deviation \sigma_i.
The analyst is assumed to know the vector \vec{\sigma}=(\sigma_1,\ldots \sigma_n), and indeed everything about this setup, with one exception: whether the means of these normal distributions are \FC(x_i) or \Fc(x_i).
Let P_a be the analyst’s probability distribution. Their prior gives equal weight to both hypotheses: P_a(\FC)=P_a(\Fc)=1/2. Let \Ov be the analyst observing \vec{y} after sampling from \vec{x}; their posterior is then P_a(\Fc \mid \Ov).
Note that, from our perspective, P_a(\FC \mid \vec{x}) is a random variable whose distribution we know. Say that:
- \vec{x} establishes the difference between \FC and \Fc if the expectation of P_a(\FC \mid \vec{x}) is less than 1/16.
We could choose other criteria, and this a relatively loose one. It only assumes three bits of information in favour of \FC over \Fc. Note that since P_a \geq 0, we can get probability bounds on P_a as well, from this result; for instance:
- If \mathbb{E}\left[P_a(\Fc \mid \vec{x})\right] \leq q/4, then with probability at least 3/4, P_a(\Fc \mid \vec{x}) \leq q.
So, for instance, our criteria above ensures that with probability at least 3/4, P_a(\Fc \mid \vec{x}) \leq 1/4. Conversely, since P_a \leq 1, probability bounds on P_a translate into expectation bounds, making the two approaches loosely equivalent. We’ll use expectation bounds, as they are more natural for this random variable.
Bounding results
Our first result, proven in later sections, is a lower bound on the expectation of P_a(\Fc \mid \vec{x}):
\begin{array}{rlr} \mathbb{E}\left[P_a(\Fc \mid \vec{x})\right] &\geq \frac{1}{2} \prod_{i=1}^n \left[ 1 - \operatorname{erf}\left(\frac{\delta_i}{\sigma_i 2\sqrt{2}}\right) \right].& (1) \end{array}
Here \operatorname{erf} is the error function and \delta_i is the absolute difference between \FC(x_i) and \Fc(x_i). We can then get the slightly looser but more easily computable bound:
\begin{array}{rlr} \mathbb{E}\left[P_a(\Fc \mid \vec{x})\right] &\geq \frac{1}{2} \left[ 1 - \sum_{i=1}^n \frac{\delta_i}{\sigma_i \sqrt{2\pi}} \right].& (2) \end{array}
How to sample
Sampling very large positive or negative values
Note that:
0\leq F_{c,l,k} = l\frac{e^{kx}}{e^{k(x-c)}+1} \leq l e^{kx}.
Hence we can bound the \delta_i via:
\delta_i = | F(x_i)-f(x_i) | \leq \max(L e^{Kx_i}, le^{kx_i}).
Let m(x_i)=\max(L e^{Kx_i}, le^{kx_i}); note this is an increasing function, exponential for very negative x_i.
Assume we sample n' different x_i values below a very negative X; then if \sigma^- is the minimum of all the \sigma_i for x_i \leq X, the contribution of these n' points to the expectation bound is at most 0 and at least:
- n'\frac{1}{\sigma^- \sqrt{2\pi}} m(X).
This gives our result for very negative values:
- If noise is irreducible below \sigma^-, then sampling below a very negative X will have very little impact on the analyst’s posterior. To get a better result, increasing the X (exponential effect) is generally more powerful than decreasing \sigma^- (inverse linear effect), and much more powerful than getting more samples (linear effect).
The behaviour for large positive x_i is also clear: unless le^{kc}=Le^{KC}, f and F must have different asymptotes. So as long as there is an upper bound \sigma^+ on the noise, sampling the curve at large values will cause the expectation of P_a(f\mid \vec{x}) to converge to 0. For large x_i, this is essentially trying to distinguish N(le^{kc},\sigma^+) from N(Le^{KC},\sigma^+), so each extra sample applies a multiplicative factor to the expected value of P_a(\Fc \mid \vec{x}). So, for large samples, the probability of the wrong function converges geometrically to zero in the number of samples.
Finding (any) turning point
So, distinguishing F and f for very low samples is very hard, but distinguishing them for very high samples is generally not very useful. But enough about asymptotic behaviour. The question is: what happens in between, closer to the turning points C and c of \FC and \Fc?
We can make some scaling and translation choices to simplify \FC, setting c=0 and l=k=1. So the turning point is at 0 (y value 1/2) and the supremum is 1:
\FC(x)=\frac{e^x}{e^x+1} = \frac{1}{1+e^{-x}}.
Assume now that the noise \sigma_i is a constant \sigma. We want \Fc to have a different turning point, so that can see how easy it is to identify this turning point. Let’s choose the worst possible scenario: \Fc is an exponential function with no turning point:
\Fc(x)=le^{kx}.
So, how can the analyst sample so that they have the greatest possible chance of distinguishing between a true function with a turning point at 0, and a false function with no turning point at all?
We have two free variables: the k and l of \Fc, and we typically want to see how well we can do when sampling below a given X. For constant \sigma, the elements of the bound are given by:
\frac{\delta_i}{\sigma \sqrt{2\pi}} = \frac{|\FC(x_i) - \Fc(x_i)|}{\sigma \sqrt{2\pi}} =\left|\frac{le^{(k+1)x}+le^{kx}-e^x}{(e^x + 1)\sqrt{2\pi}}\right|\frac{1}{\sigma}.
Define d(l,k)(x) as this function, without the \sigma term. We’ll now consider X=0, ie we are sampling at any point before the turning point. Then some experimentation allows us to minimize d(l,k)(x) for negative values, by setting l=0.51 and k=0.69; given these values, d(l,k)(x) is bounded above by 0.007:
Consequently we can use equation (2) to get a bound:
\begin{array}{rl} \mathbb{E}\left[P_a(\Fc \mid \vec{x})\right] &\geq \frac{1}{2} \left[ 1 - \frac{n}{\sigma} 0.007 \right] \end{array}
To establish the difference between \FC and \Fc, we need this below 1/16. Consequently, we need \frac{n}{\sigma}0.007 \geq \frac{7}{8}, or
\frac{n}{\sigma} \geq 125.
So if the noise is 1/200, ie 1% of the value at the turning point, a single data point might suffice. But if the noise is 10% of the value at the turning point, then at least seven samples are needed.
Anyway, that’s all the way to the turning point; what about if X is chosen so that the value \FC(X) is 1/3 (two thirds of the value at the turning point) or 1/4 (a half of the value at the turning point)? To get these, we need X=-\log(2) and X=-\log(3), respectively. We’ll also look at past the turning point, X=\log(2) and \log(3).
Optimising l and k for all five situations give:
-
For X=-\log(3), \frac{n}{\sigma}0.0015 \geq \frac{7}{8} or \mathbf{\frac{n}{\sigma} \geq 583}.
-
For X=-\log(2), \frac{n}{\sigma}0.0027 \geq \frac{7}{8} or \mathbf{\frac{n}{\sigma} \geq 324}.
-
For X=0, \frac{n}{\sigma}0.007 \geq \frac{7}{8} or \mathbf{\frac{n}{\sigma} \geq 125}.
-
For X=\log(2), \frac{n}{\sigma}0.015 \geq \frac{7}{8} or \mathbf{\frac{n}{\sigma} \geq 58}.
-
For X=\log(3), \frac{n}{\sigma}0.021 \geq \frac{7}{8} or \mathbf{\frac{n}{\sigma} \geq 41}.
But equation (2) gives poor bounds for low \sigma. Using equation (1) instead, for \sigma=1/200 (1% of turning point y-value) and \sigma=1/20 (10% of turning point y-value), gives the number n of samples needed as:
\begin{array}{|c|c|c|} \hline X & \sigma=1/200 & \sigma=1/20 \ \hline\hline -\log(3) & 6 & 69 \ \hline -\log(2) & 3 & 38 \ \hline 0 & 1 & 14\ \hline \log(2)& 1 & 6 \ \hline \log(3)& 1 & 5 \ \hline \end{array}
Other difficulties
The bounds above are only good if the values are sampled independently and close to the peak of the d(l,k) function. If the values are not independent—as values sampled close to each other tend not to be—then more must be sampled, and the same goes if the values are sampled away from the peaks.
The other issue is that, here, we’ve first optimised l and k for minimal peak of d(l,k), then assumed the best x_i were sampled. We need to consider the opposite situations, too: given the sampled x_i, optimise l and k. So, even if n samples are enough to distinguish \FC from this specific \Fc, there are other exponential functions \Fi that would be harder to distinguish from \FC.
Proof
This section will prove the bounds in equation (1) and (2).
By Bayes rule:
\begin{array}{rl} P_a(\Fc \mid \Ov) &= \frac{P_a(\Ov \mid \Fc) P_a(\Fc)}{P_a(\Ov)}\ &= \frac{P_a(\Ov \mid \Fc) P_a(\Fc)}{P_a(\Ov \mid \Fc) P_a(\Fc) + P_a(\Ov \mid \FC) P_a(\FC)}\ &= \frac{P_a(\Ov \mid \Fc)}{P_a(\Ov \mid \Fc) + P_a(\Ov \mid \FC)}, \end{array}
since the prior probabilities are equal. Since the analyst knows the true variances, P_a(\Ov \mid \Fc)=P(\Ov \mid \Fc) and similarly for \FC: we can replace the analyst’s probabilities with the true probabilities. So, contracting P(\Ov\mid\FC) as p_1(\vec{y}) and P(\Ov\mid\Fc) as p_2(\vec{y}), we get:
\begin{array}{rl} P_a(\Fc \mid \Ov) &= \frac{p_2(\vec{y})}{p_2(\vec{y}) + p_1(\vec{y}))} \ &= \frac{1}{1 + p_1(\vec{y})p_2(\vec{y})^{-1}}. \end{array}
To get the true expectation of this P_a, we need to integrate over the possible values of \vec{y}, weighted by the true probability P(\Ov\mid\ \FC)=p_1(\vec{y}) of this happening:
\begin{array}{rl} \mathbb{E}\left[P_a(\Fc \mid \vec{x})\right] &= \int \frac{1}{1 + p_1(\vec{y})p_2(\vec{y})^{-1}} p_1(\vec{y}) d\vec{y} \ &= \int \frac{1}{p_1(\vec{y})^{-1} + p_2(\vec{y})^{-1}} d\vec{y}. \end{array}
Note that p_1(\vec{y}) and p_2(\vec{y})^{-1} are both (strictly) positive, and that 1/(p_1(\vec{y})^{-1} + p_2(\vec{y})^{-1}) is half the harmonic mean of the two.
The harmonic mean of any number of positive elements is bounded below by the minimum value of its arguments. Hence:
\begin{array}{rl} \mathbb{E}\left[P_a(\Fc \mid \vec{x})\right] &\geq \frac{1}{2} \int \min(p_1(\vec{y}), p_2(\vec{y})) d\vec{y}. \end{array}
Now, since the noise is independent, p_j(\vec{y})=\prod_{i=1}^n p_j(y_i) where p_1(y_i)=P(y_i \mid x_i, \FC) and p_2(y_i)=P(y_i \mid x_i, \Fc). For positive elements, the minimum of two products is greater than or equal to the product of minimums, so
\begin{array}{rl} \mathbb{E}\left[P_a(\Fc \mid \vec{x})\right] &\geq \frac{1}{2} \int_{-\infty}^{\infty} \prod_{i=1}^n \min(p_1(y_i), p_2(y_i)) d\vec{y} \ &\geq \frac{1}{2} \prod_{i=1}^n \int_{-\infty}^{\infty} \min(p_1(y_i), p_2(y_i)) dy_i. \end{array}
The expressions \min(p_1(y_i), p_2(y_i)) can be expressed analytically. If \varphi is the probability density function of N(0,1), the normal distribution with mean 0 and variance 1, then
\begin{array}{rl} p_1(y_i) &= \frac{1}{\sigma_i} \varphi\left(\frac{\FC(x_i)-y_i}{\sigma_i}\right),\ p_2(y_i) &= \frac{1}{\sigma_i} \varphi\left(\frac{\Fc(x_i)-y_i}{\sigma_i}\right). \end{array}
So the two curves are normal curves with the same variance and means \FC(x_i) and \Fc(x_i). Assume, without loss of generality, that \FC(x_i)\leq\Fc(y_i). Then the two functions will be equal at the midpoint \mu_i=(\FC(x_i)+\Fc(x_i))/2, and for y_i \leq \mu_i, p_1(y_i) is higher, while for y_i\geq \mu_i, p_2(y_i) is higher.
Thus
\min(p_1(y_i), p_2(y_i)) = \left{\begin{array}{rl} \frac{1}{\sigma_i} \varphi\left(\frac{\Fc(x_i)-y_i}{\sigma_i}\right), \ y_i \leq \mu_i, \ \frac{1}{\sigma_i} \varphi\left(\frac{\FC(x_i)-y_i}{\sigma_i}\right), \ y_i \geq \mu_i. \end{array}\right.
If \delta_i = |\FC(x_i)-\Fc(x_i)| is the distance between the two peaks, this becomes:
\min(p_1(y_i), p_2(y_i)) = \left{\begin{array}{rl} \frac{1}{\sigma_i} \varphi\left(\frac{\mu_i+ \delta_i/2-y_i}{\sigma_i}\right), \ y_i \leq \mu_i, \ \frac{1}{\sigma_i} \varphi\left(\frac{\mu_i- \delta_i/2 -y_i}{\sigma_i}\right), \ y_i \geq \mu_i. \end{array}\right.
Since the integral of \varphi is 1/2\left[ 1 + \operatorname{erf}(y/\sqrt{2}) \right], for \operatorname{erf} the error function, we can bound the expected probability by:
\begin{array}{rl} \mathbb{E}\left[P_a(\Fc \mid \vec{x})\right] &\geq \frac{1}{2} \prod_{i=1}^n \int_{-\infty}^{\infty} \min(p_1(y_i), p_2(y_i)) dy_i \ &\geq \frac{1}{2} \prod_{i=1}^n \left(\int_{-\infty}^{\mu_i} \frac{1}{\sigma_i} \varphi\left(\frac{\mu_i+ \delta_i/2-y_i}{\sigma_i}\right)dy_i + \int_{\mu_i}^{\infty} \frac{1}{\sigma_i} \varphi\left(\frac{\mu_i- \delta_i/2-y_i}{\sigma_i}\right)dy_i \right)\ &\geq \frac{1}{2} \prod_{i=1}^n \left(\int_{-\infty}^{-\delta_i/2} \frac{1}{\sigma_i} \varphi\left(\frac{-y_i}{\sigma_i}\right)dy_i + \int_{\delta_i/2}^{\infty} \frac{1}{\sigma_i} \varphi\left(\frac{-y_i}{\sigma_i}\right)dy_i \right)\ &\geq \frac{1}{2} \prod_{i=1}^n \left( \frac{1}{2}\left[ 1 + \operatorname{erf}\left(\frac{-\delta_i/2}{\sigma_i \sqrt{2}}\right) \right] + \frac{1}{2}\left[ 1 + \operatorname{erf}\left(\frac{-\delta_i/2}{\sigma_i \sqrt{2}}\right) \right] \right)\ &\geq \frac{1}{2} \prod_{i=1}^n \left[ 1 - \operatorname{erf}\left(\frac{\delta_i}{\sigma_i 2\sqrt{2}}\right) \right]. \end{array}
For positive values, the error function is concave, and it has derivative 2/\sqrt{\pi} at the origin, so
\operatorname{erf}\left(\frac{\delta_i}{\sigma_i 2\sqrt{2}}\right) \leq \frac{2}{\sqrt{\pi}}\frac{\delta_i}{\sigma_i 2\sqrt{2}} = \frac{\delta_i}{\sigma_i \sqrt{2\pi}}.
Consequently
\begin{array}{rl} \mathbb{E}\left[P_a(\Fc \mid \vec{x})\right] &\geq \frac{1}{2} \prod_{i=1}^n \left[ 1 - \frac{\delta_i}{\sigma_i \sqrt{2\pi}} \right]. \end{array}
Using the fact that for x,y positive, (1-x)(1-y) \geq 1-(x+y), we get a final bound:
\begin{array}{rl} \mathbb{E}\left[P_a(\Fc \mid \vec{x})\right] &\geq \frac{1}{2} \left[ 1 - \sum_{i=1}^n \frac{\delta_i}{\sigma_i \sqrt{2\pi}} \right] \end{array}