Separation (statistics)
In statistics, separation is a phenomenon associated with models for dichotomous or categorical outcomes, including logistic and probit regression. Separation occurs if the predictor (or a linear combination of some subset of the predictors) is associated with only one outcome value when the predictor is greater than some constant.
For example, if the predictor X is continuous, and the outcome y = 1 for all observed x > 2. If the outcome values are perfectly determined by the predictor (e.g., y = 0 when x ≤ 2) then the condition "complete separation" is said to occur. If instead there is some overlap (e.g., y = 0 when x < 2, but y has observed values of 0 and 1 when x = 2) then "quasi-complete separation" occurs. A 2 × 2 table with an empty cell is an example of quasi-complete separation.
This observed form of the data is important because it causes problems with estimated regression coefficients. Loosely, a parameter in the model "wants" to be infinite, if complete separation is observed. If quasi-complete separation is the case, the likelihood is maximized at a very large but not infinite value for that parameter. Computer programs will often output an arbitrarily large parameter estimate with a very large standard error. Methods to fit these models include exact logistic regression and Firth logistic regression, a bias-reduction method based on a penalized likelihood.
References
- Albert, A.; Anderson, J. A. (1984). "On the Existence of Maximum Likelihood Estimates in Logistic Regression Models". Biometrika. 71 (1–10). doi:10.1093/biomet/71.1.1.
- Heinze, G.; Schemper, M. (2002). "A Solution to the Problem of Separation in logistic regression". Statistics in Medicine. 21 (16): 2409–2419. doi:10.1002/sim.1047.
- Heinze, G.; Ploner, M. (2003). "Fixing the nonconvergence bug in logistic regression with SPLUS and SAS". Computer Methods and Programs in Biomedicine. 71 (2): 181–187. doi:10.1016/S0169-2607(02)00088-3.
- Heinze, G. (2006). "A comparative investigation of methods for logistic regression with separated or nearly separated data". Statistics in Medicine. 25 (24): 4216–4226. doi:10.1002/sim.2687.