Tag Archives: Statistics

Analysis of variance (ANOVA)

Analysis of variance (ANOVA) is a collection of statistical models used to analyze the differences among group means and their associated procedures (such as “variation” among and between groups), developed by statistician and evolutionary biologist Ronald Fisher. In the ANOVA setting, the observed variance in a particular variable is partitioned into components attributable to different sources of variation. In its simplest form, ANOVA provides a statistical test of whether or not the means of several groups are equal, and therefore generalizes the t-test to more than two groups. ANOVAs are useful for comparing (testing) three or more means (groups or variables) forstatistical significance. It is conceptually similar to multiple two-sample t-tests, but is more conservative (results in less type I error) and is therefore suited to a wide range of practical problems.

Latin square



The “Gamma plus two” method for generating “odd order” magic squares, the“Gamma plus two plus swap” method for generating “singly even order” magic squares, and Durer’s method for generating “doubly even order” magic squares.
By Professor Edward Brumgnach, P.E.
City University of New York
Queensborough Community College

Continue reading

boy girl paradox

Published on Mar 8, 2016

TED-Ed presented a riddle last week based on a classic probability problem. However in the riddle there is a small and seemingly insignificant detail that changes the calculation. In this video I present the pertinent details of the frog riddle, explain its connection to the boy or girl paradox, and then do a detailed calculation of what I believe is the correct probability.

TED-ED frog riddle: https://www.youtube.com/watch?v=cpwSG…

Blog post (another calculation if the probability a male frog croaks is p): http://wp.me/p6aMk-4wD

Ron Niles made a video that shows the probability visually and explains an interpretation of a male frog croaking with probability p:https://www.youtube.com/watch?v=K53P5…

Hill’s criteria for causation

The Bradford Hill criteria, otherwise known as Hill’s criteria for causation, are a group of minimal conditions necessary to provide adequate evidence of a causal relationship between an incidence and a possible consequence, established by the English epidemiologist Sir Austin Bradford Hill (1897–1991) in 1965.

The list of the criteria is as follows:

  1. Strength (effect size): A small association does not mean that there is not a causal effect, though the larger the association, the more likely that it is causal.[1]
  2. Consistency (reproducibility): Consistent findings observed by different persons in different places with different samples strengthens the likelihood of an effect.[1]
  3. Specificity: Causation is likely if there is a very specific population at a specific site and disease with no other likely explanation. The more specific an association between a factor and an effect is, the bigger the probability of a causal relationship.[1]
  4. Temporality: The effect has to occur after the cause (and if there is an expected delay between the cause and expected effect, then the effect must occur after that delay).[1]
  5. Biological gradient: Greater exposure should generally lead to greater incidence of the effect. However, in some cases, the mere presence of the factor can trigger the effect. In other cases, an inverse proportion is observed: greater exposure leads to lower incidence.[1]
  6. Plausibility: A plausible mechanism between cause and effect is helpful (but Hill noted that knowledge of the mechanism is limited by current knowledge).[1]
  7. Coherence: Coherence between epidemiological and laboratory findings increases the likelihood of an effect. However, Hill noted that “… lack of such [laboratory] evidence cannot nullify the epidemiological effect on associations”.[1]
  8. Experiment: “Occasionally it is possible to appeal to experimental evidence”.[1]
  9. Analogy: The effect of similar factors may be considered.[1]

Debate in modern epidemiology

Bradford Hill’s criteria are still widely accepted in the modern era as a logical structure for investigating and defining causality in epidemiological study. However, their method of application is debated. Some proposed options include:

  1. using a counterfactual consideration as the basis for applying each criterion.[2]
  2. subdividing them into three categories: direct, mechanistic and parallel evidence, expected to complement each other. This operational reformulation of the criteria has been recently proposed in the context of evidence based medicine.[3]
  3. considering confounding factors and bias.[4]
  4. using Hill’s criteria as a guide but not considering them to give definitive conclusions.[5]
  5. separating causal association and interventions, because interventions in public health are more complex than can be evaluated by use of Hill’s criteria[6]

Arguments against the use of Bradford Hill criteria as exclusive considerations in proving causality also exist. Some argue that the basic mechanism of proving causality is not in applying specific criteria—whether those of Bradford Hill or counterfactual argument—but in scientific common sense deduction.[7] Others also argue that the specific study from which data has been produced is important, and while the Bradford-Hill criteria may be applied to test causality in these scenarios, the study type may rule out deducing or inducing causality, and the criteria are only of use in inferring the best explanation of this data.[8]

Debate over the scope of application of the criteria includes whether they can be applied to social sciences.[9] The argument proposed in this line of thought is that when considering the motives behind defining causality, the Bradford Hill criteria are important to apply to complex systems such as health sciences because they are useful in prediction models where a consequence is sought; explanation models as to why causation occurred are deduced less easily from Bradford Hill criteria as the instigation of causation, rather than the consequence, is needed for these models.

Researchers have applied Hill’s criteria for causality in examining the evidence in several areas of epidemiology, including connections between ultraviolet B radiation, vitamin D and cancer,[10][11] vitamin D and pregnancy and neonatal outcomes,[12] alcohol and cardiovascular disease outcomes,[13] infections and risk of stroke,[14] nutrition and biomarkers related to disease outcomes,[15] and sugar-sweetened beverage consumption and the prevalence of obesity and obesity-related diseases.[16] Referenced papers can be read to see how Hill’s criteria have been applied.

The Will Rogers phenomenon

The Will Rogers phenomenon is obtained when moving an element from one set to another set raises the average values of both sets. It is based on the following quote, attributed (perhaps incorrectly)[1] to comedian Will Rogers:

When the Okies left Oklahoma and moved to California, they raised the average intelligence level in both states.

The effect will occur when both of these conditions are met:

  • The element being moved is below average for its current set. Removing it will, by definition, raise the average of the remaining elements.
  • The element being moved is above the current average of the set it is entering. Adding it to the new set will, by definition, raise the average.

forest plot

A forest plot (or blobbogram[1]) is a graphical display designed to illustrate the relative strength of treatment effects in multiple quantitative scientific studies addressing the same question. It was developed for use in medical research as a means of graphically representing a meta-analysis of the results of randomized controlled trials. In the last twenty years, similar meta-analytical techniques have been applied in observational studies (e.g. environmental epidemiology) and forest plots are often used in presenting the results of such studies also.

Although forest plots can take several forms, they are commonly presented with two columns. The left-hand column lists the names of the studies (frequently randomized controlled trials or epidemiological studies), commonly in chronological order from the top downwards. The right-hand column is a plot of the measure of effect (e.g. an odds ratio) for each of these studies (often represented by a square) incorporating confidence intervals represented by horizontal lines. The graph may be plotted on a natural logarithmic scale when using odds ratios or other ratio-based effect measures, so that the confidence intervals are symmetrical about the means from each study and to ensure undue emphasis is not given to odds ratios greater than 1 when compared to those less than 1. The area of each square is proportional to the study’s weight in the meta-analysis. The overall meta-analysed measure of effect is often represented on the plot as a dashed vertical line. This meta-analysed measure of effect is commonly plotted as a diamond, the lateral points of which indicate confidence intervals for this estimate.

A vertical line representing no effect is also plotted. If the confidence intervals for individual studies overlap with this line, it demonstrates that at the given level of confidence their effect sizes do not differ from no effect for the individual study. The same applies for the meta-analysed measure of effect: if the points of the diamond overlap the line of no effect the overall meta-analysed result cannot be said to differ from no effect at the given level of confidence.

Forest plots date back to at least the 1970s. One plot is shown in a 1985 book about meta-analysis.[2]:252 The first use in print of the word “forest plot” may be in an abstract for a poster at the Pittsburgh (USA) meeting of the Society for Clinical Trials in May 1996.[3] An informative investigation on the origin of the notion “forest plot” was published in 2001.[4] The name refers to the forest of lines produced. In September 1990, Richard Peto joked that the plot was named after a breast cancer researcher called Pat Forrest and as a result the name has sometimes been spelled “forrest plot

logit regression

In statistics, logistic regression, or logit regression, or logit model[1] is a type of probabilistic statistical classification model.[2] It is also used to predict a binary response from a binary predictor, used for predicting the outcome of a categorical dependent variable (i.e., a class label) based on one or more predictor variables (features). That is, it is used in estimating the parameters of a qualitative response model. The probabilities describing the possible outcomes of a single trial are modeled, as a function of the explanatory (predictor) variables, using a logistic function. Frequently (and hereafter in this article) “logistic regression” is used to refer specifically to the problem in which the dependent variable is binary—that is, the number of available categories is two—while problems with more than two categories are referred to as multinomial logistic regression or, if the multiple categories are ordered, as ordered logistic regression.

Logistic regression measures the relationship between the categorical dependent variable and one or more independent variables, which are usually (but not necessarily) continuous, by using probability scores as the predicted values of the dependent variable.[3] Thus, it treats the same set of problems as does probit regression using similar techniques; the first assumes a logistic function and the second a standard normal distribution function.

Logistic regression can be seen as a special case of generalized linear model and thus analogous to linear regression. The model of logistic regression, however, is based on quite different assumptions (about the relationship between dependent and independent variables) from those of linear regression. In particular the key differences of these two models can be seen in the following two features of logistic regression. First, the conditional mean p(y \mid x) follows a Bernoulli distribution rather than a Gaussian distribution, because logistic regression is a classifier. Second, the linear combination of the inputs w^T x \in R is restricted to [0,1] through the logistic distribution function because logistic regression predicts the probability of the instance being positive.

the odds ratio

In statistics, the odds ratio[1][2][3] (usually abbreviated “OR”) is one of three main ways to quantify how strongly the presence or absence of property A is associated with the presence or absence of property B in a given population. If each individual in a population either does or does not have a property “A”, (e.g. “high blood pressure”), and also either does or does not have a property “B” (e.g. “moderate alcohol consumption”) where both properties are appropriately defined, then a ratio can be formed which quantitatively describes the association between the presence/absence of “A” (high blood pressure) and the presence/absence of “B” (moderate alcohol consumption) for individuals in the population. This ratio is the odds ratio (OR) and can be computed following these steps:

  1. For a given individual that has “B” compute the odds that the same individual has “A”
  2. For a given individual that does not have “B” compute the odds that the same individual has “A”
  3. Divide the odds from step 1 by the odds from step 2 to obtain the odds ratio (OR).

The term “individual” in this usage does not have to refer to a human being, as a statistical population can measure any set of entities, whether living or inanimate.

If the OR is greater than 1, then having “A” is considered to be “associated” with having “B” in the sense that the having of “B” raises (relative to not-having “B”) the odds of having “A”. Note that this is not enough to establish that B is a contributing cause of “A”: it could be that the association is due to a third property, “C”, which is a contributing cause of both “A” and “B” (Confounding).

The two other major ways of quantifying association are the risk ratio (“RR”) and the absolute risk reduction (“ARR”). In clinical studies and many other settings, the parameter of greatest interest is often actually the RR, which is determined in a way that is similar to the one just described for the OR, except using probabilities instead of odds. Frequently, however, the available data only allows the computation of the OR; notably, this is so in the case of case-control studies, as explained below. On the other hand, if one of the properties (say, A) is sufficiently rare (the “rare disease assumption“), then the OR of having A given that the individual has B is a good approximation to the corresponding RR (the specification “A given B” is needed because, while the OR treats the two properties symmetrically, the RR and other measures do not).

In a more technical language, the OR is a measure of effect size, describing the strength of association or non-independence between two binary data values. It is used as adescriptive statistic, and plays an important role in logistic regression.

In statistics and epidemiology, relative risk (RR) is the ratio of the probability of an event occurring (for example, developing a disease, being injured) in an exposed group to the probability of the event occurring in a comparison, non-exposed group. Relative risk includes two important features: (i) a comparison of risk between two “exposures” puts risks in context, and (ii) “exposure” is ensured by having proper denominators for each group representing the exposure [1][2]

RR= \frac {p_\text{event when exposed}}{p_\text{event when non-exposed}}
Risk Disease status
Present Absent
Smoker a b
Non-smoker c d

Consider an example where the probability of developing lung cancer among smokers was 20% and among non-smokers 1%. This situation is expressed in the 2 × 2 table to the right.

Here, a = 20, b = 80, c = 1, and d = 99. Then the relative risk of cancer associated with smoking would be

RR=\frac {a/(a+b)}{c/(c+d)} = \frac {20/100}{1/100} = 20.

Smokers would be twenty times as likely as non-smokers to develop lung cancer.

Another term for the relative risk is the risk ratio because it is the ratio of the risk in the exposed divided by the risk in the unexposed. Relative risk contrasts with the actual orabsolute risk, and may be confused with it in the media or elsewhere.

maximum-likelihood estimation (MLE)

In statistics, maximum-likelihood estimation (MLE) is a method of estimating the parameters of a statistical model. When applied to a data set and given a statistical model, maximum-likelihood estimation provides estimates for the model’s parameters.

The method of maximum likelihood corresponds to many well-known estimation methods in statistics. For example, one may be interested in the heights of adult female penguins, but be unable to measure the height of every single penguin in a population due to cost or time constraints. Assuming that the heights are normally (Gaussian) distributed with some unknown mean and variance, the mean and variance can be estimated with MLE while only knowing the heights of some sample of the overall population. MLE would accomplish this by taking the mean and variance as parameters and finding particular parametric values that make the observed results the most probable (given the model).

In general, for a fixed set of data and underlying statistical model, the method of maximum likelihood selects the set of values of the model parameters that maximizes thelikelihood function. Intuitively, this maximizes the “agreement” of the selected model with the observed data, and for discrete random variables it indeed maximizes the probability of the observed data under the resulting distribution. Maximum-likelihood estimation gives a unified approach to estimation, which is well-defined in the case of the normal distribution and many other problems. However, in some complicated problems, difficulties do occur: in such problems, maximum-likelihood estimators are unsuitable or do not exist