In statistics, an outlier is an observation that is numerically distant from the rest of the data. Statistics is a mathematical science pertaining to the collection analysis interpretation or explanation and presentation of Data. A data set (or dataset) is a collection of Data, usually presented in tabular form Statistics derived from data sets that include outliers may be misleading. For example, if one is calculating the average temperature of 10 objects in a room, and most are between 20 and 25 degrees Celsius, but an oven is at 350 °C, the median of the data may be 23 but the mean temperature will be 55. In Mathematics, an average, or central tendency of a Data set refers to a measure of the "middle" or " expected " value of The Celsius Temperature scale was previously known as the centigrade scale. In Probability theory and Statistics, a median is described as the number separating the higher half of a sample a population or a Probability distribution In Statistics, mean has two related meanings the Arithmetic mean (and is distinguished from the Geometric mean or Harmonic mean In this case, the median better reflects the temperature of a randomly sampled object than the mean. Outliers may be indicative of data points that belong to a different population than the rest of the sample set. In Biology a population is the collection of inter-breeding organisms of a particular Species; in Sociology In Statistics, a sample is a Subset of a population. Typically the population is very large making a Census or a complete Enumeration
In most larger samplings of data, some data points will be further away from the sample mean than what is deemed reasonable. This can be due to incidental systematic error or flaws in the theory that generated an assumed family of probability distributions, or it can simply be the case that some observations happen to be a long way from the center of the data. Systematic errors are Biases in Measurement which lead the situation where the Mean of many separate measurements differs Significantly The word theory has many distinct meanings in different fields of Knowledge, depending on their methodologies and the context of discussion. In Probability theory and Statistics, a probability distribution identifies either the probability of each value of an unidentified Random variable Outlier points can therefore indicate faulty data, erroneous procedures, or areas where a certain theory might not be valid. However, a small number of outliers not due to any anomalous condition is to be expected in large samples.
Estimators not sensitive to outliers are said to be robust. In Statistics, an estimator is a function of the observable sample data that is used to estimate an unknown population Parameter (which is called the Robustness is the quality of being able to withstand stresses pressures or changes in procedure or circumstance
In the case of normally distributed data, roughly 1 in 22 observations will differ by twice the standard deviation or more from the mean, and 1 in 370 will deviate by three times the standard deviation. The normal distribution, also called the Gaussian distribution, is an important family of Continuous probability distributions applicable in many fields In a sample of 1000 observations, the presence of up to five observations deviating from the mean by more than three times the standard deviation is within the range of what can be expected and not indicative of an anomaly. If the sample size is only 100, however, just three such outliers are already reason for concern. In general, if the nature of the population distribution is known a priori, it is possible to test if the outliers deviate significantly from what can be expected. In Statistics, a result is called statistically significant if it is unlikely to have occurred by Chance.
Outliers can have many anomalous causes. A physical apparatus for taking measurements may have suffered a transient malfunction. There may have been an error in data transmission or transcription. A sample may have been contaminated with elements from outside the population being examined. Alternatively, an outlier could be the result of a flaw in the assumed theory, calling for further investigation by the researcher.
Unless it can be ascertained that the deviation is not significant, it is ill-advised to ignore the presence of outliers. Outliers that cannot be readily explained demand special attention.
There is no rigid mathematical definition of what constitutes an outlier; determining whether or not an observation is an outlier is ultimately a subjective exercise.
Some methods which are commonly used for identification assume that the data are from a normal distribution, and identify observations which are deemed "unlikely" based on mean and standard deviation:
Other methods flag observations based on measures such as the interquartile range. In Statistics, mean has two related meanings the Arithmetic mean (and is distinguished from the Geometric mean or Harmonic mean In Probability and Statistics, the standard deviation is a measure of the dispersion of a collection of values In statistical theory the Chauvenet's criterion is a means of assessing whether one piece of experimental data &mdash an Outlier &mdash from a set of observations is likely Many statistical techniques are sensitive to the presence of Outliers For example simple calculations of the mean and standard deviation may be distorted by a single grossly inaccurate Peirce's criterion is a method devised by Benjamin Peirce, that may be used to eliminate suspect Experimental data using Probability theory. In Descriptive statistics, the interquartile range (IQR, also called the midspread, middle fifty and middle of the #s, is a measure of For example, if Q1 and Q3 are the lower and upper quartiles respectively, then one could define an outlier to be any observation outside the range:
for some constant k. In Descriptive statistics, a quartile is any of the three values which divide the sorted Data set into four equal parts so that each part represents one fourth of
The choice of how to deal with an outlier should depend on the cause.
Even when a normal distribution model is appropriate to the data being analyzed, outliers are expected for large sample sizes and should not automatically be discarded if that is the case.
Deletion of outlier data is a controversial practice frowned on by many scientists and science instructors; while mathematical criteria provide an objective and quantitative method for data rejection, they do not make the practice more scientifically or methodologically sound, especially in small sets or where a normal distribution cannot be assumed. Rejection of outliers is more acceptable in areas of practice where the underlying model of the process being measured and the usual distribution of measurement error are confidently known.
In regression problems, an alternative approach may be to only exclude points which exhibit a large degree of influence on the parameters, using a measure such as Cook's distance. In Statistics, the Cook's distance is a commonly used estimate of the influence of a data point when doing least squares regression.
The possibility should be considered that the underlying distribution of the data is not approximately normal, having "fat tails". For instance, when sampling from a Cauchy distribution, the sample variance increases with the sample size, the sample mean fails to converge as the sample size increases, and outliers are expected at far larger rates than for a normal distribution. The Cauchy–Lorentz distribution, named after Augustin Cauchy and Hendrik Lorentz, is a continuous Probability distribution.
In cases where the cause of the outliers is known, it may be possible to incorporate this effect into the model structure, for example by using a hierarchical Bayes model or a mixture model. The hierarchical Bayes method is one of the most important topics in modern Bayesian analysis In Mathematics, the term mixture model is a model in which independent variables are fractions of a total