

Introduction to Outlier
In a data collection, outliers are stragglers, which means they are extremely high or extremely low values. In simple words, it’s the data that lies outside other values in a set.
For example, we have a set of random numbers as follows,
2, 98, 101, 103, 106, 109, 112, 205
Here, 2 and 205 are the outliers.
[Image will be Uploaded Soon]
Most of the data points clustered along the straight line very closely, as you can see in the above chart. The outlier is far from other points.
Outlier Meaning
An outlier is an observation in which in a random sample of a population lies an abnormal distance from other values. In a way, this definition leaves it up to the analyst to determine what would be considered abnormal. It is important to classify normal observations before abnormal observations can be picked out.
Defining Outliers
Examination for important features, including symmetry and deviations from assumptions, of the overall shape of the graphed results.
Examination of the information for odd findings that are far away from the data collection. Such points are also classified as outliers.
Inliers
An Inlier, on the other hand, is an inaccurate data value that is simply within a statistical distribution, making it difficult to separate it from good data values. A simple example of an inlier might be a value recorded in the incorrect units in a record, say degrees Fahrenheit rather than degrees Celsius.
Extreme and Mild Outlier
Mild Outlier:
The data values below the first quartile or above the third quartile that lie between 1.5 times and 3.0 times the interquartile scale.
Extreme Outlier:
Any data values that lie more than 3.0 times the interquartile range below the first quartile or above the third quartile are extreme outliers.
How to Find Outliers?
Extreme Value Analysis: The statistical tails of the underlying data distribution are measured.
Probabilistic and Statistical models: From a probabilistic model of the data, evaluate unlikely instances.
Linear Models: Projection techniques that use linear correlations to model the data into lower dimensions. Outliers can be, for instance, main component analysis and data with significant residual errors.
Proximity-based Models: Data instances as determined by cluster, density or nearest neighbor analysis that is separated from the mass of the data.
Information-Theoretic Models: Outliers are detected as data instances that increase the complexity of the dataset (minimum code length).
High-Dimensional Outlier Detection: Methods that scan outlier subspaces provide a higher-dimensional breakdown of distance-based measures.
Causes of Inlier and Outlier
1. Human Mistakes: Errors in data entry.
2. Instrument Mistakes: Errors in the calculation.
3. Experimental Errors: Extraction of data or planning/executing errors for experiments.
4. Intentional: Dummy outliers for evaluating methods of detection.
5. Errors in Data Processing: Data manipulation or unwanted mutations in the data collection.
6. Errors in Sampling: Collecting or combining data from incorrect or different sources.
Uses of Outliers
Outliers help in Fraud detection, fraudulent loan applications, Intrusion detection in the networks, Activity monitoring, Network performance, Satellite image analysis, Detecting novelties in images, Detecting mislabelled data, and many more.
Fun Fact
Do you know that there is an outlier company which is actually a clothing entity? You can find different kinds of outlier jeans which are famous among the people especially the outlier chinos.
Conclusion
Outliers should be properly investigated. They also provide useful information about the procedure under review or the process of collecting and documenting data. One should try to understand why they occurred and whether similar values are likely to continue to occur before contemplating the potential removal of these points from the results. Outliers are considered bad data points most of the time.
FAQs on Outlier
1. What is an outlier in Maths?
In mathematics, specifically in statistics, an outlier is a data point that is significantly different from the other observations in a dataset. It is an observation that lies an abnormal distance from other values. An outlier can be either extremely high or extremely low compared to the main cluster of data.
2. Can you provide an example of an outlier?
Certainly. Consider the test scores of a group of students: {85, 90, 88, 92, 91, 25}. In this dataset, most scores are clustered between 85 and 92. The score of 25 is significantly lower than the others and is therefore considered an outlier. It stands out from the general pattern of the data.
3. How does an outlier affect the mean, median, and mode?
An outlier has a very different impact on these three measures of central tendency:
- Mean: The mean is highly sensitive to outliers. A single extreme value can significantly pull the mean towards it, making it a potentially misleading representation of the data's centre.
- Median: The median is resistant or robust to outliers. Since it is the middle value, an extreme high or low value will not substantially change its position.
- Mode: The mode is generally not affected by an outlier, unless the outlier itself is repeated multiple times to become the new mode, which is rare.
4. What is the most common method to identify outliers in a dataset?
The most common visual method to identify outliers is by using a box plot (or box-and-whisker plot). In a box plot, data points that fall outside the 'whiskers' of the plot are flagged as potential outliers. Another common statistical method is using the Interquartile Range (IQR); any data point that falls below Q1 - 1.5*IQR or above Q3 + 1.5*IQR is typically considered an outlier.
5. What are the typical causes of outliers in data?
Outliers can appear in a dataset for several reasons, which are important to understand before taking any action. Common causes include:
- Data Entry Errors: Mistakes made during the manual input of data, such as a typo.
- Measurement Errors: Issues with the instruments or procedures used to collect the data.
- Genuine Extreme Values: The data point is not an error but a true, rare observation that is different from the rest of the population (e.g., the height of a professional basketball player in a dataset of average adults).
- Sampling Errors: When a data point is accidentally included from a different population.
6. Should you always remove an outlier from your data analysis?
No, you should not automatically remove an outlier. The decision depends entirely on its cause. If an outlier is confirmed to be the result of a data entry or measurement error, it is appropriate to correct it or remove it. However, if the outlier is a genuine, albeit extreme, data point, removing it can lead to a loss of valuable information and a misleading analysis. In such cases, the outlier should be investigated further to understand its significance.

















