What is an outlier in data

By Fenritaxe | 29.12.2020

What happens when you have outliers in your data?

Last modified: April 05, • Reading Time: 6 minutes. An outlier is a value or point that differs substantially from the rest of the data. Outliers can look like this: This: Or this: Sometimes outliers might be errors that we want to exclude or an anomaly that we don’t want to include in our analysis. Definition of outliers An outlier is an observation that lies an abnormal distance from other values in a random sample from a population. In a sense, this definition leaves it up to the analyst (or a consensus process) to decide what will be considered abnormal.

Statistical Indicators

May 22,  · Outliers are data values that differ greatly from the majority of a set of data. These values fall outside of an overall trend that is present in the data. A careful examination of a set of data to look for outliers causes some difficulty. Although it is easy to see, possibly by use of a stemplot, that some values differ from the rest of the data, how much different does the value have to be to be . An outlier is simply a data point that is drastically different or distant from other data points. A set of data can have just one outlier or several. To be an outlier, a data point must not correspond with the general trend of the data set. It must be very noticeably outside the pattern. Apr 09,  · They are data records that differ dramatically from all others, they distinguish themselves in one or more characteristics. In other words, an outlier is a value that escapes normality and can (and probably will) cause anomalies in the results obtained through algorithms and analytical systems. There, they always need some degrees of attention.

What are Outliers? They are data records that differ dramatically from all others, they distinguish themselves in one or more characteristics. In other words, an outlier is a value that escapes normality and can and probably will cause anomalies in the results obtained through algorithms and analytical systems.

There, they always need some degrees of attention. Understanding the outliers is critical in analyzing data for at least two aspects:. While working with outliers, many words can represent them depending on the context. Some other names are: Aberration, oddity, deviation, anomaly, eccentric, nonconformist, exception, irregularity, dissent, original and so on.

Here are some common situations in which outliers arise in data analysis and suggest best approaches on how to deal with them in each case. The simplest way to find outliers in your data is to look directly at the data table or worksheet — the dataset, as data scientists call it. The case of the following table clearly exemplifies a typing error, that is, input of the data.

Looking at the table it is possible to identify the outlier, but it is difficult to say which would be the correct age. There are several possibilities that can refer to the right age, such as: 47, 70 or even 40 years. In a small sample the task of finding outliers with the use of tables can be easy. But when the number of observations goes into the thousands or millions, it becomes impossible. This task becomes even more difficult when many variables the worksheet columns are involved.

For this, there are other methods. One of the best ways to identify outliers data is by using charts. When plotting a chart the analyst can clearly see that something different exists. Here are some examples that illustrate the view of outliers with graphics. In the dataset, several patterns have been found, for example: children are practically not missing their appointments; and women attend consultations much more than men.

However, a curious case was that of an outlier, who at age 79 scheduled a consultation days in advance and actually showed up in her appointment. This is a case, for example, of a given outlier that deserves to be studied, because the behavior of this lady can bring relevant information of measures that can be adopted to increase the rate of attendance in the schedules.

See the case in the chart below. On May 17, Petrobras shares fell Most of the shares of the Brazilian stock exchange saw their price plummet on that day. This strong negative variation had as main motivation the Joesley Batista, one of the most shocking political events that happened in the first half of In analyzing the chart below, even in the face of several observations, it is easy to identify the point that disagrees with the others.

Still from the graph above you can see that although different from the others, the data is not exactly outside the curve. In another case, still with data from the Brazilian stock market, the stock of the company Magazine Luiza appreciated This data, besides being an atypical point, distant from the others, also represents an outlier.

See the chart:. This is an outlier case that can harm not only descriptive statistics calculations, such as the mean and median, for example, but it also affects the calibration of predictive models. A more complex but quite precise way of finding outliers in a data analysis is to find the statistical distribution that most closely approximates the distribution of the data and to use statistical methods to detect discrepant points.

The dataset used for this example is a public dataset greatly exploited in statistical tests by data scientists. More details at this link. The histogram is one of the main and simplest graphing tools for the data analyst to use in understanding the behavior of the data.

In the histogram below, the blue line represents what the normal Gaussian distribution would be based on the mean, standard deviation and sample size, and is contrasted with the histogram in bars. The red vertical lines represent the units of standard deviation. It can be seen that cars with outlier performance for the season could average more than 14 kilometers per liter, which corresponds to more than 2 standard deviations from the average. In this video in English with subtitles we present the identification of outliers in a visual way using a visual clustering process with national flags.

We have seen that it is imperative to pay attention to outliers because they can bias data analysis. But, in addition to identifying outliers we suggest some ways to better treat them:. Aquarela Analytics is Brazilian pioneering company and reference in the application of Artificial Intelligence in industry and large companies. Doctor and Master of Business Administration in Finance.

Specialist in financial econometrics, behavioral finance, quantitative methods and capital markets. What are outliers and how to treat them in Data Analytics? Apr 9, Understanding the outliers is critical in analyzing data for at least two aspects: The outliers may negatively bias the entire result of an analysis; the behavior of outliers may be precisely what is being sought.

O que vou encontrar neste artigo? Antony Smith age outlier. Joni Hoppen. Wlademir Ribeiro Prates.

