Understanding the function -Pandas.DataFrame.describe

What does .describe do?

It generates descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.

Parameters:

  • Percentiles :

The percentiles to include in the output. All should fall between 0 and 1. The default is [.25, .5, .75], which returns the 25th, 50th, and 75th percentiles.

Now, some of you might wonder how are percentiles calculated? Let’s see :

It is calculated at the back-end by the ‘interpolation’ parameter of numpy function named numpy.percentile.

For eg. let’s say we have this output.

        One
count   4.000000
mean    7.000000
std     2.581989
min     4.000000
25%     5.500000
50%     7.000000
75%     8.500000
max     10.000000

25 % is calculated as : 4 + (10-4)*(1/4) = 5.5
75 % is calculated as : 4 + (10-4)* (3/4) = 8.5

Bonus Info –
The 25th percentile is also called the first quartile.
The 50th percentile is generally the median (if you’re using the third definition—see below).
The 75th percentile is also called the third quartile.
The difference between the third and first quartiles is the interquartile range.

  • include

‘all’, list-like of dtypes or None (default)

all : All columns of the input will be included in the output.
a list-like of dtypes :  Limits the results to the provided data types.
None (default) : The result will include all numeric columns.

  • exclude:

list-like of dtypes or None (default)

A list-like of dtypes : Excludes the provided data types from the result.
None (default) : The result will exclude nothing.

Returns:

Series or DataFrame

Summary statistics of the Series or Dataframe provided.

Note-

For object data (e.g. strings or timestamps), the result’s index will include countuniquetop, and freq.
The top is the most common value.
The freq is the most common value’s frequency. Timestamps also include the firstand last items.

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html

Happy to help 🙂 For any queries, you can reach out to me at harjotsaini69@gmail.com.

See you in the next blog with another data science topic 🙂

Leave a comment

Design a site like this with WordPress.com
Get started