What is ‘Kernel Density Estimation’ (KDE) ?

In statistics, kernel density estimation (KDE) is a non-parametric way to estimate the probability density function (PDF) of a random variable.
Kernel density estimation is a fundamental data smoothing problem where inferences about the population are made, based on a finite data sample.

Kernel density estimation is a really useful statistical tool with an intimidating name. Often shortened to KDE, it’s a technique that let’s you create a smooth curve given a set of data.

This can be useful if you want to visualize just the “shape” of some data, as a kind of continuous replacement for the discrete histogram.

How does it work?

The KDE algorithm takes a parameter, bandwidth, that affects how “smooth” the resulting curve is. 

Changing the bandwidth changes the shape of the kernel: a lower bandwidth means only points very close to the current position are given any weight, which leads to the estimate looking squiggly; a higher bandwidth means a shallow kernel where distant points can contribute.

What is a kernel, in non-parametric statistics?

In non-parametric statistics, a kernel is a weighting function used in non-parametric estimation techniques. Kernels are used in kernel density estimation to estimate random variables‘ density functions, or in kernel regression to estimate the conditional expectation of a random variable.

(https://ipfs.io/ipfs/QmXoypizjW3WknFiJnKLwHCnL72vedxjQkDDP1m)XWo6uco/wiki/Kernel_(statistics).html

How to plot KDE using Pandas Series?

Let’s see..

Series.plot.kde(bw_method=Noneind=None**kwds)

Parameters:

  • bw_method

The method used to calculate the estimator bandwidth. This can be ‘scott’, ‘silverman’, a scalar constant or a callable. If None (default), ‘scott’ is used.

  • ind

Evaluation points for the estimated PDF :
If None (default), 1000 equally spaced points are used.
If ind is a NumPy array, the KDE is evaluated at the points passed. If ind is an integer, ind number of equally spaced points are used.

Returns :

axes : matplotlib.axes.Axes or numpy.ndarray of them

Note -  A scalar bandwidth can be specified. Using a small bandwidth value can lead to over-fitting, while using a large bandwidth value may result in under-fitting: 

Happy to help 🙂 You can reach out to me at harjotsaini69@gmail.com for any questions.

Happy Learning!!!

Leave a comment

Design a site like this with WordPress.com
Get started