Harjot Singh

Featured

The Journey Begins

Hello Everyone

This is the first blog of my life. Excuse me for my grammatical errors and bad writing skills. I hope to learn writing better and better with time and practice.

A short introduction about me

My name is Harjot Singh. I belong to Ludhiana, situated in the heart of Punjab, popularly known as India’s Manchester.

I have done my B.Tech in Production & Industrial Engineering from Guru Nanak Dev Engineering College, Ludhiana. Currently, I’m persuing my M.tech in Industrial Engineering from Punjab Engineering College(also called PEC), Chandigarh. Some of my hobbies include singing, doing Bhangra (Punjabi folk dance, I love it (Pure Punjabi Blood), playing cricket, table tennis, badminton. I love helping people, animals in every possible way I can.

Motivation behind starting to blog

So, right now it is 10:40 pm while I’m writing this and today is 16th March, 2019. I’m in Bengaluru and I’m here for my 3 months Internship (Feb-May 2019) at ABB Ability Innovation Center, Bengaluru. I’m working on a project ‘Inventory Optimization in discrete manufacturing’.

As I come from an Industrial Engineering background, I was interviewed on the basis of my knowledge in industrial engineering topics and I didn’t knew that I would be given a project which also involves use of data science/data analytics. I had a month before coming here and I had done a basic course on Python Programming with a thought that it might come handy sometime. So, I used that months time for learning basic libraries in python like pandas, matplotlib, numpy. I also did a course (https://www.udemy.com/data-analysis-with-pandas/) taught by Boris Paskhaver.

Currently, I’m working on time series forecasting using ARIMA models. We’re still working on exploratory data analysis (EDA) on the time series data. I should mention here that I have worked for 5 months at Accenture (got placed in it after my B.tech) as an Assistant Software Engineer, a job that I didn’t find very interesting. That is why I left it and started doing M.tech to explore more and stick to my area of interest.

Coming back, so currently, I am very new to data science field, but the good point is I have started loving what I do in the office. I use python and all of a sudden, I don’t hate writing codes anymore. Infact, I love it. Python is very easy to learn even when you come from a non circuital background. My project involves domain knowledge of industrial engineering and use of data analytics as a problem solving and optimization tool.

This happy and inspiring feeling of trying to explore more in data science has really increased exponentially these days, and just to explore more, I’ve just completed this course (https://www.udemy.com/careers-in-data-science-a-ztm/) taught by Kirill Eremenko and Hadelin de Ponteves.

Learning about the broadness of application of ML Algorithms and data science in general, I’m really excited to start my journey in the field of data science.

It would be very unfair not to mention the person whom I look up to, who has motivated me so much to start this learning journey, Manja Bogevic
(https://www.linkedin.com/in/manjabogicevic/ ) . If you’re reading my blog and if you don’t know about this super-women, I recommend go read her blogs, see her data science journey and come back. No one can inspire you if she can’t.

Also recently, I was lucky to come across this great human and a great mentor, Mr. Blaine Bateman (https://www.linkedin.com/in/blainebateman/), mentor at Springboard (https://www.linkedin.com/school/springboard/).
He has been one of the reasons behind my motivation. I hope to come across more like minded individuals with whom I can grow and learn and apply more.

Also thanks to my college junior Jagjeet Singh Ubhi, a passionate IT engineer, who motivated me to start writing blogs and who is a constant source of help and knowledge.

I am really excited to start this journey and I will try to keep posting my weekly progress through my blogs. If you are still thinking about starting your journey, let’s start together 🙂

We’ll walk this road together, through the storm
Whatever weather, cold or warm
Just letting you know that, you’re not alone
Holla if you feel like you’ve been down the same road

Thanks for reading through and bearing with my bad writing skills.

Let’s make it happen.

Hypothesis Testing

“If it’s true what is said, that only the wise discover the wise, then it must also be true that the lone wolf symbolizes either the biggest fool on the planet or the biggest Einstein on the planet.”

― Criss Jami, Diotima, Battery, Electric Personality

What is Hypothesis?
It is a premise or claim that we want to test.

Null Hypothesis – Ho – Currently accepted value for a parameter.
Alternative Hypothesis (Research Hypothesis) – Ha – Claim to be tested.

Let’s understand it with an example.

Example – It is believed that a candy machine makes chocolate bars that are on average 5g. A worker claims that the machine after maintenance no longer makes 5g bars. What would be Ho and Ha here?

Ho: μ = 5g
Ha: μ ≠ 5g

Ho and Ha are mathematical opposites.
You assume null hypothesis to be true unless evidence points otherwise.

Possible Outcome of this test:
– Reject the null hypothesis.
– Fail to reject Null Hypothesis.

Next is, How do we do that?

Test Statistic – calculated from sample data, used to decide.
Let’s continue to understand it with our example above.

We sample 50 chocolate bars (not practical to take all the bars produced).
– Get Average Value for 50 bars.
– use this information to calculate test statistic.

What do we mean when we say ‘statistically significant’ ?
– It is where do we draw the line to make a decision.

Continuing…
Let’s say the average of the sample of 50 bars comes out to be 5.12 g (sampled on Monday), 5.72 g (sampled on wednesday), 7.23 g (sampled on Friday)
Avg: 5.12 g, Avg: 5.72 g, Avg: 7.23g

Now most of us will see these averages and form different opinions . Some people might say we should reject the null hypothesis based on third sample which averages 7.23. Some might say we can accept the null hypothesis based om first sample which averages 5.12 g.

But there is no concreteness here. We’re all talking here.

Statistics is not about how you think it should be. We have to have a concrete way, looking at the null hypothesis, collecting the data and having a concrete method to decide when we accept the null hypothesis and when we leave it there.

And that is what a hypothesis test does!

A hypothesis test collects a data, put it in a equation, get a number back, and that number is going to show you how you decide when that test statistic is too high or too low, and when you reject a null hypothesis and when you don’t ( based on concrete boundaries).

Level of confidence (LOC) : C – 95 %, 99%
It is how confident are we in our decision.

Don’t forget we are doing a hypothesis test. We are testing something and we are deciding to reject the null hypothesis, or to fail to reject a null hypothesis. The level of confidence is telling us how sure we are that we did the right thing (rejecting or failing to reject the null hypothesis).

Level of Significance :
denoted by ‘α’.
α = 1 – C
if LOC = 95 %, C = 0.95
the, α = 1 – 0.95 = 0.05

YOU DON’T HAVE TO PROVE THAT NULL HYPOTHESIS IS TRUE, HYPOTHESIS TESTING ALREADY ASSUMES NULL HYPOTHESIS TO BE TRUE. YOU EITHER REJECT A NULL HYPOTHESIS OR FAIL TO REJECT A NULL HYPOTHESIS.
This is what statistics do.

What is ‘Kernel Density Estimation’ (KDE) ?

In statistics, kernel density estimation (KDE) is a non-parametric way to estimate the probability density function (PDF) of a random variable.
Kernel density estimation is a fundamental data smoothing problem where inferences about the population are made, based on a finite data sample.

Kernel density estimation is a really useful statistical tool with an intimidating name. Often shortened to KDE, it’s a technique that let’s you create a smooth curve given a set of data.

This can be useful if you want to visualize just the “shape” of some data, as a kind of continuous replacement for the discrete histogram.

How does it work?

The KDE algorithm takes a parameter, bandwidth, that affects how “smooth” the resulting curve is.

Changing the bandwidth changes the shape of the kernel: a lower bandwidth means only points very close to the current position are given any weight, which leads to the estimate looking squiggly; a higher bandwidth means a shallow kernel where distant points can contribute.

What is a kernel, in non-parametric statistics?

In non-parametric statistics, a kernel is a weighting function used in non-parametric estimation techniques. Kernels are used in kernel density estimation to estimate random variables‘ density functions, or in kernel regression to estimate the conditional expectation of a random variable.

(https://ipfs.io/ipfs/QmXoypizjW3WknFiJnKLwHCnL72vedxjQkDDP1m)XWo6uco/wiki/Kernel_(statistics).html

How to plot KDE using Pandas Series?

Let’s see..

Series.plot.kde(bw_method=None, ind=None, **kwds)

Parameters:

bw_method

The method used to calculate the estimator bandwidth. This can be ‘scott’, ‘silverman’, a scalar constant or a callable. If None (default), ‘scott’ is used.

ind

Evaluation points for the estimated PDF :
If None (default), 1000 equally spaced points are used.
If ind is a NumPy array, the KDE is evaluated at the points passed. If ind is an integer, ind number of equally spaced points are used.

Returns :

axes : matplotlib.axes.Axes or numpy.ndarray of them

Note -  A scalar bandwidth can be specified. Using a small bandwidth value can lead to over-fitting, while using a large bandwidth value may result in under-fitting:

Happy to help 🙂 You can reach out to me at harjotsaini69@gmail.com for any questions.

Happy Learning!!!

Understanding the function -Pandas.DataFrame.describe

What does .describe do?

It generates descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.

Parameters:

Percentiles :

The percentiles to include in the output. All should fall between 0 and 1. The default is [.25, .5, .75], which returns the 25th, 50th, and 75th percentiles.

Now, some of you might wonder how are percentiles calculated? Let’s see :

It is calculated at the back-end by the ‘interpolation’ parameter of numpy function named numpy.percentile.

https://docs.scipy.org/doc/numpy/reference/generated/numpy.percentile.html

For eg. let’s say we have this output.

        One
count   4.000000
mean    7.000000
std     2.581989
min     4.000000
25%     5.500000
50%     7.000000
75%     8.500000
max     10.000000

25 % is calculated as : 4 + (10-4)*(1/4) = 5.5
75 % is calculated as : 4 + (10-4)* (3/4) = 8.5

Bonus Info –
The 25th percentile is also called the first quartile.
The 50th percentile is generally the median (if you’re using the third definition—see below).
The 75th percentile is also called the third quartile.
The difference between the third and first quartiles is the interquartile range.

include

‘all’, list-like of dtypes or None (default)

all : All columns of the input will be included in the output.
a list-like of dtypes : Limits the results to the provided data types.
None (default) : The result will include all numeric columns.

exclude:

list-like of dtypes or None (default)

A list-like of dtypes : Excludes the provided data types from the result.
None (default) : The result will exclude nothing.

Returns:

Series or DataFrame

Summary statistics of the Series or Dataframe provided.

Note-

For object data (e.g. strings or timestamps), the result’s index will include count, unique, top, and freq.
The top is the most common value.
The freq is the most common value’s frequency. Timestamps also include the firstand last items.

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html

Happy to help 🙂 For any queries, you can reach out to me at harjotsaini69@gmail.com.

See you in the next blog with another data science topic 🙂