### Recent Posts

Those who are free of resentful thoughts surely find peace. - Buddha

# Machine Learning -II

Posted on 11th May 2019

Machine learning has two objectives.

1. Statstics - Prediction =>  Y = a + bi * x1 + b2 + ... bn*xn + error for Numerical data

2. Mathematics - Classification => 1/(1+ exp^(-Y)) for Categorical data

Statstic is of two types:

1. Descriptive Statstics - Yesterday - EDA methods

2. Inferential Statstics - Tommorow - Models

Problem statement:

iphone Sales for is as follows: 18,18,19,20,20,30,30,30,30,30,31,32,33,34,40,45,46,47,47,50,50,50,59,60,60,61

1. Summarise and visualise Customer age data

2. Increase 5% of sales followed by customers.

How to approach such question?

Find the different types of approaches & techniques you have. Do the data validation, data cleaning & data munging.

Example below:

Summary Approach:

1. Continous/Discrete Data Approach:

Numerical Data Approach - All statstic expect Count & percentage

 Column Mean 38.30769231 Standard Error 2.70371211 Median 33.5 Mode 30 Standard Deviation 13.78628081 Sample Variance 190.0615385 Kurtosis -1.124971234 Skewness 0.189941371 Range 43 Minimum 18 Maximum 61 Sum 996 Count 26 Geomean 34.90718932 Harmean 32.28076389

2. Ordinal/ Nominal Data Approach:

Categorical Data Approach - Only Count & percentage.

To make the decision go with this apporach.

1. Histogram - Only for Data Quality check Only

2. Steam Leaf technique : For Summarization, Visualization & Data Quality Check

https://www.rosettacode.org/wiki/Stem-and-leaf_plot

for below values:

 18 18 19 20 20 21 30 30 30 30 31 32 33 33 34 40 45 46 47 47 50 50 50 59 60 60 61

 Stem Leaf Frequency Cumulative Frequency Frequency% Cumulative Frequency % 1 8 8 9 3 3 0.111111111 0.111111111 2 0 0 1 3 6 0.111111111 0.222222222 3 0 0 0 0 1 2 3 3 4 9 15 0.333333333 0.555555556 4 0 5 6 7 7 5 20 0.185185185 0.740740741 5 0 0 0 9 4 24 0.148148148 0.888888889 6 0 0 1 3 27 0.111111111 1 27

From the above we can draw the charts for the histogram which is very useful in taking the decision to increase the sales.

Next let's see what is

3. Box Plot: For Summarization, Visualization & Data Quality Check

Range = (Max Value - Min Value)

Bin = 5 (Suppose)

So, 43/5 = 8.6 is the width of the bin.

 Bin frequency 1 18 26.6 6 2 26.6 35.2 9 3 35.2 43.8 1 4 43.8 52.4 7 5 52.4 61 4

### Summary:

 Data Types Continues Discrete Ordinal Nominal Interval Ratio

 Central Tendency Mean Median Mode Geo mean Harmean Trimmed mean 95% upper mean = Mean + Zscore * Std Error = Mean + (X - Xbar)/Std Deviation * (Std Deviation)/Sqrt(Sample) 95% lower mean = Mean - Zscore * Std Error weighted average

 Dispersions Standard Deviation ( should not dominate Central Tendency) =  Sqrt(1/N*(X-Mean)^2) Variance = 1/N * (X-Mean)^2 CV (coffeciant of variance(as much as low is better) = Standard Deviation/Arithmetic Mean Range = Max - Min Min Max Skwed( +- 0.80)  ~(Zscore)^3 https://en.wikipedia.org/wiki/Skewness Kurtosys (+- 3.00) IQR (Inter Quartile Range) Std Error (should be < 0.05 is good sample) = Standard Deviation /Sqrt(Sample Size) DQ (should be   High then data is good ) = Harmonic mean / Arithmetic mean Zscore (+- 1.96 then outlier) = X(Value) - Xbar(Mean) / Standard Deviation

 Five number summary Q0 0% Q1 25% Q2 50% (median) Q3 75% Q4 100%

### Categories

Good, better, best. Never let it rest. Untill your good is better and your better is best. - St. Jerome