Recent Posts

Those who are free of resentful thoughts surely find peace. - Buddha

Machine Learning -I

Posted on 11th May 2019

<-Back to Blogs

What is Machine Learning?

Machine learning is making machines(computers) to learn or understand the pattern in data without being explicitly programmed. Machine learning makes use of mathematical and statistical algorithms in order to make the prediction about result or to take some decision about the data.

ML Model = Algorithm(Data) & Data = X + Y

X-> Set of independent variables or features Y-> Output variable or responder

The objective of ML is to estimate target function (f) that best maps input variables (X) to an output variable (Y ). Y =f(X) + e

Here “e” is the irreducible error because no matter how good we get at estimating the target function (f), we cannot reduce this error


The Analytics Life cycle:

 1. Evalute/Monitor Results - Evaluate Results - Business Manager

2. Identify/Formulate Problem - Identify the problem

3. Data Preparation - Business Analyst

4. Data Exploration

5. Transform & Select - Data Scientist

6. Build Model

7. Validate Model - IT System

8. Deploy Model

Data Science Project Life Cycle:

What is CRISP-DM?

CRISM-DM was conceived in late 1996 bt three veterans of the young and immature data mining market. CRISP stands for "CRoss-Industry Standard Process for Data Mining."

The Process model for data mining provides an overview of the life cycle of a data mining project. It contains all the phases of a project, their respective tasks, and the relationships between these tasks. Relationships could exist between ant data mining tasks depending on the goals, the background, and the interest of the user-and most importantly-on the data.

CRISP-DM Phases:

1. Business Understanding - This initial phase focus on understanding the project objectives and requirements from a business perspective, then converting this knowledge into a data mining problem definition and prelimimary plan designed to achieve the objectives.

2. Data Understanding - The data understanding phase starts with initial data collection and proceeds with activities that enable you to become familiar with the data. In this phase usually ETL developer, Hadoop developers or data processing enginner or data scientist work. Connecting and Collecting the data often are done here.

3. Data Preparation - The data preparation phase covers all activities needed to construct the final dataset or data that will be fed into the modelling tools from the initial raw data. Tasks include table, record, and attribute selection, as well as transformation and cleaning of data for modelling tools. It requires Statistical techniques.

4. Modeling -  In this phase, various modeling techniques are selected and applied, and their parameters are calibrated to optimal values. In these models we have all kinds of the machine learning algorithms & techniques. It requires Machine Learning techniques. 

5. Evaluation - At this stage you have built a model (or models) that appears to have high quality from a data analysis perspective. Before proceeding to final deployment of the model, it is important to thoroughly evaluate it and review the steps executed to create it, to be certain the model properly achieves the business objectives. A key objectives is to determine if there is some important business issue that has benn sufficently considered.

6. Deployment - Creation of the model is generally not the end of the project. Even if the purpose of the model is to increase knowledge of the data, the knowledge gained will need to be organized and presented ina way that the customer can use it. It often involves applying "live" models within an organization's decision making processes. Depending on the requirements, the deployment phase can be as simple as generating a report or as complex as implementing a repeatable data minins process process the enterprise.

Steps invloved in the Data Science Project Life Cycle

1. Business Understanding

> Determine Business Objectives - Background, Business Objectives, Business Success Criteria

> Access Situation - Inventory of Resources, Requirements, Assumptions, and Constraints, Risks & Contingencies, Terminology, Costs and Benefits

> Determine Data Mining Goals - Data Mining Goals, Data Mining Success Criteria

> Produce Project Plan - Project Plan, Initial Assessment of Tools and Techniques

2. Data Understanding

> Collect Initial Data - Initiial Data Collection Report

> Describe Data - Data Description Report

> Explore Data - Data Exploration Report

> Verify Data Quality - Data Quality Report

3. Data Preparation

> Select Data - Rationale for inclusion/Exclusion

> Clean Data - Data Cleaning Report

> Construct Data - Derived Attributes, Generated Records

> Integrate Data - Merged Data

> Format Data - Reformatted Data

> Dataset - Dataset Description

4. Modelling

> Select Modeling Techniques - Modelling Technique, Modelling Assumptions

> Generate Test Design - Test Design

> Build Model - Parameter Settings Model, Model Descriptions

> Assess Model - Model Assesment, Revised Parameter Settings

5. Evaluation

> Evaluate Results - Assesment of Data Mining Results wrt Business Sucess Criteria, Approved Models

> Review Process - Review of Process

> Determine Next Steps - List of possible actions decisions

6. Deployment

> Plan Deployment - Deployment plan

> Plan Monitoring and Maintenance - Monitoring and Maintenace Plan

> Produce Final Report - Final Report, Final Presenation

> Review Project -  Experience Dicumentation

For more details:


Few examples which shows applied Machine Learning techniques:

  • Regression Analysis – Finding the relationship between a dependent variable and one or more independent variables - Predicting Diamond price based on Carat, Cut & Clarity

  • Classification Analysis – Dividing objects into 2 or more known classes - Distinguishing cancer and normal cells

  • Outliers Analysis - Finding unusual - Credit card transactions

  • Association Analysis – Finding links - Shopping cart analysis

  • Cluster Analysis (Segmentation) – Grouping similar objects together - Grouping customers into different clusters based on their previous shopping data/transactions.

  • Time Series Analysis – Time dependent Data - Stock prediction


Basics of Statistics:

  • Random Variable
  • Types of Random Variables
  • Central Tendencies - Mean, Mode, Median etc
  • Probabilty, Probablity Distribution of Random Variables.

For example:

Mean - Sum of values of a data set divided by number of values: (1+2+2+3+4+7+9)/7 = 4

Median - Middle value seperating the greater and lesser halves of a data set : 1, 2, 2, 3, 4, 7, 9  : 3

Mode - Most frequent value in a data set : 1, 2, 3, 4, 7, 9 : 2


Statistics is a branch of Mathematics dealing with the collection, analysis, interpretation, presentation, and organization of data.

  1. Descriptive Statistics - It talks about Yesterday - EDA(Exploratory Data Analysis)
  2. Inferential Statistics - It talks about Tommorow - Modelling - This model contains Predicitons(Statstics)/Forecasting(Mathematics)/Estimation(Economics)

1. Descriptive Statistics: Yesterday's data

  • Data Types
  • Central Tendenncy
  • Dispersion
  • Five number summary
  • Distribution
  • Cross Tabluation

How to get/collect the Data related with the problem?

  • Transcational Data 
  1. Enterprise Resource Planning Systems(ERP): Helps on faculitating business

        > Financial/Accounting (Pay role, general ledger and cash management)

        > Customer Relationship Management: Sales and Marketing, Commissions, Call centre

        > e-Commerce Applications: Online Shopping, Online Ticketing

  • External Sources
  1. Market Survey: Email questionnaire, Paper questionnaire
  2. Reports: Government reports, Agency reports

How data is being processed?

Legacy Data    --------- |                               | Data Warehouse                 |

Operatonal Data ------- |  Staging Area ------  | Raw Data   Summary Data  | ----------- Data Information ---> Target Info

Flat Files---------------- |                               | Meta Data                          |


Now this Target Information is used for the Fact Based Decision Making.

Target Information <--|Summary, Visualize {EDA},  Predict, Estimate, Forecast{Modelling}|--> Fact Based Decision Making


What are the EDA Techniques?

Visualization & Summary: Used for Reporting and Data Validation.



1. Visual Analytics: It's like Harry Potters DVD - Quick overview

  • Histogram
  • Box Plot
  • Bar Chart
  • Pie Chart
  • Bubbble Chart
  • Correlation Plot
  • Scatter Plot
  • Line Chart
  • Decision Tree
  • Cluster Charts

2.  Summary Analytics:  It's like Harry Potters Novel - Contains Details

  • Central Tendency: Mean(Average of the values) like Ambani, Median(Middle of the values), Mode(Most Repeating of the values), GeoMean(Compounded Mean), Harmonic Mean(Proprotionate Mean) like Rikshaw Puller, Trimmed Mean, 95% Upper Mean, 95% Lower Mean, Weighted Avearge. - HISTOGRAM> Central Tendency is for analysis and quality checking.

          > Central Tendency gives the answer not the solution i.e if you score 34 marks in the exam.

  • Dispersion: Standard Deviation, Variance, Cofficient of Variance(CV = Standard Deviation/Arithmetic Mean - As small as it can be), Range(Max - Min ), Min, Max, Skwed(+-0.80), Kurtosis(+-3.00), IQR(interquartile range), Std Error( Std Deviation /Sqrt(Sample Size): It should be < 0.05 is good sample), DQ(Data Quality = Harmonic Mean/ Arithmetic Mean - Should be high) - STANDARIZED PLOT

           > Dispersion is for making decision.

          > Dispersion should always be low. The data quality is good.

          > Dispersion should not cross or exceed the central tendency.


  • Five number/Robust Analysis, Percentiles, Quartiles Summary : Q0(0 percentile), Q1(25% percentile), Q2(50 percentile) BOX PLOT
  • Distribution
  • Cross Tabulation

 Summary Techniques:

  1. Data Types: Continuous, Discrete, Ordinal, Nomial, Interval(Linear Regression), Ratio(Quadrant Regression)

       Numerical Data : All statistic except Count & percentage can be applied

  • Continuous data: i.e Price Product: 5.66, 4.76, 5.54 etc (with decimal)
  • Discrete data: i.e Number of credit cards/book etc: 6, 4, 5, 7, 100 etc(no decimal)

         Character Data: Count & percentage should be applied. No other statistic.

  • Ordinal data: i.e Characters with ratings education,salary range - 1 to 5 where 1=poor, 2=average, 3=good, 4=good, 5=excellent, 6=outstanding, 7=awesome, 8=incredible, 9=mindblowing >Ordinal Algorithm
  • Nominal data: No need to follow order: Gender: 1=male, 2=female; Location: 1=hyderabad, 2=Mumbai, 3=Pune, Binary Data - Only two, Uninary Data - Aadhar, PAN, CUST_ID, PRODUCT_ID >  Discriminative Algorithm 

Case Study:

CC defaulter CC Default Zscore(X-Mean/ Standard Deviation)
1 5400 -0.33029
2 6500 -0.27536
3 5430 -0.32879
4 65000 2.66563
5 5430 -0.32879
6 5210 -0.33978
7 4350 -0.38272
8 5433 -0.32864
9 4980 -0.35126


Central Tendency

Mean 12014.8
Median 5430
Mode 5430
Geo Mean 7022
Harmonic Meann 5885



Standard Deviation 20027.3
Data Quality(Quality) 49%
Coefficent of Variance (Std/mean)(Risk) 167%
  • Extream values are affecting the business data, for finding the extream data apply the ZScore. It value should be +-1.96,
  • Why Zscore is used? Standard Score. The standard score (more commonly referred to as a z-score) is a very useful statistic because it (a) allows us to calculate the probability of a score occurring within our normal distribution and (b) enables us to compare twoscores that are from different normal distributions.
  • ZScore = (X(Value) - Xbar(Mean) / Standard Deviation); Reference in python:
  • ZScore helps in finding the outliner data. Whenever any data's does not lies in between its range it is categorized as an outliner data.

When to do Data Cleaning? 

  •  Standard Deviation should not dominate the Central Tendencies i.e Mean, Median, Mode, GeoMean, Harmonic Mean.
  • If it gets dominated, go back to the data - find the Zscore to get the outliner data.
  • Whenever Mean, Mode, Median is similar - Data is good.
  • Mean is like Ambani, Harmonic Mean is like Rickshaw puller.
  • DQ(Data Quality) should be high.
  • CV(Coefficent of Variance) should be low.
  • Zscore of data should be between +-1.96 

Remove the data in the above case, to clean the data.

After cleaning the data, we have:

CC defaulter CC Default Zscore(X-Mean/ Standard Deviation)
1 5400 -0.33029
2 6500 -0.27536
3 5430 -0.32879
4 65000 2.66563
5 5430 -0.32879
6 5210 -0.33978
7 4350 -0.38272
8 5433 -0.32864
9 4980 -0.35126


Central Tendency

Mean 5393
Median 5430
Mode 5430
Geo Mean 5362
Harmonic Meann 5330



Standard Deviation 625.986
Data Quality(Quality) 99%
Coefficent of Variance (Std/mean)(Risk) 12%


<-Back to Blogs


Good, better, best. Never let it rest. Untill your good is better and your better is best. - St. Jerome

© SOFTHINKERS 2013-18 All Rights Reserved. Privacy policy