Interpreting K-means cluster analysis

From mtab wikisupport
Revision as of 15:57, 1 August 2013 by Mtabadmin (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Metric Knowledge Management Services, Pvt. Ltd provided the article listed below involving the analysis required to interpret the results of PCA (principle components analysis) or factor analysis. The principles in this article are the same for the interpretation of mTAB's K-Means Cluster Analysis results.

For a pictorial illustration of the use of mTAB's K-Mean's feature, refer to the mTAB version 5.4 Newsletter.

For more information on Metric Knowledge Management's Services and Consulting opportunities, refer to the Metric Knowledge web-site at www.metricknowledge.com.

mTAB would like to acknowledge and thank Metric Knowledge Management Services for granting their permission to include the article below.

Spreadsheet interpreting-K-means-cluster-analysis newsletter.jpg

Taking Chance with PCA / Factor Analysis

All of us know that PCA (Principal Component Analysis) and Factor Analysis are popular Multivariate Statistical techniques that are used to summarize information contained in a large number of variables, into a smaller number of subsets or factors.

We use PCA or Factor Analysis to discover simple patterns in the complex web of relationships in multivariate data. Specifically, we try to see if the observed variables can be replaced largely or entirely by a much smaller number of variables called Components or Factors.

In Satisfaction Measurement studies, we use PCA / Factor Analysis to reduce the number of attributes to a few factors that account for the maximum variability contained in the full set of original attributes.

However, as practicing MR professionals, we have to in turn “sell” our findings to the skeptical line managers and hard-nosed decision makers.

And that is where the problem begins.

I don’t know how many of us, and how many times, have got a neat and beautiful table like below.

VARIABLE PC1 PC2 PC3
DESIGN 0.81 0.19 -0.31
RELIABILITY 0.81 0.22 -0.34
SETUP TIME 0.76 0.21 -0.38
QUALITY 0.76 0.14 0.05
COURTESY -0.21 0.93 0.04
SPEED OF RESPONSE -0.26 0.92 -0.06
KNOWLEDGE -0.23 0.91 0.04
COST OF ACCESSORIES 0.54 0.05 0.75
SYSTEM COST 0.60 0.03 0.70
 % VARIANCE (CUMMULATIVE) 36 66 82

It’s almost a dream situation for two reasons.

The attribute weights are such that, each component (PC1, PC2 & PC3) contains a different set of attributes because their weights in the three components are quite different from each other. First four attributes belong to PC1 because their weights are the highest in PC1, etc. You are saved of an awkward decision of putting an attribute in one component, whereas it might as well belong to another one or more components.

More importantly, the new components thus evolved (the synthetic components) nicely correspond to some distinctly different real life categories, which are understandable and actionable for managerial decision-making (PC1 = Technology, PC2 = Service PC3= Cost).

What we have to say will be quite clear when this textbook case is contrasted with the following output for a real life data.

Variable PC1 PC2 PC3 PC4 PC5 PC6
P_1 -0.165 0.006 -0.345 0.083 0.109 0.092
P_2 -0.163 0.017 -0.267 0.166 -0.148 -0.023
P_3 -0.145 -0.040 -0.096 -0.059 -0.205 0.080
P_4 -0.166 0.294 0.004 0.083 -0.030 -0.333
P_5 -0.139 0.204 -0.063 -0.176 0.143 -0.043
P_6 -0.178 0.012 -0.100 0.165 -0.149 0.022
P_7 -0.142 0.372 -0.148 -0.257 -0.120 0.104
P_8 -0.144 -0.031 -0.111 0.084 -0.112 -0.041
P_9 -0.101 -0.147 -0.253 0.045 -0.010 -0.061
P_10 -0.186 0.179 -0.127 0.057 0.096 0.210

Firstly, you will notice that almost every attribute can belong to more than one component by virtue of its close values of weights.

What is more important is that even if we fit some division by force, the components that emerge do not necessarily correspond to distinctly different and meaningful real life categories, for a useful managerial decision making.

This is not to discard the technique altogether. But more to understand its limitations so that we don’t make a fetish out of it.

Now let me illustrate a case study where we managed to use the PCA and come out with relevant inputs, for managerial decision-making.


Case Study: Gas Station CS Study

CONTEXT

A large oil sector company asked us to evaluate their 300 gas stations for customer satisfaction. Unlike in west, in India, the front end of a gas station: the customer interface is yet to be mechanized. So the entire service component is still delivered by a team of deliverymen with the help of dispensing units. These are susceptible to tampering at most places. The price structure for oil products is such that, there is a hefty incentive for mixing and adulteration.

DATA COLLECTED

A predetermined sample of visiting customers was interviewed at the exit point using a structured questionnaire. The list of 40 attributes contained attributes like accurate quantity of fuel, pure quality of fuel, courteous behavior of staff, value added services like windshield cleaning, payment by credit card, availability of auto accessories etc.

Analysis of the satisfaction data was done using METRIC’s proprietary model: The MOSTER System.

Despite being strong critiques of reckless use of traditional MVA tools for Satisfaction Data, we always try them out in each of our projects. In this case, we were lucky to get “some” meaningful insights using the Principal Component Analysis (PCA). This is how it goes…(The table has been truncated for the sake of brevity)

GAS STATION DATA
Variable PC1 PC2 PC3 PC4 PC5 PC6
Courteous Behavior -0.304 0.898 0.272 0.005 -0.111 0.075
Easy Access -0.387 -0.328 0.316 -0.318 -0.524 -0.163
Quick Service -0.373 -0.275 0.442 0.546 0.237 0.486
Accurate Quantity -0.405 -0.018 -0.047 -0.212 0.753 -0.410
Less waiting time -0.369 0.025 -0.748 -0.030 -0.074 0.476
24 hour service -0.394 -0.017 -0.256 0.553 -0.287 -0.555
Pure Quality -0.405 -0.096 0.065 -0.500 -0.041 0.167
Cumulative% Explained 68.6 77.7 85.1 91.1 94.9 97.5

The three components accounting for 85% of variation were termed as,

PC1: Product
PC2: Human Interference
PC3: Delivery
= Managerial Dimensions

You will notice that we were unable to convert the other components, namely the PC4, PC5 and PC6 into any meaningful managerial categories. It is different story that any way they were not adding substantially to the cumulative explained variation.

Based on the above study, the company initiated the following actions.

Action Target Component Target Attribute
Tightened the supply chain Product Pure Quality
Installed a tight quality control regime Product Pure Quality
Training of the staff on work discipline and customer orientation Human Interface Courteous Behavior & Work Discipline
Installed New generation dispensing units which are tamper proof and fast Product & Delivery Accurate Quantity
Added man power Human Interface & Delivery Quick Service

The moral of the story as we see is…

The use of PCA & Factor Analysis is likely to be fraught with uncertainties as explained above. We must make the user aware of these uncertainties. (Easier said than done! The users often demand comfort of certainty.)

Practice of both PCA and Factor Analysis requires considerable use of human creativity and judgment at every step. This is not a blind plug and play software tool, as often made out to be.


Feedback

If you receive this newsletter from a colleague, you can start your own free subscription.

We protect your privacy.

To unsubscribe from this newsletter, please click here.

If you have general comments or suggestions regarding this newsletter, e-mail us at newsletter@metricknowledge.com.

Office Contact: business@metricknowledge.com or metric@vsnl.com.

Spreadsheet interpreting-K-means-cluster-analysis copyright.jpg