For example, in the case of customer data, even though we may have data from millions of customers, these customers may only belong to a few segments: customers are similar within each segment but different across segments. We may often want to analyze each segment separately, as they may behave differently e. In such situations, to identify segments in the data one can use statistical techniques broadly called Clustering techniques.
Cluster analysis is used in a variety of applications. For example it can be used to identify consumer segments, or competitive sets of products, or groups of assets whose prices co-move, or for geo-demographic segmentation, etc. In general it is often necessary to split our data into segments and perform any subsequent analysis within each segment in order to develop potentially more refined segment-specific insights.
- Handbook of Air Pollution Prevention and Control, First Edition;
- Cluster Sampling: Definition, Method and Examples;
- The Third Man.
- Cluster Analysis.
- Water Transmission and Distribution: Principles and Practices of Water Supply Operations;
In this note we discuss a process for clustering and segmentation using a simple dataset that describes attitudes of people to shopping in a shopping mall. Before reading further, do try to think what segments one could define using this example data. As always, you will see that even in this relatively simple case it is not as obvious what the segments should be, and you will most likely disagree with your colleagues about them: the goal after all is to let the numbers and statistics help us be more objective and statistically correct.
Q Research Software - Advanced Data Analysis | Market Research Guide | Q Research Software
The management team of a large shopping mall would like to understand the types of people who are, or could be, visiting their mall. They have good reasons to believe that there are a few different market segments, and they are considering designing and positioning the shopping mall services better in order to attract mainly a few profitable market segments, or to differentiate their services e. To make these decisions, the management team run a market research survey of a few potential customers.
In this case this was a small survey to only a few people, where each person answered six attitudinal questions and a question regarding how often they visit the mall, all on a scale , as well as one question regarding their household income:. We will see some descriptive statistics of the data later, when we get into the statistical analysis.
How can the company segment these 40 people? Are there really segments in this market? There is not one process for clustering and segmentation. However, we have to start somewhere, so we will use the following process:. While one can cluster data even if they are not metric, many of the statistical methods available for clustering require that the data are so: this means not only that all data are numbers, but also that the numbers have an actual numerical meaning, that is, 1 is less than 2, which is less than 3 etc.
However, one could potentially define distances also for non-metric data. For example, if our data are names of people, one could simply define the distance between two people to be 0 when these people have the same name and 1 otherwise - one can easily think of generalizations. We will show a simple example of such a manual intervention below.
It is possible e.
What is Cluster Analysis?
In our case the data are metric, so we continue to the next step. Before doing so, we see the descriptive statistics of our data to get, as always, a better understanding of the data. Our data have the following descriptive statistics:. Note that one should spend a lot of time getting a feeling of the data based on simple summary statistics and visualizations: good data analytics require that we understand our data very well.
This is an optional step. To avoid such issues, one has to consider whether or not to standardize the data by making some of the initial raw attributes have, for example, mean 0 and standard deviation 1 e. Here is for example the R code for the first approach, if we want to standardize all attributes:.
The “Business Decision”
While this is typically a necessary step, one has to always do it with care: some times you may want your analytics findings to be driven mainly by a few attributes that take large values; other times having attributes with different scales may imply something about those attributes. In many such cases one may choose to skip step 2 for some of the raw attributes. The decision about which variables to use for clustering is a critically important decision that will have a big impact on the clustering solution. So we need to think carefully about the variables we will choose for clustering.
Good exploratory research that gives us a good sense of what variables may distinguish people or products or assets or regions is critical. Moreover, we often use only a few of the data attributes for segmentation the segmentation attributes and use some of the remaining ones the profiling attributes only to profile the clusters, as discussed in Step 8. In our case, we can use the 6 attitudinal questions for segmentation, and the remaining 2 Income and Mall.
Visits for profiling later. Remember that the goal of clustering and segmentation is to group observations based on how similar they are.
- Methods and Models in Artificial and Natural Computation. A Homage to Professor Mira’s Scientific Legacy: Third International Work-Conference on the Interplay Between Natural and Artificial Computation, IWINAC 2009, Santiago de Compostela, Spain, June 22-.
- Cluster Analysis;
- Hypertension and Brain Mechanisms;
- Survey and evaluation of techniques!
- Conduct and Interpret a Cluster Analysis.
- The Political Economy of Competitiveness: Corporate Performance and Public Policy (Routledge Studies in Contemporary Political Economy).
It is therefore crucial that we have a good undestanding of what makes two observations e. If the user does not have a good understanding of what makes two observations e. Most statistical methods for clustering and segmentation use common mathematical measures of distance. Typical measures are, for example, the Euclidean distance or the Manhattan distance see help dist in R for more examples.
In our case we explore two distance metrics: the commonly used Euclidean distance as well as a simple one we define manually. The Euclidean distance between two observations in our case, customers is simply the square root of the average of the square difference between the attributes of the two observations in our case, customers. K-means cluster is a method to quickly cluster large data sets.
The researcher define the number of clusters in advance. This is useful to test different models with a different assumed number of clusters. Hierarchical cluster is the most common method. It generates a series of models with cluster solutions from 1 all cases in one cluster to n each case is an individual cluster. Hierarchical cluster also works with variables as opposed to cases; it can cluster variables together in a manner somewhat similar to factor analysis.
In addition, hierarchical cluster analysis can handle nominal, ordinal, and scale data; however it is not recommended to mix different levels of measurement. Two-step cluster analysis identifies groupings by running pre-clustering first and then by running hierarchical methods.
Because it uses a quick cluster algorithm upfront, it can handle large data sets that would take a long time to compute with hierarchical cluster methods. In this respect, it is a combination of the previous two approaches. Two-step clustering can handle scale and ordinal data in the same model, and it automatically selects the number of clusters. The hierarchical cluster analysis follows three basic steps: 1 calculate the distances, 2 link the clusters, and 3 choose a solution by selecting the right number of clusters.
First, we have to select the variables upon which we base our clusters. In the dialog window we add the math, reading, and writing tests to the list of variables. Since we want to cluster cases we leave the rest of the tick marks on the default.
In the dialog box Statistics… we can specify whether we want to output the proximity matrix these are the distances calculated in the first step of the analysis and the predicted cluster membership of the cases in our observations. Again, we leave all settings on default.
In the dialog box Plots… we should add the Dendrogram. The Dendrogram will graphically show how the clusters are merged and allows us to identify what the appropriate number of clusters is. Various heuristics have been developed that attempt to find sweet spot in this trade-off. For example, this can be useful to:. This example analyzes seven variables measuring the extent of agreement with the following statements:.
The seven variables that have been analyzed can be reduced to three variables. The real conclusion to draw from this analysis is that the principal components analysis has failed to identify much that is interesting. Perhaps a more interesting solution could be found by investigating more components e.
Cluster Analysis and Segmentation
For example, the equation above shows that TV advertising is more effective than online advertising. Common dependent variables in survey analysis applications of regression include: Overall satisfaction with a product or a service. Likelihood to recommend a product or service. Net Promoter Score. Likelihood to use a product or service again. Product quality. Frequency of buying or using a product or service. In most applications of regression to survey analysis, the independent variables are either: Demographic variables. For example, if wishing to identify high value customers, the dependent variable may be amount of money spent and the independent variables would be demographics.
Measurements of performance in different areas. For example, if the dependent variable measures satisfaction with an airline the independent variables could be things such as satisfaction with the food, satisfaction with the cabin crew, satisfaction with the in-flight entertainment, and so on. Measurements of effort in different areas. For example, expenditures on different types of advertising, such as in the above example. Viewing the tree as a table The tree above can also be expressed as an admittedly ugly table:.
The strengths and weaknesses of predictive trees The key strength of predictive trees is ease-of-use. The major limitations of predictive trees are that: They are at their best with very large sample sizes. Predictive trees with less than a few hundred observations are often not useful. This is because each time the tree splits the sample size also splits, so with small sample sizes the trees end up having only one or two variables. Predictive trees cannot be used to make conclusions about which variables are strong and which are weak predictors. To appreciate this, we can exclude marital status and re-grow the tree.
The revised tree is shown below.
Note that now Age is shown to be the first predictor of number of SMS per week and it seems to be close to be similar in its predictive accuracy to marital status, with predictions ranging from 2. Split the sample into groups that are relatively similar in terms of this variable e.