A predictive tree is an analysis that looks like an upside down tree. Although it looks quite complicated this 'tree' is just a graphical representation of a table. The term predictive tree is not a standard term; the same basic idea has lots of different brand names, including: Rpart (R), Tree and AnswerTree (SPSS) and CHAID (Statistical Innovations), CART, regression trees, classification trees and decision trees.
The tree below predicts the number of SMS of people in the Mobiles example study. The top box on the tree shows the data for the entire sample (i.e., 100%). It shows an average of 11.0 text messages per person and the histogram (column chart) shows that the most common answer is around 0. From this tree we can see that the first predictor variable is Marital status: people that are Single sent an average of 20.7 SMSs, compared to 4.0 among the non-single people.
As the Single category is not split further this tells us that once we know that somebody is single then we can make a prediction without any further information, where the prediction is that a single person will make 20.7 SMS. From the histogram we can see that there is still considerable variation among these single people.
Looking now at the people that are not single (i.e., Married/de facto + Separated/Divorced + Widowed), this group is split by age and we can see that people that are Married/de facto + Separated/Divorced + Widowed and are aged under 45 have an average number of SMS per month of 9.3, compared to 2.2 among those that are aged 45 or more and are Married/de facto + Separated/Divorced + Widowed.
Looking further down the tree we can see that this younger group is split further by whether or not they have children, age again and occupation.
Viewing the tree as a table
The tree above can also be expressed as an (admittedly ugly) table:
The strengths and weaknesses of predictive trees
The key strength of predictive trees is ease-of-use. Provided that you apply a bit of commonsense and take the time to learn how to use the software that you are using, it is hard to go particularly wrong with a tree. Predictive trees are the closest thing that there is to an idiot-proof advanced analysis technique. By contrast, regression, which can be used for similar problems, is frequently used incorrectly by very experienced researchers.
The major limitations of predictive trees are that:
- They are at their best with very large sample sizes. Predictive trees with less than a few hundred observations are often not useful. This is because each time the tree splits the sample size also splits, so with small sample sizes the trees end up having only one or two variables.
- Predictive trees cannot be used to make conclusions about which variables are strong and which are weak predictors. For example, looking at the Single group a natural, but incorrect, interpretation is that age is irrelevant because it is not used to split this group further. However, such an interpretation is not correct because it ignores that being single is correlated with being young. To appreciate this, we can exclude marital status and re-grow the tree. The revised tree is shown below. Note that now Age is shown to be the first predictor of number of SMS per week and it seems to be close to be similar in its predictive accuracy to marital status, with predictions ranging from 2.8 to 22.3, compared to from 4.0 to 20.7 for marital status.[note 1]
Most advanced statistical techniques are reasonably standard, with different programs providing basically the same outputs. This is not the case with predictive trees. Most software companies create their trees in different ways and the distinctions between the trees created by different programs can be large.
All of the common predictive tree programs are technically described as being recursive divisive algorithms, which is a fancy way of saying that they:
- Find the variable that best predicts what is being predicted.
- Split the sample into groups that are relatively similar in terms of this variable (e.g., in the tree at the top of this page, the groups were defined based on marital status).
- For each of the new groups, repeat steps 1 and 2, continually splitting each group and further splitting until no more splitting is practical. (In computer science and logic the idea of continually reapplying the same process is called recursion).
The key differences between the different bits of software are:
- What they can predict. Most programs will only predict a single categorical or a single numeric variable. The major exception to this is Q, which can simultaneously predict multiple numeric, categorical and other exotic data types (e.g., Choice Modeling data).
- How many groups they form each time they do a split. In the examples above, the program (in this case Q), tries to optimally work out the number of groups at each level of the tree. A similar approach is used by SPSS's Tree, AnswerTree, CHAID and DataCracker. The alternative approach is to always split into two groups; this approach is used in CART.
- What rules they use to determine when to stop further splitting groups in the tree. Each of the programs has various ways of doing this, including:
- Requiring that additional splits improve the predictive accuracy of the tree (different programs implement in this in different ways; CHAID uses significance tests, DataCracker and Q use information criteria and CART usually uses cross-validation).
- Specifying a minimum sample size after which a group cannot be further split.
- Specifying a maximum number of levels of the tree.
- The number of options available. DataCracker, for example, only lets the user choose between a small and a bigger tree, SPSS's tree let's the user choose between various tree algorithms (e.g., different variants of CHAID and CART), R's rpart can be customized in pretty much any way that is desired if the user is skilled enough and Q lets the user change the predictor variables available within each group.
- Indeed, it may look that age is a better predictor, as it has a larger range. However, in this case occupation is preferred because it has a fewer number of categories with broadly similar predictive accuracy; this is an example of the principle of parsimony being taken into account when creating a model.