# Predictive Trees

A *predictive tree* is an analysis that looks like an upside down tree. Although it looks quite complicated this 'tree' is just a graphical representation of a table. The term *predictive tree* is not a standard term; the same basic idea has lots of different brand names, including: *Rpart* (R), *Tree* and *AnswerTree* (SPSS) and *CHAID* (Statistical Innovations), *CART*, regression trees, classification trees and *decision trees*.

## Contents

## Example

The tree below predicts the number of SMS of people in the Mobiles example study. The top box on the tree shows the data for the entire sample (i.e., 100%). It shows an average of 11.0 text messages per person and the histogram (column chart) shows that the most common answer is around 0. From this tree we can see that the first *predictor variable* is `Marital status`: people that are `Single` sent an average of 20.7 SMSs, compared to 4.0 among the non-single people.

As the `Single` category is not split further this tells us that once we know that somebody is single then we can make a prediction without any further information, where the prediction is that a single person will make 20.7 SMS. From the histogram we can see that there is still considerable variation among these single people.

Looking now at the people that are not single (i.e., `Married/de facto + Separated/Divorced + Widowed`), this group is split by age and we can see that people that are `Married/de facto + Separated/Divorced + Widowed` and are aged under 45 have an average number of SMS per month of 9.3, compared to 2.2 among those that are aged 45 or more and are `Married/de facto + Separated/Divorced + Widowed`.

Looking further down the tree we can see that this younger group is split further by whether or not they have children, age again and occupation.

## Viewing the tree as a table

The tree above can also be expressed as an (admittedly ugly) table:

## The strengths and weaknesses of predictive trees

The key strength of predictive trees is ease-of-use. Provided that you apply a bit of commonsense and take the time to learn how to use the software that you are using, it is hard to go particularly wrong with a tree. Predictive trees are the closest thing that there is to an idiot-proof advanced analysis technique. By contrast, regression, which can be used for similar problems, is frequently used incorrectly by very experienced researchers.

The major limitations of predictive trees are that:

- They are at their best with very large sample sizes. Predictive trees with less than a few hundred observations are often not useful. This is because each time the tree splits the sample size also splits, so with small sample sizes the trees end up having only one or two variables.
- Predictive trees cannot be used to make conclusions about which variables are strong and which are weak predictors. For example, looking at the
`Single`group a natural, but incorrect, interpretation is that age is irrelevant because it is not used to split this group further. However, such an interpretation is not correct because it ignores that being single is correlated with being young. To appreciate this, we can exclude marital status and re-grow the tree. The revised tree is shown below. Note that now Age is shown to be the first predictor of number of SMS per week and it seems to be close to be similar in its predictive accuracy to marital status, with predictions ranging from 2.8 to 22.3, compared to from 4.0 to 20.7 for marital status.^{[note 1]}

## Software

Most advanced statistical techniques are reasonably standard, with different programs providing basically the same outputs. This is not the case with predictive trees. Most software companies create their trees in different ways and the distinctions between the trees created by different programs can be large.

All of the common predictive tree programs are technically described as being *recursive divisive* algorithms, which is a fancy way of saying that they:

- Find the variable that best predicts what is being predicted.
- Split the sample into groups that are relatively similar in terms of this variable (e.g., in the tree at the top of this page, the groups were defined based on marital status).
- For each of the new groups, repeat steps 1 and 2, continually splitting each group and further splitting until no more splitting is practical. (In computer science and logic the idea of continually reapplying the same process is called
*recursion*).

The key differences between the different bits of software are:

- What they can predict. Most programs will only predict a single categorical or a single numeric variable. The major exception to this is Q, which can simultaneously predict multiple numeric, categorical and other exotic data types (e.g., Choice Modeling data).
- How many groups they form each time they do a split. In the examples above, the program (in this case
*Q*), tries to optimally work out the number of groups at each*level*of the tree. A similar approach is used by SPSS's*Tree*,*AnswerTree*,*CHAID*and*DataCracker*. The alternative approach is to always split into two groups; this approach is used in CART. - What rules they use to determine when to stop further splitting groups in the tree. Each of the programs has various ways of doing this, including:
- Requiring that additional splits improve the predictive accuracy of the tree (different programs implement in this in different ways;
*CHAID*uses significance tests,*DataCracker*and*Q*use information criteria and CART usually uses*cross-validation*). - Specifying a minimum sample size after which a group cannot be further split.
- Specifying a maximum number of
*levels*of the tree.

- Requiring that additional splits improve the predictive accuracy of the tree (different programs implement in this in different ways;
- The number of options available.
*DataCracker*, for example, only lets the user choose between a small and a bigger tree, SPSS's*tree*let's the user choose between various tree algorithms (e.g., different variants of CHAID and CART), R's`rpart`can be customized in pretty much any way that is desired if the user is skilled enough and*Q*lets the user change the predictor variables available within each group.

## Notes

- ↑ Indeed, it may look that age is a better predictor, as it has a larger range. However, in this case occupation is preferred because it has a fewer number of categories with broadly similar predictive accuracy; this is an example of the principle of
*parsimony*being taken into account when creating a model.