wittakarn: Decision tree(Data mining)

We can take a unique service each customer by using customer relationship management (CRM) in order to discover his/her behavior in detail. In the CRM, Decision Trees can be used to classify existing customer records into customer segments. The process starts with data related whose behavior is already known; for example, customers who have responded to a promotional campaign and someone have not; C4.5 is an algorithm developed by Ross Quinlan that generates Decision Trees. This algorithm dealing with both continuous and discrete attributes, missing values and pruning trees after construction.

Algorithm C4.5(D)
Input: an attribute-value dataset D
1: Tree = {}
2: if D is "pure" OR other stopping criteria met then
3: terminate
4:end if
5:for all attribute a Є D do

6:     Compute information-theoretic criteria if we split on a
7:end for
8:a_best = Best attribute according to above compute criteria
9:Tree = Create a decision node that test a_est in the root
10:D_v = Induce sub-datasets from D based on a_best
11:for all D_v do
12:     Tree_v = C4.5(D_v)
13:     Attach Tree_v to the correspoding branch of Tree
14:end for
15:return Tree

Example

Gender	Age	Salary	Status	Number of times travel per year	interest travel promotion or not
Male	21-40	30001-50000	Single	1-3	No
Female	21-40	10001-30000	Married	4-5	Yes
Female	> 60	10001-30000	Single	1-3	No
Female	41-60	30001-50000	Married	4-5	Yes
Male	41-60	> 50001	Single	1-3	Yes
Female	21-40	> 50001	Single	1-3	Yes
Male	41-60	30001-50000	Married	4-5	Yes
Male	> 60	10001-30000	Single	1-3	No
Female	41-60	10001-30000	Married	1-3	No
Male	> 60	30001-50000	Married	1-3	Yes

a) Entropy using the frequency table of one attribute:

Yes of interest is 6.

No is 4.

E(travel promotion) = E(4/10, 6/10) = - (0.4log₂0.4) - (0.6log₂0.6) = 0.971

E original = 0.971

b) Entropy using the frequency table of two attributes:

Gender	Yes	No
Male	3	2
Female	3	2

E(Interest, Gender) = P(Male)*E(3/5, 2/5) + P(Female)*E(3/5, 2/5)

= 5/10*(- 0.6log₂0.6 - 0.4log₂0.4)) + 5/10*(- 0.6log₂0.6 - 0.4log₂0.4)) = 0.971

c) Information gain formula.

Gain(Original, Gender) = 0.971 - 0.971 = 0

d) SplitINFO formula.

SplitINFO(Interest, Gender) = - (5/10log₂5/10) - (5/10log₂5/10) = 1

e) GainRaTIO formula.

GainRATIO(Interest, Gender) = 0/1 = 0

The C4.5 uses "Gain Ratio" measure which is Information Gain divided by SplitInfo. The Information Gain divided by SplitInfo is call GainRaTIO.

Step 1: Calculate entropy of the target.

Step 2: Split the dataset on the different attribute, and calculate GainRATIO of each attribute.

Step 3: Choose attribute that have largest GainRATIO as the decision node.

Step 4: The C4.5 algorithm is run recursively one the non-leaf branch, until all data is classified.

Step 1:

E(travel promotion) = E(4/10, 6/10) = - (0.4log₂0.4) - (0.6log₂0.6) = 0.971

Step 2:

Gender	Yes	No
Male	3	2
Female	3	2

E(Interest, Gender) = P(Male)*E(3/5, 2/5) + P(Female)*E(3/5, 2/5)

= 5/10*(- 0.6log₂0.6 - 0.4log₂0.4) + 5/10*(- 0.6log₂0.6 - 0.4log₂0.4) = 0.971

Gain(Original, Gender) = 0.971 - 0.971 = 0

SplitINFO(Interest, Gender) = - (5/10log₂5/10) - (5/10log₂5/10) = 1

GainRATIO(Interest, Gender) = 0/1 = 0

Age	Yes	No
21-40	2	1
41-60	3	1
> 60	2	1

E(Interest, Age) = P(21-40)*E(2/3, 1/3) + P(41-60)*E(3/4, 1/4) + P(> 60)*E(2/3, 1/3)

= 3/10*0.9183 + 4/10*0.8113 + 3/10*0.9183 = 0.876

Gain(Original, Age) = 0.971 - 0.876 = 0.095

SplitINFO(Interest, Age) =- (3/10log₂3/10) - (4/10log₂4/10) - (3/10log₂3/10) = 1.571

GainRATIO(Interest, Age) = 0.095/1.571 = 0.06

Salary	Yes	No
10001-30000	1	3
30001-50000	3	1
> 50001	2	0

E(Interest, Age) = P(10001-30000)*E(1/4, 3/4) + P(30001-50000)*E(3/4, 1/4) + P(> 50001)*E(2/2, 0/2)

= 4/10*0.811 + 4/10*0.811 + 2/10*0 = 0.649

Gain(Original, Age) = 0.971 - 0.649 = 0.322

SplitINFO(Interest, Age) =- (4/10log₂4/10) - (4/10log₂4/10) - (2/10log₂2/10) = 2.376

GainRATIO(Interest, Age) = 0.322/2.376 = 0.136

Status	Yes	No
Married	4	1
Single	2	3

E(Interest,Status) = P(Married)*E(4/5, 1/5) + P(Single)*E(2/5, 3/5)

= 5/10*0.722 + 5/10*0.971 = 0.847

Gain(Original,Status) = 0.971 - 0.847 = 0.124

SplitINFO(Interest,Status) =- (5/10log₂5/10) - (5/10log₂5/10) = 1

GainRATIO(Interest,Status) = 0.124/1 = 0.124

Number of times travel per year	Yes	No
1-3	3	4
4-5	3	0

E(Interest, travel per year) = P(1-3)*E(3/7, 4/7) + P(4-5)*E(3/3, 0/3)

= 7/10*0.985 + 3/10*0 = 0.69

Gain(Original, travel per year) = 0.971 - 0.69 = 0.281

SplitINFO(Interest, travel per year) =- (7/10log₂7/10) - (3/10log₂3/10) = 0.881

GainRATIO(Interest, travel per year) = 0.281/0.881 = 0.319

Step 3: GainRATIO of Number of times travel per year is come to top, so number of times travel per year is the root node.

Step 4: Next we will interest only number of times travel per year that have 1-3 value.

Gender	Age	Salary	Status	interest travel promotion or not
Male	21-40	30001-50000	Single	No
Female	> 60	10001-30000	Single	No
Male	41-60	> 50001	Single	Yes
Female	21-40	> 50001	Single	Yes
Male	> 60	10001-30000	Single	No
Female	41-60	10001-30000	Married	No
Male	> 60	30001-50000	Married	Yes

Step 5: Calculate GainRATIO that have travel per year = 1-3

E(Interest[1-3]) = E(3/7, 4/7) = - (3/7log₂3/7) - (4/7log₂4/7) = 0.985

Gender	Yes	No
Male	2	2
Female	2	1

E(Interest[1-3], Gender) = P(Male)*E(2/4, 2/4) + P(Female)*E(2/3, 1/3)

= 4/7*1 + 3/7*0.918 = 0.965

Gain(Original[1-3], Gender) = 0.985 - 0.965 = 0.02

SplitINFO(Interest[1-3], Gender) =- (4/7log₂4/7) - (3/7log₂3/7) = 0.985

GainRATIO(Interest[1-3], Gender) = 0.02/0.985 = 0.02

Age	Yes	No
21-40	1	1
41-60	1	1
> 60	1	2

E(Interest[1-3], Age) = P(21-40)*E(1/2, 1/2) + P(41-60)*E(1/2, 1/2) + P(> 60)*E(1/3, 2/3)

= 2/7*1 + 2/7*1 + 3/7*0.918 = 0.965

Gain(Original[1-3], Age) = 0.985 - 0.965 = 0.02

SplitINFO(Interest[1-3], Age) =- (2/7log₂2/7) - (2/7log₂2/7) - (3/7log₂3/7) = 1.556

GainRATIO(Interest[1-3], Age) = 0.02/1.556 = 0.013

Salary	Yes	No
10001-30000	0	3
30001-50000	1	1
> 50001	2	0

E(Interest[1-3], Salary) = P(10001-30000)*E(3/3, 0/3) + P(30001-50000)*E(1/2, 1/2) + P(> 50001)*E(2/2, 0/2)

= 3/7*0 + 2/7*1 + 2/7*0 = 0.286

Gain(Original[1-3],Salary) = 0.985 - 0.286 = 0.699

SplitINFO(Interest[1-3],Salary) =- (3/7log₂3/7) - (2/7log₂2/7) - (2/7log₂2/7) = 1.556

GainRATIO(Interest[1-3],Salary) = 0.699/1.556 = 0.449

Status	Yes	No
Married	1	1
Single	2	3

E(Interest[1-3], Status) = P(Married)*E(1/2, 1/2) + P(Single)*E(2/5, 3/5)

= 2/7*1 + 5/7*0.971 = 0.979

Gain(Original[1-3], Status) = 0.985 - 0.979 = 0.006

SplitINFO(Interest[1-3], Status) =- (2/7log₂2/7) - (5/7log₂5/7) = 0.863

GainRATIO(Interest[1-3], Status) = 0.006/0.863 = 0.007

Step 6: GainRATIO of salary is come to top, so salary is the child node of travel per year.

Step 7: we will interest only number of times travel per year that have 1-3 value and salary = 30001-50000.

Gender	Age	Status	interest travel promotion or not
Male	21-40	Single	No
Male	> 60	Married	Yes

Step 8: Calculate GainRATIO that have travel per year = 1-3 and salary = 30001-50000.

E(Interest[1-3][30001-50000]) = E(1/2, 1/2) = - (1/2log₂1/2) - (1/2log₂1/2) = 1

Gender	Yes	No
Male	1	1
Female	0	0

E(Interest[1-3][30001-50000], Gender) = P(Male)*E(1/2, 1/2) + P(Female)*E(0, 0) = 2/2*1 + 0 = 1

Gain(Original[1-3][30001-50000], Gender) = 1 - 1 = 0

SplitINFO(Interest[1-3][30001-50000], Gender) =- (2/2log₂2/2) = 0

GainRATIO(Interest[1-3][30001-50000], Gender) = 0

Age	Yes	No
21-40	0	1
41-60	0	0
> 60	1	0

E(Interest[1-3][30001-50000], Age) = P(21-40)*E(0, 1) + P(41-60)*E(0, 0) + P(> 60)*E(1, 0) = 0

Gain(Original[1-3][30001-50000], Age) = 1 - 0 = 1

SplitINFO(Interest[1-3][30001-50000], Age) = - (1/2log₂2/7) - (1/2log₂1/2) = 1

GainRATIO(Interest[1-3][30001-50000], Age) = 1/1 = 1

Status	Yes	No
Married	1	0
Single	0	1

E(Interest[1-3][30001-50000], Status) = P(Married)*E(0, 1) + P(Single)*E(1, 0) = 0

Gain(Original[1-3][30001-50000], Status) = 1 - 0 = 1

SplitINFO(Interest[1-3][30001-50000], Status) =- (1/2log₂1/2) - (1/2log₂1/2) = 1

GainRATIO(Interest[1-3][30001-50000], Status) = 1

GainRATIO both age and status is equal, so we can use either age or status to child node of salary, but we should select a node that have least number of type of attribute. Therefore we got a decision tree like this.

After We constructed a decision tree completely, we can use post pruning in order to discard unreliable parts.

f is the error on the training data.

N is the number of instances covered by the left.

z from normal distribution.

e is error estimate for a node.

If the error estimate at children node greater than parent node, we do not want to keep the children.

The advantages of the C4.5 are:

Builds models that can be easily interpreted and explained to executives.
Easy to implement because construct a decision tree only once that can handle test sets.
Can use both categorical and continuous values.
Deals with noise.

The disadvantages are:

Small variation in data can lead to different decision trees (especially when the variables are close to each other in value).
Does not work very well on a small training set.
If in the first time we use small training set to construct a tree, It can't handle a test set accurately.

Solution 1:

When test data has been predicted by decision tree, and get Yes in class interest travel promotion that means customer should have interest in a new promotion. We'll send e-mail to him/her. Since he/she rejected, the training set can't handle correctly. If number of rejection greater than threshold(incorrect predictions/correct prediction), we'll use test sets to become training sets combination with old training sets to construct a decision tree again for preparing to predict new test set.

Example

Since we define threshold=0.3, (incorrect predictions)/(correct prediction) is greater than 0.3, we'll construct decision tree immediately.

Solution 2:

Use C5.0 instead of C4.5. The advantages of C5.0

Multi-threaded: can take advantage of multiple CPUs or cores that can improve speed to construct a decision tree.
Boosting: that is a technique for generating and combining multiple classifiers to improve predictive accuracy.
Decision trees generally smaller.

PDF Version.

wittakarn

Friday, September 19, 2014

Decision tree(Data mining)

4 comments: