Friday, September 19, 2014

Decision tree(Data mining)

      We can take a unique service each customer by using customer relationship management (CRM) in order to discover his/her behavior in detail. In the CRM, Decision Trees can be used to classify existing customer records into customer segments. The process starts with data related whose behavior is already known; for example, customers who have responded to a promotional campaign and someone have not; C4.5 is an algorithm developed by Ross Quinlan that generates Decision Trees. This algorithm dealing with both continuous and discrete attributes, missing values and pruning trees after construction.

Algorithm C4.5(D)
Input: an attribute-value dataset D
1: Tree = {}
2: if D is "pure" OR other stopping criteria met then
3:     terminate
4:end if
5:for all attribute a Є D do
6:     Compute information-theoretic criteria if we split on a
7:end for
8:a_best = Best attribute according to above compute criteria
9:Tree = Create a decision node that test a_est in the root
10:D_v = Induce sub-datasets from D based on a_best
11:for all D_v do
12:     Tree_v = C4.5(D_v)
13:     Attach Tree_v to the correspoding branch of Tree
14:end for
15:return Tree

Example
GenderAgeSalaryStatusNumber of times travel per yearinterest travel promotion
or not
Male21-4030001-50000Single1-3No
Female21-4010001-30000Married4-5Yes
Female> 6010001-30000Single1-3No
Female41-6030001-50000Married4-5Yes
Male41-60> 50001Single1-3Yes
Female21-40> 50001Single1-3Yes
Male41-6030001-50000Married4-5Yes
Male> 6010001-30000Single1-3No
Female41-6010001-30000Married1-3No
Male> 6030001-50000Married1-3Yes

a) Entropy using the frequency table of one attribute:
Yes of interest is 6.
No is 4.
E(travel promotion) = E(4/10, 6/10) = - (0.4log20.4) - (0.6log20.6) = 0.971
E original = 0.971

b) Entropy using the frequency table of two attributes:
GenderYesNo
Male32
Female32
E(Interest, Gender) = P(Male)*E(3/5, 2/5) + P(Female)*E(3/5, 2/5)
= 5/10*(- 0.6log20.6 - 0.4log20.4)) + 5/10*(- 0.6log20.6 - 0.4log20.4)) = 0.971

c) Information gain formula.
Gain(Original, Gender) = 0.971 - 0.971 = 0

d) SplitINFO formula.
SplitINFO(Interest, Gender) = - (5/10log25/10) - (5/10log25/10) = 1

e) GainRaTIO formula.
GainRATIO(Interest, Gender) = 0/1 = 0


The C4.5 uses "Gain Ratio" measure which is Information Gain divided by SplitInfo. The Information Gain divided by SplitInfo is call GainRaTIO.

Step 1: Calculate entropy of the target.
Step 2: Split the dataset on the different attribute, and calculate GainRATIO of each attribute.
Step 3: Choose attribute that have largest GainRATIO as the decision node.
Step 4: The C4.5 algorithm is run recursively one the non-leaf branch, until all data is classified.

Step 1:
     E(travel promotion) = E(4/10, 6/10) = - (0.4log20.4) - (0.6log20.6) = 0.971
Step 2:

GenderYesNo
Male32
Female32
E(Interest, Gender) = P(Male)*E(3/5, 2/5) + P(Female)*E(3/5, 2/5)
= 5/10*(- 0.6log20.6 - 0.4log20.4) + 5/10*(- 0.6log20.6 - 0.4log20.4) = 0.971
Gain(Original, Gender) = 0.971 - 0.971 = 0
SplitINFO(Interest, Gender) = - (5/10log25/10) - (5/10log25/10) = 1
GainRATIO(Interest, Gender) = 0/1 = 0

AgeYesNo
21-4021
41-6031
> 6021
E(Interest, Age) = P(21-40)*E(2/3, 1/3) + P(41-60)*E(3/4, 1/4) + P(> 60)*E(2/3, 1/3)
= 3/10*0.9183 + 4/10*0.8113 + 3/10*0.9183 = 0.876
Gain(Original, Age) = 0.971 - 0.876 = 0.095
SplitINFO(Interest, Age) =- (3/10log23/10) - (4/10log24/10) - (3/10log23/10) = 1.571
GainRATIO(Interest, Age) = 0.095/1.571 = 0.06

SalaryYesNo
10001-3000013
30001-5000031
> 5000120
E(Interest, Age) = P(10001-30000)*E(1/4, 3/4) + P(30001-50000)*E(3/4, 1/4) + P(> 50001)*E(2/2, 0/2)
= 4/10*0.811 + 4/10*0.811 + 2/10*0 = 0.649
Gain(Original, Age) = 0.971 - 0.649 = 0.322
SplitINFO(Interest, Age) =- (4/10log24/10) - (4/10log24/10) - (2/10log22/10) = 2.376
GainRATIO(Interest, Age) = 0.322/2.376 = 0.136

StatusYesNo
Married41
Single23
E(Interest,Status) = P(Married)*E(4/5, 1/5) + P(Single)*E(2/5, 3/5)
= 5/10*0.722 + 5/10*0.971 = 0.847
Gain(Original,Status) = 0.971 - 0.847 = 0.124
SplitINFO(Interest,Status) =- (5/10log25/10) - (5/10log25/10) = 1
GainRATIO(Interest,Status) = 0.124/1 = 0.124

Number of times travel per yearYesNo
1-334
4-530
E(Interest, travel per year) = P(1-3)*E(3/7, 4/7) + P(4-5)*E(3/3, 0/3)
= 7/10*0.985 + 3/10*0 = 0.69
Gain(Original, travel per year) = 0.971 - 0.69 = 0.281
SplitINFO(Interest, travel per year) =- (7/10log27/10) - (3/10log23/10) = 0.881
GainRATIO(Interest, travel per year) = 0.281/0.881 = 0.319
     
Step 3: GainRATIO of Number of times travel per year is come to top, so number of times travel per year is the root node.

Step 4: Next we will interest only number of times travel per year that have 1-3 value.

GenderAgeSalaryStatusinterest travel promotion
or not
Male21-4030001-50000SingleNo
Female> 6010001-30000SingleNo
Male41-60> 50001SingleYes
Female21-40> 50001SingleYes
Male> 6010001-30000SingleNo
Female41-6010001-30000MarriedNo
Male> 6030001-50000MarriedYes
Step 5: Calculate GainRATIO that have travel per year = 1-3

E(Interest[1-3]) = E(3/7, 4/7) = - (3/7log23/7) - (4/7log24/7) = 0.985

GenderYesNo
Male22
Female21
E(Interest[1-3], Gender) = P(Male)*E(2/4, 2/4) + P(Female)*E(2/3, 1/3)
= 4/7*1 + 3/7*0.918 = 0.965
Gain(Original[1-3], Gender) = 0.985 - 0.965 = 0.02
SplitINFO(Interest[1-3], Gender) =- (4/7log24/7) - (3/7log23/7) = 0.985
GainRATIO(Interest[1-3], Gender) = 0.02/0.985 = 0.02

AgeYesNo
21-4011
41-6011
> 6012
E(Interest[1-3], Age) = P(21-40)*E(1/2, 1/2) + P(41-60)*E(1/2, 1/2) + P(> 60)*E(1/3, 2/3)
= 2/7*1 + 2/7*1 + 3/7*0.918 = 0.965
Gain(Original[1-3], Age) = 0.985 - 0.965 = 0.02
SplitINFO(Interest[1-3], Age) =- (2/7log22/7) - (2/7log22/7) - (3/7log23/7) = 1.556
GainRATIO(Interest[1-3], Age) = 0.02/1.556 = 0.013

SalaryYesNo
10001-3000003
30001-5000011
> 5000120
E(Interest[1-3], Salary) = P(10001-30000)*E(3/3, 0/3) + P(30001-50000)*E(1/2, 1/2) + P(> 50001)*E(2/2, 0/2)
= 3/7*0 + 2/7*1 + 2/7*0 = 0.286
Gain(Original[1-3],Salary) = 0.985 - 0.286 = 0.699
SplitINFO(Interest[1-3],Salary) =- (3/7log23/7) - (2/7log22/7) - (2/7log22/7) = 1.556
GainRATIO(Interest[1-3],Salary) = 0.699/1.556 = 0.449

StatusYesNo
Married11
Single23
E(Interest[1-3], Status) = P(Married)*E(1/2, 1/2) + P(Single)*E(2/5, 3/5)
= 2/7*1 + 5/7*0.971 = 0.979
Gain(Original[1-3], Status) = 0.985 - 0.979 = 0.006
SplitINFO(Interest[1-3], Status) =- (2/7log22/7) - (5/7log25/7) = 0.863
GainRATIO(Interest[1-3], Status) = 0.006/0.863 = 0.007

Step 6: GainRATIO of salary is come to top, so salary is the child node of travel per year.

Step 7: we will interest only number of times travel per year that have 1-3 value and salary = 30001-50000.

GenderAgeStatusinterest travel promotion
or not
Male21-40SingleNo
Male> 60MarriedYes
Step 8: Calculate GainRATIO that have travel per year = 1-3 and salary = 30001-50000.

E(Interest[1-3][30001-50000]) = E(1/2, 1/2) = - (1/2log21/2) - (1/2log21/2) = 1
GenderYesNo
Male11
Female00
E(Interest[1-3][30001-50000], Gender) = P(Male)*E(1/2, 1/2) + P(Female)*E(0, 0) = 2/2*1 + 0 = 1
Gain(Original[1-3][30001-50000], Gender) = 1 - 1 = 0
SplitINFO(Interest[1-3][30001-50000], Gender) =- (2/2log22/2) = 0
GainRATIO(Interest[1-3][30001-50000], Gender) = 0

AgeYesNo
21-4001
41-6000
> 6010
E(Interest[1-3][30001-50000], Age) = P(21-40)*E(0, 1) + P(41-60)*E(0, 0) + P(> 60)*E(1, 0) = 0
Gain(Original[1-3][30001-50000], Age) = 1 - 0 = 1
SplitINFO(Interest[1-3][30001-50000], Age) = - (1/2log22/7) - (1/2log21/2) = 1
GainRATIO(Interest[1-3][30001-50000], Age) = 1/1 = 1

StatusYesNo
Married10
Single01
E(Interest[1-3][30001-50000], Status) = P(Married)*E(0, 1) + P(Single)*E(1, 0) = 0
Gain(Original[1-3][30001-50000], Status) = 1 - 0 = 1
SplitINFO(Interest[1-3][30001-50000], Status) =- (1/2log21/2) - (1/2log21/2) = 1
GainRATIO(Interest[1-3][30001-50000], Status) = 1

GainRATIO both age and status is equal, so we can use either age or status to child node of salary, but we should select a node that have least number of type of attribute. Therefore we got a decision tree like this.



After We constructed a decision tree completely, we can use post pruning in order to discard unreliable parts.

f is the error on the training data.
N is the number of instances covered by the left.
z from normal distribution.
e is error estimate for a node.

If the error estimate at children node greater than parent node, we do not want to keep the children.

The advantages of the C4.5 are:
  • Builds models that can be easily interpreted and explained to executives.
  • Easy to implement because construct a decision tree only once that can handle test sets.
  • Can use both categorical and continuous values.
  • Deals with noise.
The disadvantages are:
  • Small variation in data can lead to different decision trees (especially when the variables are close to each other in value).
  • Does not work very well on a small training set.
  • If in the first time we use small training set to construct a tree, It can't handle a test set accurately.
Solution 1:
When test data has been predicted by decision tree, and get Yes in class interest travel promotion that means customer should have interest in a new promotion. We'll send e-mail to him/her. Since he/she rejected, the training set can't handle correctly. If number of rejection greater than threshold(incorrect predictions/correct prediction), we'll use test sets to become training sets combination with old training sets to construct a decision tree again for preparing to predict new test set.
Example
Since we define threshold=0.3, (incorrect predictions)/(correct prediction) is greater than 0.3, we'll construct decision tree immediately.

Solution 2:
Use C5.0 instead of C4.5. The advantages of C5.0
  • Multi-threaded: can take advantage of multiple CPUs or cores that can improve speed to construct a decision tree.
  • Boosting: that is a technique for generating and combining multiple classifiers to improve predictive accuracy.
  • Decision trees generally smaller.

1 comment:

  1. This information really worth saying, i think you are master of the content and thank you so much sharing that valuable information and get new skills after refer that post.
    Websphere Training in Chennai

    ReplyDelete