We can take a unique service each customer by using customer relationship management (CRM) in order to discover his/her behavior in detail. In the CRM, Decision Trees can be used to classify existing customer records into customer segments. The process starts with data related whose behavior is already known; for example, customers who have responded to a promotional campaign and someone have not; C4.5 is an algorithm developed by Ross Quinlan that generates Decision Trees. This algorithm dealing with both continuous and discrete attributes, missing values and pruning trees after construction.
Algorithm C4.5(D)
Input: an attribute-value dataset D
1: Tree = {}
2: if D is "pure" OR other stopping criteria met then
3: terminate
4:end if
5:for all attribute a Є D do
Input: an attribute-value dataset D
1: Tree = {}
2: if D is "pure" OR other stopping criteria met then
3: terminate
4:end if
5:for all attribute a Є D do
6: Compute information-theoretic criteria if we split on a
7:end for
8:a_best = Best attribute according to above compute criteria
9:Tree = Create a decision node that test a_est in the root
10:D_v = Induce sub-datasets from D based on a_best
11:for all D_v do
12: Tree_v = C4.5(D_v)
13: Attach Tree_v to the correspoding branch of Tree
14:end for
15:return Tree
7:end for
8:a_best = Best attribute according to above compute criteria
9:Tree = Create a decision node that test a_est in the root
10:D_v = Induce sub-datasets from D based on a_best
11:for all D_v do
12: Tree_v = C4.5(D_v)
13: Attach Tree_v to the correspoding branch of Tree
14:end for
15:return Tree
Example
Gender | Age | Salary | Status | Number of times travel per year | interest travel promotion or not |
Male | 21-40 | 30001-50000 | Single | 1-3 | No |
Female | 21-40 | 10001-30000 | Married | 4-5 | Yes |
Female | > 60 | 10001-30000 | Single | 1-3 | No |
Female | 41-60 | 30001-50000 | Married | 4-5 | Yes |
Male | 41-60 | > 50001 | Single | 1-3 | Yes |
Female | 21-40 | > 50001 | Single | 1-3 | Yes |
Male | 41-60 | 30001-50000 | Married | 4-5 | Yes |
Male | > 60 | 10001-30000 | Single | 1-3 | No |
Female | 41-60 | 10001-30000 | Married | 1-3 | No |
Male | > 60 | 30001-50000 | Married | 1-3 | Yes |
a) Entropy using the frequency table of one attribute:
Yes of interest is 6.
No is 4.
E(travel promotion) = E(4/10, 6/10) = - (0.4log20.4) - (0.6log20.6) = 0.971
E original = 0.971
Gender | Yes | No |
Male | 3 | 2 |
Female | 3 | 2 |
E(Interest, Gender) = P(Male)*E(3/5, 2/5) + P(Female)*E(3/5, 2/5)
= 5/10*(- 0.6log20.6 - 0.4log20.4)) + 5/10*(- 0.6log20.6 - 0.4log20.4)) = 0.971
Gain(Original, Gender) = 0.971 - 0.971 = 0
SplitINFO(Interest, Gender) = - (5/10log25/10) - (5/10log25/10) = 1
GainRATIO(Interest, Gender) = 0/1 = 0
The C4.5 uses "Gain Ratio" measure which is Information Gain divided by SplitInfo. The Information Gain divided by SplitInfo is call GainRaTIO.
Step 1: Calculate entropy of the target.
Step 2: Split the dataset on the different attribute, and calculate GainRATIO of each attribute.
Step 3: Choose attribute that have largest GainRATIO as the decision node.
Step 4: The C4.5 algorithm is run recursively one the non-leaf branch, until all data is classified.
Step 1:
E(travel promotion) = E(4/10, 6/10) = - (0.4log20.4) - (0.6log20.6) = 0.971
Step 2:
Gender | Yes | No |
Male | 3 | 2 |
Female | 3 | 2 |
E(Interest, Gender) = P(Male)*E(3/5, 2/5) + P(Female)*E(3/5, 2/5)
= 5/10*(- 0.6log20.6 - 0.4log20.4) + 5/10*(- 0.6log20.6 - 0.4log20.4) = 0.971
Gain(Original, Gender) = 0.971 - 0.971 = 0
SplitINFO(Interest, Gender) = - (5/10log25/10) - (5/10log25/10) = 1
GainRATIO(Interest, Gender) = 0/1 = 0
Age | Yes | No |
21-40 | 2 | 1 |
41-60 | 3 | 1 |
> 60 | 2 | 1 |
E(Interest, Age) = P(21-40)*E(2/3, 1/3) + P(41-60)*E(3/4, 1/4) + P(> 60)*E(2/3, 1/3)
= 3/10*0.9183 + 4/10*0.8113 + 3/10*0.9183 = 0.876
Gain(Original, Age) = 0.971 - 0.876 = 0.095
SplitINFO(Interest, Age) =- (3/10log23/10) - (4/10log24/10) - (3/10log23/10) = 1.571
GainRATIO(Interest, Age) = 0.095/1.571 = 0.06
Salary | Yes | No |
10001-30000 | 1 | 3 |
30001-50000 | 3 | 1 |
> 50001 | 2 | 0 |
E(Interest, Age) = P(10001-30000)*E(1/4, 3/4) + P(30001-50000)*E(3/4, 1/4) + P(> 50001)*E(2/2, 0/2)
= 4/10*0.811 + 4/10*0.811 + 2/10*0 = 0.649
Gain(Original, Age) = 0.971 - 0.649 = 0.322
SplitINFO(Interest, Age) =- (4/10log24/10) - (4/10log24/10) - (2/10log22/10) = 2.376
GainRATIO(Interest, Age) = 0.322/2.376 = 0.136
Status | Yes | No |
Married | 4 | 1 |
Single | 2 | 3 |
E(Interest,Status) = P(Married)*E(4/5, 1/5) + P(Single)*E(2/5, 3/5)
= 5/10*0.722 + 5/10*0.971 = 0.847
Gain(Original,Status) = 0.971 - 0.847 = 0.124
SplitINFO(Interest,Status) =- (5/10log25/10) - (5/10log25/10) = 1
GainRATIO(Interest,Status) = 0.124/1 = 0.124
Number of times travel per year | Yes | No |
1-3 | 3 | 4 |
4-5 | 3 | 0 |
E(Interest, travel per year) = P(1-3)*E(3/7, 4/7) + P(4-5)*E(3/3, 0/3)
= 7/10*0.985 + 3/10*0 = 0.69
Gain(Original, travel per year) = 0.971 - 0.69 = 0.281
SplitINFO(Interest, travel per year) =- (7/10log27/10) - (3/10log23/10) = 0.881
GainRATIO(Interest, travel per year) = 0.281/0.881 = 0.319
Step 3: GainRATIO of Number of times travel per year is come to top, so number of times travel per year is the root node.
Step 4: Next we will interest only number of times travel per year that have 1-3 value.
Gender | Age | Salary | Status | interest travel promotion or not |
Male | 21-40 | 30001-50000 | Single | No |
Female | > 60 | 10001-30000 | Single | No |
Male | 41-60 | > 50001 | Single | Yes |
Female | 21-40 | > 50001 | Single | Yes |
Male | > 60 | 10001-30000 | Single | No |
Female | 41-60 | 10001-30000 | Married | No |
Male | > 60 | 30001-50000 | Married | Yes |
Step 5: Calculate GainRATIO that have travel per year = 1-3
E(Interest[1-3]) = E(3/7, 4/7) = - (3/7log23/7) - (4/7log24/7) = 0.985
Gender | Yes | No |
Male | 2 | 2 |
Female | 2 | 1 |
E(Interest[1-3], Gender) = P(Male)*E(2/4, 2/4) + P(Female)*E(2/3, 1/3)
= 4/7*1 + 3/7*0.918 = 0.965
Gain(Original[1-3], Gender) = 0.985 - 0.965 = 0.02
SplitINFO(Interest[1-3], Gender) =- (4/7log24/7) - (3/7log23/7) = 0.985
GainRATIO(Interest[1-3], Gender) = 0.02/0.985 = 0.02
Age | Yes | No |
21-40 | 1 | 1 |
41-60 | 1 | 1 |
> 60 | 1 | 2 |
E(Interest[1-3], Age) = P(21-40)*E(1/2, 1/2) + P(41-60)*E(1/2, 1/2) + P(> 60)*E(1/3, 2/3)
= 2/7*1 + 2/7*1 + 3/7*0.918 = 0.965
Gain(Original[1-3], Age) = 0.985 - 0.965 = 0.02
SplitINFO(Interest[1-3], Age) =- (2/7log22/7) - (2/7log22/7) - (3/7log23/7) = 1.556
GainRATIO(Interest[1-3], Age) = 0.02/1.556 = 0.013
Salary | Yes | No |
10001-30000 | 0 | 3 |
30001-50000 | 1 | 1 |
> 50001 | 2 | 0 |
E(Interest[1-3], Salary) = P(10001-30000)*E(3/3, 0/3) + P(30001-50000)*E(1/2, 1/2) + P(> 50001)*E(2/2, 0/2)
= 3/7*0 + 2/7*1 + 2/7*0 = 0.286
Gain(Original[1-3],Salary) = 0.985 - 0.286 = 0.699
SplitINFO(Interest[1-3],Salary) =- (3/7log23/7) - (2/7log22/7) - (2/7log22/7) = 1.556
GainRATIO(Interest[1-3],Salary) = 0.699/1.556 = 0.449
Status | Yes | No |
Married | 1 | 1 |
Single | 2 | 3 |
E(Interest[1-3], Status) = P(Married)*E(1/2, 1/2) + P(Single)*E(2/5, 3/5)
= 2/7*1 + 5/7*0.971 = 0.979
Gain(Original[1-3], Status) = 0.985 - 0.979 = 0.006
SplitINFO(Interest[1-3], Status) =- (2/7log22/7) - (5/7log25/7) = 0.863
GainRATIO(Interest[1-3], Status) = 0.006/0.863 = 0.007
Step 6: GainRATIO of salary is come to top, so salary is the child node of travel per year.
Step 7: we will interest only number of times travel per year that have 1-3 value and salary = 30001-50000.
Gender | Age | Status | interest travel promotion or not |
Male | 21-40 | Single | No |
Male | > 60 | Married | Yes |
Step 8: Calculate GainRATIO that have travel per year = 1-3 and salary = 30001-50000.
E(Interest[1-3][30001-50000]) = E(1/2, 1/2) = - (1/2log21/2) - (1/2log21/2) = 1
Gender | Yes | No |
Male | 1 | 1 |
Female | 0 | 0 |
E(Interest[1-3][30001-50000], Gender) = P(Male)*E(1/2, 1/2) + P(Female)*E(0, 0) = 2/2*1 + 0 = 1
Gain(Original[1-3][30001-50000], Gender) = 1 - 1 = 0
SplitINFO(Interest[1-3][30001-50000], Gender) =- (2/2log22/2) = 0
GainRATIO(Interest[1-3][30001-50000], Gender) = 0
Age | Yes | No |
21-40 | 0 | 1 |
41-60 | 0 | 0 |
> 60 | 1 | 0 |
E(Interest[1-3][30001-50000], Age) = P(21-40)*E(0, 1) + P(41-60)*E(0, 0) + P(> 60)*E(1, 0) = 0
Gain(Original[1-3][30001-50000], Age) = 1 - 0 = 1
SplitINFO(Interest[1-3][30001-50000], Age) = - (1/2log22/7) - (1/2log21/2) = 1
GainRATIO(Interest[1-3][30001-50000], Age) = 1/1 = 1
Status | Yes | No |
Married | 1 | 0 |
Single | 0 | 1 |
E(Interest[1-3][30001-50000], Status) = P(Married)*E(0, 1) + P(Single)*E(1, 0) = 0
Gain(Original[1-3][30001-50000], Status) = 1 - 0 = 1
SplitINFO(Interest[1-3][30001-50000], Status) =- (1/2log21/2) - (1/2log21/2) = 1
GainRATIO(Interest[1-3][30001-50000], Status) = 1
GainRATIO both age and status is equal, so we can use either age or status to child node of salary, but we should select a node that have least number of type of attribute. Therefore we got a decision tree like this.
After We constructed a decision tree completely, we can use post pruning in order to discard unreliable parts.
f is the error on the training data.
N is the number of instances covered by the left.
z from normal distribution.
e is error estimate for a node.
If the error estimate at children node greater than parent node, we do not want to keep the children.
The advantages of the C4.5 are:
- Builds models that can be easily interpreted and explained to executives.
- Easy to implement because construct a decision tree only once that can handle test sets.
- Can use both categorical and continuous values.
- Deals with noise.
The disadvantages are:
- Small variation in data can lead to different decision trees (especially when the variables are close to each other in value).
- Does not work very well on a small training set.
- If in the first time we use small training set to construct a tree, It can't handle a test set accurately.
Solution 1:
When test data has been predicted by decision tree, and get Yes in class interest travel promotion that means customer should have interest in a new promotion. We'll send e-mail to him/her. Since he/she rejected, the training set can't handle correctly. If number of rejection greater than threshold(incorrect predictions/correct prediction), we'll use test sets to become training sets combination with old training sets to construct a decision tree again for preparing to predict new test set.
Example
Since we define threshold=0.3, (incorrect predictions)/(correct prediction) is greater than 0.3, we'll construct decision tree immediately.
Solution 2:
Use C5.0 instead of C4.5. The advantages of C5.0
- Multi-threaded: can take advantage of multiple CPUs or cores that can improve speed to construct a decision tree.
- Boosting: that is a technique for generating and combining multiple classifiers to improve predictive accuracy.
- Decision trees generally smaller.
This information really worth saying, i think you are master of the content and thank you so much sharing that valuable information and get new skills after refer that post.
ReplyDeleteWebsphere Training in Chennai
This comment has been removed by the author.
ReplyDelete