Thursday, January 28, 2010

Thoughtful Thursday: The ID3 Machine Learing Algorithm

ID3 is an algorithm to generate a decision tree from a set of examples. It is used in machine learning.

Many legal codes are some sort of function of several parameters and are of a nature that can be defined as a decision tree. For example, one is making the gun control law. Some of the parameters might be the type of gun. (Is this a BB-gun, an antique gun, a non-functional gun, an assault gun, a pistol, etc?) Is the weopon concealed? And there are the characteristics of the owner of the possessor. Are they mentally ill, convicted of a crime, have an order of protection against them, etc. Was the person's life threatened? Does the person job need it, e. g. security guard. And lastly, what is the place: a school, bar, a place where beer is sold such as some supermarkets.

The answer could be a simple boolean, "permitted" or "not permitted" Or we could get a classification such that the possession might be a misdemeanor or a felony. Many states have categories for felonies. For example, in the State of New York, a class A felonly, the term is life imprisonment, for class B, the term for a class B felony does not exceed twenty-five years, etc. (There are other special rules for dispositions, such as those who have prior convictions or are youthful offendors.) Then the bulk of the penal code defines offenses to be a class D felony, a class B felony, etc.

A possible decision tree

Gun TypeLocationAgePersonal StatusPenalty, if any
antique No Offense
BB Gun N Offense
Pistol15-20A Class Misdemeanour
Pistol20-30B Class Felony
Pistol30-50Felony
Pistol50-70A Class Misdemeanour
RifleBarClass B Felony
RifleScholClass C Felony
RiflePlace where Beer is SoldClass A Misdemeanour
RifleotherNo Offensse
AssualtClass B felony
RifleOthermentally illClass D Felonly
RifleOtherConvicted of a FelonyClass C Felony
RifleOtherOtherClass B Misdemeanor

An ID3 algorithm would take a series of examples such as the one in the table above. It would create a decision tree that would give the same answers as the examples. It needs to pick which parameter to use for the root node--in the above tree, it was Gun Type. It picks one that divides a lot of examples between thee categories. Often the formula from information theory is used. In the above example, the tree was precisely determined from the input data. That is not always true. In a diagnosis application, there would likely be lines that contradict each other. Or there would might be several lines with very similar attributes. All might have the same classification or penalty, except one. Is that due to a real difference, or just a random error?

ABCDResult
12344ii
12444ii
12544ii
12345ii
12445ii
12545ii
12346ii
12446ii
12546iii
There is a danger of overfitting when there are lots of examples. Thus, a pruning process removes those divisions that don't have much information gain.

Mr. Yeluri set up an example to illustrate these ideas. Since it is long, I put it at the end in an appendix.

In participatory democracy, we do it a little bit differently. For each node in the tree, the users vote on what the next parameter should be. Thus, at the start the individuals would vote on which of the above questions regarding place, type of gun and person should be asked first. Then, the group for each resulting category would be classified again. The individuals can vote to decide for any node when it comes up to classify it once and for all. E. G., if the category of assault weopon shows up, individuals might vote to always ban it and for the category of BB-gun, they may vote not to consider any other factors and always permit it. This is a leaf node and the people will then vote on the classification from the possibilities available.

We established the following terminology, for the input to our experiment:

title
a thing on which one chooses, like gun type
attribute
one of the possibilities for a title thus gun type could be antique, bb, assualt, rifle, assault or shot gun
node
this corresponds to series of pairs of parameter and type.

for example, the voting might bring us to making a decision on what to do with those caught with an antique guns, a history of mental illness, and who are sixteen to twenty years old. This is a list of title (nature of gun) and type, antique, mental illness status, and type of history of mental ilness and age, which is divided into types or ranges. this is defined as

of root node
age top of the tree which coresponds to no titles and parameters selected so far.
The voters first vote on which title to choose as the root parameter. Then, they vote for each attribute of that title, how next to classify it. as ech of thse achieve quorum, each of these nodes becomes accessible, that is, the voters might vote on what title to use for gun-type= antique, choosing from the remaining titles of personal status, age range, location. then, they vote on what title to use for gun-type="bb" gun-type="pistol" gun-type="rifle" gun-hattype="assault." Each creates a new node which is marked accessible; and individuals vote for each in turn. One of Masters students, Mr. Chaitanya Yeluri, just implemented this. More details are in his masters project.

After the root node, the users can assign a penalty. Observe in the figure that we immidetely assign "no offense" to those possessing antique guns. on the other hand, there arthree levels until we get to gun type ="rifle",m location="other" and "mental health status" = "felony conviction." at each node, the users have the option to "vote now" and if that gets more votes than any title, the users vote on a penalty for same.

How would we apply this to tax law? The ID3 algorithm is not designed to deal with cases that produce a continuous function, like the amount of tax that the taxpayer should payer. (Note, ID3 is used when there are input parameters that are numeric. The algorithm sorts the examples in numeric order of the parameter. It then finds the decision pont that gives the greatest information theory gain.)

Thus, for a tax system by ID3, at each leaf node, there would per be a formula. The titles in a tax problem might be various income ranghe atues, the number of children, marital status. And then the users might vote for those unmarried head-of-household with three children and in the range $20,000 to $40,000 to make the tax = $500.00 + 0.10*(Income - $20,000). That is $500.00 on the first $20,000 plus ten percent of the remaining income. They might put other components of the formulae in for the amount of state tax they paid, the mortgage interest, etc.

Reference

, Quinlan Ross, Programs for Machine Learning, Morgan Kaufman, San Mateo CA 1993

Appendix, The Information Theory Formula

We have three titles, A, B and C. The possible choices, or parameters for each of these are:
A F, G, H
B I, J, K
C M, N, P
Assume the input examples are:
ABCClassification
Yes /No
FIMN
FINY
FIPN
GIMN
GIPY
GIMN
GIPY
GINY
GJMY
GJPY
FJMN
FJNY
FJPY
HINN
HINN
HJNN
HIPY
HJMY
HJNN
HJPY
HKMN
HKNY
HKPY
FKMN
FKNY
FKPY
The information content of the entire sest is 0.3149. We try classifying it on the basis of A, and we get an information content 1.3328. And it happens by coincidence that if we tried B for the root node, we also get 1.3228. (The three nodes for a would have been:
ABCClassification
Yes /No
FIMN
FINY
FIPN
FJMN
FJNY
FJPY
FKMN
FKNY
FKPY
giving an information content of 4/9 log 2 4/9 + 5/9 log 2 5/9.
ABCClassification
Yes /No
GIMN
GIPY
GINY
GJMY
GIMN
GIPY
GJPY
giving an information content of 5/7 log 2 5/7 + 2/7 log 2 2/7.
ABCClassification
Yes /No
HIMN
HINN
HJNN
HIPY
HJMY
HJNN
HJPY
HKMN
HKNY
HKPY
giving on an information content of: 5/10 log 2 (5/10) + 5/10 log 2 (5/10) .

By comparison, the sample sets for classifying by C are:

C="M"

ABCClassification
Yes /No
FIMN
GIMN
GJMY
FJMN
HIMN
HINN
HJMY
HKMN
FKMN
GIMN
And, if we tried to split this by A, we would have:
ABCClassification
Yes /No
FIMN
FJMN
FKMN
ABCClassification
Yes /No
GIMN
GJMY
GIMN
ABCClassification
Yes /No
HJMY
HKMN
giving a total information content of 0.6438. If we split C="N" by B
ABCClassification
Yes /No
FINY
GINY
FJNY
HINN
HJNN
HKNY
FKNY
Then, there are two possibilities breaking it down by A
ABCClassification
Yes /No
FINY
FJNY
FKNY
ABCClassification
Yes /No
GINY
ABCClassification
Yes /No
HINN
HJNN
HKNY
This gives us an information value of 1.5579.

Or we can split it by B, giving us:

ABCClassification
Yes /No
FINY
GINY
GIMN
HINN
ABCClassification
Yes /No
FJNY
GJPY
HJNN
ABCClassification
Yes /No
HKNY
FKNY
This gives us an information value of 1.6253.

The resulting tree chosen would be given by the below:

No comments:

Post a Comment