Notes from ML for Trading
# What's Involved
Features/Factors: X1, X2, X3
Labels: Y
Nodes
- feature used
- split value
- left link
- right link
Root
Leaves
# The Data Structure
Using an 2-dimensional array is actually quite efficient, instead of building a object-oriented tree
- The value in the left/right column points to the index of array where the next node can be found depending on whether we are going left or right.
- Factor of -1 indicates a leaf node.
- Instead of using absolute indexes in left/right column, we can also use a relative value to jump to the next relevant row. i.e. in row 1, instead of entering absolute index of 5 in left column, you can write 4, since it requires +4 jumps to get from row 1 to row 5.
# Constructing the Decision Tree
- We're using the relative indexes instead of absolute indexes in this algorithm for the left/right columns
# Determing Best Feature
While constructing the decision tree, we continuously split the data and work with smaller subset at every subtree as we go down. As the number of records in dataset get narrower and narrower, we might start discovering different features that best describe their relationship to the label.
When you get to the lowest subtree and have only 3-4 records, it is likely you will get the same value of correlation
/entropy
for all features. In such case, you can pick determininstically by order or randomly.
Recommended: Randomly
# Approaches
- Information gain: Entropy (Measure the randomness and diversity of data)
- Information gain: Correlation (with the label)
- Information gain: Gini index
# Random Tree Algorithm
The Decision Tree algorithm is great but the bottleneck is the part of computing the best feature to split on, and also computing the mean. That's where the random tree algorithm makes changes.
While a single random tree in itself is weak compared to a decision tree, but if we adopt bagging and construct a bunch of random trees, they lead to better results.
# Strengths & Weaknesses
- Cost of Learning is higher compared to instance-based algorithms like KNNs or parametric algorithms like Linear Regressions.
- Cost of query is faster compared to something like KNNs but slower than parametric algorithms like Linear Regressions.
- No need to normalize the data like in other algorithms like KNN so all features with different value ranges are all treated as of equal importance.
# Evaluating Trees
- The more the number of nodes, the more perfectly the model fits the training data. As a result, overfitting might occur causing the accuracy in train data to be high but test data to be low. So, we should look to generalize.
- The closer the correlation between Y_predict & Y_test is to 1, the better.
Notes from ML Course
# Expressiveness of Decision Trees
Analyzing decision trees using the simple logic gates
# OR
- Solving OR problems creates a decision tree that grows linearly
# XOR
- Solving XOR problems creates a decision tree that grows exponentially
- Odd parity refers to answer will be true if the number of true variables is odd.
# Bias
# Restriction Bias
Bias because of hypothesis set we care about. For decision trees, that would be all possible decision trees. This means we're not considering linear or quadratic functions, non-boolean functions. We only care about what can be represented by decision trees.
# Preference Bias
Heart of inductive bias
What do we prefer out of the all possible hypothesis we care about. For ID3 decision trees, since we make decision top down, we prefer trees that
- make good splits near the top
- correct modeling
- shorter trees
# Considerations
- It isn't right to repeat an attribute down the same path of the tree if it is a binary/discrete attribute since you would've already gotten its answer earlier in the path. But it is perfectly acceptable to repeat an attribute down the same path if it is continuous attribute. For eg. we might group people under 20 earlier in the tree, and then later attempt to identify if they're teenagers by further restricting their age further down in the tree.
# Avoiding Overfitting
- Cross validation by constructing multiple trees and picking one with least error on testing set
- While building a tree, everytime before expanding the tree, check against the validation set and see if the testing error is low enough, if it is, you can stop.
- Suggested: Pruning: First build the tree without worrying about anything. Once it is done, go through attempting to remove nodes and check how removing the node might affect the overall accuracy of the tree. Only remove a node if it has a minimal effect on the accuracy.