# Feature selection

The problem is exponentially hard since we have to consider all combinations of features.

feature-sel-approach

L refers to learning algorithm.

# Filtering

  • Faster speed
  • Ignores the learning problem
  • Isolated features, fails to account cases where a feature might prove valuable when combined with another feature to solve a particular learning problem

# Wrapping

  • Super slow
  • Takes into account model bias and learning

# Criterion to use for filtering

  • Information gain (like in Decision Trees)
  • Variance/Entropy/Gini Index
  • Independent features / non-correlated features

# Wrapping techniques

  • Randomized optimization
  • Forward search
    • Consider all features one-by-one and see which performs best with the learner.
    • Add that feature to the bucket of selected features (initially empty)
    • Take the bucket of selected features and pair remaining features one-at-a-time to find out addition of which remaining feature gives the best outcome.
    • Take that best performing feature and repeat from step 2, until you don't see any significant boost of scores.
  • Backward search
    • Consider all combinations of all-except-one features with the learner and see which leads to the least loss.
    • Eliminate that "except-one" feature.
    • Take the remaining features and repeat from step 1, until you don't start seeing significant loss.

# Feature Relevance

feature-relevance

# Feature Usefulness

feature-usefulness

# Feature transformation

# Principal Component Analysis (PCA)

pca

PCA is about looking at correlation and maximizing variance so reconstruction is possible.

# Independent Components Analysis (ICA)

ICA is about looking at independence.

Blind source separation / Cocktail party problem is an example of what it can solve.

# PCA vs ICA

  • Mutually orthogonal: PCA (This is what makes PCA global algorithm)

  • Mutually independent: ICA

  • Maximal variance: PCA

  • Maximal mutual information: ICA

  • Ordered features: PCA

  • Bag of features: ICA, PCA

  • ICA is highly directional, whereas PCA is not.

    • Therefore, ICA is great for sound waves where direction is important,
    • ICA is great for images like faces where direction is important. ICA ends up detecting noses, eyes, contours, etc whereas PCA finds brightness, average face, so basically anything global.
    • For pictures of the world, ICA detects edges, whereas PCA still behaves the same as for faces above.
    • For documents, ICA gives topics.

# Random Components Analysis (RCA)

Generates random directions.

Big advantage of RCA: Fast.

# Linear Discriminant Analysis (LDA)

Finds a projection that discriminates based on the label.