# Single Linkage Clustering

# Algorithm

  • Consider each object an cluster (n objects)
  • Define intercluster distance as the distance between the closest two points in the two clusters. (There is also median, average linkage clustering which take into account median of distance between two clusters)
  • merge two closest clusters
  • repeat n-k times to make k clusters

# Runtime

O(N^3)

For k times (almost N), compute distance between each pairs of points (which is N^2)

# Drawback

Since it is based on the algorithm which joins near points to a cluster, it can lead to weird clustering outcomes which are not necessarily optimal. Like in the following example

drawback-single-linkage

# K-means Clustering

# Algorithm

  • Pick k centers (at random)
  • Each center claims its closest points
  • Recompute the centers by averaging the clustered points
  • Repeat until convergence

# Properties

  • Each iteration is polynomial
  • Finite (exponential) iterations
  • Error decreases (if ties broken consistently in case where distance of a point to two clusters is equal)
  • Can get stuck if initial cluster centers are badly positioned.
    • Try random restarts
    • Try to find initial cluster centers that are spread out

# Soft Clustering

Sometimes when we're performing clustering on a dataset, there exist points which don't belong strongly to any given cluster. If we were to use something like k-means clustering, we're forced to make a decision as to which group an observation belongs to. However, it would be useful to be able to quantify the likelihood of a given observation belonging to any of the clusters present in our dataset. That's where soft clustering comes into play.

Formally, soft clustering (also known as fuzzy clustering) is a form clustering where observations may belong to multiple clusters.

# Algorithm

  • Identify the number of clusters you'd like to split the dataset into.
  • Define each cluster by generating a Gaussian model.
  • For every observation, calculate the probability that it belongs to each cluster (ex. Observation 23 has a 21% chance that it belongs to Cluster A, a 0.1% chance that it belongs to Cluster B, a 48% chance of Cluster C, ... and so forth).
  • Using the above probabilities, recalculate the Gaussian models.
  • Repeat until observations more or less "converge" on their assignments.

Alternative Explanation

Assume the data was generated by

  • Select one of k gaussians (fixed known variance) uniformly
  • Sample from that gaussian
  • Repeat n times

Task: Find hypothesis (a set of k means - not K-means algorithm) that maximizes the probability of the data (maximum likelihood)

# Maximum Likelihood Gaussian (Mue)

The ML mean of the gaussian is the mean of the data

# Properties of Expectation Maximization (EM)

  • Every iteration has a monotonically non-decreasing likelihood
  • Does not converge (practically does, but due to infinite possibilities of configration since this is based off probability, it might infinitely progress with miniscule difference in values towards a convergence)
  • Will not diverge
  • Can get stuck
  • Works with any distribution (if E, M are solvable)

# Clustering Properties

# Richness

For any assignment of objects to clusters, there is some distance matrix D such that (clustering scheme) returns a clustering.

# Scale-invariance

Clustering algorithm should be invariant to scale of distances between the objects given the relative distances between the points is the same. For example, changing units from km to miles.

# Consistency

Shrinking intra cluster distances and expanding intercluster distances does not change the cluster.

clustering-property-quiz

# Clustering Impossibility Theorem

No clustering scheme can achieve all three of: richness, scale invariance, consistency.