Supervision
Introduction
Supervised learning methods are the most commonly used, but there has been development/use in other methods, such as unsupervised or semi-supervised learning. Here we explore the meaning of supervision, and when supervision might not be the best learning modality.
Developing Intuition
You will recall from the Starter Guide on the meaning of f(x) in data science that we have 3 constituent parts:
-
The output, $Y$
-
The function, $f$
-
The data, $X$
Unsupervised learning is when we remove the output $Y$. Hence, we would merely have a function modifying our data. But what is the function doing to our data? How do we know its the right function? One of the primary purposes of having $Y$ in the first place was to see if our function is working as expected. How can we know if the learning is accurate with no output to verify with? Really, we have no way to confirm the function is working as expected (that is, there is no supervision, no output, to confirm with). So what use is this type of learning?
Another way to state this is that we are not attaching labels to data before training.
Semi-Supervised Learning
Semi-supervised learning is when some of the data is labeled, but not all. Often this is because there are advantages to having data be labeled, but it is expensive to label all of the data at scale; so only some of it gets labeled and used for training, and the rest is unsupervised.
Types of Semi- or Unsupervised Learning
In general, the common approaches fall into a few categories, including:
-
Clustering, including but not limited to K-means, hierarchical, and mixture models
-
Latent variable models, including but not limited to method of moments, principal component analysis (PCA), and singular value decomposition (SVD)
-
Anomaly Identification, including but not limited to local outlier factor and isolation forest
When you want to factor in some labeled data, it becomes semi-supervised. If you want to use no labeled data at all, it is unsupervised.
When Unsupervised Learning Could Be Used
-
When you have lots of data that is hard to label at scale
-
When we can’t get accurate labels for our data
-
When you have lots of computing power (supervised learning methods can be much less complex computationally speaking)
-
When you are doing anomaly detection, recommendation engines, or marketing segmentation
Code Examples
All of the code examples are written in Python, unless otherwise noted. |
Containers
These are code examples in the form of Jupyter notebooks running in a container that come with all the data, libraries, and code you’ll need to run it. Click here to learn why you should be using containers, along with how to do so. |
Power of PCA
Power of PCA explores the basics of how PCA functions by finding the principal components in a single image as a demonstration and reconstructs the image from those principal components. PCA is also an unsupervised learning technique, although this particular notebook explores its uses as a dimension reduction technique moreso than it does show how it is used for unsupervised learning.
#pull container, only needs to be run once
docker pull ghcr.io/thedatamine/starter-guides:power-of-pca
#run container
docker run -p 8888:8888 -it ghcr.io/thedatamine/starter-guides:power-of-pca
K-means Clustering
This is an introductory notebook to a common unsupervised learning technique, K-means Clustering.
#pull container, only needs to be run once
docker pull ghcr.io/thedatamine/starter-guides:k-means-clustering
#run container
docker run -p 8888:8888 -it ghcr.io/thedatamine/starter-guides:k-means-clustering
Need help implementing any of this code? Feel free to reach out to datamine-help@purdue.edu and we can help!