Theme 2: Maps or Clusters

October 23, 2022 (Last Modified: October 30, 2022)

In this theme, your objectice is to create a 2D Map or Clustering of the ATUS data, to visualize that (Map or Clustering), and to use visualization to show why the (Map or Clustering) is good.

In this theme, you goal is to create a 2D Map (embedding) of the sample population or a Clustering of the embedding. These are really similar: you might think of both of them are embeddings: one is a continuous embedding from people to 2D, the other is from people to a discrete set of clusters.

You should create the clusters/maps based on how people use their times. People who use time similarly should be close together in the map / in the same cluster. Then you can use this clustering/maps to look for interesting things.

The naive way to do this is quite simple: each person is a 20 (or 400) vector of time usage - we could throw these into your favorite clustering/embedding algorithm (e.g., k-Means clustering or UMAP embedding), and viola! we’ve made a clustering or a map. Naive visualizations are obvious as well: you could show an embedding as a scatterplot, or a clustering as a tree-map.

But the real problem: did we make a “good” embedding/clustering? I will divide good into (at least) three broad categories" “correct” and “meaningful” and “interesting”. Correct refers to the mathematical “correctness” (e.g., are things in the same cluster really numerically similar, there aren’t too many weird outliers); meaningful refers to the map/clusters seeming to have some actual meaning (e.g., this cluster or region of the map is people who work a lot and sleep enough but still find time for hobbies), and interesting might refer to the ability to show interesting things either on the variables you are using (e.g., who would have thought so many people spent so much time on education and shopping) or connecting to other variables (this cluster was based on time usages, but seems to be all males in the southeast).

So, your real challenge in this theme isn’t just to make an embedding, but also to show that it is good (or to assess its goodness). This might be some thorough exploration that shows the properties we are seeking, or it might be building tools for interactively looking at the embedding/map that allows us to see how good it is, or some mix.

The problem of showing the clustering/embedding is closely related to showing that it is good. If you show the results in a meaningful way, you will probably need to convery some of the properties that make it good. A naive visualization is unlikely to convey either the utility or quality of the clustering/embedding.

Mixed in with this will be at least some work in actually trying to make a good embedding/clustering (through data curation, tuning and weighting, parameter adjustment, …).

You might even make multiple embeddings/clusterings (you almost certainly will - even if you are looking for 1 or 2), but you might choose to do the assessment comparatively (how can you decide one is better than another?).

Realistically, you may not succeed at all aspects in this theme - you may focus on a few (you cannot only do #1 - you need some the later ones):

Making a good embedding/clustering of the ATUS data
Displaying this embedding/clustering in a useful way (that allows the viewer to make use of the similarities, identify similar items, …)
Assessing (visually) that the embedding/clustering is “correct” (showing “regions” where things are right wrong, identifying places where things are good/bad, …)
Assessing (visually) the meaningfulness (showing the meanings of clusters/map regions in terms of similar time usages)
Showing interacting things in the embedding/clustering (involving other variables in the visualizations)

Archive of the Fall 2022 Class

This web page is from the Fall 2022 CS765 (Data Visualization) class.

Theme 2: Maps or Clusters