Theme 2 (Maps and Cluster) Comments

Comments that apply to theme 2. Be sure to read the comments for “all themes” as well.

We had a variant of this theme for last year’s project. Here is a link (with some good references): https://pages.graphics.cs.wisc.edu/765-21/assigns/dc2-embeddings/.

Some extra readings from last year (found by students - I haven’t looked at all of these):

2021 Readings
  • L. G. Nonato and M. Aupetit, “Multidimensional Projection for Visual Analytics: Linking Techniques with Distortions, Tasks, and Layout Enrichment,” in IEEE Transactions on Visualization and Computer Graphics, vol. 25, no. 8, pp. 2650-2673, 1 Aug. 2019, doi: https://doi.org/10.1109/TVCG.2018.2846735.

  • Rene Cutura, Michaël Aupetit, Jean-Daniel Fekete, and Michael Sedlmair. 2020. Comparing and Exploring High-Dimensional Data with Dimensionality Reduction Algorithms and Matrix Visualizations. In Proceedings of the International Conference on Advanced Visual Interfaces (AVI ‘20). Association for Computing Machinery, New York, NY, USA, Article 10, 1–9. DOI: https://doi.org/10.1145/3399715.3399875

  • Exploring dimension-reduced embeddings with Sleepwalk. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7263188/pdf/749.pdf Svetlana Ovchinnikova, Simon Anders. 30:749–756 Published by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/20;

  • https://link.springer.com/content/pdf/10.1007/s00521-021-05839-5.pdf, Links to an external site. this paper is entitled “A flexible framework for anomaly Detection via dimensionality reduction”. The authors are Alireza Vafaei Sadr, Bruce A. Bassett and M. Kunz.

  • Visual Cluster Separation Using High-Dimensional Sharpened Dimensionality Reduction Links to an external site. by Youngjoo Kim, Alexandru C. Telea, Scott C. Trager, Jos B. T. M. Roerdink.  https://arxiv.org/abs/2110.00317

  • “t-viSNE: Interactive Assessment and Interpretation of t-SNE Projections” https://arxiv.org/pdf/2002.06910.pdf Chatzimparmpas A, Martins R M, Kerren A. t-viSNE: Interactive Assessment and Interpretation of t-SNE Projections[J]. IEEE transactions on visualization and computer graphics, 2020, 26(8): 2696-2714.

  • “Latent Space Cartography: Visual Analysis of Vector Space Embeddings” https://idl.cs.washington.edu/papers/latent-space-cartography/

Some generic comments (not necessarily in response to any particular proposal):

  1. In my mind, clusters and maps are both kinds of the same thing: in one case we take a collection of high-dimensional items and place it into the “small space” of a few discrete categories; in another we target the small space of 2D.

  2. Simple clustering or embedding methods will probably require some more clever distance metric to get meaningful results. The data is not nicely distributed in all the axes. Some normalization is necessary (since different categories have wildly different amounts), but you don’t want to completely throw out those differences either (it means something that people sleep a lot more than talk on the phone). Be sure to document the choices that you make.

  3. A counterpoint to #2: don’t get too hung up on trying to make a “good” embedding or map: the focus is visual tools to understand what you’ve made. In fact, a visual tool that lets you understand what is bad about doing a simple thing (like using L^2 norms in standard algoritms with simple normalization) might be a successful project. A huge win would be that your tool gives you enough insight with what went wrong with the simple thing that you came up with a better approach and an improved map/clustering.

  4. Think about what the tasks the clusters/maps might be used for. You tool might help the user do those - but importantly it should help them assess whether the map/clustering is appropriate for those tasks.

  5. One thing to think about: given a cluster (or a region of the map), can you “summarize” the set of items? Preferably in a visual way - but hopefully, you would see why things were grouped together (e.g., this is the cluster of people who sleep a lot).

Specific number comments given to at least one student (these would appear as 2.X in your list). Again, the list given in the feedback may not be exhaustive.

  1. Using an embedding to evaluate clusters (or clusters to evaluate embeddings) creates a chicken and egg problem. Worse, if they are both based on the same metric, they will have similar problems is the metric is wrong.

  2. Applying reduction techniques such as PCA to the data may not preserve the important properties (since each row sums to 24 hours, it isn’t really a linear space). It may or may not give meaningful results.

  3. Even basic algorithms require lots of tweaking and tuning - you may want to use visualization to help you decide that you’ve gotten good values for the parameters.

  4. The goal is less to tell a specific story with the data than to have a tool for figuring out which stories might emerge from clustering/DR and validating that the stories do mean something.

  5. The project description suggested doing clustering/mapping based on the time measurements themselves. You could cluster based on the demographics variables instead, but be careful: these are non-numeric (they are categorical and ordinal variables encoded as numbers), and naive use of these as vectors will not work well. What I had envisioned was clustering/mapping based on the time measurements and then showing the other variables on top of that (to see if there are correlations).

  6. This project gets away from the spirit of the assignment. It seems that you are choosing pairs of variables as mappings (choosing 2 variables is an embedding - by throwing away the other variables to reduce the dimenstionality) - so this is more a manual exploration for correlations across the different embeddings. This might be a (different) but feasible project - I would think the key here would be visualization tools that would allow you to efficiently scan through all the possible plots to find interesting things.

  7. Trying to compare different choices (for example, is one clustering better than another) is a good way to approach the problem. Sometimes it is easier to show “why is X better than Y?” than “is X good?”. You could try this for different clustering algorithms, or parameter settings or …