Theme 1 (Subgroups) Comments

Comments that apply to theme 1. Be sure to read the comments for “all themes” as well.

A key thing in my mind is that the tool should help the user consider many different potential subgroups simultaneously - to help them figure out good ways to divide the population.

Last year, there was a similar project theme - it included some recommended readings. 2021’s Grouper Project Theme

Here is an anti-design (exactly what I am not thinking of): the system could allow the user to set filters to define a group, and tell the user how many samples are in the group. In order to decide a good way to divide things into groups, the user needs to try everything.

That isn’t to say that filtering (especially interactive filters) can’t be a useful tool - or part of a strategy.

Here’s a different “straw man” design: a tool could allow the user to pick N variables. This forms 2^N groups (if they are binary variables), or more groups if they are more than binary variables. The tool could tell the user the minimum number of samples in any group. Or it could present a list of groups (with 2^N or more) each with the number of elements. This might help me quickly decide if there is insufficient samples - but not necessarily to know what to do about it, or to find groups of N variables.

Here are some example tasks (these are just examples - you can/should think of others - and not all tools will address all tasks):

  1. I have variables in mind (maybe 4-5). How do the samples distribute over the the many intersectional groups formed? Are things evenly distributed? Are there “voids” (bins with no samples)? Where are they?
  2. I have variables in mind (maybe 3-5). How do I need to adjust the bins so that I get enough samples in each bin? I might care about both balance in the bin values and quantities.
  3. I want to know what variables lead to acceptable groupings. (e.g., find sets of 4 variables where all combination groups have enough samples).
  4. Once I’ve picked a few variables to create groups… which variables are still possible to further divide things?

Some specific comments on students’ proposals:

  1. I am skeptical of building things on top of Tableau for this. It may be possible (Tableau never ceases to amaze me). The obvious strategies to explore this with Tableau involve having the user do too much work, and relying on limited encodings (to show 1 or 2D spaces). You may have other things in mind.

  2. Some proposals gave specific variables or groupings: remember the goal is a tool that helps the viewer see things for different groupings. You might show your tool with specific examples.

  3. Beware of relying too much on having the user specify too much.

  4. The goal is less to tell a specific story with the data than to have a tool for figuring out which stories can be researched (by having sufficient data). Of course, if good stories emerge as examples…

  5. The readings from “high-dimensional” week Readings 07: High-Dimensional Data may or may not be helpful. The problem here is that the problem isn’t necessarily high dimensional (we’re talking 4 or 5 dimensions) - it’s that we’re trying to show a dense multi-dimensional matrix. If don’t know of much to read about this. If you find something good, please post on the discussion! There are some readings listed in last year’s project theme (listed below).

  6. Some groups noticed that the variables have different types (generally, they are categorical, but in some cases they are ordered, or even really numerical/interval). This does have a big impact! (I’m giving this comment to a group that observed it)

  7. Hopefully, the ideas from working on 4D will help with 5D, and 5D with 6D, … In fact, solutions might scale to many dimensions (maybe once you get to 5 or 6D, you end up with a general solution for N?).

  8. Glad that you are finding resources! Please consider sharing on the class discussion.

  9. It is good that you are concerned about efficiency - processing the whole ATUS dataset might be time consuming. But this shouldn’t be a key element (if you want to make things easier by subsetting the data, that’s OK). I recommend trying to make something that is so good that you care about wanting it to be faster. The dataset should be small enough to fit into memory if you use efficient mechanisms (like pandas in python, or a database like sqllite).

  10. When combining groups to make bigger groups: consider that not all combinations may be meaningful.

  11. Some groups mentioned a mobile app (or maybe, multiple people from the same group) - this is cool, but might be harder than a simple web-based thing.

  12. The idea of thinking of this as exploring different groupings is good. But remember that the groupings need to be intersectional.

  13. Great that you have some specific design ideas! Thing about how these ideas connect to task, and how you will decide if they are good enough to warrant implemenation.

  14. Remember that our focus is on the quanitity of samples in the groups, not necessarily of the values (like the averages, …).

  15. The entire data set while easy to put in memory in a backend (desktop) application, might be a bit much for the whole thing to be in the front-end without some cleverness. If you want to build a web-based thing, you may need to consider having a backend that does some of the processing.

  16. Designs that show a single variable at a time (or bin a single variable at a time) may not work when intersections are present. You can have two variables that are both uniformly sampled, but their intersection can be really sparse. (for a trivial example, consider the case where the two variables have the same value - if you bin each into N bins, only N of the N^2 bins will have anything in them)

Last year, I required all students to find 1 paper and post it to a discussion. Some readings from last year’s students (I haven not checked these!):

Theme 1 References from 2021
  • It turns out that Googling “visualizing subgroups” can result in finding a paper called Subgroup Visualization. This is a conference paper that you can find here The authors present a novel visualization method, which might help inspire some ideas if you are thinking through things, but I like the paper’s approach to the problem motivation. If you are wondering “why bother with visualization subgroups when I could just use a statistical analysis method?” here is a quote from the paper:

    Since subgroup discovery is a task of descriptive induction, the visualization of results is crucial for presenting the results to the end user. The subgroup visualization task is to visualize the subgroups detected by subgroup discovery algorithms.

  • There is an interesting paper discussing how to deal with dynamic changes when visualizing subgroups with many high dimensions called Visualizing Dynamic Hierarchies in Graph Sequences

  • Reading Name: Automated Data Slicing for Model Validation: A Big data - AI Integration Approach link

  • “A critical review of graphics for subgroup analyses in clinical trials” by Ballarini et al. covers an array of graphics currently used in clinical trial subgroup analysis. link

  • On the quest to find papers other people hadn’t posted yet, I found two that discussed Subgroup mining and created interactive tools to facilitate subgroups. Mining and Visualizing the Evolution of Subgroups in Social Networks

  • Evaluation of Hierarchical Visualization for Large and Small Hierarchies https://ieeexplore.ieee.org/document/9373115