There was recently a paper made famous because a low-resolution Obama was “upsampled” to become a high-resolution white man. That paper captures a combination of bias in data sets and algorithmic choices to show one example image rather than the distribution of images. This paper created quite a bit of controversy, including a widely publicized exchange between Yann Lecun and Timnit Gebru. I encourage everyone in the lab to read about this exchange.
But the rest of this post goes in another direction. What else can we do with this “upsampling” ability? Here is one crazy idea, ripped from the reality of our modern lives — we spend a huge amount of time in front of screens, in zoom calls, for example. Are we unwittingly sharing what is on our screen? Or, to put this in the constructive setting, suppose I have a picture of you, taking from your webcam, while you are looking at a screen (that I can’t see directly). Could I reconstruct your screen based on what I can see through your webcam?
Probably there will be papers that try to do this based on direct reflections in your glasses or from your eye. I’m wondering if we can do something more, and the *first* thing I think about as I think about papers like this is to ask, “where could I get data for this? Well, I’ve been on a rampage about reaction videos for the last 2 years. These are videos that people (often video-bloggers, or vloggers) share, of them watching other videos, and a common tradition is to use video editing software to put the “being watched video” into the video of the vlogger who is watching and reacting to that video.
This offers an amazing dataset. Each reaction video shows: (1) video taken from the webcam (or a camera) of the vlogger, and (2) in the inset, the video that is on the screen the video logger is watching. So the question becomes, can we predict or estimate the inset image, based on subtle changes in the main video. The below youtube video gives one example (go to 6:10 if it doesn’t start there automatically):
Notice the reflection on the wall on the right captures the color of what is being shown on the screen. Also, the microphone itself changes color. Some of these are cues we might be able to learn geometrically (following, for example, the sparklevision paper):
Below is a list of other youtube videos where I see a clear “response” of something in the scene to the video that is being reacted to. In this case, in the glasses of the vlogger (which move around and might be difficult to use):
and then, as if to troll me, there’s literal glitter (sequins) in the background of the next vlogger. I don’t actually by eye see the glitter responding to the screen, but it must be, right? (and I’m going to write some code to check for this).
and in the below video I see reflections in the game controllers on the table:
So what are “practical” applications of this? Well, you could try to exactly reconstruct an unknown image, but the resolution of that is not likely to be good enough to, for example, read their e-mail. But I’m struck by noticing that we spend our days in front of many screens like this the one below. What is actually showing on the screen that each of these people are looking at?
This is a nice question because you can ask it without completely reconstructing the image. In zoom there is a default “big” picture that zoom selects to show based on who is talking (a) which of these 9 people are looking at that screen? (b) you could ask, for each person, are they showing the grid of all 9, or which 1 of the 9 are they looking at? (c) Can you improve your response to (b) by looking at a time-sequence? (d) Can you index other popular web-pages, live-streams, or media so that you can detect other things that might be showing on someone’s screen?
I’d be excited to see any other videos, or classes of videos anyone can find that show more of this. I think that watching reaction videos from gamers might be a win; they are used to sitting close to a large screen with an otherwise darkish room (sorry, stereotyping, i know!), which is likely to create the easiest datasets to use.