Paper Idea: Are you unknowingly sharing Your screen?

There was recently a paper made famous because a low-resolution Obama was “upsampled” to become a high-resolution white man. That paper captures a combination of bias in data sets and algorithmic choices to show one example image rather than the distribution of images. This paper created quite a bit of controversy, including a widely publicized exchange between Yann Lecun and Timnit Gebru. I encourage everyone in the lab to read about this exchange.

But the rest of this post goes in another direction. What else can we do with this “upsampling” ability? Here is one crazy idea, ripped from the reality of our modern lives — we spend a huge amount of time in front of screens, in zoom calls, for example. Are we unwittingly sharing what is on our screen? Or, to put this in the constructive setting, suppose I have a picture of you, taking from your webcam, while you are looking at a screen (that I can’t see directly). Could I reconstruct your screen based on what I can see through your webcam?

Probably there will be papers that try to do this based on direct reflections in your glasses or from your eye. I’m wondering if we can do something more, and the *first* thing I think about as I think about papers like this is to ask, “where could I get data for this? Well, I’ve been on a rampage about reaction videos for the last 2 years. These are videos that people (often video-bloggers, or vloggers) share, of them watching other videos, and a common tradition is to use video editing software to put the “being watched video” into the video of the vlogger who is watching and reacting to that video.

This offers an amazing dataset. Each reaction video shows: (1) video taken from the webcam (or a camera) of the vlogger, and (2) in the inset, the video that is on the screen the video logger is watching. So the question becomes, can we predict or estimate the inset image, based on subtle changes in the main video. The below youtube video gives one example (go to 6:10 if it doesn’t start there automatically):

Notice the reflection on the wall on the right captures the color of what is being shown on the screen. Also, the microphone itself changes color. Some of these are cues we might be able to learn geometrically (following, for example, the sparklevision paper):

Below is a list of other youtube videos where I see a clear “response” of something in the scene to the video that is being reacted to. In this case, in the glasses of the vlogger (which move around and might be difficult to use):

and then, as if to troll me, there’s literal glitter (sequins) in the background of the next vlogger. I don’t actually by eye see the glitter responding to the screen, but it must be, right? (and I’m going to write some code to check for this).

and in the below video I see reflections in the game controllers on the table:

So what are “practical” applications of this? Well, you could try to exactly reconstruct an unknown image, but the resolution of that is not likely to be good enough to, for example, read their e-mail. But I’m struck by noticing that we spend our days in front of many screens like this the one below. What is actually showing on the screen that each of these people are looking at?

This is a nice question because you can ask it without completely reconstructing the image. In zoom there is a default “big” picture that zoom selects to show based on who is talking (a) which of these 9 people are looking at that screen? (b) you could ask, for each person, are they showing the grid of all 9, or which 1 of the 9 are they looking at? (c) Can you improve your response to (b) by looking at a time-sequence? (d) Can you index other popular web-pages, live-streams, or media so that you can detect other things that might be showing on someone’s screen?

I’d be excited to see any other videos, or classes of videos anyone can find that show more of this. I think that watching reaction videos from gamers might be a win; they are used to sitting close to a large screen with an otherwise darkish room (sorry, stereotyping, i know!), which is likely to create the easiest datasets to use.

Current issue and potient solution in dissimilarity visualization

Issue:

The activation map and pooled vector(before fc layer) in Resnet-50 are non-negative. Then, no component is negative for dot product.

Potient solution:

1. Remove relu layer before pooling layer.

2. Based on equation in Grad-CAM:

w is weighs of fc layer. Yc is score for category c. A is activation map, k is channel, i_j is spatial location.

The equation is the goal of Grad-CAM, to find weighs for each activation map and combine them.

In similarity visializtion for embedding, the goal is the same, we try to find ∂D/∂A, which is the contribution for each activation map to the distance between two image vector. D is dot product of fc_vect_a and fc_vect_b. Since ∂D/∂A = ∂D/∂Y * ∂Y/∂A, (Y is fc layer vector) , and based on equation in Grad-CAM, we can get ∂D/∂A = fc_vect*W, which fc_vect and A from different images

So, we can use vector after fc layer and W matrix to get a new vector(which have negative value) to replace orignal pooled activation map.

3. we can forward vector in each location in activation map to fc layer. And use the output.

Deep Wheat: t-SNE Plots of Different Coloring Methods & Std of Cultivars

This experiment have trained 200 epochs, I did the t-SNE plot of epoch 100 to check if we got over trained.

exp_name: g_8_ep_200_R50
model: ResNet-50
group size: 8
epochs: 200
data: 97 cultivar testing data

For the 3 cultivars which are more likely to be predicted as, I colored them in the t-SNE plot of epoch 200. And the “red vs. blue” t-SNE plot for training and testing data.

I also computed the standard deviation of each cultivar in both training and testing data. The average std of training data is 10.191, the min of it is 8.584 and the max of it is 17.640. The std of those 3 special cultivars (217, 266 and 292) are 9.798, 10.427 and 11.361. It’s hard to tell if they are “small and well clustered”.

For the training data:
array([10.46416187, 10.52323914, 10.40782642,  9.80121613, 10.91461849,
       10.59361935,  9.27608776, 10.63208294, 10.70333767,  9.64982319,
       10.78457069, 10.37274933,  9.98364162,  9.94653797,  9.72254467,
        9.2923727 , 10.25147915, 10.64735985, 10.51951218,  9.6249609 ,
       10.06981087,  9.56434917, 11.33858013, 10.03707123,  8.90496445,
       11.01442051, 10.0634861 , 10.67785645,  9.28373718,  9.8069067 ,
        8.74895096,  9.13527775, 10.45289898, 10.59641647, 10.1052494 ,
       12.39895058,  9.28281784, 10.66273308, 10.1367321 ,  9.09992504,
       10.6760788 ,  9.58821964, 10.77769184,  9.89109039,  9.76054573,
       10.47741604, 10.04706287, 10.15864277,  9.57283783,  9.16854095,
       10.62602806,  8.94312286,  9.55071926, 10.03430557,  9.92059803,
       10.5079565 , 10.24594975, 10.28643513, 11.30548   , 10.32584286,
       10.35098648,  9.36228371,  9.44154644, 10.26055527,  9.71262646,
        9.95606804, 10.14754963, 10.73396015,  8.83175564,  8.85721397,
        9.95898533, 10.3237114 ,  9.72339725, 10.00537491, 10.46583652,
        9.92854786, 10.02325344,  9.46943188, 12.16626072, 10.05992985,
        9.71245289,  9.77691841,  9.98790359,  9.84941006, 10.17197227,
        9.49926472, 10.93229866,  9.89042568, 10.04704285, 10.11202145,
        8.74681091,  9.77938843,  9.46885395,  9.62973976,  9.26524353,
       10.35689545, 10.31644821, 10.14037132,  8.87110901,  9.36151028,
       11.10306263, 11.20150185,  9.50447655,  9.8961916 , 10.21476841,
       10.30224895,  9.77637482, 10.64656258, 10.57614517, 10.79774189,
        9.24985695,  9.78373623, 11.80111408, 10.16897488,  9.2091856 ,
        9.65493393,  9.95381737, 11.11535645,  9.88694859,  9.53554153,
       10.16892147, 10.32589912, 10.41562176, 10.7139616 , 11.67827702,
        9.69061565,  9.33803082,  9.53231525, 10.19188213, 10.18854046,
        9.8159399 , 10.26011944, 10.29724884,  9.97103977,  9.67305756,
        9.7047348 , 11.54331398,  9.89892387,  9.73020649, 10.45486641,
        9.7187376 , 10.08184147, 10.53874493,  9.79509926, 10.14669132,
       10.35056114,  9.66332531,  9.39208794, 10.43498516, 10.5133009 ,
       10.57205105, 10.68231487, 10.11015797, 10.02883911,  9.97161293,
        9.02976131, 10.16350555, 11.41958523,  9.62627125, 10.69648933,
       11.61301708,  9.37502575,  9.04355717, 10.27496719,  9.30772495,
       10.48779106, 10.23538589, 10.22371006, 10.37346554,  9.86011314,
       10.34436226,  9.74267864, 10.44162178, 10.40956783,  9.29414654,
       10.05021477, 10.38432026, 10.26773739, 10.17112541,  9.01770878,
       10.28068447, 10.49031925, 11.42122269,  9.8182745 , 10.50577736,
       11.1862793 , 11.04720592, 10.3194685 , 10.29874992,  9.63099384,
       10.55593967,  9.96100903, 11.32965088, 10.1030798 , 10.15437603,
       10.72751522, 11.02680206, 10.17289352, 10.20807171, 10.5807972 ,
        9.82701492,  9.70990467,  9.47952938,  9.89400482,  9.90310192,
       10.88363266, 10.18148708,  9.6078186 , 11.3612957 ,  9.55789566,
       10.81572723, 10.1153183 ,  9.88606739,  9.82144833,  9.27096558,
       11.44052124, 10.06314659,  9.79760933, 10.63856602, 10.16060257,
        9.73304176, 10.69305706, 10.63834858,  9.93109226,  9.00567532,
        9.10076809, 11.36916447, 17.64044762,  9.75825214, 11.37934113,
        8.58387661,  9.7455349 ,  9.97937489, 11.80943584,  8.99349594,
       10.65017414, 11.27319622, 10.55690765,  9.8085041 ,  9.57181549,
       10.37087154, 10.07132339, 11.51591396, 10.98816109,  9.99708939,
        9.5718689 , 10.48200989,  9.7787447 , 11.26506901, 12.72904587,
        8.60349369, 10.42990398, 10.61195087, 10.49966335, 10.79094791,
        9.94118023, 10.96340847,  9.98089409, 10.97180176,  8.98347187,
        9.96456432,  8.73169136, 11.55494595,  9.42425919,  9.74652672,
       11.18292904, 10.42725754,  9.73459339, 12.27190781,  9.84189034,
        9.34319305,  9.17745876, 10.32913113, 10.85206032, 10.1724968 ,
       10.41303253, 10.36054325, 10.0679493 ,  9.92787743, 10.6046648 ,
       11.96834373, 12.10933113,  9.40503407, 10.526577  , 10.17176819,
       10.24804115,  9.4227066 , 10.24769497,  9.85063839, 10.28357792,
        9.91055489, 10.07267857, 11.36082458, 11.16549587,  9.50157356,
       10.33835506, 10.00465393])
For the testing data:
array([10.09661198,  9.5269556 ,  9.68397427, 10.92653275, 10.73310947,
        9.59925747,  9.71199703,  9.77644825, 10.23285961,  9.02428913,
        9.21716785, 11.85523224, 10.9337101 , 10.03600311,  9.44603062,
        9.91543293,  9.87280655, 10.11535931,  9.87183857, 10.0390377 ,
       10.87068272, 10.06609154, 10.34671307, 10.13275623,  9.83918476,
       10.12144756, 11.60314655, 13.00332069,  9.01407719,  9.10905933,
       10.55102444,  9.25766277, 10.28547096,  9.96294785,  9.73879814,
       10.35756779,  9.76302624, 10.07028675, 10.43801594,  9.91042614,
        9.38255119,  9.87288189, 11.13052464,  9.49461555, 10.38577747,
       10.26436234,  9.97051048,  9.2813549 , 10.22738838, 12.92765808,
       10.01099396, 12.18794727, 10.67965221, 11.53987885, 10.14023495,
       10.21092129, 10.68487072, 10.03306675, 10.95620728,  9.85312748,
       10.13000202, 10.13410473, 10.42385292, 10.51057148, 11.2904911 ,
       11.14164257, 10.1654644 ,  9.7957859 ,  9.86104488, 10.19400597,
        9.0341959 ,  9.65091038,  9.7457428 ,  9.54844666, 10.03637695,
       10.9693718 ,  9.86671925, 10.50609016,  9.17971325,  9.47025013,
       11.07761288, 11.22717667, 10.00571537, 10.14381218, 10.27927113,
        9.42927647, 10.02052784,  9.81192493,  9.4473629 ,  9.54242897,
        9.79098225, 10.72505283, 10.59535599, 10.49421978,  9.32136059,
        9.61936569,  9.69990921])

Some result of Dissimilarity Visualization

I tried one potential method to visualize how two image embedding are different. Basicly, it tries to find which channel in embedding vector make negetive value in dot product of two embedding. Then, Showing the combination(weighted by how negetive they are) of activation map (before pooling layer) on those channel.

Here are some interesting example on HOTEL dataset:

It clear catch the different on the sink
This one catch the headboard
And this one focus on the lights.

Deep Wheat: Montages of Certain Testing Cultivars

I use the feature vectors of only 3 testing cultivars to get a t-SNE plot, then I took out the images from the top two arms.

I drop the cultivars which are only in the training data, and got a new testing result confusion matrix. There are 3 obvious “lines” of cultivar 235, 242 and 249, which means that many other cultivars have been predicted as these 3 cultivars. So, I choose some “hot spots” which have more than 100 images and made montage of them.

Upcoming Conference Deadlines and Possible Papers

There are a collection of upcoming conferences towards which we could target papers, listed with their paper deadlines below. Usually, part of the draw is where the conference is, but I expect most conferences over the next year will be virtual. Here are ones that I think might be relevant, and I’ll share possible paper ideas towards these conference below:

  • WACV Round 2 (Vision), August 26
  • AAAI (AI/ML), ~ Sept 1, 2021
  • IEEE Virtual Reality, ~ Sept 1, 2020
  • ICRA (Robotics) — Sept 15, 2020
  • CVPR (Vision) Nov. 15,
  • ICCP (Computational Photography), ~Dec. 1
  • RSS (Robotics), Jan 30, 2021
  • ICML (AI/ML) ~Jan 2021
  • SIGKDD (ML), ~Jan 2021

Some of the papers I know of that we could push towards writing, and potentially relevant deadlines include:

  • Metric Learning for Time-Varying Data — AAAI or CVPR (or both)
    • I really like the story of our approach (what we’ve been calling “center of mass” but that is a terrible name). The limiting factor as I think about this paper are which exact datasets are we going to use? Aging people? Are there Kaggle challenges that include “classification or image retrieval” problems that have something aging or progressing through time… but where that temporal change is not really the point?
  • 2-SNE++ (WACV or CVPR)
    • If we just write 2-SNE we are double dipping. But what are other things we could do with 2-SNE? Is there a “right” way to do long term video instead of lock in one frame and solve for the next? Are there ways to better explain the variations between the embeddings? Can we push on why the joint t-SNE embedding almost always finds a lower error solution? What about a variant of t-SNE that is focussed on a particular class (for example, Porche!) and embeds that at the center, keeps the similar classes nearby, and doesn’t really care about the rest; is there a 2-SNE variant of that? (Terrible title idea: 2-SNE or not 2-SNE?)
  • Dissimilarity Visualization (WACV or CVPR or AAAI)
    • We have some very initial results — but they aren’t (quite) yet compelling. I think we need to find a few datasets that we thing will be compelling (including the sorghum datasets and the hotel datasets), and run our triplet loss on bigger images and get 16 x 16 final convolutional layers … so that we get more fine grained representations of images and more refined
    • Is there a way to combine the Dissimilarity Visualization with semantic segmentation and automatically create explanations of why 2 images are different?
    • Can we push on the “PCA of the last conv layer vectors” to better understand what parts of the images are the same, different, and independent?
  • Single Image Complete Geometric Camera Calibration including aperture, skew, and everything — CVPR
    • The key to this is to reconstruct the glitter sheet and the calibration setup so that we *REALLY* know everything to less than 1mm. There should be nothing that we measure with a ruler; let’s figure out how to get everything really really nailed down.
  • Single image Stereo Camera Calibration — RSS? ICRA?
    • “How low can we go?” … how small of images can we get and still make things work? Can we make things real time?
  • Single Image Geometric + Photometric Camera Calibration using the holographic glitter (ICCP?)
  • Semantic Pooling for Improved Image Search (joint w/ SLU + Temple) — WACV or CVPR

Deep Wheat: t-SNE Plots of 3 Certain Cultivars in Testing Data

I read EPSHN code carefully this week, especially the Sampler part, and I think the method we use now will not have overlap images in one batch. And I also did many experiments with many different hyper-parameters like learning rate and batch size, none of their result better than before.

Here are some interesting images that I use the feature vectors of only 3 testing cultivars to get a t-SNE plot.

And I also tried to drop the cultivars which are only in the training data, and got a new testing result confusion matrix, but it still doesn’t have a diagonal.

New AMOS Website

I will focus this blog post on a quick summary of the new AMOS website: https://amostest.seas.gwu.edu/

I have designed the website to highlight the unique qualities of AMOS, and encourage researchers, students, and scientists to use the dataset as a resource. With the intention of creating a strong and positive user experience, I have created buttons and carefully positioned them in key locations so that someone can quickly navigate to a particular page for more information, explore images from the dataset, or to get information on how to access the AMOS dataset.

The AMOS website provides detailed information about the project, the dataset, publications, participants, and how to access and use the dataset. I have written text throughout the website: introduce AMOS, the focus of the project, current research, benefits and use/past use of the dataset, background information, webcam information, image information, funding sources and acknowledgement information, and more.

In addition to the main navigation at the top of each page, I added links to the pages in a footer. In the footer I have also created a “Back to Top” button for each page. I worked to keep the main navigation menu visible while scrolling, so it never disappears! I chose to make the feedback color on mouse over a blue shade for better contrast, and applied this to all menu options, buttons, and links across the website. I also changed the size to be larger on mouse over for better readability. As for the names of the buttons (I chose an orange color for the buttons), I chose names that were motivational, exciting, directional, and scientific (ex: “Explore The Globe”, “Discover AMOS More”). The button “Observe AMOS Images” ties in to what I wrote for the map legend and the text I edited for the popups on the “Map” page. I created the legend and placed it at the top of the page so that the user could clearly understand what each cluster represented. I also made a one click “Zoom Out” button on the “Map” page to restore the page to the default zoom I set up on the page opening, when I was working to constrain the initial zoom and the maximum zoom out to prevent the excessive repeating of the map. For the menu options I chose descriptive and clear names. On the “Publications” page I have created a link for each of the 32 papers.

The layout is consistent across the entire website, down to every detail: formatting, alignment, size, spacing, white space, color choices, number and size of images, headings, logo appearance and placement, footer, menu, wording, structure, etc. One example is the blue/violet color choice I made to emphasize certain words. I want to draw the viewer’s attention to particular words at a given moment.

As I mentioned in a previous blog post, a major focus of my vision for the new website was to integrate multiple images that I carefully chose and position them in order to create a showcase and overall impression of the dataset. I took great care in selecting images with different time stamps and images that show the following: roads, vehicles (cars, trucks, boats, airplanes), public gathering places, people interaction, natural outdoor settings (with and without water), urban settings, animals, residential areas, skylines, known landmarks, etc. I chose images taken at different times of day, different times of year, and in different locations all around the world. In my opinion the “Overview of Images” page I made is such a cool page that together with an interactive component, static and dynamic images (mixture of archived images and most recently captured images), provides a comprehensive overview of what can be found in the dataset (details about this page in previous blog post). Every page is rich in images, including the “Publications” page where I selected and cut images from all the papers to enhance the listing of the titles. I even decided to add an image from one of the papers to the footer, to remind the user of the geo-located outdoor webcams. The “Map” page displays the most recently captured image from a camera, when the user clicks on an individual camera indicator (this functionality already existed) – but, I did manually add images to 118 of the individual camera indicator’s pop-ups to display the most recently captured image from that particular webcam. I have done this around the entire map covering many different locations (changed the color of the individual camera indicators from green to purple with a white outline for better visibility on green terrain).

I changed all the URLs so that the names reflect the name of the new pages. I also changed the names that appear on the browser tab, and placed the AMOS logo as the browser tab icon.

The AMOS website has the following pages: Home, About AMOS, Publications, Project Participants, Overview of Images, Dataset Access, and Map. I have some ideas for future development that could even enhance the website further. The website is operational, I have fixed all bugs, and I think the interface meets usability goals and design principles. Hopefully the new website will contribute to the AMOS dataset being considered as a resource that supports the needs of scientists, students, and researchers.

Proxy method or EPHN method? Global feature or Local feature?

Recently, there are many new deep metric learning paper released by CVPR2020. I read most of the papers.

The following papers are worth to read:

  1. Moving in the Right Direction: A Regularization for Deep Metric Learning

The main idea of this paper is said to make two vector (fa – fp) and (fa – fn) to be orthogonal in addition to triplet loss.

Orthogonal is my interest point in last year (2019) research. Although I find nothing helpful for Deep Metric Learning, I gain a lot of insights about how feature orthogonal.

Suppose we have a 3 dimensional embedding(x,y,z), anchor and positive live on xy-plane. The orthogonal constraint requires negative to be lived on z axis without xy component. Next, in high-dimension space, the orthogonal constraint require the negative have less projection in the subspace spanned by anchor and positive. In more detail, suppose embedding dimension is 6, orthogonal means anchor and positive feature should look like [a,a,a,0,0,0] and negative should look like [0,0,0,b,b,b].

To a high-level understanding, this means the feature of two different classes can not share a same activation pattern. Also, Maybe orthogonal lets the network learn more discriminative feature.

2. Proxy Anchor Loss for Deep Metric Learning

The analysis of the gradient indicate why the proxy method converge fast(equation 6). So my second order paper can also say the same thing in the next submission. Proxy method still looks different to the pair-wise embedding method and show great improvements.

Now let me go back to the title

Proxy method always pre-assign proxies. These proxy are randomly initialized and static during the training. So their elements in each digit of the feature vector are not zero and means the feature are not sparse. In my current experiments on background suppression, the feature raw norm of proxy method is always much higher than the feature raw norm of EPSHN method. I guess the reason is the following.

Proxies itself are not sparse. During the training, each elements of feature vector of a positive can active high to make the dot product high. Therefore, the feature embedding is also not sparse and lead the raw norm of it to be large.

Triplet loss gradient for anchor negative is to make both feature sparse. As we explore in previous gradient paper, second order or sct paper. The way to push anchor negative apart is to minus negative feature vector with anchor feature vector. In detail, it means the most positive/negative elements in the feature will reduce their scale to zero. Gradually, the feature become sparse for most of images.

The proxy method also has negative, why the feature is not sparse? Yeah maybe there exists a temporary phase for the feature to be sparse be in the final phase the feature need to be close around the proxy and becomes not sparse again.

Then another related question comes, Global feature or Local feature? Since the proxy active many element in the feature, the feature should be more global. In contrast, EPHN method lead sparse feature, it means few elements in the feature vector are activated. Then this method will only capture some local/small information from the images.

If the above hypothesis is correct, then the following directions can push the SOTA to a high level.

  1. making feature rich with EPHN method. One idea is to chunk the feature and send each chunk to EPHN loss and normalized each chunk individually
  2. push negative leads sparse feature in EPHN method. how about turn the push force become a pull force? let me use ping instead of proxy here to distinguish the proxy idea. we have a triplet(Anc, Pos, Neg)
    Still pull Anc, Pos together. But when we need to push Neg, set a ping for Neg, and the ping is already far way to the anchor, and pull the Neg, ping close.

Deep Wheat: Montages of Cultivars

4 by 4 montage examples
10 by 15 montage examples (containing blurry days)

The montages below are the images in the hop spots (1NN match amount higher than 50) in the confusion matrix of training data. As we can see in the images, cultivars have very long hair or awnless have a better performance. Cultivar 35 is awnless, cultivar 117, 132 and 152 are confused with each other. And there are all soil images in cultivar 227.

cultivar: 4
cultivar: 35
cultivar: 54
cultivar: 109
cultivar: 117
cultivar: 132
cultivar: 152
cultivar: 173
cultivar: 185
cultivar: 195
cultivar: 227
cultivar: 249