In one of my last blog posts, I discussed the preliminary results of trying to optimize for both the camera & light locations simultaneously. I have been working on a simulation of our geometry to try and understand this better. To recap, it seems that there is a slight ambiguity in the position of the light and camera, in that they can move slightly along two rays and still give a lower error. One thought is that it is an error in how we are modeling the light and the gaussian receptive field of the glitter. Another thought is that it may have to do with the fact that our light is not a point light source, but rather a small square; so we don’t quite know what part of the light a glitter piece sees. More likely is that it some combination of these two issues.

In this blog post I want to highlight where I am in the simulation, and what it shows so far. I simulate 10,000 pieces of glitter that lay on a place, each with a surface normal corresponding to some random screen location. 500 of them (chosen at random), however, have surface normals that allow the glitter to see the simulated point light and the simulated camera. For those that ‘see’ the light, I give them intensity values of 1, and give intensity values of 0 for all of the rest.

I then compute the predicted intensities for each glitter pieces using a few different gaussian values for the light and the glitter receptive field. Below, the non-yellow rays are the pieces of glitter that have actual intensity values of 0, but predicted intensity values between 0.1 and 1.

The above images are made using a light gaussian size of 5 and a glitter receptive field of 40. As the predicted intensity gets higher, the rays getting lighter (and farther away from the camera), which is exactly what we would expect to happen in this system.

Next, I am adding in the actual ‘optimization’ part of the simulation. As a first pass, I will just optimize for everything (including light and camera) and see what happens. In doing that, I hope to understand what parameters to tune, and I can start fiddling with the way in which I am predicting intensities and how I am modeling receptive fields.

The second thing I am spending time this week working through is adding the light location as a parameter that can be optimized over in the calibration. This involved adding a new flag indicating if we treat the light location as known or not, and if we treat it as unknown, then we optimize for the light location in steps 1 & 3 of the 3-step optimization.

In order to make this task as ‘easy’ as possible, I am only optimizing for translation, rotation, focal, gaussian and light location (everything I did for WACV + light location), and I have not changed the RANSAC protocol to include the fact that the light is unknown. I also initialized the light location to be the correct light location. This blog post will focus on the results of this calibration run.

The main thing I notice is that the camera and light seem to be able to both move slightly, as if they were each attached to a string, one on each end, that was on a pulley in the middle.

The upper image shows a birds-eye view of the setup, with the monitor on the left and the glitter on the right. The bottom image shows the view of the setup looking at it from the side of the table (looking towards the drones). The dotted-line frustum is the GT camera location and zoom, while the red-outlined frustum is the optimized camera location and zoom. The blue circle is the GT location of the center of the square of light displayed on the monitor, and the magenta circle is the optimized location of the center of the light.

The main thing I notice here is that the optimized camera is millimeters closer to the glitter, along the axis in line with the lens. Meanwhile, the optimized light is millimeters farther from the glitter, almost straight back (and slightly lower).

Here we can maybe see why this is the case (again from a birds-eye view of the setup). We are treating the light location as a point light source with some gaussian (which may not be the most accurate representation of our light). So, while all of the rays may fall within some small ‘square of light’ on the monitor, they actually more-or-less intersect close to a point a little further behind, and in this case even below, the point on the monitor where the light actually was.

This leads me to believe that I need to take a look at the method by which I am using the gaussian representation of our light, and think about how we can represent our light more accurately. Or maybe we need an even smaller light source (more like an actual point light). Something to think about for the re-implementation of the setup!

One of the potential (definite) sources of error that we found in our exploration of camera calibration over the last several months is the thickness of the checkerboard – the checkerboard does not lie in the same plane as the glitter sheet, even when we place the checkerboard directly on the glitter. This is a factor that I had not considered at all, and it was brought to my attention recently. This is something that we will address in the next iteration of the calibration setup, but for now I want to try to correct for this in the measurements I have. In order to compute the camera location in 3D world coordinates, we implement the following process…

Take ~25 pictures of the checkerboard in various/random orientations, 1 of which is a picture of the checkerboard sitting straight and flush against the glitter:

2. Find 3 orthogonal points on the checkerboard in order to establish a coordinate system for the checkerboard, shown in green/red (red just to help me know which is the upper left):

3. Map these 3 points into our glitter coordinate system using a homography:

4. Compute the 3D world coordinates of these 3 points – now these 3D coordinates are in the plane of the glitter, which is ~3mm behind the plane of the checkerboard. Herein lies the problem.

Gut Check: We know the squares are 24.5mmx24.5mm. When I compute the 3D locations of the 3 points shown above, I get the x-distance from upper left to upper right to be 98.8 (that actual distance on the checkerboard is 98). I get the y distance from upper left to lower left to be 74.0653 (that actual distance on the checkerboard is 73.5). I also compute the dot product between the two ‘axis’ vectors formed by the 3 points, and get -0.0021 (so they are almost orthogonal).

Solution

The MATLAB Camera Calibration Toolbox gives me a location of the camera relative to the checkerboard. I then do a change of coordinate system in order to get this camera location in world coordinates. So, if the checkerboard is being treated as being in the plane of the glitter sheet (which it is), then my relative location of the camera will be computed to be ~3mm closer to the glitter sheet than it actually is.

I think I can just subtract ~3mm from the x coordinate of the 3D location of the checkerboard points, and then compute the camera location in 3D world coordinates as I was before. The y-coordinate and the z-coordinate don’t change since the checkerboard is flush with the glitter sheet (and the glitter sheet lies flat in the y-z plane). So then this effectively gives a camera location that is ~3mm behind where the checkerboard gives as the location.

My understanding then is that I will need to re-run characterization using this new camera location, and then use this new characterized data to run calibration. This seems too simple — I’ll end my post here and leave it open to comments about my reasoning.

Translation: Ray intersection of consistent glitter

Rotation: Looking along the world coordinate system x-axis ()

Focals: [10000, 10000] — this is the middle of the range of the lens we are using

Image Center: The actual center of the calibration test image

Distortion: [0, 0] — different idea is to do a quick & dirty grid search over possible distortion values and choose the best as the intialization

Skew: Trying various fixed values…the next portion of this post focuses on this

Below is a table with results for the calibration run using a different fixed values for the skew. All other initializations are as described above.

As you can see, the values for translation, rotation, focal and gaussian (all of the parameters from the original calibration) are relatively consistent for each experiment. However, there is quite a bit of variation in the skew and the distortion parameters.

During the second and last steps of the optimization, I am computing the ‘CALTag Error’ in which I compute the reprojected image coordinates of the caltags and the consistent glitter, undistort them, and then compare these to their actual image coordinates. This is where the distortion parameter gets used. During this same ‘CALTag Error’ process, I am using the calibration matrix, where s = skew, (cx, cy) = image center.

For now, I am only considering the CALTags in the ‘CALTag Error’, not the actual glitter coordinates as well. The reprojections of the CALTags and the glitter are shown below for the calibration with a fixed skew of 9 — the two number on top are the average difference between GT and calculated CALTags & glitter respectively.

…the final CALTag error (with only CALTag reprojections) came out to be 3.9078.

Next, I tried considering both the CALTags and the Glitter in the ‘CALTag Error’. The reprojections of the CALTags and the glitter are shown below for the calibration with a fixed skew of 9 — the two number on top are the average difference between GT and calculated CALTags & glitter respectively.

The results for this are as follows: T: [632.2, 426.1, 177.8] R: [1.3681, -1.0523, 1.0032] F: [10517, 10577] 58.69 k1,k2: [-0.0107, 0.0651] cx,cy: [4131, 2879]

…and the final CALTag error (including both caltag reprojections and glitter reprojections) came out to be 0.576 (average pixel difference across all points).

I reviewed the Deep Wheat dataset last week and re-arranged the training and testing dataset. In the original dataset, there are many cultivars that lost the data of date 4, 5, 7, 8, 10, and 12, so I removed the data from these days. And also, there are many cultivars that lost too much data so I removed them. Then I removed the boundary cultivars with too much data, finally get a new dataset with 264 cultivars, 9 dates. I took half of the cultivars and the 1st rep of the other half as training data, and the 2nd rep of it as testing data.

exp_name: g_8_ep_200_R50_132tra_132tes model: ResNet-50 loss function: EPSHN loss group size: 8 epochs: 200 data: 132 cultivar training & testing data

testing data

training data

The 1-NN result of the testing data on cultivar is 0.009 (just a bit better than chance 1/132 = 0.0076), and 0.20 on date (2 times of chance 1/9 = 0.11). It seems that after removing the cultivars and dates which have incomplete data, it becomes harder for the network to distinguish different cultivars and dates. The confusion matrix of training data shows an obvious diagonal, but the confusion matrix of testing data is a kind of a mess.

confusion matrix of 1NN results of testing dataconfusion matrix of 1NN results of training data

Wide Residual Network is a improved version of ResNet. It has less but wider layers. From it’s repo. It outperform ResNet-1001 (yes, it’s not 101) on different datasets. At first I just tried with a non controlled settings. So I decided give it a try on Sorghum. But got contradict results (Better loss, but lower validation recall). Then I ran a very small grid search to compare the two network. The changes are the learning rate decay.

The validation recall, loss and 2 loss terms over epoch

The most outstanding line is the baby blue one, which is Resnet 101 without learning rate decay. But that is not reasonable since the two settings at first 30 epochs are identical, despite the random part (selection of mini batches). But the recall@5 have about 10% difference. And from my pervious training, the resnet 101 has kind of same recall as the red line. Does it just get lucky?

Despite the baby blue, The WRN (wide residual network) do have a lower loss. But it doesn’t outperform on the validation set. This validated the initial training result. It is probably overtrained. The WRN-50-2 has 68.9M parameters, and ResNet-101 has 44.5M parameters. Further works could be running with larger grid search on Pegasus. Maybe a smaller WRN could have better result.

There was recently a paper made famous because a low-resolution Obama was “upsampled” to become a high-resolution white man. That paper captures a combination of bias in data sets and algorithmic choices to show one example image rather than the distribution of images. This paper created quite a bit of controversy, including a widely publicized exchange between Yann Lecun and Timnit Gebru. I encourage everyone in the lab to read about this exchange.

But the rest of this post goes in another direction. What else can we do with this “upsampling” ability? Here is one crazy idea, ripped from the reality of our modern lives — we spend a huge amount of time in front of screens, in zoom calls, for example. Are we unwittingly sharing what is on our screen? Or, to put this in the constructive setting, suppose I have a picture of you, taking from your webcam, while you are looking at a screen (that I can’t see directly). Could I reconstruct your screen based on what I can see through your webcam?

Probably there will be papers that try to do this based on direct reflections in your glasses or from your eye. I’m wondering if we can do something more, and the *first* thing I think about as I think about papers like this is to ask, “where could I get data for this? Well, I’ve been on a rampage about reaction videos for the last 2 years. These are videos that people (often video-bloggers, or vloggers) share, of them watching other videos, and a common tradition is to use video editing software to put the “being watched video” into the video of the vlogger who is watching and reacting to that video.

This offers an amazing dataset. Each reaction video shows: (1) video taken from the webcam (or a camera) of the vlogger, and (2) in the inset, the video that is on the screen the video logger is watching. So the question becomes, can we predict or estimate the inset image, based on subtle changes in the main video. The below youtube video gives one example (go to 6:10 if it doesn’t start there automatically):

Notice the reflection on the wall on the right captures the color of what is being shown on the screen. Also, the microphone itself changes color. Some of these are cues we might be able to learn geometrically (following, for example, the sparklevision paper):

Below is a list of other youtube videos where I see a clear “response” of something in the scene to the video that is being reacted to. In this case, in the glasses of the vlogger (which move around and might be difficult to use):

and then, as if to troll me, there’s literal glitter (sequins) in the background of the next vlogger. I don’t actually by eye see the glitter responding to the screen, but it must be, right? (and I’m going to write some code to check for this).

and in the below video I see reflections in the game controllers on the table:

So what are “practical” applications of this? Well, you could try to exactly reconstruct an unknown image, but the resolution of that is not likely to be good enough to, for example, read their e-mail. But I’m struck by noticing that we spend our days in front of many screens like this the one below. What is actually showing on the screen that each of these people are looking at?

This is a nice question because you can ask it without completely reconstructing the image. In zoom there is a default “big” picture that zoom selects to show based on who is talking (a) which of these 9 people are looking at that screen? (b) you could ask, for each person, are they showing the grid of all 9, or which 1 of the 9 are they looking at? (c) Can you improve your response to (b) by looking at a time-sequence? (d) Can you index other popular web-pages, live-streams, or media so that you can detect other things that might be showing on someone’s screen?

I’d be excited to see any other videos, or classes of videos anyone can find that show more of this. I think that watching reaction videos from gamers might be a win; they are used to sitting close to a large screen with an otherwise darkish room (sorry, stereotyping, i know!), which is likely to create the easiest datasets to use.

The activation map and pooled vector(before fc layer) in Resnet-50 are non-negative. Then, no component is negative for dot product.

Potient solution:

1. Remove relu layer before pooling layer.

2. Based on equation in Grad-CAM:

w is weighs of fc layer. Yc is score for category c. A is activation map, k is channel, i_j is spatial location.

The equation is the goal of Grad-CAM, to find weighs for each activation map and combine them.

In similarity visializtion for embedding, the goal is the same, we try to find ∂D/∂A, which is the contribution for each activation map to the distance between two image vector. D is dot product of fc_vect_a and fc_vect_b. Since ∂D/∂A = ∂D/∂Y * ∂Y/∂A, (Y is fc layer vector) , and based on equation in Grad-CAM, we can get ∂D/∂A = fc_vect*W, which fc_vect and A from different images

So, we can use vector after fc layer and W matrix to get a new vector(which have negative value) to replace orignal pooled activation map.

3. we can forward vector in each location in activation map to fc layer. And use the output.

This experiment have trained 200 epochs, I did the t-SNE plot of epoch 100 to check if we got over trained.

exp_name: g_8_ep_200_R50 model: ResNet-50 group size: 8 epochs: 200 data: 97 cultivar testing data

t-SNE Plots of Epoch 200 for Testing Data

t-SNE Plots of Epoch 100for Testing Data

t-SNE Plots of Epoch 200 for Training Data

For the 3 cultivars which are more likely to be predicted as, I colored them in the t-SNE plot of epoch 200. And the “red vs. blue” t-SNE plot for training and testing data.

I also computed the standard deviation of each cultivar in both training and testing data. The average std of training data is 10.191, the min of it is 8.584 and the max of it is 17.640. The std of those 3 special cultivars (217, 266 and 292) are 9.798, 10.427 and 11.361. It’s hard to tell if they are “small and well clustered”.