Daniel Taylor

== Nopelepsy Update: OpenCV and OpenCL === Updating Nopelepsy to use image and GPGPU libraries and the algorithm changes involved.

Hello again! This one is a doozy, but that’s okay. We’ll get through it together

What’s changed

For one, Nopelepsy is link(https://github.com/CulDeVu/Nopelepsy)[public on Github now]! So that’s cool.

There’s been a big rewrite since last time. Firstly, the Java implementation was ported to C++, and a lot of the single-threaded functionality I figured out how to re-cast into algebraic operations on wide vectors. So that’s cool too. In fact, that’s what most of this is going to be about

But first! Here’s some videos. Note how much less jarring they look versus the last set of videos I showed off.

General Changes

One of the big algorithmic changes is how I’m doing the flash removal. Previously, if a domain $[a,b]$ was marked for removal, then any pixel value at time between $a$ and $b$ would be set to a linearly interpolated value between pixel values $p_a$ and $p_b$. However, this is very wasteful in terms of the “video metric”.

Any fool could turn a video with flashes in it to a video without just by writing the entire video with zeros. The most desirable thing for a detector is the ability to modify the video as little as possible, while still completely removing the flash. The precise metric being used is the sum of Lab-space distances, summed over all pixels and over all frames. It’s pretty easy to see how this sort of criteria, if followed to the letter, could go very badly. For example, take a look at this image:

img(dots.png)[]

Nearly half of this 100x100 pixel image’s pixels are white, but every white pixel sees less than 25% of its 50x50 square of neighbors as being white. If every white pixel were a square wave, this would technically pass the W3C criteria. However, this is still clearly harmful because the eye isn’t able to discern individual pixels, and instead perceives average luminance, which in this case is still harmful. By adhearing too closely to the metric, you could end up in a situation like this. The two options are to (1) change the metric or (2) just don’t go overboard.

In Nopelepsy, I’ve chosen option (2). The way it’s done now is by clamping the brightness of pixel $p_c$ by $\max(p_a, p_b) + \text{lum_thresh}$. This isn’t optimal, or even locally optimal per pixel, but it’s a good first step.

Another, smaller change is the removal of the Center of Mass domain equality. Empirically, this seemed to have no discernible effect that I could see on the output videos.

OpenCV Implementation

In the last post, I listed the algorithm as follows:

function detectAndModify()
  F := empty list of potential flash "ballots"

  for every pixel in the video dimensions
    look at its histogram and determine all potential flashes that could happen

    for every potential detected flash
      if it flashes concurrently with some potential flash in F, let that pixel "vote" on that flash in F
      if not, create a new "ballot", vote on it, and then add it to F

  optionally reduce F by looking at its neighbors

  for every potential flash ballot in F
    for every voter on that ballot
      check to see if there are enough other voters around it (25% of the 0.024 steradian region around it)
      if not, continue

      there is now determined to be a general flash that corresponds to some domain of the voter pixel (the necessary information is all on the ballot)
      modify the necessary frames of that voter to get rid of it

This algorithm has changed very little in spirit, but very much in implementation. The “ballots” described above were originally implemented as lists of pixel coordinates, but are now described with an OpenCV bitmask. This is made possible by getting rid of the Center of Mass equality and focusing on Domain equality.

A more representative algorithm looks something like this:

function detectAndModify()
    B := array of bitmask "ballots", one for every possible flash domain

    for every frame F in video
        for every possible domain that can F can be a part of
            add to the bitmask those pixels that flash on this domain to the correct bitmask in B

    reduce B at least once

    for every frame F in video
        for every pixel masked in some valid bitmask in B, clamp brightness accordingly

The main steps are basically the same, but with the difference being that now the entire algorithm can be expressed in terms of common image processing functions operating on ballots and frames.

The first block looks like this in code:

Mat s = frame at beginning of flash domain
Mat e = frame at end of flash domain

// the value to threshold by
Mat truncate = max(s,e);
truncate += luminance_thresh;

// build the ballot
Mat comp = Mat(lum_stack.get(0).size(), CV_8UC1);
for (int mid = start + 1; mid < end; ++mid)
{
    Mat m = lum_stack.get(mid);
    compare(m, truncate, comp, CMP_GT);
    bitwise_or(ballot, comp, ballot);
}

This is much less costly than doing the manual pushing and popping of coordinates into a ballot list. A separate thread is launched for each potential flash domain, so these ballots are being build asynchronously. There is a way to build them synchronously faster, as is done already in the flash removal phase, but I haven’t gotten around to testing it.

Speaking of flash removal, here’s the new version of that:

// this has to be done in sequence due to differences in se_max
Mat s = start frame of current video slice

Mat cur_min = Mat(height, width, CV_32FC1, Scalar(1.0));
for (int i = 0; i < stack_size; ++i)
{
    // doing this backwards is important
    int cur = stack_size - 1 - i;
    if (cur < 1)
        break;

    Mat m = the i'th frame in current video slice

    if (cur >= 0)
    {
        Mat ballot = ballot for domain [0, i]
        m.copyTo(cur_min, (m < cur_min) & ballot);
    }

    Mat se_max = max(cur_min, s);
    se_max += user_settings.luminance_thresh;

    se_max.copyTo(m, m > se_max);
}

The variable s is set at the start due to the optimizations discussed last time. This algorithm is able to remove all flashes from all frames in the current video slice in a single sweep through the domain list.

Notice that the threshold value described above only depends on the pixel values at the ends of the domain, and is the maximum possible value it can be set to guarantee that flashes don’t occur after the modification. So that means if pixel $p$ is belongs to flashes on domains $[a,b]$ and $[c,d]$, then the maximum value that $p$ can take after modification is $\min(\max(p_a, p_b), \max(p_c, p_d)) + \text{lum_thresh}$, and the pattern continues for 3 or more flash domains. Since all the domains are nested (e.g. $[0, 2], [0,3], [0,4],…$), all we have to do is keep track of the minimum value of $\max(p_0, p_a)$ (or even better, the minimum value of $p_a$) to get the maximum allowable pixel brightness, going backwards from largest domain down to smallest. This is what is happening in the if block above.

The reduction phase is uninteresting. All the reduction does is make sure that flashes are surrounded by a certain amount of other flashes, otherwise they’re ignored. This is done using box filters and masks to early out. A simple optimization is done in this step by downsampling the ballots for reduction, and resampling back up.

The three big things on the menu right now are

OpenCL Rewrite

So. OpenCL.

A couple weeks ago I started a re-write of the core algorithm of Nopelepsy into OpenCL. The goal was to get to the point of having Nopelepsy running on all platforms and all devices to try to get Nopelepsy running at full speed, hopefully faster than real-time.

You’d think I would have learned by now. It’s not like OpenCL drivers are known for implementing literally anything correctly.

On my work Macbook, things went fairly well. Even with the reduction phase (which unfortunately is the most time-consuming part of the CV and CL versions) still being done on the CPU, I was able to get it running slightly faster than the CV version while only running on one device. I wasn’t able to get it running faster by using multiple devices, and this is something I’m going to be investigating at some point in the future. I suspect it has something to do with behind-the-scenes memory management.

On my Windows machine, it just straight up doesn’t work. Every device on my Windows machine outputs a completely different result when running Nopelepsy, and only possibly one of them is correct behavior. This isn’t just a little infuriating. All I’m doing is moving data around in buffers! How the hell do you mess that up? All my kernels together don’t even take up 100 lines of code. One particularly maddening thing that’s been happening recently is clEnqueueFillBuffer returning CL_INVALID_OPERATION, which link(https://www.khronos.org/registry/OpenCL/sdk/1.2/docs/man/xhtml/clEnqueueFillBuffer.html)[isn’t even a valid error code]!

EDIT (2020-06-05): So this was actually my fault. It turns out that platforms and devices aren’t guarenteed to be the same. So before using an opencl feature you need to check BOTH versions.

When I’m not having to deal with broken drivers or implementation-defined behavior, I’m (very slowly) making progress on this re-write. Even if it is making my hair turn gray before my time.

Conclusion

So yeah, that’s all I have to say at the moment. If you want to check out Nopelepsy, the source is on Github.