Can someone help me with the use of prefix sums?

**MutantJohn** · 05-29-2014

Hey all,

So, I'm right back at it again. But this time, it's parallelized... from outer space!!!!

All jokes aside, let's say that we're creating a tetrahedral mesh. A tetrahedron is a geometric object with 4 vertices, 4 faces (each is a triangle). Here's a picture : http://www.kidsmathgamesonline.com/i...etrahedron.jpg

Now, I'm reading this paper found here : http://www.comp.nus.edu.sg/~tants/gd...appaThesis.pdf

I think it's absolutely brilliant but I'm having some trouble with it.

Namely, how he gets the index of where to construct new tetrahedra in a previously existing buffer.

If you don't want to look at the paper (I'm having trouble on page 61, the section titled "Splitting Tetrahedra"), here's the quote :

The first new tetrahedron t_1 can be stored at the same location as that of the original tetrahedron
t_0 . A thread performing the insertion also needs the space and location to store the other three new
tetrahedra. This can be achieved by performing a parallel prefix sum of the insertion points before the
insertion is done. This produces an index for every tetrahedron that has an insertion. From this the
total number of point insertions can also be obtained. This information can be used to expand the
tetrahedron array enough to accommodate the new tetrahedra. The insertion thread can use the prefix
sum value to find the place to store the three new tetrahedra.

I'm imagining that what he's saying is, we flip in rounds and each round, we insert multiple points. Each point to be inserted is marked. Each point is already stored in a point array. So point_array[4] is a point to be inserted, etc.

My problem with this is, I don't think it works.

If you have points 1, 2, 3 and 4 being inserted, the prefix sum is 1, 3, 6 and 10 or 0, 1, 3, 6 if you prefer the exclusive scan operation. Those would clearly overlap.

So how does this guy use a prefix sum to get indices in the tetrahedra array that don't overlap with each other? Also, I can't just overwrite existing data either. Basically, I'm very confused.

The reason this needs to be done is, I can't use a mutex otherwise the algorithm "fails" so I can't use a convenient "back of array" iterator. I need to modify an array to be written into by hundreds of threads and I need to ensure no threads write over each other. I don't think prefix sums do this. Unless I'm missing something from the paper.

**phantomotap** · 05-29-2014

O_o

Your example seems flawed.

The quote speaks of areas in the expansion, overwrites, and one assumes we are dealing with one tetrahedron at a time.

How are the points 1, 2, 3, and 4 valid for an insertion?

Soma

**MutantJohn** · 05-30-2014

Ah, my apoligies, Sr. Caballito Rosado. (I hope that language compiles for you fine but if there is a read error, it means, Mr. Pink Pony.)

The crux of the algorithm is this :

There is a mesh, T, which contains every tetrahedron.

There is also a set of uninserted points. To understand this better, take the image I posted and imagine inserting a point into it and treating it like a vertex. This yields 4 child tetrahedra. The paper has better images of this. But it's because each face forms a unique tetrahedron with the insertion point.

So, how does CUDA make a good choice of language for this type of programming?

Simply put, we launch a thread for every uninserted point and find which point is closest to which tetrahedron's circumcenter. Here circumcenter means closest to the center of the sphere circumscribed by the vertices of the tetrahedron.

Using this information, we have a large set of independent insertions. In fact, we'll be inserting 1000 points into 1000 tetrahedra at once.

And now this is where I get lost. The idea is, we use a pre-allocated buffer and simply construct the tetrahedra at specific locations.
It's the same thing as :

Code:

std::vector<tetra> t_swift_mesh;
t_swift_mesh.reserve(a_million_bajillion);

for ( /* every unique thread */) {
    new(t_swift_mesh.data() + proper_index) tetra( /* vertices 0 - 3 */ );
}

My biggest problem is, the paper doesn't really seem to explain how he avoids collisions.

The idea is, we're only inserting a subset of the entire uninserted point set at once so we mark each point that is going to be inserted. This is fair. Keep in mind, each point is also stored in an array as well so each point has an array index. His paper cites this as well when he explains how the Tet structure works, it stores the indices of the points in the buffer using a signed int. Which is appropriate considering how small GPU memory is.

I just don't follow how a prefix sum of the indices of points to be inserted guarantees any sort of safety. I'm imagining that he uses the initial tetrahedral index as an offset to the sum but what if the difference between points to be inserted is small? What if the index given writes into a previously allocated space? I have many doubts about the security of this algorithm be claims it works...

**phantomotap** · 05-30-2014

O_o

No. Your "the same as this" is flawed.

Code:

std::vector<tetra> sSurface(GetSurface());
/* ... */
ScanResult sOffset(Scan(sSurface, NeedsSplit)); // We are semantically overloading `operator []' for the sake of example.
/* ... */
sSurface.resize(sSurface.size() + sOffset.splits()); // Expand the array with information obtained from--the "prefix sum"--`NeedsSplit' scan.
/* ... */
forall([/*...*/](/*...*/)
{
    tetra * s(new (sSurface.data() + sOffset[/* The "job id" or "target tetra" which is a "per thread" data. */]));
});

The points 1, 2, 3, and 4 are not valid points for insertion, but the points 1, 2, 3, and 4 may need to be split. However, the "decision array" is 1, 1, 1, and 1 instead of 1, 2, 3, and 4 because where the points currently live is incidental in determining where the split should live.

The example, in purity, has a flaw so consider the points 1, 2, 3, and 4 while knowing that only 1 and 3 require a split. The decision array is 1, 0, 1, and 0 which can now be counted for the 1, 1, 2, and 2 "prefix array". The calculated offset is `pointer + original_length + prefix[target]' for each thread. Of course, you also know that we start counting from zero and each split needs three blocks so you account for that in determining the values. You get +0, +0, +3, +3 where each thread uses the associated offset which, yes, does overlap. However, the overlap is irrelevant because threads working on points 2 and 4 have not work to do so will not be storing anything.

You can, in practice, combine a lot of this adjustment into a single pass which can be done in parallel.

I am unsure the nature of the name comment so I will assume poor intent in the future; you may use phantomotap, Soma, or an implied if you wish to avoid my such assuming.

Soma

**MutantJohn** · 05-30-2014

I would first like to begin this by saying, I'm sorry if I offended you. You're the only one helping me and I'm very grateful for that. I'm just confused as to why you would think that I'd ever have a poor intention regarding you. My comment was the logical equivalent of you calling me Mr. Blue Space or Stars or something to such effect. Irregardless, I'll stick to what you prefer. And about me, you can assume this, I like you, phantom; I mean no disrespect.

I also hate to be a jerk and ask, did you read the paper? I only ask because I'm not sure if we're talking about the same thing.

I think your assertion about points 1, 2, 3 and 4 is wrong. The idea is, we perform the prefix sum on points marked for insertion. It is possible for those 4 to be marked because those points can have any coordinate values. Even if they did follow some sort of spatial proximity criteria, it would still be possible assuming small enough tetrahedra.

A thread performing the insertion also needs the space and location to store the other three new
tetrahedra. This can be achieved by performing a parallel prefix sum of the insertion points before the
insertion is done.

I'm assuming when the author says a prefix sum of the insertion points, he means the indices of the points in the array that are all marked for insertion. If this is the flaw in my reasoning then I could believe it.

The insertion thread can use the prefix
sum value to find the place to store the three new tetrahedra.

So assuming 1, 2, 3, 4 are valid points marked for insertion (which I still maintain is a possibility) then the prefix sum (using thrust's exclusive_scan [thrust is the std namespace from nVidia ( well, it's a partial port)]) yields, 0, 1, 3, 6. Assuming a 'normal' prefix sum, we have 1, 3, 6, 10.

This would mean that the threads would be writing over themselves if we don't use the initial tetrahedral indices but like I said, this has no guarantees about overwriting because where the tetrahedra can be can be quite variable.

Wait... I think I may be a noob. He keeps mentioning "expanding" the array. If we did have 4 insertions, we'd have a total increase of 12 extra tetrahedra, assuming overwrites. It's really 16 writes but we can re-use 4 addresses so we only really need 16 - 4 = 12 new addresses... How do I use this efficiently... But then the author also notes pre-allocation which is exactly what I was originally thinking of. The idea is, we don't always use resize() here. Let's just say we really do reserve space for many tetrahedra before we even begin the insertion procedure.

Hmm... Do you think you could explain your "decision arrays" a tad more? I'm very confused here and I do appreciate the help. Well, if we really did have 1, 3, 6, 10 as the prefix sum, it might be possible to just multiply or add each element by 3 so we'd have 3, 9, 18, 30 or 4, 7, 9, 13 and that might be workable. But I'm still convinced this algorithm will fail due to overwrites.

**phantomotap** · 05-30-2014

So assuming 1, 2, 3, 4 are valid points marked for insertion (which I still maintain is a possibility) then the prefix sum (using thrust's exclusive_scan [thrust is the std namespace from nVidia ( well, it's a partial port)]) yields, 0, 1, 3, 6.

O_o

No. I have not read the paper. I do not really care about the paper. I also, admittedly, recall basically nothing from my time with "CUDA".

I do know quite a bit about `Scan'--"all prefix sum"--from functional programming. I have a "filter" iterator which internally uses the mechanic.

You do not mark points 1, 2, 3, and 4 for insertion by passing those values through--whatever mechanic--`exclusive_scan' if `exclusive_scan' is "all prefix sum" and the "question" is unrelated to those values. Sure, if the question is related to those values you use those values, but you are asking the question "Is the tetra to be split?". You instead create a temporary array of a specific size--related to the available threads of execution/memory--which stores the values related to the question--which is your "Is this tetra marked for insertion?". You do not mark "point #1" as interesting by inserting a 1 integer literal value into the temporary array. As I said, where datum of interest currently lives is irrelevant in deciding where the "expanded" data is to live. By considering that, you are deluding yourself into thinking 0, 1, 3, and 6 is the correct result for the "all prefix sum". A bit of datum either is or is not a point of interest; the value is boolean not integer. The values which answers your question will always be either "yes" or "no".

Consider the datum 1, 2, 3, and 4 requires processing. (The relevant datum is, I understand, the tetra of the target surface.) The values stored in the temporary, "decision", array is "True", "True", "True", and "True" which is saying "The datum at indice 1 is interesting.", "The datum at indice 2 is interesting.", "The datum at indice 3 is interesting.", and "The datum at indice 4 is interesting.". You use the values within the temporary array as the values to build the "all prefix sum" array. The array `(1, 2, 3, 4)' does produce `(0, 1, 3, 6)', but we are not interested in the indice for the "target" datum. We already have that information by definition. We are interested in whether or not the "target" datum needs to be processed. The array `(true, true, true, true)', as `(1, 1, 1, 1)', produces `(1, 2, 3, 4)'; we are interested in that data. Each job, as in thread #1 processes datum #1, may look at the "all prefix sum", `(1, 2, 3, 4)', as derived from the "decision array", `(1, 1, 1, 1)', to determine if the datum is interesting and where to record new material from processing the same datum in the original input.

The `exclusive_scan' is essentially being used with a predicate to filter interesting data from uninteresting data, count the instances of interesting data, and produce an offset for storing newly generated data all at the same time.

As I said earlier, the example is flawed due to the purity; you can't really see how the example works because "Everything is false.".

Code:

int sData = 4; /* How many datum do we have? */
/* ... */
int * sInput = malloc(sData * sizeof(*sInput)); /* The actual `tetra' from the surface. */
/* sInput[] = {23, 44, 65, 86}; */
bool * sDecision = Consider(sInput + 0, sInput + sData, [](int f){return(f % 2);}); /* Determine if a bit of data should be processed. */
/* sDecision[] == {false, true, false, true} */
size_t * sOffset = Sum(sDecision + 0, sDecision + sData); /* Use "all prefix sum" algorithm(s) to generate offset values. */
/* sDecision[] == {0, 1, 1, 2} */
int sExtra = *(sDecision + 3) * 3; /* How much more space do we need? */
/* ... */
sInput = realloc((sExtra + sData) * sizeof(*sInput));
/* sInput[] = {23, 44, 65, 86, ?, ?, ?, ?, ?, ?}; */
Adjust(sDecision + 0, sDecision + sData); /* Adjust the offset to account for each split needing three extra storage spaces according to "start counting at zero" rule. */
/* sDecision[] == {0, 0, 0, 3} */
forall(sInput + 0, sInput + sData, [&]()
{
    int sThreadNumber = /* API */; /* Get the ordered thread number from some provided interface. */
    int * sPrivateStorage = sInput + sData + sDecision[sThreadNumber];
    if(sInput[sThreadNumber] % 2)
    {
        /* We have nothing to do. */
    }
    else
    {
        /* We may safely write to `*sPrivateStorage'. */
        /* The "overlapping" storage is irrelevant. The threads which share the same "overlapping" region have no work to do by definition. */
        /* Threads one and three do nothing. Threads two and four do something, but threads two and four do not have "overlapping" storage. */
    }
});

As I said, you can practically do all of that in a single pass; the code here is only intended as an explanation.

Soma

**MutantJohn** · 06-02-2014

phantom, you're a genius, you know that, right?

Sorry this reply is so late. I totally swamped with stuff irl. But I finally took the time to read your post and wow, I'm an idiot. Your method, though, is absolutely brilliant and is 100% what the author meant. This actually makes sense.

I only have one question though, don't we actually want to process the even elements of sInput? Otherwise assuming thread IDs 0, 1, 2 and 3, we don't wind up writing using the correct offsets (sThreadNumber[3] has an offset associated with 3).

Edit : Oh wait, I'm an idiot. It's because n % 2 == 0 for even numbers and > 0 for odd. Using n % 2 would be true if n is odd and false if n is even. Duh. Nevermind.

Anyway, thank you so much for all of your help. I think I can actually proceed with my algorithm now!

CUDA!!!!!

And just 'cause you said you were rusty, it should be

Code:

int thread_id = blockIdx.x * blockDim.x + threadIdx.x;

**MutantJohn** · 06-02-2014

But out of curiosity, phantom, how do you know all this stuff? Like, you didn't even read the paper and yet you know the technique. Is this popular in real world programming? Did you learn it in school? How do you know this? For me, this is literally the first time I've ever seen anything like this done before.

Thread: Can someone help me with the use of prefix sums?

Thread Tools

Search Thread

Display

Can someone help me with the use of prefix sums?

Similar Threads

Can someone help me with Evaluating Sums?

Win32 sums

Arrays & Sums

Stringy Sums

Sums...