So assuming 1, 2, 3, 4 are valid points marked for insertion (which I still maintain is a possibility) then the prefix sum (using thrust's exclusive_scan [thrust is the std namespace from nVidia ( well, it's a partial port)]) yields, 0, 1, 3, 6.
O_o
No. I have not read the paper. I do not really care about the paper. I also, admittedly, recall basically nothing from my time with "CUDA".
I do know quite a bit about `Scan'--"all prefix sum"--from functional programming. I have a "filter" iterator which internally uses the mechanic.
You do not mark points 1, 2, 3, and 4 for insertion by passing those values through--whatever mechanic--`exclusive_scan' if `exclusive_scan' is "all prefix sum" and the "question" is unrelated to those values. Sure, if the question is related to those values you use those values, but you are asking the question "Is the tetra to be split?". You instead create a temporary array of a specific size--related to the available threads of execution/memory--which stores the values related to the question--which is your "Is this tetra marked for insertion?". You do not mark "point #1" as interesting by inserting a 1 integer literal value into the temporary array. As I said, where datum of interest currently lives is irrelevant in deciding where the "expanded" data is to live. By considering that, you are deluding yourself into thinking 0, 1, 3, and 6 is the correct result for the "all prefix sum". A bit of datum either is or is not a point of interest; the value is boolean not integer. The values which answers your question will always be either "yes" or "no".
Consider the datum 1, 2, 3, and 4 requires processing. (The relevant datum is, I understand, the tetra of the target surface.) The values stored in the temporary, "decision", array is "True", "True", "True", and "True" which is saying "The datum at indice 1 is interesting.", "The datum at indice 2 is interesting.", "The datum at indice 3 is interesting.", and "The datum at indice 4 is interesting.". You use the values within the temporary array as the values to build the "all prefix sum" array. The array `(1, 2, 3, 4)' does produce `(0, 1, 3, 6)', but we are not interested in the indice for the "target" datum. We already have that information by definition. We are interested in whether or not the "target" datum needs to be processed. The array `(true, true, true, true)', as `(1, 1, 1, 1)', produces `(1, 2, 3, 4)'; we are interested in that data. Each job, as in thread #1 processes datum #1, may look at the "all prefix sum", `(1, 2, 3, 4)', as derived from the "decision array", `(1, 1, 1, 1)', to determine if the datum is interesting and where to record new material from processing the same datum in the original input.
The `exclusive_scan' is essentially being used with a predicate to filter interesting data from uninteresting data, count the instances of interesting data, and produce an offset for storing newly generated data all at the same time.
As I said earlier, the example is flawed due to the purity; you can't really see how the example works because "Everything is false.".
Code:
int sData = 4; /* How many datum do we have? */
/* ... */
int * sInput = malloc(sData * sizeof(*sInput)); /* The actual `tetra' from the surface. */
/* sInput[] = {23, 44, 65, 86}; */
bool * sDecision = Consider(sInput + 0, sInput + sData, [](int f){return(f % 2);}); /* Determine if a bit of data should be processed. */
/* sDecision[] == {false, true, false, true} */
size_t * sOffset = Sum(sDecision + 0, sDecision + sData); /* Use "all prefix sum" algorithm(s) to generate offset values. */
/* sDecision[] == {0, 1, 1, 2} */
int sExtra = *(sDecision + 3) * 3; /* How much more space do we need? */
/* ... */
sInput = realloc((sExtra + sData) * sizeof(*sInput));
/* sInput[] = {23, 44, 65, 86, ?, ?, ?, ?, ?, ?}; */
Adjust(sDecision + 0, sDecision + sData); /* Adjust the offset to account for each split needing three extra storage spaces according to "start counting at zero" rule. */
/* sDecision[] == {0, 0, 0, 3} */
forall(sInput + 0, sInput + sData, [&]()
{
int sThreadNumber = /* API */; /* Get the ordered thread number from some provided interface. */
int * sPrivateStorage = sInput + sData + sDecision[sThreadNumber];
if(sInput[sThreadNumber] % 2)
{
/* We have nothing to do. */
}
else
{
/* We may safely write to `*sPrivateStorage'. */
/* The "overlapping" storage is irrelevant. The threads which share the same "overlapping" region have no work to do by definition. */
/* Threads one and three do nothing. Threads two and four do something, but threads two and four do not have "overlapping" storage. */
}
});
As I said, you can practically do all of that in a single pass; the code here is only intended as an explanation.
Soma