# Thread: how to compute set intersection efficiently?

1. ## how to compute set intersection efficiently?

Hello everyone,

I need to compute the intersection of two (or more) sets. I will use C or C++, but STL algorithms are not enabled to be used. I have Googled, but it seems that there are only union algorithms and no intersection algorithms.

It is not my homework, and it is my interest to improve the application I am working on now. Does anyone have any reference materials or any ideas of how to design the intersection operation of sets efficiently.

George

2. Well if sets were bits, then

a & b is intersection
a | b is union

3. Hi Salem,

Originally Posted by Salem
Well if sets were bits, then

a & b is intersection
a | b is union
It is composed of integers. Any ideas?

regards,
George

4. > It is composed of integers. Any ideas?
How many integers
What range of integers

> how to design the intersection operation of sets efficiently.
Try not to engage in premature optimisation.
If your design and interfaces are clean, then rewriting bits of code AFTER you've done a profile run on the completed project will tell you what you really need to be concentrating on.

5. If the sets are sorted you'll be able to optomize your intersection algorithm pretty easily. Of course, that means a performance hit when adding numbers to your sets.

But if your sets are sorted you can use binary searches for the intersection (and for insertions) which are pretty speedy. Binary searches also scale really well.

Here's a start I came up with:
Code:
```typedef struct
{
int *set;
size_t size;
} set_t;

size_t binsearch(int *haystack, int needle, size_t size)
{
size_t half = size / 2;
size_t res;

if(!size)
return (size_t)-1;

if(needle == haystack[half])
return half;

if(size < 2)
return (size_t)-1;

if(needle < haystack[half])
return binsearch(haystack, needle, half);

res = binsearch(haystack + half, needle, size - half);

return res == (size_t)-1 ? res : res + half;
}

size_t insertionpoint(int *haystack, int needle, size_t size)
{
size_t half = size / 2;

if(size < 2)
return haystack[0] < needle;

if(haystack[half] > needle)
return insertionpoint(haystack, needle, half);
return insertionpoint(haystack + half, needle, size - half) + half;
}

set_t *set_init(void)
{
set_t *set;

if(!(set = malloc(sizeof(set_t))))
return NULL;

set->set = NULL;
set->size = 0;

return set;
}

void set_free(set_t *set)
{
if(set->set)
free(set->set);
free(set);
}

size_t set_insert(set_t *set, int num)
{
size_t i;

if(binsearch(set->set, num, set->size) == (size_t)-1)
{
int *temp;

if(!(temp = realloc(set->set, sizeof(int) * (set->size + 1))))
return 0;

i = set->size ? insertionpoint(temp, num, set->size) : 0;

memmove(temp + i + 1, temp + i, sizeof(int) * (set->size - i));
temp[i] = num;

set->set = temp;
set->size++;
}

return set->size;
}

set_t *set_intersect(set_t *set1, set_t *set2)
{
set_t *new;
size_t i;

if(!(new = set_init()))
return NULL;

for(i = 0;i < set1->size;++i)
if(binsearch(set2->set, set1->set[(int)i], set2->size) != (size_t)-1)
if(!(set_insert(new, set1->set[(int)i])))
{
set_free(new);
return NULL;
}

return new;
}```
I'm sure there's still lots of room for improvement.