# Thread: searching problem

1. ## searching problem

Hi there,

I have a little problem and would like to hear some opinions about it:

Well let's consider the simplest way of my problem:
given is a set of Points in an interval (starting with 0) - let P be the Number of these points.
And given is a partitioning of that interval in M partitions - for example an array whose entries are the (sorted) endpoints of the partitions.

My problem is to find an efficient way to compute for every point in P the number of the partition in which he lies.

(i haven't decided yet if the points are given by a list or an array)

really tricky is : the number of M and P may vary in scale
P and M can be really small but also VERY large

my options are :
-binary search for each point on the partitioning
or
-sort the points and then
( -one-sided-binary search on the partitioning
or
-interpolation search on the partitioning )

Do anyone see any other option?

And now the main problem :
Is there is best all-over-choice for all possible numbers P and M ?
if not : when would pay each option above?

(for little M and large P binary search would do best, i think, but what does "little" mean in dependence to P ?)

I know my English sucks but i hope someone is able to understand my problem.

thx in return for answers !
greetz

2. Can you store extra information in the data structure or create a helper data structure? If I understand your problem then you can just store the partition number for each point with that point and then do a binary search for the point. Once you've found it, you've also found the partition number.

3. Hi,

Originally Posted by Narf
Can you store extra information in the data structure or create a helper data structure?

I could both, yes.

Originally Posted by Narf
If I understand your problem then you can just store the partition number for each point with that point and ...
That IS the main problem...
(not to store but to FIND this number)

Suppose you have a partitioning of the interval [0,1000]
in say M=20000 partitions (NOT equally allocated)

and then u get an unsorted set of points - lets say 10^6 points within the interval [0,1000]

and the problem is to find a efficient way to compute for each point in which partition he lies.

for little M and large P it would be binary search without sorting the points first

but for little P and large M sorting the points doesn't cost that much and interpolation search on the large scale of M would do the rest...

am i wrong?

but what if not these easy conditions?
say both M=P=10^5 ? what then?

greets

4. but for little P and large M sorting the points doesn't cost that much and interpolation search on the large scale of M would do the rest...
I have a feeling that this is a case of having to choose the correct data structure. Anyway, have you actually tested sorting the points for large P?

5. Originally Posted by laserlight
I have a feeling that this is a case of having to choose the correct data structure. Anyway, have you actually tested sorting the points for large P?
yeah - the choice of data structure is important here.
I think, I'll use a list for the points and an array for the (sorted) partitions-endpoints.

And : no, i didn't have tested the sorting of the points yet.
Especially for large P there might be no use of it...

But that IS the question here !
Do anybody know, when to use it?

or do i really have to test a possible two dimensional M-P-grid of Numbers with each technique ?

greets

6. That IS the main problem...
(not to store but to FIND this number)
If you store the data the right way, efficient search becomes a non-issue.
am i wrong?
Probably not, but you're not explaining the problem to the point where I can fully understand what you're doing. I don't know what the requirements of this algorithm are. I don't know what the algorithm is going to be used for. I don't know where you're getting the points, or how they're placed into partitions when you get them. As far as I can tell, you want to find out which partition a point is in. But there aren't enough details for me to give you a useful answer.

7. Originally Posted by Narf
I don't know where you're getting the points, or how they're placed into partitions when you get them. As far as I can tell, you want to find out which partition a point is in. But there aren't enough details for me to give you a useful answer.
the points are simply given to me - and they are not placed in the partitioning of course.
But you already see the point : I want to know in which partition every point is and that as efficient as possible.

Given a sorted array of endpoints of the partitions and an unsorted list of points.

Output : each point gets an integer flag which represent the partition it is in (the array-index).

Question : what is the most efficient way from Input to Output?

and as I said before : M and P vary in size, so (IMHO) there is not a good all-over-technique.

The problem is to find a strategy when to use which technique.

And perhaps another idea for a technique not mentioned?

What details are missing?

greets

8. Use binary search for each point of the partitioning!

I'll list some possible algorithms below. I hope you understand Big O notation (if you read the rest of this).

I saw two options before I looked at yours. Option 1 you already gave, so forgive me for being redundant. Option 2 is partly what you gave as your second option, but I considered one of your ideas and one of mine.

Option 1: Dealing with each point separately, use binary search to find the partition in your array of size M. This takes O(P log M) time, since each binary search is O(log M), and there are P of them.

Option 2: Sort the list of points (in O(P log P) time), and then use an algorithm walking forward through the arrays, not too unsimilar from the merging step in mergesort, to assign the partition numbers (in O(M + P) time). The overall runtime is O(P log P + M). If you used one-sided binary search naively, you'd get O(P log P + P log M) time. If you used one-sided binary search, updating your idea of the "front" of the array of partitions to start at the last found partition, you'd get O(P log P + P log (1 + M / P))) runtime. (If the points are relatively evenly distributed.) I would call the latter form of one-sided binary search "progressive" one-sided binary search, I guess; I've never really dealt with it.

O(P log P + P log (1 + M / P)) = O(P (log P + log (1 + M / P)))
= O(P log (P * (1 + M / P))) = O(P log (P + M)).

So the regular old binary search solution runs in O(P log M), and the "sort points first" solutions run in O(P log P + M) or O(P log (P + M)). The O(P log (P + M)) "progressive one-sided binary search" solution loses to binary search, obviously. This leaves the "sort P first mergesort-like algorithm" method.

If M < P, then P log M < P log P + M.
If M = P, then O(P log P + M) = O(P log P) = O(P log M), so asymptotically, they tie. However, binary search would be less complex and generally run faster.
If P < M, then things get more interesting. If M is large, binary search wins, but if M is relatively close to P, it might not? Some more mathematics would show that binary search wins all the time, unless I made a mistake. I would not like to type it just yet, since I might have made a mistake and my work is all over the place.

[Edit: Here it is.]
If P < M, then either P > M / log(M/P) or P <= M / log(M/P).

If P <= M / log(M/P), then since log(M/P) > 0, we know that P log (M/P) <= M. Then P(log M - log P) <= M. Then P log M <= P log P + M.

Suppose that M / log(M/P) < P. It follows that M / P < log(M/P). Let R = M/P. Substituting, we see that R < log(R). But there is no value of R for which this is true! (Hence, there are no values of M and P for which this is true.) So we always have the case P <= M / log(M/P).
[end edit]

So for all positive values of P, M, it's true that P log M < P log P + M. I hope.

Edit:
If the points are always far from evenly distributed (if they are heavily clumped into groups), the "progressive one-sided binary search" algorithm performs better than before, with O(P log P + log M) running time. The value log M could practically be considered a constant, since M will always be less than the amount of available storage. If P is significantly less than M, here P log P might beat binary search asymptotically -- but this would still require M to be huuuuuuuuuuuuuuuuge (growing superexponentially) compared to P. I bet you that binary search would still end up winning in terms of actual microseconds used.

Also, if all the points are heavily clumped in or near the first interval, the algorithm that walks through both arrays will not have to walk through the M intervals, so running time will be O(P log P) in those special cases.

9. Also, if you use 'interpolation search', however fast that runs (you'd probably lose to regular old binary search), you'll still have to sort the container of points, which takes O(P log P) time. Remember that even if M is "exponentially" larger than P, log M will only be a constant multiple of log P, which is not very much of a difference, and if M and P are multiples of one another, log P and log M are practically the same. So only if you have a relationship as extreme as M = P! or P = M! (factorial) or so will an O(P log M) and an O(P log P) give you an important performance difference.

10. Hello Rashakil Fol,

thank you very very very much for your answer !
That was exactly what I'm looking for !

I do have a question on the "progressive" one-sided binary search : How do u get P*log (1 + M / P) for the search (after sorting the points) ?

EDIT:
another question:

If M < P, then P log M < P log P + M.
If M = P, then O(P log P + M) = O(P log P) = O(P log M), ...
you meant P log M < P log(P + M) , right ?
then with M=P and log-base=2
O(P log(P + M))=O(P*log(2P))=O(P*logP + P)

but that would still be greater than O(P*logM)

But i think you are right - it's early and I'll check it later again, but it seems that binary search is the best.

But as I said before : this was only the simplest way of the problem, consider this in four dimensions (makes a constant factor 4) and a linear Variable Z (not that big) - then I'll have to compute with some values if logP and logM gives quite a difference on a very large scale...

(BTW: for binary search we should use base=2 for the logarithm, that represents the number of "if" queries)

But I'll ask again if I have any question about it.

Again : Thank you very much ! You made my day a lot easier !!

greets and much thx
DaMenge

Popular pages Recent additions