Optimizing searching through a string array

lewist57 · Apr 1, 2010

First of all, let me apologize for asking such a simplistic question, but here goes.

Suppose one wants to search an array of 100 strings, for three words. The search for the words is conducted one at a time (word 1, word 2 and then word 3). And the words are exclusive, in that only one of the three words will appear only once in a line, if at all (ie - no line will have more than one of the words in it). In searching through the array for the first word, suppose you find it in 23 lines. Now, if a word is found in a string, that string does not need to be searched again, so when searching for the second word, one would search through 77 strings (100 - 23). And if you found the second word in 40 strings, then you would want to search only the remaining 37 (77 - 40) strings for the 3rd word.

One could simply search through all 100 strings each time a word is being searched for, but is there a clean and efficient way to avoid searching strings that have already been identified as containing other search words? Although this sounds trivial, for my purposes, my array could approach 10,000 strings, so every little bit helps.

Thanks in advance for your input

snarfblam · Apr 1, 2010

There are plenty of viable solutions. Here are a few off the top of my head, each of which amounts to tracking which strings have been searched so you can skip them next time.

If you are storing the strings in a list-like collection (array, ArrayList, List<T>) you can use a BitArray class the same size as the list. This would use one bit per item to track which strings have been searched.

You could also use a collection of structs similar to what I have below.

Code:

struct stringListItem {
    public string value;
    public bool wasSearched;
}

This couples the related data more closely and might be a little more intuitive to work with, but uses 32 extra bits per string. Still, with 10,000 lines thats only 40k of data, which isn't extraordinary.

You could also clone the array and remove strings as they are found (replacing them with null).

If you wanted to be more elaborate, you could create a custom collection class that is represented externally by a struct similar to the one above but backed by a string array/list and a BitArray.

But I think the best approach would be the simplest and easiest to understand. I wouldn't really worry that much about optimizing unless you actually experience performance issues or excessive memory consumption. The actual bottlenecks in an application often aren't where you expect them to be, and premature optimization often amounts to fixing a problem that doesn't exist.

lewist57 · Apr 2, 2010

Many thanks for your reply. I agree in the era of 3.0+ GHZ microprocessors with 4G+ of RAM available, some optimization techniques are no longer relevant.

However, I will document my solution, noting that it might not be optimum, but it works, in psuedo code:

CLUMSY WAY:

For i = 1 to 100
Is word1 in the string(i)? If so, set flag1 for string(i)
next i

For i = 1 to 100
Is flag1 set? If so, skip this string
Is word2 in the string(i)? If so, set flag2 for string(i)
next i

For i = 1 to 100
Is flag1 set? If so, skip this string
Is flag2 set? If so, skip this string
Is word3 in the string(i)? if so, set flag3 for string(i)
next i

BETTER WAY:

For i = 1 to 100
Is word1 in string(i)? if so, set flag1 for string(i) and go to next string
Is word2 in string(i)? If so, set flag2 for string(i) and go to next string
Is word3 in string(i)? If so, set flag3 and continue
next i

If one knows the probability of which of the three words will appear most often, you can also set word1 = most likely, word2 = second likely, word3 = least likely in the search order. I guess this is why my computer science professor was always stressing the importance of flow charts and algorithms.

snarfblam · Apr 2, 2010

Your second method there seems to eliminate the need for a flag at all, but maybe an integer to identify the index of the word that was found, if you really need to cache the results. I was working on the assumption that the list of strings to search for was arbitrary and could be of any length. Even so an inner loop could easily account for a variable length search list.

Your algorithm is probably better than mine. I don't know why it didn't occur to me that the lines shouldn't need to be iterated over more than once. Anyhow, algorithmic optimizations are generally the most significant. The methods I suggested represented the same algorithm with different data storage, whereas your better method can probably do the same chore with less memory and less processing, depending on exactly what you are doing.

lewist57 · Apr 8, 2010

Gee, I am blushing! Too bad I did not think about it before posting.

In any case, the "weird" thing about this situation is that only one of the three words may appear in each line (or none may appear), so that kind of throws the search algorithm off the typical solution path.

Thanks for your kind input and feedback!

Optimizing searching through a string array

lewist57

Newcomer

snarfblam

Ultimate Contributor

lewist57

Newcomer

snarfblam

Ultimate Contributor

lewist57

Newcomer