regexp using a ton of memory

mattsrebatespam

Newcomer
Joined
Feb 15, 2005
Messages
10
I am trying to search through a bunch of text files (~25K-50K files) using regular expressions. I have a FileContentsArray array that contains each line of the file. I also have a SearchExpression array that has usually < 10 regular expressions. The code is working, but the memory usage is enormous. The program ties up all the system resources and takes forever to finish. I have managed to isolate the huge memory usage to the second line where I declare the Regexp. Is there any way to not declare it as a new object or to purge the object each time? I have tried to set Regexp = nothing but the memory used remains the same. Any help is greatly appreciated.

For j = 0 To FileContentsArray.Count - 1
Dim Regexp As Regex = New Regex(SearchExpression.Item(k))
If Regexp.Match(FileContentsArray.Item(j)).Success Then
'add the match to a dataset​
End If​
Next
 
Another thing I would like to add is that if I comment out the If.. Then part of the code where the match is performed (leaving the declaration of the regex object only) the memory usage is still high.

mattsrebatespam said:
I am trying to search through a bunch of text files (~25K-50K files) using regular expressions. I have a FileContentsArray array that contains each line of the file. I also have a SearchExpression array that has usually < 10 regular expressions. The code is working, but the memory usage is enormous. The program ties up all the system resources and takes forever to finish. I have managed to isolate the huge memory usage to the second line where I declare the Regexp. Is there any way to not declare it as a new object or to purge the object each time? I have tried to set Regexp = nothing but the memory used remains the same. Any help is greatly appreciated.

For j = 0 To FileContentsArray.Count - 1
Dim Regexp As Regex = New Regex(SearchExpression.Item(k))
If Regexp.Match(FileContentsArray.Item(j)).Success Then
'add the match to a dataset​
End If​
Next
 
Don't open all the files at once(!).

I'm not at all familiar with VB so; here's an algorithm that should be slightly more efficient:
Code:
'build an array, MyFilenamesArray, of valid file names
'build an integer, sexpcount, equal to the last valid index value of your regular 
'   expression array

count = index of MyFilenamesArray-1
for (iteration = 0..count) do
    InnerStreamReader = new sr (MyFilenamesArray.Item(iteration))
    InnerTextblock = read_to_end_of_file InnerStreamReader
    close InnerStreamReader
    for (rexp_iter = 0..sexpcount)
        Regexp = new Regex(SearchExpression.Item(rexp_iter))
        'regex work here
        'close/garbage collect regexp
    end_inner_for
    'close/garbage collect innertextblock
end_for
 
Im not too familiar with regular expressions but looking at your code i noticed that you declare a new regex with every iteration. Thats a lot of iterations, so of course you will generate a lot of objects. If it is not necessary to create a new one each time then that can save you oodles of memory.

I also don't know a ton about the garbage collection, but I'm not sure that it will occur in the middle of a function like that. I haven't the slightest idea as to whether or not this will work but if you call doevents every now and then it might give the app a chance to do garbage collection. With an operation this big you want to call doevents every now and then just as to not hog 100% of the cpu. You could also call gc.collect(). Calls to gc.collect can cause memory to be managed worse later in the app though because you are forcing any objects still in use into later generations, so be careful with that one.
 
I see at least three things that can be made faster:
1. Use a foreach loop. A regular for loop incurrs extra overhead vs. a foreach as .NET will do an array bounds check on each iteration. This is small, but since a foreach is *easier*, might as well use it:
Visual Basic:
Dim s As String
ForEach  s in FileContentsArray
    Dim Regexp As Regex = New Regex(SearchExpression.Item(k))
    If Regexp.Match(s).Success Then
        'add the match to a dataset
    End If
Next

2. Create the RegEx objects once, outside of the loop. Not sure what k is - I'm guessing you loop through regular expressions? Below is a new version that just moves out RegEx. The GC will have to create the object on each iteration and then GC it. That's a tremendous amount of overhead that you can avoid.
Visual Basic:
Dim s As String
Dim Regexp As Regex = New Regex(SearchExpression.Item(k))
ForEach  s in FileContentsArray
    If Regexp.Match(s).Success Then
        'add the match to a dataset
    End If
Next

3. Put the match into a Match object so you can use it inside the If. This will help if you're testing for Success then getting the Match object into a variable to do something with the matches. You didn't show what you did with the data.
Visual Basic:
Dim s As String
Dim Regexp As Regex = New Regex(SearchExpression.Item(k))
Dim m As Match
ForEach  s in FileContentsArray
    m = Regexp.Match(s)
    If m.Success Then
        'add the match to a dataset
        ds.Tables... = match.Match.Groups["Col1"].Value
    End If
Next

There could be lots of other optimaztions we'd recommend. The best thing to do is profile the code yourself and determine EXACTLY what the slowest part of the code is. If you don't have a tool, some scattered Debug.Write statements should help. It's always better to analyze first before diving into code changes where you think it's slow - it's just too easy to find out for sure.

The first two above are pretty easy to do though, so you might try them first.

-ner
 
Thanks for the advice. I knew about garbage collection but did not know how to use it. In my larger datasets I am doing 10 million+ searches (at about 10-15 thousand per second). It is just taking to long and I knew that there was a better way to manage the resources.
I will try out your suggestions and see if it helps, thanks again!
 
Back
Top