Jump to content
Xtreme .Net Talk

Recommended Posts

Posted

I am trying to search through a bunch of text files (~25K-50K files) using regular expressions. I have a FileContentsArray array that contains each line of the file. I also have a SearchExpression array that has usually < 10 regular expressions. The code is working, but the memory usage is enormous. The program ties up all the system resources and takes forever to finish. I have managed to isolate the huge memory usage to the second line where I declare the Regexp. Is there any way to not declare it as a new object or to purge the object each time? I have tried to set Regexp = nothing but the memory used remains the same. Any help is greatly appreciated.

 

For j = 0 To FileContentsArray.Count - 1

Dim Regexp As Regex = New Regex(SearchExpression.Item(k))

If Regexp.Match(FileContentsArray.Item(j)).Success Then

'add the match to a dataset

End If
Next
Posted

Another thing I would like to add is that if I comment out the If.. Then part of the code where the match is performed (leaving the declaration of the regex object only) the memory usage is still high.

 

I am trying to search through a bunch of text files (~25K-50K files) using regular expressions. I have a FileContentsArray array that contains each line of the file. I also have a SearchExpression array that has usually < 10 regular expressions. The code is working, but the memory usage is enormous. The program ties up all the system resources and takes forever to finish. I have managed to isolate the huge memory usage to the second line where I declare the Regexp. Is there any way to not declare it as a new object or to purge the object each time? I have tried to set Regexp = nothing but the memory used remains the same. Any help is greatly appreciated.

 

For j = 0 To FileContentsArray.Count - 1

Dim Regexp As Regex = New Regex(SearchExpression.Item(k))

If Regexp.Match(FileContentsArray.Item(j)).Success Then

'add the match to a dataset

End If
Next
Posted

Don't open all the files at once(!).

 

I'm not at all familiar with VB so; here's an algorithm that should be slightly more efficient:

'build an array, MyFilenamesArray, of valid file names
'build an integer, sexpcount, equal to the last valid index value of your regular 
'   expression array

count = index of MyFilenamesArray-1
for (iteration = 0..count) do
   InnerStreamReader = new sr (MyFilenamesArray.Item(iteration))
   InnerTextblock = read_to_end_of_file InnerStreamReader
   close InnerStreamReader
   for (rexp_iter = 0..sexpcount)
       Regexp = new Regex(SearchExpression.Item(rexp_iter))
       'regex work here
       'close/garbage collect regexp
   end_inner_for
   'close/garbage collect innertextblock
end_for

  • Leaders
Posted

Im not too familiar with regular expressions but looking at your code i noticed that you declare a new regex with every iteration. Thats a lot of iterations, so of course you will generate a lot of objects. If it is not necessary to create a new one each time then that can save you oodles of memory.

 

I also don't know a ton about the garbage collection, but I'm not sure that it will occur in the middle of a function like that. I haven't the slightest idea as to whether or not this will work but if you call doevents every now and then it might give the app a chance to do garbage collection. With an operation this big you want to call doevents every now and then just as to not hog 100% of the cpu. You could also call gc.collect(). Calls to gc.collect can cause memory to be managed worse later in the app though because you are forcing any objects still in use into later generations, so be careful with that one.

[sIGPIC]e[/sIGPIC]
  • *Experts*
Posted

I see at least three things that can be made faster:

1. Use a foreach loop. A regular for loop incurrs extra overhead vs. a foreach as .NET will do an array bounds check on each iteration. This is small, but since a foreach is *easier*, might as well use it:

Dim s As String
ForEach  s in FileContentsArray
   Dim Regexp As Regex = New Regex(SearchExpression.Item(k))
   If Regexp.Match(s).Success Then
       'add the match to a dataset
   End If
Next

 

2. Create the RegEx objects once, outside of the loop. Not sure what k is - I'm guessing you loop through regular expressions? Below is a new version that just moves out RegEx. The GC will have to create the object on each iteration and then GC it. That's a tremendous amount of overhead that you can avoid.

Dim s As String
Dim Regexp As Regex = New Regex(SearchExpression.Item(k))
ForEach  s in FileContentsArray
   If Regexp.Match(s).Success Then
       'add the match to a dataset
   End If
Next

 

3. Put the match into a Match object so you can use it inside the If. This will help if you're testing for Success then getting the Match object into a variable to do something with the matches. You didn't show what you did with the data.

Dim s As String
Dim Regexp As Regex = New Regex(SearchExpression.Item(k))
Dim m As Match
ForEach  s in FileContentsArray
   m = Regexp.Match(s)
   If m.Success Then
       'add the match to a dataset
       ds.Tables... = match.Match.Groups["Col1"].Value
   End If
Next

 

There could be lots of other optimaztions we'd recommend. The best thing to do is profile the code yourself and determine EXACTLY what the slowest part of the code is. If you don't have a tool, some scattered Debug.Write statements should help. It's always better to analyze first before diving into code changes where you think it's slow - it's just too easy to find out for sure.

 

The first two above are pretty easy to do though, so you might try them first.

 

-ner

"I want to stand as close to the edge as I can without going over. Out on the edge you see all the kinds of things you can't see from the center." - Kurt Vonnegut
Posted

Thanks for the advice. I knew about garbage collection but did not know how to use it. In my larger datasets I am doing 10 million+ searches (at about 10-15 thousand per second). It is just taking to long and I knew that there was a better way to manage the resources.

I will try out your suggestions and see if it helps, thanks again!

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...