mattsrebatespam Posted February 15, 2005 Posted February 15, 2005 I am trying to search through a bunch of text files (~25K-50K files) using regular expressions. I have a FileContentsArray array that contains each line of the file. I also have a SearchExpression array that has usually < 10 regular expressions. The code is working, but the memory usage is enormous. The program ties up all the system resources and takes forever to finish. I have managed to isolate the huge memory usage to the second line where I declare the Regexp. Is there any way to not declare it as a new object or to purge the object each time? I have tried to set Regexp = nothing but the memory used remains the same. Any help is greatly appreciated. For j = 0 To FileContentsArray.Count - 1 Dim Regexp As Regex = New Regex(SearchExpression.Item(k)) If Regexp.Match(FileContentsArray.Item(j)).Success Then 'add the match to a dataset End IfNext Quote
mattsrebatespam Posted February 16, 2005 Author Posted February 16, 2005 Another thing I would like to add is that if I comment out the If.. Then part of the code where the match is performed (leaving the declaration of the regex object only) the memory usage is still high. I am trying to search through a bunch of text files (~25K-50K files) using regular expressions. I have a FileContentsArray array that contains each line of the file. I also have a SearchExpression array that has usually < 10 regular expressions. The code is working, but the memory usage is enormous. The program ties up all the system resources and takes forever to finish. I have managed to isolate the huge memory usage to the second line where I declare the Regexp. Is there any way to not declare it as a new object or to purge the object each time? I have tried to set Regexp = nothing but the memory used remains the same. Any help is greatly appreciated. For j = 0 To FileContentsArray.Count - 1 Dim Regexp As Regex = New Regex(SearchExpression.Item(k)) If Regexp.Match(FileContentsArray.Item(j)).Success Then 'add the match to a dataset End IfNext Quote
seve7_wa Posted February 23, 2005 Posted February 23, 2005 Don't open all the files at once(!). I'm not at all familiar with VB so; here's an algorithm that should be slightly more efficient: 'build an array, MyFilenamesArray, of valid file names 'build an integer, sexpcount, equal to the last valid index value of your regular ' expression array count = index of MyFilenamesArray-1 for (iteration = 0..count) do InnerStreamReader = new sr (MyFilenamesArray.Item(iteration)) InnerTextblock = read_to_end_of_file InnerStreamReader close InnerStreamReader for (rexp_iter = 0..sexpcount) Regexp = new Regex(SearchExpression.Item(rexp_iter)) 'regex work here 'close/garbage collect regexp end_inner_for 'close/garbage collect innertextblock end_for Quote
Leaders snarfblam Posted February 24, 2005 Leaders Posted February 24, 2005 Im not too familiar with regular expressions but looking at your code i noticed that you declare a new regex with every iteration. Thats a lot of iterations, so of course you will generate a lot of objects. If it is not necessary to create a new one each time then that can save you oodles of memory. I also don't know a ton about the garbage collection, but I'm not sure that it will occur in the middle of a function like that. I haven't the slightest idea as to whether or not this will work but if you call doevents every now and then it might give the app a chance to do garbage collection. With an operation this big you want to call doevents every now and then just as to not hog 100% of the cpu. You could also call gc.collect(). Calls to gc.collect can cause memory to be managed worse later in the app though because you are forcing any objects still in use into later generations, so be careful with that one. Quote [sIGPIC]e[/sIGPIC]
*Experts* Nerseus Posted February 24, 2005 *Experts* Posted February 24, 2005 I see at least three things that can be made faster: 1. Use a foreach loop. A regular for loop incurrs extra overhead vs. a foreach as .NET will do an array bounds check on each iteration. This is small, but since a foreach is *easier*, might as well use it: Dim s As String ForEach s in FileContentsArray Dim Regexp As Regex = New Regex(SearchExpression.Item(k)) If Regexp.Match(s).Success Then 'add the match to a dataset End If Next 2. Create the RegEx objects once, outside of the loop. Not sure what k is - I'm guessing you loop through regular expressions? Below is a new version that just moves out RegEx. The GC will have to create the object on each iteration and then GC it. That's a tremendous amount of overhead that you can avoid. Dim s As String Dim Regexp As Regex = New Regex(SearchExpression.Item(k)) ForEach s in FileContentsArray If Regexp.Match(s).Success Then 'add the match to a dataset End If Next 3. Put the match into a Match object so you can use it inside the If. This will help if you're testing for Success then getting the Match object into a variable to do something with the matches. You didn't show what you did with the data. Dim s As String Dim Regexp As Regex = New Regex(SearchExpression.Item(k)) Dim m As Match ForEach s in FileContentsArray m = Regexp.Match(s) If m.Success Then 'add the match to a dataset ds.Tables... = match.Match.Groups["Col1"].Value End If Next There could be lots of other optimaztions we'd recommend. The best thing to do is profile the code yourself and determine EXACTLY what the slowest part of the code is. If you don't have a tool, some scattered Debug.Write statements should help. It's always better to analyze first before diving into code changes where you think it's slow - it's just too easy to find out for sure. The first two above are pretty easy to do though, so you might try them first. -ner Quote "I want to stand as close to the edge as I can without going over. Out on the edge you see all the kinds of things you can't see from the center." - Kurt Vonnegut
mattsrebatespam Posted February 24, 2005 Author Posted February 24, 2005 Thanks for the advice. I knew about garbage collection but did not know how to use it. In my larger datasets I am doing 10 million+ searches (at about 10-15 thousand per second). It is just taking to long and I knew that there was a better way to manage the resources. I will try out your suggestions and see if it helps, thanks again! Quote
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.