Thursday, January 28, 2010

Alfresco is taking too much memory : Memory Leak

I was hit hard by a probable memory leak in Alfresco. This post is about how I found the issue and got the solution. Sometimes we need to go extra mile to achieve the much desired success.

I was working on a wrapper above Alfresco 3.1 Stable Edition. We wrote a plug-in (webscript) for Alfresco. Ours is a J2EE App which talks to Alfresco using this plug-in. Our app and Alfresco war both are deployed in JBoss 4.2. I had been hit by a High Memory usage, which possibly could be a memory leak/issue in Alfresco.

Our app extends the folder type to include some (~20) additional meta props. ‘file’ (content) type has also been extended to include ~20 meta properties.

So, we have defined our custom folder and files.

My System Configuration was:
64 bit JVM 1.6 running on Windows Server 2008 R2, core 2 machine with 6 GB of physical RAM. SQL Server 2008 is the DB. I was using JMeter to talk to our app, which eventually talks to our Alfresco plug-in using webscripts. As recommended by Alfresco, I had provided 2GB of Heap to Alfresco.

In the following discussion, memory will mean old gen memory only.

I attached JConsole to the JVM running JBoss. Initial memory consumption was 160m. I added 1000 custom folders and a single xml file in each folder. Files differ (content and name wise) by some strings only and are very small (0.6 KB). Memory consumption started to increase, went up to 1.5 Gb and settled down at around 500m. Settled down means I did GC manually quite a times using JConsole. I added 1000 more folders and similar 1 file in each folder and the memory settled at 800mb. Added the same amount of data more, and it settled at 1GB. More same data and it settled at 1.16 GB. As I have told earlier, all files added so far differ by some strings (content and name wise) only.

I took the heap dump and analyzed with ‘jhat’ (show instance of all classes excluding platform). There were so many objects of different classes, but what took my attention was many objects of classes related to Lucene Indexing. Should they be there? Objects of classes lucene.index.FieldInfo, lucene.index.Term , luene.index.TermInfo , lucene.index.TermBuffer, lucene.store.RAMFile, lucene.store.RAMDirectory,lucene.index.CompoundFileReader$CSIndexInput, lucene.store.RAMInputStream, etc . Count for some of these instances reaching up to 200000. Is Lucene eating up the memory? Other user as also reported similar problem: http://forums.alfresco.com/en/viewtopic.php?f=9&t=18791&p=61675&hilit=leak#p61675. Wiki has a page for similar problem: http://wiki.alfresco.com/wiki/Lucene_Extensions_and_Issues but there is no update on it (whether the issues has been fixed in 3.1). I have scanned my app's code thoroughly and can assume that there is no memory leak from there. Issue seems to be on Alfresco side.

I again conducted my test case. This time I added around 3000 folders and one file in each folder. JVM started with around 200m of memory, and after I was done adding content, it settled at 1050m. ‘jhat’ heap dump (instance count of all classes excluding platform) showed more than 1000000 (1 million) objects from Lucene only, here is the breakup :

277571 instances of class org.apache.lucene.index.FieldInfo
237693 instances of class org.apache.lucene.index.CompoundFileReader$CSIndexInput
174082 instances of class org.apache.lucene.index.SegmentReader$Norm
102453 instances of class org.apache.lucene.index.TermBuffer
77363 instances of class org.apache.lucene.index.Term
55351 instances of class org.apache.lucene.index.TermInfo
42776 instances of class org.apache.lucene.index.CompoundFileReader$FileEntry
35981 instances of class org.apache.lucene.index.SegmentTermEnum
15373 instances of class org.apache.lucene.store.RAMFile
9813 instances of class org.alfresco.repo.search.impl.lucene.index.ReferenceCountingReadOnlyIndexReaderFactory$ReferenceCountingReadOnlyIndexReader$MultipleValueFieldAccessor
9102 instances of class org.apache.lucene.index.FieldsReader$FieldForMerge
6542 instances of class org.alfresco.repo.search.impl.lucene.index.ReferenceCountingReadOnlyIndexReaderFactory$ReferenceCountingReadOnlyIndexReader$SingleValueFieldAccessor
5828 instances of class org.apache.lucene.store.RAMInputStream
5789 instances of class org.apache.lucene.index.FieldInfos
5380 instances of class org.apache.lucene.index.SegmentInfo
5376 instances of class org.apache.lucene.index.FieldsReader
5376 instances of class org.apache.lucene.index.SegmentReader
5376 instances of class org.apache.lucene.index.TermInfosReader
5375 instances of class org.apache.lucene.index.CompoundFileReader
5359 instances of class [Lorg.apache.lucene.index.Term;
5359 instances of class [Lorg.apache.lucene.index.TermInfo;
3698 instances of class org.apache.lucene.store.RAMDirectory
3698 instances of class org.apache.lucene.store.SingleInstanceLockFactory
3610 instances of class org.apache.lucene.index.SegmentInfos
3271 instances of class org.alfresco.repo.search.impl.lucene.index.ReferenceCountingReadOnlyIndexReaderFactory$ReferenceCountingReadOnlyIndexReader
3271 instances of class org.alfresco.repo.search.impl.lucene.index.ReferenceCountingReadOnlyIndexReaderFactory$ReferenceCountingReadOnlyIndexReader$ListFieldAccessor
3089 instances of class org.apache.lucene.store.FSDirectory
2529 instances of class [Lorg.apache.lucene.index.IndexReader;
2528 instances of class org.apache.lucene.index.MultiReader
2356 instances of class org.apache.lucene.util.BitVector
2132 instances of class org.apache.lucene.document.Field
691 instances of class org.apache.lucene.document.Document
...........

I was even not sure whether it is leak or just that the memory consumption is high. If it is not a leak, then the old gen memory should stabilize somewhere. I kept on adding the folders and files and memory usage went on increasing till the point where the JVM seized to respond, as it was only busy doing FULL GC only, old gen max being full. I didn't know even then whether it is leak, as I did not test it with 4GB or 8 GB of heap memory assigned to JVM (as I did not have the resources for that too). So, for me, for 2 GB heap, it was a leak.

The next step which I did was an experiment. I fresh installed Alfresco on a separate machine, and wrote a Java program to add 100 (normal) folders and 1000 (normal) files in each folder, thus total of 100000 files (1/10th of a million) to be added that to on a JVM on which I allocated only 1 GB of RAM. I did not deploy my app on that JVM. Again, each file was an xml file, and two files contents differ by a guid only. The Java program used Webservice interface to call the Alfresco APIs for adding files and folders. When I started the JVM, the old gen memory was 90 MB. I ran the program and it successfully completed. The old gen memory reached up to 600m but it settled at 240m. What did it mean???? It meant to me that something is wrong, what it is - I did not know and I had no clue at all. Alfresco seems to be definitely working fine, as I was able to add 100000 files easily, and in my app case for 4000 files, it was taking up too much memory. The difference between the two cases were a> in my app case, I was also creating one folder per file b> in my app case, I was communicating to Alfresco using plug-in (webscript), and in the experiment scenario, I was using WebService c> in my app case, I was doing everything in transaction (inside our plug-in, each request was catered inside transaction) d> in my app case, we have additional metadata, and aspects for files and folders.

I analyzed the dump of experimental JVM and I found the instance count of Lucene objects as below:


4279 instances of class org.apache.lucene.index.Term
4171 instances of class org.apache.lucene.index.TermInfo
1723 instances of class org.apache.lucene.index.FieldInfo
533 instances of class org.apache.lucene.index.CompoundFileReader$CSIndexInput
373 instances of class org.apache.lucene.index.TermBuffer
302 instances of class org.apache.lucene.index.SegmentReader$Norm
234 instances of class org.apache.lucene.store.RAMFile
85 instances of class org.apache.lucene.store.RAMDirectory
63 instances of class org.apache.lucene.index.FieldInfos
...........

Don't know, why is this difference, but one thing was sure to me : it is something related to indexing/Lucene

I now used netBeans profiler. It took me much time to get it work with my JVM.
Here is the initial snapshot:












This time I added just 10 custom folders and files. netBeans showed that after the folders and files have been added, some amount of memory was retained (after doing GC for some times ). Much of this retained memory was attributed to byte[].
Here is how, it looked like:



I looked the stack trace of byte[]:



So, it looked like to me that mergeIndexes() should be investigated further. This was inside Alfresco. I opened the source code of Alfresco and traversed (statically) the stack chain upwards till buildReferenceCountingIndexReader function. Inside the function, what was causing new RAMDirectory to be created was the check size > config.getMaxDocsForInMemoryMerge() to return false, that is getMaxDocsForInMemoryMerge should be returning higher value. I searched for "maxDocsForInMemoryMerge" string in the Alfresco source code, and actually found a variable of the same name which was default initialized to 10000. Now i wanted to set it to some small value (around 100) but I did not find any documentation in Lucene Configuration section (http://wiki.alfresco.com/wiki/Full-Text_Search_Configuration), so I thought to build the class with initializer setting it to 100 rather than 10000. My source code was not getting built, there were many errors. I just thought of googling the string "maxDocsForInMemoryMerge" and I actually found what I was looking for : http://issues.alfresco.com/jira/browse/ETHREEOH-1147
There is a memory issue in Alfresco and the link gives the workaround.

So here is the workaround:

For default Alfresco installation add lucene.indexer.maxDocsForInMemoryMerge=0 in the file tomcat/webapps/alfresco/WEB-INF/classes/alfresco/repository.properties. For JBoss installation, add the string to conf/alfresco/extension/repository.properties file.

Also create a file named custom-lucene-indexer-context.xml in the same directory as above with the following content:


<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE beans PUBLIC '-//SPRING//DTD BEAN//EN' 'http://www.springframework.org/dtd/spring-beans.dtd'>

<!-- Core and miscellaneous bean definitions -->
<beans>
<bean id="admLuceneIndexerAndSearcherFactory"
class="org.alfresco.repo.search.impl.lucene.ADMLuceneIndexerAndSearcherFactory">
<property name="nodeService">
<ref bean="mlAwareNodeService" />
</property>
<property name="tenantService">
<ref bean="tenantService"/>
</property>
<property name="dictionaryService">
<ref bean="dictionaryService" />
</property>
<property name="nameSpaceService">
<ref bean="namespaceService" />
</property>
<property name="indexRootLocation">
<value>${dir.indexes}</value>
</property>
<property name="contentService">
<ref bean="contentService" />
</property>
<property name="queryRegister">
<ref bean="queryRegisterComponent" />
</property>
<property name="maxAtomicTransformationTime">
<value>${lucene.maxAtomicTransformationTime}</value> <!-- milliseconds allowed for inline text transformation -->
</property>
<property name="queryMaxClauses">
<value>${lucene.query.maxClauses}</value>
</property>
<property name="indexerBatchSize">
<value>${lucene.indexer.batchSize}</value>
</property>
<property name="lockDirectory">
<value>${dir.indexes.lock}</value>
</property>
<property name="indexerMaxFieldLength">
<value>${lucene.indexer.maxFieldLength}</value>
</property>
<property name="writeLockTimeout">
<value>${lucene.write.lock.timeout}</value>
</property>
<property name="commitLockTimeout">
<value>${lucene.commit.lock.timeout}</value>
</property>
<property name="lockPollInterval">
<value>${lucene.lock.poll.interval}</value>
</property>
<property name="defaultMLIndexAnalysisMode">
<value>EXACT_LANGUAGE_AND_ALL</value>
</property>
<property name="defaultMLSearchAnalysisMode">
<value>EXACT_LANGUAGE_AND_ALL</value>
</property>
<property name="threadPoolExecutor">
<ref bean="indexThreadPoolExecutor"></ref>
</property>
<property name="bulkLoader">
<ref bean="hibernateL1CacheBulkLoader"></ref>
</property>
<property name="maxDocsForInMemoryMerge">
<value>0</value>
</property>
</bean>
</beans>


This should fix your memory leak problem (if you are also suffering from the same problem)

I hope this fix doesn't break anything else.

Now if someone can figure out why this issue doesn't arise when someone load tests Alfresco with over 1mn documents.
To me, following can be the key differences between the test case and my case (if someone can get a clue from it):
- we have derived file and folder types to define our own types
- we are using good amount of metadata (15-20 additional properties on each file/folder node)
- we are calling Alfresco APIs using our webscript plug-in to Alfresco.
- inside webscript plug-in, we wrap each call within transaction