Another Stranger Me

my comments, my projects, my resources…

Alfresco – Lucene index gone bad and storage considerations

Background

At IT Sistemi we have a couple of clients that have really big Alfresco repositories. We started working with Alfresco years ago, I believe our first version was 2.1. Due to keeping costs low for our clients we have always used the Community edition. Over the years we have become true experts in Alfresco internals, mainly due to the fact that we never had the access to the support that comes with the Enteprise edition.

Basically we had to learn all the pits of scaling Alfresco ourselves, often the hard way.

One of our clients uses Alfresco from the version 2.1 (upgraded to 3.0 Labs C, and then to 3.4.c) and over the years their repository became huge. To give you the sense of scale, they have about 2,5 M documents in their repository, alf_node_properties table in database has over 120M rows (MySQL database server). Customer indexes thousands of scanned documents into Alfresco each day to make a total of about 600 GB of documents ATM. And of course, everything needs to run smoothly. Entire setup runs on just two VMware machines (Alfresco server + MySQL server).

Performance issues

Over the years, as repository grew, also grew the number of performance issues. Since Alfresco was for a long time a black box for us at first we tackled these performance issues mainly by raising the JVM heap size which was fairly easy thing for us to do since we were on VMware. Also, most of the “server is stuck” issues could be solved with the simple service restart. But, about a year ago, running Alfresco 3.0 we started to receive a lot of complaints about the system speed.

Now, when this started happening our Alfresco server was running on 4 CPUs, RAID 10 for lucene, RAID 5 for data, 12 GB RAM. MySQL was also on RAID 10 disks. Looking at hardware, system should preform well. But problems continued. We were receiving either “search is slow” or “indexing is slow” tickets every couple of days, in ratio 5:1 for search.

We simply could not find out what is the problem, we would simply see a 100% CPU spike out of nowhere, with normal disk behavior and no bottlenecks than CPU. Now, problem with CPU is that you can’t just throw it more in like memory or disks as if you do that on VMware and you still end up using 100% CPU then you might spread the problem on other VMs. Problem had to be solved on the software side. But, where does one start after optimizing database, rebuilding lucene index, defragmenting all disks, optimizing JVM settings, and clearing all the junk from the repository?

Shit hits the fan

Things started to look really nasty. Users would click save after editing document properties and it would take 5-10min to complete the action. But finally we saw the other symptom other than the CPU – average disk queue on lucene disks was over 120 (everything above 2 indicates that disks can’t keep up). But we are speaking about RAID 10 here. How can this be? Is something else on VMware that runs on same disks killing it?

To isolate possible VMware problems with disk we came up with another nasty strategy to give us some breathing space. We ended up halving the JVM heap size and using a great little program Imdisk to create a RAM disk and we put the lucene index on it. This was safe to do as Alfresco backs up lucene index every night so in case of the unexpected shutdown we would always be able to quickly bring back the index without the need for full index rebuild. Nasty, ugly, silly but it worked… Unfortunately, only for a couple of days.

Users started to complain again. We couldn’t believe it – even the RAM disk couldn’t keep up with the IO demand of the lucene index.

Light at the end of the tunnel

While struggling with the production environment we were also looking at our options. Searching high and low on Google we ended up on this article: http://onjava.com/onjava/2003/03/05/lucene.html. So we tested on the customers servers using IndexTuningDemo class and using different MergeFactor values. We could see some performance improvements but again it turned out to be a dead end. All the cool properties to change to optimize the indexing were not available on this Alfresco version. And then it dawned on us and we used the IndexTuningDemo class with lucene JAR-s from Alfresco version 3.4.c. Results were amazing:

C:\temp\itsistemi – lucene – 3.4.c>”C:\Program Files\Java\jdk1.6.0_11\bin\java” -Xmx1536M -Xms1024M IndexTuningDemo 10000 10 100000
C:\DOCUME~1\dms\LOCALS~1\Temp\4\\index
Total time: 1156 ms

C:\temp\itsistemi – lucene>”C:\Program Files\Java\jdk1.6.0_11\bin\java” -Xmx1536M -Xms1024M IndexTuningDemo 10000 10 100000
C:\DOCUME~1\dms\LOCALS~1\Temp\4\\index
Total time: 77424 ms

Yup, that’s right. Approximately 6.700% performance boost in indexing speed. Upgrade was the path to go no matter how much we dislike upgrading repository with 2.4M documents inside.

Getting mad

So we’ve upgraded the repository, got rid of the RAM disk, did another full index rebuild just in case, and life was good again…NOT. Only one week passed and the enemy was at the gates again. At this time I was cursing Alfresco, Lucene and everyone that built them. CPU again at 100%, now with some ugly blocking, disks are again slow.

So I fired up the HotThreads to see what’s going on again and guess what – lucene index:

uto 02.08.2011
09:06

106.3% CPU Usage by Thread ‘indexThread7’
4/10 snapshots sharing following 19 elements
java.io.RandomAccessFile.writeBytes(Native Method)
java.io.RandomAccessFile.write(RandomAccessFile.java:482)
org.apache.lucene.store.FSIndexOutput.flushBuffer(FSDirectory.java:588)
org.apache.lucene.store.BufferedIndexOutput.flush(BufferedIndexOutput.java:85)
org.apache.lucene.store.BufferedIndexOutput.writeByte(BufferedIndexOutput.java:35)
org.apache.lucene.store.IndexOutput.writeChars(IndexOutput.java:106)
org.apache.lucene.index.TermInfosWriter.writeTerm(TermInfosWriter.java:133)
org.apache.lucene.index.TermInfosWriter.add(TermInfosWriter.java:108)
org.apache.lucene.index.SegmentMerger.mergeTermInfo(SegmentMerger.java:322)
org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentMerger.java:289)
org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:253)
org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:96)
org.apache.lucene.index.IndexWriter.addIndexes(IndexWriter.java:1247)
org.alfresco.repo.search.impl.lucene.index.IndexInfo$Merger.mergeIndexes(IndexInfo.java:3460)
org.alfresco.repo.search.impl.lucene.index.IndexInfo$Merger.runImpl(IndexInfo.java:2875)
org.alfresco.repo.search.impl.lucene.index.IndexInfo$AbstractSchedulable.run(IndexInfo.java:2711)
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
java.lang.Thread.run(Thread.java:662)
3/10 snapshots sharing following 21 elements
java.io.RandomAccessFile.readBytes(Native Method)
java.io.RandomAccessFile.read(RandomAccessFile.java:338)
org.apache.lucene.store.FSIndexInput.readInternal(FSDirectory.java:537)
org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:74)
org.apache.lucene.index.CompoundFileReader$CSIndexInput.readInternal(CompoundFileReader.java:220)
org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:93)
org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:34)
org.apache.lucene.store.IndexInput.readChars(IndexInput.java:107)
org.apache.lucene.index.TermBuffer.read(TermBuffer.java:67)
org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:118)
org.apache.lucene.index.SegmentMergeInfo.next(SegmentMergeInfo.java:65)
org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentMerger.java:293)
org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:253)
org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:96)
org.apache.lucene.index.IndexWriter.addIndexes(IndexWriter.java:1247)
org.alfresco.repo.search.impl.lucene.index.IndexInfo$Merger.mergeIndexes(IndexInfo.java:3460)
org.alfresco.repo.search.impl.lucene.index.IndexInfo$Merger.runImpl(IndexInfo.java:2875)
org.alfresco.repo.search.impl.lucene.index.IndexInfo$AbstractSchedulable.run(IndexInfo.java:2711)
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
java.lang.Thread.run(Thread.java:662)
2/10 snapshots sharing following 10 elements
org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentMerger.java:289)
org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:253)
org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:96)
org.apache.lucene.index.IndexWriter.addIndexes(IndexWriter.java:1247)
org.alfresco.repo.search.impl.lucene.index.IndexInfo$Merger.mergeIndexes(IndexInfo.java:3460)
org.alfresco.repo.search.impl.lucene.index.IndexInfo$Merger.runImpl(IndexInfo.java:2875)
org.alfresco.repo.search.impl.lucene.index.IndexInfo$AbstractSchedulable.run(IndexInfo.java:2711)
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
java.lang.Thread.run(Thread.java:662)
Unique snapshot
java.io.RandomAccessFile.readBytes(Native Method)
java.io.RandomAccessFile.read(RandomAccessFile.java:338)
org.apache.lucene.store.FSIndexInput.readInternal(FSDirectory.java:537)
org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:74)
org.apache.lucene.index.CompoundFileReader$CSIndexInput.readInternal(CompoundFileReader.java:220)
org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:93)
org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:34)
org.apache.lucene.store.IndexInput.readVLong(IndexInput.java:77)
org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:121)
org.apache.lucene.index.SegmentMergeInfo.next(SegmentMergeInfo.java:65)
org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentMerger.java:293)
org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:253)
org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:96)
org.apache.lucene.index.IndexWriter.addIndexes(IndexWriter.java:1247)
org.alfresco.repo.search.impl.lucene.index.IndexInfo$Merger.mergeIndexes(IndexInfo.java:3460)
org.alfresco.repo.search.impl.lucene.index.IndexInfo$Merger.runImpl(IndexInfo.java:2875)
org.alfresco.repo.search.impl.lucene.index.IndexInfo$AbstractSchedulable.run(IndexInfo.java:2711)
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
java.lang.Thread.run(Thread.java:662)

20.3% CPU Usage by Thread ‘RMI TCP Connection(1287)-10.211.0.116’
10/10 snapshots sharing following 32 elements
sun.management.ThreadImpl.getThreadInfo1(Native Method)
sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:154)
sun.reflect.GeneratedMethodAccessor412.invoke(Unknown Source)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
java.lang.reflect.Method.invoke(Method.java:597)
com.sun.jmx.mbeanserver.ConvertingMethod.invokeWithOpenReturn(ConvertingMethod.java:167)
com.sun.jmx.mbeanserver.MXBeanIntrospector.invokeM2(MXBeanIntrospector.java:96)
com.sun.jmx.mbeanserver.MXBeanIntrospector.invokeM2(MXBeanIntrospector.java:33)
com.sun.jmx.mbeanserver.MBeanIntrospector.invokeM(MBeanIntrospector.java:208)
com.sun.jmx.mbeanserver.PerInterface.invoke(PerInterface.java:120)
com.sun.jmx.mbeanserver.MBeanSupport.invoke(MBeanSupport.java:262)
javax.management.StandardMBean.invoke(StandardMBean.java:391)
com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.invoke(DefaultMBeanServerInterceptor.java:836)
com.sun.jmx.mbeanserver.JmxMBeanServer.invoke(JmxMBeanServer.java:761)
javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImpl.java:1427)
javax.management.remote.rmi.RMIConnectionImpl.access$200(RMIConnectionImpl.java:72)
javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(RMIConnectionImpl.java:1265)
javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(RMIConnectionImpl.java:1360)
javax.management.remote.rmi.RMIConnectionImpl.invoke(RMIConnectionImpl.java:788)
sun.reflect.GeneratedMethodAccessor599.invoke(Unknown Source)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
java.lang.reflect.Method.invoke(Method.java:597)
sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:305)
sun.rmi.transport.Transport$1.run(Transport.java:159)
java.security.AccessController.doPrivileged(Native Method)
sun.rmi.transport.Transport.serviceCall(Transport.java:155)
sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:535)
sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:790)
sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:649)
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
java.lang.Thread.run(Thread.java:662)

1.6% CPU Usage by Thread ‘Store org.alfresco.cache.ticketsCache Spool Thread’
10/10 snapshots sharing following 4 elements
java.lang.Thread.sleep(Native Method)
net.sf.ehcache.store.DiskStore.spoolAndExpiryThreadMain(DiskStore.java:588)
net.sf.ehcache.store.DiskStore.access$800(DiskStore.java:64)
net.sf.ehcache.store.DiskStore$SpoolAndExpiryThread.run(DiskStore.java:1074)
uto 02.08.2011
09:28

109.4% CPU Usage by Thread ‘indexThread2’
4/10 snapshots sharing following 14 elements
java.io.RandomAccessFile.readBytes(Native Method)
java.io.RandomAccessFile.read(RandomAccessFile.java:338)
org.apache.lucene.store.FSIndexInput.readInternal(FSDirectory.java:537)
org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:74)
org.apache.lucene.index.CompoundFileWriter.copyFile(CompoundFileWriter.java:211)
org.apache.lucene.index.CompoundFileWriter.close(CompoundFileWriter.java:169)
org.apache.lucene.index.SegmentMerger.createCompoundFile(SegmentMerger.java:153)
org.apache.lucene.index.IndexWriter.addIndexes(IndexWriter.java:1280)
org.alfresco.repo.search.impl.lucene.index.IndexInfo$Merger.mergeIndexes(IndexInfo.java:3460)
org.alfresco.repo.search.impl.lucene.index.IndexInfo$Merger.runImpl(IndexInfo.java:2875)
org.alfresco.repo.search.impl.lucene.index.IndexInfo$AbstractSchedulable.run(IndexInfo.java:2711)
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
java.lang.Thread.run(Thread.java:662)
6/10 snapshots sharing following 15 elements
java.io.RandomAccessFile.writeBytes(Native Method)
java.io.RandomAccessFile.write(RandomAccessFile.java:482)
org.apache.lucene.store.FSIndexOutput.flushBuffer(FSDirectory.java:588)
org.apache.lucene.store.BufferedIndexOutput.flush(BufferedIndexOutput.java:85)
org.apache.lucene.store.BufferedIndexOutput.writeBytes(BufferedIndexOutput.java:75)
org.apache.lucene.index.CompoundFileWriter.copyFile(CompoundFileWriter.java:212)
org.apache.lucene.index.CompoundFileWriter.close(CompoundFileWriter.java:169)
org.apache.lucene.index.SegmentMerger.createCompoundFile(SegmentMerger.java:153)
org.apache.lucene.index.IndexWriter.addIndexes(IndexWriter.java:1280)
org.alfresco.repo.search.impl.lucene.index.IndexInfo$Merger.mergeIndexes(IndexInfo.java:3460)
org.alfresco.repo.search.impl.lucene.index.IndexInfo$Merger.runImpl(IndexInfo.java:2875)
org.alfresco.repo.search.impl.lucene.index.IndexInfo$AbstractSchedulable.run(IndexInfo.java:2711)
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
java.lang.Thread.run(Thread.java:662)

104.7% CPU Usage by Thread ‘Sess_W23_LSN24’
8/10 snapshots sharing following 4 elements
org.alfresco.jlan.netbios.win32.Win32NetBIOS.Receive(Native Method)
org.alfresco.jlan.smb.server.win32.Win32NetBIOSPacketHandler.readPacket(Win32NetBIOSPacketHandler.java:133)
org.alfresco.jlan.smb.server.SMBSrvSession.run(SMBSrvSession.java:1232)
java.lang.Thread.run(Thread.java:662)
2/10 snapshots sharing following 1 elements
java.lang.Thread.run(Thread.java:662)

20.3% CPU Usage by Thread ‘RMI TCP Connection(1323)-10.211.0.116’
10/10 snapshots sharing following 32 elements
sun.management.ThreadImpl.getThreadInfo1(Native Method)
sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:154)
sun.reflect.GeneratedMethodAccessor412.invoke(Unknown Source)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
java.lang.reflect.Method.invoke(Method.java:597)
com.sun.jmx.mbeanserver.ConvertingMethod.invokeWithOpenReturn(ConvertingMethod.java:167)
com.sun.jmx.mbeanserver.MXBeanIntrospector.invokeM2(MXBeanIntrospector.java:96)
com.sun.jmx.mbeanserver.MXBeanIntrospector.invokeM2(MXBeanIntrospector.java:33)
com.sun.jmx.mbeanserver.MBeanIntrospector.invokeM(MBeanIntrospector.java:208)
com.sun.jmx.mbeanserver.PerInterface.invoke(PerInterface.java:120)
com.sun.jmx.mbeanserver.MBeanSupport.invoke(MBeanSupport.java:262)
javax.management.StandardMBean.invoke(StandardMBean.java:391)
com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.invoke(DefaultMBeanServerInterceptor.java:836)
com.sun.jmx.mbeanserver.JmxMBeanServer.invoke(JmxMBeanServer.java:761)
javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImpl.java:1427)
javax.management.remote.rmi.RMIConnectionImpl.access$200(RMIConnectionImpl.java:72)
javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(RMIConnectionImpl.java:1265)
javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(RMIConnectionImpl.java:1360)
javax.management.remote.rmi.RMIConnectionImpl.invoke(RMIConnectionImpl.java:788)
sun.reflect.GeneratedMethodAccessor612.invoke(Unknown Source)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
java.lang.reflect.Method.invoke(Method.java:597)
sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:305)
sun.rmi.transport.Transport$1.run(Transport.java:159)
java.security.AccessController.doPrivileged(Native Method)
sun.rmi.transport.Transport.serviceCall(Transport.java:155)
sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:535)
sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:790)
sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:649)
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
java.lang.Thread.run(Thread.java:662)


Solution

Well, at least now we had all those cute lucene configuration options at our disposal. But which ones should we use? Whatever we tried before made no real improvements. And then finally, in the obscure universe of Alfresco documentation I found a real gem: http://wiki.alfresco.com/wiki/Index_Merging_Performance:

It is usualy a good practice that the highest-numbered INDEX entries (which contains the least documents, number 4 in the examples above) do not contain more than a few hundred documents. It it’s not the case, it could lead to massive amount of IO pressure on the index directories for merging operations.

Few hundred documents? Our highest-numbered INDEX entry had 250.000 documents in it!

Default value for parameter lucene.indexer.mergerTargetIndexCount in Alfresco is 5. So we changed that to 8 (you don’t want to go very high with this number as it affects search speed). And voila, after changing this our highest-numbered INDEX entry dropped to about 2.000 and performance was finally normal.

Summary

To summarize some best practices:

  • Greatest impact on user-experience in Alfresco is achieved by optimizing search/indexing behavior
  • Searching lucene is FAST no matter how many document you search
  • Checking permissions on documents found using lucene is SLOW
  • Considering the above, make sure your database is working as fast as possible
  • Use RAID 10 for database files, and for lucene index
  • Use SSD if possible
  • Make sure your lucene index is configured optimally, specifically mergerMergeFactor and mergerTargetIndexCount parameters
  • Alfresco content store can be on slow disks and it generally won’t impact user-experience
  • CPU is rarely a real problem
  • JVM heap size is rarely a real problem
  • Content transformations (for instance document preview) can be a resource hog – offload if possible
  • HotThreads is your friend
  • org.alfresco.repo.search.impl.lucene.index.IndexInfo is your friend

5 Comments

  1. You should write Alfresco’s manuals.

    This is the right way to explain, simply and readable.

    congrats!

  2. Thank you for this great article. It documents very well the many struggles one faces when dealing with Alfresco performance issues… You know there’s a solution, but where to look in the myriad of possible configurations.

  3. One more confirmation that all those fat layered Spring interlaced frameworks are recepies for disaster.
    Lucene is ok if you keep things simple, but Alfresco with Activiti BPM is ultimate crap. I have to support it and every day we have some kind of problems with it: starting with JDBC exceptions – “Your process has been chosen as deadlock victim” – ending with OutOfMemory Exception.

  4. hi
    we have a problem with alfresco community on weblogic
    if we load some documents alfresco/lucene lose index references and so our loading is going bad
    any experience about this?

    thank you
    maria teresa

  5. admin

    April 23, 2014 at 19:41

    If I recall correctly, we had a bad time even getting community to run on weblogic so we never really did much with that combination. I’m sorry but without more details on the process and error logs I can’t help you much.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

© 2024 Another Stranger Me

Theme by Anders NorenUp ↑