Our customers almost always have a business requirement that is both reasonable and a real performance killer. That is sorting search results by the last modification time, or cm:modified in Alfresco model.
Lets first explain how sorting in lucene actually works. Let us imagine that we have 10 million documents in our index. Of these, 100,000 were updated within the last day. Now, searching for a subset of documents with last modification date 1 day in past would work quite fast. But when you add sort by date into context, things get dead slow.
This is because lucene will first load all the different modification timestamps for the subset that matches the query, potentially all 100,000 of them if you were unfortunate to allow the ‘*’ wildcard, sort them and then complete the search for a subset of documents on that ordered list. And sorting a huge number of date strings sucks even with modern CPUs. When you add up couple of requests from multiple users at the same time, you can end up with a dead slow application until the sorting completes.
Colleagues even did a simple test search using noderef. This search always returns a single result. When they add sorting by any property, like title, getting result takes double the time.
There are several approaches to this problem:
1) Always try to avoid sorts by property that potentially has a huge number of different values. Completely avoid sorts of any kind.
2) With Alfresco we’ve found out that whenever we change the document, a new row is inserted into the database. We figured out that we have the node-dbid property within the lucene index which is an integer. So we do the sorting on node-dbid field. While the results may not be 100% as accurate as with the last modification datetime in all cases, we found the resultset the same in all our test and performance was much better as we are sorting integers now.
3) Implement your own metadata property which would be a timestamp and keep it updated. Use it for sorts.