On-premise email archive indexing nightmares
Posted by Nick Mehta on Mon, Feb 09, 2009 @ 09:32 AM
One of the top reasons for which organizations deploy email archiving solutions is to allow them to find important email when they need it - whether for email discovery, email compliance or simply knowledge management.
Unfortunately, as the billions of dollars in CapEx that Google spends on its search infrastructure proves, searching (and indexing - the process to make searching possible) is easier said than done.
Many customers that try to deploy their own archives often find that the indexes become corrupt (unusable), slow or worse, inconsistent.
For example, witness this thread on a Google Groups forum about Autonomy's Zantaz EAS product:
We have a similar problem. We have been struggling for 8 months to build an idol index. We start building the index from scratch and everything runs at a reasonable pace initially and the idx files are processed. As the index grows the speed at which it processes the idx files slows down considerably, eventually it almost grinds to a halt.
Our vendor has tried various configurations for us over the last 9 months and we have still not succeeded in building a complete index. We have about 21 million docs to index and the best we get too is about 5 million docs indexed.
Quite honestly this product is not doing more for us other than reduce the size of our mailfiles. Even on the archiving side we continually experience cases where users are unable to retrieve archived mails. I could spend time on webbex's with our vendor trying to sort each of these issues out, but there are so many and my perception is that the support from autonomy is
so poor that I do not waste my time anymore, I just restore from tape.
This isn't an issue with Autonomy per-se. You'll find similar issues for nearly all on-premise products. The fact is that indexing technology is notoriously-complex:
- You need to make sure that indices have consistent access to high-speed storage.
- You need to make sure that index servers have appropriate RAM and RAM configuration.
- You need to continually scale and add indexing nodes to scale with unpredictable search volume. Troubleshooting performance is really challenging.
- Even if you have it down to science, you need to figure out how to handle the once-a-year HUGE search without always over-provisioning the system and wasting capacity the rest of the year.
- You need to diagnose missing or inconsistent results if you find them (and you will).
- You need to make sure you have full-time staff who can handle all of the issues above.
In the end, many customers are left like the one above - using the on-premise email archive for mailbox management but not getting the E-Discovery benefits that they originally bought the product for.