In the Beginning, There Was the Word...
... and the word was written.. and ever since the dawn of the internet, the written word has become a critical part of our lives... yes.. that's right... e-mail. So, as a domain registrar, Gandi provides email services for use with the domains registered with us. So far so good...
The original platform was designed to be horizontally scalable to support several tens or even hundreds of thousands of mailboxes. One of the challenges faced was an architecture that had scalable storage capability whilst at the same time allowing the user to access his or her mailbox irrespective of which storage system or access server he or she connected to. Sounds like a job for the good old Network File System (NFS). As a result, the original platform was based on an NFS storage infrastructure for the mailboxes.
Incoming mail was received on any one of a number of inbound spool servers running Postfix. Once the mail was received and passed through a number of anti-spam and other filters, the spool would then identify which storage filer contained the mailbox in question, and forward the mail using SMTP to the back-end storage server (itself also running Postfix) for local delivery.
When the user wanted to access his or her mailbox, he would access a front-end server running Dovecot. The access servers would have "local" access to all of the mailboxes through the use of NFS mounts. The user would simply use a POP or IMAP client to connect to the server, to access his or her mailbox.
Outgoing mail is quite simple; basically an SMTP relay using SASL authentication.
The following diagram shows a very high level overview of the original version of the mail platform.
So What Has Changed?
Okay - before we get to the "what" has changed - let's first look at "why" we had to change it.
The original platform was great and the architecture works very well for moderate traffic levels, irrespective of the number of mailboxes. The risk with scaling a platform based on the number of mailboxes is that it is easy to overlook, or in some cases, misinterpret the knock-on effects of that increased scale. As the amount of traffic began to increase over the years, from time to time the front-end access servers would start having to contend for access to the NFS filesystems, which uses a system of locks to avoid corruption that may occur when there are multiple read/write operations on the same file or block.
This diagram outlines the average volumes for the past year. (note that the graph is not "stacked", so the elements are cumulative.) The vertical axis is "messages per minute", while the horizontal axis is by month.
As these locks increased over time (and remember that all of the servers had access across all of the filesystems), the result was a snowball effect that caused severe performance degradation of the whole platform -- and not only for the mailboxes on the storage server with the lock in place. During this time, users would attempt to connect to their mailboxes at which time the server would accept the connection and simply wait for the lock to free in order to access the mailbox.
So the challenges were simple:
- How to eliminate the need for NFS and still allow horizontal scalability.
- How to avoid impacting the entire mail platform in case of a difficulty on just one storage server, and how to minimise the impact to customers in this case.
- How to maximise the performance of the platform to allow vertical scalability as well as horizontal.
Since there is very little change on the incoming SMTP spool elements, and the majority of the load was associated with NFS, let's look straight as the access elements.
Where Is My Mailbox?
Okay, so the user connects to his mailbox with his mail client (Thunderbird, Outlook Express, Mail.App, Evolution, or anything else for that matter...). The client connection arrives on one of a number of front-end mail access servers running Dovecot. How then, does the server know where to look to find the mailbox? Originally, the mailbox was "local" because it was mounted via NFS. Dovecot made a simple database lookup to determine the filesystem path that the mailbox was mounted under. With the new system, there is no NFS, so there is no "local" filesystem for Dovecot to look under.
This is where a very useful feature of Dovecot come into play -- the proxy function. Using this, the front end server performs the authentication of the user, checks which storage server the mailbox is located on, and then initiates a proxy "client" connection directly to the storage server which itself is running Dovecot. If the client connects using IMAP, then the proxy connection is also IMAP. Similarly if the client is using POP3, then the proxy is also POP3. The storage server does not need to re-authenticate the connection.
There are a few benefits of this architecture:
- Elimination of NFS also eliminates the side-effect of NFS locks.
- Since the back-end storage server actually has the mailbox physically locally attached, there is no contention on the filesystem, and no need for locks. Plus, since the storage arrays are high performance anyway, access to the mailbox is much faster.
- The front-end servers no longer have to perform local disk I/O operations, and thus consume considerably less CPU. (In fact, technically, there is no real reason for the front-end servers to even have disks of their own -- this could enable lower cost horizontal access scaling by being able to use diskless servers...)
What Happens if a Filer Breaks?
To respond to the other part of the challenge, and to limit the impact in case of component failure to as few customers as possible, the original concept of scaling storage to hold as many mailboxes as possible had to be discarded. After all, if a filer happened to fall over, all the mailboxes on that filer would also be offline.
So the idea here is to increase the number of storage servers, and spread the mailboxes more thinly across them. In this way, in case of a failure of the storage server, fewer mailboxes are affected.
The second aspect to minimising the impact of a component failure is fairly simple as well. With the previous version of the platform (yes, remember the NFS locks?), a client connection to a mailbox would be answered by the access server and simply wait for the filesystem access. The effect for the user is that his client would just "sit there" and eventually time out.
Using the IMAP/POP3 proxy arrangement, if the actual storage server is down, the front-end server will reply immediately to the mail client with a "Temporarily Unavailable" message, and the TCP connection is closed.
The disks arrays themselves are, of course, redundant. The only real potential single point of failure is the server that controls the disk arrays since due to technical limitations of the disk arrays, it is not possible to have dual controllers if using split and mirrored RAID volumes across two disk arrays. It would have been possible if the volumes weren't mirrored across arrays, but this would have been more risky as there would be no "backup" copy of the data volume in case of an array failure... we thus considered that the single controller server is an acceptable risk provided a spare is available and can be easily swapped in. The following image depicts the mail storage solution.
But.. I Didn't Notice the MigrationIf you are in this category, then all I can say is "super!" -- that's what we intended. Though, we did have one or two hiccups along the way, and a very small minority of customers noticed, at no point was data lost, jeopardised, or otherwise endangered :)
So, how did we do the migration? Well, over a course of a number of weeks, and mostly during off-peak times, our admins worked on one filer at a time, migrating all customer mailboxes to the new filer structure using rsync... several iterations of it, in fact. At the last iteration to fully synchronise, the database was immediately updated to reflect the new filer as the storage location of the mailbox. All new deliveries, access requests, etc., were then made to the new filer.
This process went on filer by filer over the course of a number of weeks. An interesting side-effect of this migration and the gradual removal of NFS from the architecture is the gradual reduction of average CPU load of the access servers over the migration period, as can be seen in the graphs below. Of course now the CPU load is pretty much negligible since NFS has been eliminated. The two graphs are relative scale based on the average of the time period. The first graph is the past six months, while the second graph is the average over the past five weeks.
Some Interesting FiguresHere are just a few interesting facts about the Gandi Mail platform.
- Average 60 million emails per day via the SMTP incoming spools and outgoing relays.
- 8 outgoing SMTP relays
- 8 incoming SMTP spools
- 10 mail filters (anti-spam, etc.)
- 7 front-end access servers (POP3 / IMAP)
- 16 mailbox storage filers
- 4 database servers (2 read plus 2 master with replication)
- Hardware distributed among multiple datacentres
Just a Quick Stats Update Three Days Later
Just wanted to add a quick update to the IO-Wait CPU load for the front-end servers now three days on. In the following graph showing the CPU load on the front-end access servers for the past seven days, you can see the significant difference before and after the final migration ;)
I hope that this article has given a useful insight into the new Gandi Mail platform.