Mirroring (Server Replication) FAQ

Mirroring the server - What is mirroring.

The SurgeMail 'Mirror' system allows you to link two systems together and read or deliver Email to either system and both systems will continually 'match' each other. This can be used in several ways:

  • Keep a live backup system for 'hot swapping' in case of failure or upgrade requirements on your live system.
  • Move a system from one geographic location to another with no downtime

Mirroring will work over a LAN or WAN connection and can be encrypted. Unlike using shared NFS drives there is no single point of failure in a SurgeMail Mirrored system so you have genuine fail over capability.

 

Should I be setting up mirroring in my environment or is it a bad idea?

In almost all cases, you should be running a mirror of your mail server, it's the cheapest and most efficient way to keep a live backup of your system. The only cases we can think of where you don't need a mirror are if:

  • You are running a home system, and you don't mind if you loose your mail folders.
  • You have a working daily backup of your mail store, and you are happy to lose up to 24 hours of email when your disk fails.
  • You are running a free service and you have warned your customers that the service may vanish 'at any time'

Some people forget that disk drives fail, they do, your mail server's disk will fail approximately once in the next 2-3 years. Some people think RAID 5 or similar systems provides protection from disk failure, it does not, we've had so many customers loose Raid 5 arrays (and we've lost so many) that we actually consider them less reliable than non raid5 disk arrays. (Speaking of which always use RAID 10 for high performance and reliability for a mail server, when possible, NOT Raid 5)

Do not LOAD BALANCE mirrored servers

Many people think mirroring should be combined with a load balancer as you would do with a web server, this is NOT the case. A simple load balancer causes serious risks when used with mirroring because if the mirroring fails even briefly, and the user accesses both of the servers during that time, the new messages could be assigned identical UID values. Then one of those messages will be invisible and lost to the end user.

To avoid this and still have a fault redundant system you can do any of the following

  • Only deliver new mail to one host.
  • or Only failover rather than load balance
  • or Make the load balancer balance on incoming ip address when it can so it doesn't randomly move users between hosts.

Authent modules that support mirroring.

In general nwauth is the only module that natively supports mirroring, but some other modules work where they can both access a common back end server (like mysql, ldap etc) ntauth doesn't work because it relies on some local files to fill in fields that are not available in the windows database.

Module Support?
nwauth Yes
ntauth No
mysqlauth Yes but both servers must point to the same mysql database backend
ldapauth Yes again both servers must point to the same ldap back end typically

How do I turn it on?

In brief:

  1. Install surgemail on the mirror system.
  2. Copy surgemail.ini to the slave system and adjust any system specific settings.
  3. Turn on the mirror settings on both servers, set mode to "master" on the master, and 'slave" on the second system.
  4. If you are mirroring configuration then issue tellmail resync_config on master system
  5. Then issue tellmail resync_fast on master system.
  6. You may wish to use the command tellmail resync_mkdir if you wish to mirror empty folders across.

If you want to add mirroring to an existing server, you'll need to read this.

Simply setup two mail servers in a similar manner, we recommend you copy the config from one to the other and then adjust any system specific settings (mail paths etc.) it's important that the configs have the same domains and forward rules and the same g_mirror_secret)

Example: (adding these settings to surgemail.ini)

Server 1: ip 10.0.0.1 (master)

g_mirror_nossl "TRUE"
g_mirror_mode "master"
g_mirror_host "10.0.0.2"
g_mirror_secret "testing"
g_mirror_config "true" (if you want to mirror config changes as well)

server 2: ip 10.0.0.2 (slave)

g_mirror_nossl "TRUE"
g_mirror_mode "slave"
g_mirror_host "10.0.0.1"
g_mirror_secret "testing"
g_mirror_config "true" (if you want to mirror config changes as well)

Commands to issue after adding a new SLAVE to an existing system:
issue "tellmail resync_config" on master (if using g_mirror_config)
issue "tellmail resync_nwauth"' on master (if using nwauth)
issue "tellmail resync_fast" on master 

So above are the settings that go into each servers surgemail.ini. That will give you a mirror, its that simple.
You may wish to add g_mirror_trash "true" if you want the trash folder to mirror as well.

Now you need to consider how users get to the server and how you can easily allow them to get to the 'working' server in the event of a failure.

For incoming messages you can just setup 'MX' records so that the backup server is listed as a low priority host. e.g.:

your.domain MX=10 mail.your.domain
your.domain MX=20 mail2.your.domain

But for user access to the server you have several options:

  1. Tell users to use 'mail2.your.domain' in the event of a failure (suitable for office mail servers)
  2. Manually change the IP number of the systems in the event of a catastrophic failure.
  3. Invent an 'extra' IP number for the mail server and assign it to the 'working' box. Then manually add that iIP number to the other system during a failure of the main system.
  4. Use system 3, but then use some scripts to automatically change the IP number during failures (not recommended) See here for details
  5. Use system 3 but then use a router that can do the failover on the fly for you. (expensive but reliable)

Using config setting mirroring (Requires SurgeMail 3.1 or later)

You can choose to enable config setting mirroring. This causes SurgeMail to send it's config from master to slave and vice-versa if/when config changes are made in the web interface (it does not notice manual changes done by editing the config file).

First make a backup of both ini files, just in case :-)

To enable it set this on BOTH machines:

g_mirror_config "TRUE"

then:

(ON THE MASTER) tellmail resync_config

Of course, you do not always want to mirror all the settings, especially settings to handle mirroring like g_mirror_host for example. You may use g_mirror_config_except specify settings to be ignored when processing an incoming config, in addition there are a number of settings which are ignored by default, see g_mirror_config_except for details.

Adding a mirror to an existing system

You need to install surgemail on a new system, then follow the instructions above in "How do I turn it on?" to add the correct mirror settings to both ini files (old system and new system), you should set the new system up as SLAVE, then...

issue "tellmail resync_config" on master (if using g_mirror_config)
issue "tellmail surgehost_update" on master
issue "tellmail resync_nwauth"' on master (if using nwauth)
issue "tellmail resync_fast" on master

Note: Although this feature exists in earlier versions of surgemail, we recommend upgrading to 3.1 before using it as we made significant improvements to the fault tolerance of this feature (it's more idiot proof in version 3.1 :-)

What is not mirrored?

Only the users mailboxes/folders, nwauth, and surgemail.ini are mirrored.   Any files external to this will not be mirrored.  It is sometimes wise on a new system to start by duplicating the surgemail root directory /usr/local/surgemail (c:\surgemail) first to pickup other odd files you may have tailored.  This depends a lot on how much tailoring you've done of your system.

How do I know it is working? Is it in sync yet ? Is it keeping up ?

Always check in two ways, first check the status as below, then compare two directories manually to be 'absolutely' sure.

In the status window (near the end) you will see the following information

Mirror out: Que/sent add=612/611 (3343432 bytes) del=612/610   

Mirror  in: Received add=0 del=0 rename=0 failed=0

This shows both halves of the mirroring operation. The "Mirror out:" line shows messages queued to be sent to the other system and the second number (/611) shows how many have been successfully sent (so one is still queued) and how many delete operations have been queued (612) and how many have been sent (/610), so 2 are still queued.  Obviously these numbers should normally match.

The second line, "Mirror in:" shows how many new items or deletions have arrived from the other system.

To compare directories do this on both servers, and compare the directory listings:

tellmail path user@domain.name
dir [path it returns]/mdir/new

Lastly, issue a 'tellmail resync_fast' and check in the status to see how many corrections it needs to send.

What's the deal with master/slave, can I swap them over?

For internal reasons we needed to establish a master/slave concept, although in almost all respects they are identical and neither is the 'master' in any behavioral sense, for example if you change something on the slave the change will appear on the master and versa visa. The one thing you should never do is swap the master/slave settings over as this will confuse the mirroring software! (It can be done reasonably safely if both servers are stopped at the time, but it's best avoided :-)

We do recommend that you generally avoid doing things on both servers randomly, it's best to make everything go through one server and do all changes etc. on one server, then use the other server purely as a 'hot' backup. In this way if something does go wrong, but goes unnoticed (e.g. they get unplugged from each other) you will know which one is in a 'good' state and which one is 'out of date'

Note: DLIST runs only on the 'master', so in the situation where your master is going to be down for several days, you will need to swap master and slave so that the dlist on the 'slave' will come to life.

How long can I expect mirroring actions to take in a real environment (100 user + 1GB mail / 5000 user + 50GB mail )

How long is a piece of string :-), the time is mainly dependent on the 'number' of messages stored on the server, so it is not directly related to the number of users, or the size of your mail store.

But as a rough guide, to resync from scratch, you would expect it to take something like:

  • 10 minutes for 100 users first time, 1 minute to resync
  • 3 hours for 1000 users first time, 4 minutes to resync
  • 3 days for 40,000 users first time, 3-4 hours to resync.

What happens if I stop surgemail / surgemail crashes / change my mirroring configuration etc. halfway through...

The mirroring is very forgiving, it will try to continue after a crash, when one server is down changes are 'queued' until it reappears. The only time you must issue commands is when one server's disk is lost/reformatted, then you must issue a 'tellmail resync' on the 'good' system.

If mirroring is really such a good idea why do none of the competing products offer this capability.

Mostly because they can't, to implement mirroring it's essential to integrate it into the core mail server code at the design stage so they are too far down the path to add it.

Most other suppliers offer one of two alternatives instead, they either provide 'file system' level mirroring, which at best is much less efficient, and likely to be minutes or hours 'out of date'. Or they promote the 'shared network drive' approach even though this clearly fails to duplicate the 'data' and thus is completely ineffective as a fault redundant solution.

I have two live mail system that I would like to mirror each other. Can I and / how do I configure this?

Assuming these systems currently run different domains, yes you can, you first add the other domains to each of the servers, and then turn on the mirroring settings. Then issue a tellmail resync on 'both' of the servers so that each one sends the new domains to the 'other' system.

Can I check it's always working?

The best test is to send an email to one system, then read it from the other, you can setup our 'watchdog' utility to do this automatically once an hour so that you will always know if anything goes wrong.

You can also check the mirror section of the 'status' page, here the cryptic errors are often not that important, the key thing to look for is the counters showing successfully mirrored items, are these counters ticking over.

Is there any functionality / settings etc. that are not mirror?

The best answer we can give here is 'probably'. :-). We've tried to identify everything, but there may be things we've missed. In particular if you add settings to your config file which refer to files that are non standard then those files may not get mirrored. (And alias file setting for a domain would be one example).

DLIST currently only runs on 'one' of the servers (the master), this is to avoid problems of mailing list messages being sent twice by mistake :-), It's files are mirrored though so the data is duplicated.

The user database is not mirrored unless you are using NWAUTH and you turn on the setting. If you are using some other user database then you will need to consider if it needs mirroring in some way. Usually in this situation it won't be an issue as it will be a network accessed database anyway.

Please also note that the config mirroring is new and requires SurgeMail 3.1 or later for best results.

Any Performance Impact?

There is of course some and the data does need to be sent between the two systems. However, the load is by no means doubled, as the mirroring occurs at the delivery stage after much pre-processing has occurred (e.g.: spam & virus filtering). Also most mail servers will run at about 98% idle so the extra load is really of no relevance (even for quite large ISP operations).  We run mirroring on the servers we host with 40,000 plus users on a system. So far we've had about 3 Raid/system failures on our own hosting systems where mirroring has 'saved the day' and resulted in no significant loss of data.

 

You can mirror over a WAN connection, but the round trip time may slow down the mirroring a bit so if the system is very heavily used it may struggle to keep in sync. On most systems this is not a problem. But on very large busy systems this would be a mistake.

Swap Master/Slave when a system dies or is replaced.

For short periods there is no need to swap master/slave. The only thing that doesn't function on the slave is mailing lists.

If the master is dead or is being replaced but it may take a week then you may choose to swap them, do it like this:

1) Stop both servers.

2) Change the g_mirror_mode setting from "slave" to "master"

3) If needed change/swap ip addresses for the servers.

4) start the new master server (which was the slave)

5) When the old master server is repaired be sure to set it's g_mirror_mode to "slave" before starting it!!!

6) Issue a 'resync_config' and 'resync' on the MASTER (which was the slave) once the new 'slave' is running.